[CUWiN-Dev] hslsd updates coming

Tue Sep 27 11:30:09 CDT 2005

On Mon, 26 Sep 2005, Bill Comisky wrote:

> On Thu, 22 Sep 2005, David Young wrote:
>
>>  On Thu, Sep 22, 2005 at 04:17:20PM -0500, Bill Comisky wrote:
>> >  On Wed, 21 Sep 2005, David Young wrote:
>> > 
>> > >  I have found some hslsd bugs by watching the Race Street network, 
>> > >  which
>> > >  keeps growing with Tom Wiltzius' help, and by watching the indoor
>> > >  testbed. I have some fixes under development.
>> > > 
>> > >  Dave
>> > 
>> >  I've seen recently a few occasions where a node will reboot frequently,
>> >  though the intervals vary; sometimes hours between reboots and sometimes
>> >  minutes.  The ETX metric and beacon strength to the gateway from the 
>> >  node
>> >  in question looks like a solid link, and there is typically a fair 
>> >  amount
>> >  of traffic on the wireless network at the time... I have some cron jobs
>> >  fetching files on a few nodes, including the rebooting one.
>> > 
>> >  Once the node has rebooted, the evidence for what happened is gone, 
but 
>> >  I
>> >  have seen an hslsd segfault before, in dmesg output and 
>> >  /var/core/hslsd.*
>> >  files.  Is this symptomatic of the bugs you've found?
>>
>>  I know of a rare condition where hslsd will segfault.  I'm working on
>>  a fix in the ls-refcnt-hsls branch.  There may be other conditions, too.
>>
>>  If hellowdog finds that hslsd isn't running, it should not stop the
>>  watchdog tickle, but it should restart hslsd.  I guess it's possible hslsd
>>  will fail to restart if, say, the memory disk is full of core files....
>> 
>> >  I could tweak hellowdog to scp over some information before rebooting
>> >  (core files, dmesg output, etc); I guess you'd need the unstripped
>> >  binaries too.  Let me know if this would be useful.
>>
>>  That would be very useful.
>
> I've upgraded since the last time this happened (CUWiN and NetBSD srcs 
> rsync'd to yours), and haven't seen it since; though the traffic pattern on 
> the testbed may have changed as well.

Ok, since my last upgrade I've seen a node in the testbed rebooting 
frequently again.  This time I had it dump a bunch of things if hellowdog 
sends STOP to wdogctl.  No hslsd core files, but from the daemon log it 
looks like the tickle process is hanging or not being reaped or something:

$ grep tickle daemon
daemon:Sep  5 18:55:37 cuw hslsd: hsls_shell_tickle: tickle cancelled, tickle process 6648 still runs
daemon:Sep  5 18:55:52 cuw hslsd: hsls_shell_tickle: tickle cancelled, tickle process 6648 still runs
daemon:Sep  5 18:56:07 cuw hslsd: hsls_shell_tickle: tickle cancelled, tickle process 6648 still runs
daemon:Sep  5 18:56:22 cuw hslsd: hsls_shell_tickle: tickle cancelled, tickle process 6648 still runs
daemon:Sep  5 18:56:37 cuw hslsd: hsls_shell_tickle: tickle cancelled, tickle process 6648 still runs
daemon:Sep  5 18:56:53 cuw hslsd: hsls_shell_tickle: tickle cancelled, tickle process 6648 still runs
daemon:Sep  5 18:57:08 cuw hslsd: hsls_shell_tickle: tickle cancelled, tickle process 6648 still runs
daemon:Sep  5 18:57:23 cuw hslsd: hsls_shell_tickle: tickle cancelled, tickle process 6648 still runs
daemon:Sep  5 18:57:38 cuw hslsd: hsls_shell_tickle: tickle cancelled, tickle process 6648 still runs
daemon:Sep  5 18:57:53 cuw hslsd: hsls_shell_tickle: tickle cancelled, tickle process 6648 still runs
daemon:Sep  5 18:58:08 cuw hslsd: hsls_shell_tickle: tickle cancelled, tickle process 6648 still runs
daemon:Sep  5 18:58:23 cuw hslsd: hsls_shell_tickle: tickle cancelled, tickle process 6648 still runs
daemon:Sep  5 18:58:38 cuw hslsd: hsls_shell_tickle: tickle cancelled, tickle process 6648 still runs
daemon:Sep  5 18:58:54 cuw hslsd: hsls_shell_tickle: tickle cancelled, tickle process 6648 still runs
daemon:Sep  5 18:59:09 cuw hslsd: hsls_shell_tickle: tickle cancelled, tickle process 6648 still runs
daemon:Sep  5 18:59:24 cuw hslsd: hsls_shell_tickle: tickle cancelled, tickle process 6648 still runs
daemon:Sep  5 18:59:39 cuw hslsd: hsls_shell_tickle: tickle cancelled, tickle process 6648 still runs
daemon:Sep  5 18:59:54 cuw hslsd: hsls_shell_tickle: tickle cancelled, tickle process 6648 still runs
daemon:Sep  5 19:00:10 cuw hslsd: hsls_shell_tickle: tickle cancelled, tickle process 6648 still runs
daemon:Sep  5 19:00:25 cuw hslsd: hsls_shell_tickle: tickle cancelled, tickle process 6648 still runs
daemon:Sep  5 19:00:40 cuw hslsd: hsls_shell_tickle: tickle cancelled, tickle process 6648 still runs

So 5 minutes goes by since a tickle and it calls STOP.  I just caught it 
again, and tickled the watchdog manually.  The tickle PID this time is 
1107, and ps shows that it's a zombie:

# ps auxw | grep 11107
root   11107  0.0  0.0    0    0 ?      ZW         - 0:00.00 (sh)

I peeked in hsls_watchdog.c, but am not familiar enough with signal 
handlers to know where something could go wrong.  ideas?

Bill

--
Bill Comisky
bcomisky at pobox.com