[CUWiN-Dev] hslsd updates coming

David Young dyoung at pobox.com
Wed Oct 12 00:55:44 CDT 2005


On Tue, Sep 27, 2005 at 11:30:09AM -0500, Bill Comisky wrote:
> On Mon, 26 Sep 2005, Bill Comisky wrote:
> 
> >On Thu, 22 Sep 2005, David Young wrote:
> >
> >> On Thu, Sep 22, 2005 at 04:17:20PM -0500, Bill Comisky wrote:
> >>>  On Wed, 21 Sep 2005, David Young wrote:
> >>> 
> >>> >  I have found some hslsd bugs by watching the Race Street network, 
> >>> >  which
> >>> >  keeps growing with Tom Wiltzius' help, and by watching the indoor
> >>> >  testbed. I have some fixes under development.
> >>> > 
> >>> >  Dave
> >>> 
> >>>  I've seen recently a few occasions where a node will reboot frequently,
> >>>  though the intervals vary; sometimes hours between reboots and 
> >>sometimes
> >>>  minutes.  The ETX metric and beacon strength to the gateway from the 
> >>>  node
> >>>  in question looks like a solid link, and there is typically a fair 
> >>>  amount
> >>>  of traffic on the wireless network at the time... I have some cron jobs
> >>>  fetching files on a few nodes, including the rebooting one.
> >>> 
> >>>  Once the node has rebooted, the evidence for what happened is gone, 
> but 
> >>>  I
> >>>  have seen an hslsd segfault before, in dmesg output and 
> >>>  /var/core/hslsd.*
> >>>  files.  Is this symptomatic of the bugs you've found?
> >>
> >> I know of a rare condition where hslsd will segfault.  I'm working on
> >> a fix in the ls-refcnt-hsls branch.  There may be other conditions, too.
> >>
> >> If hellowdog finds that hslsd isn't running, it should not stop the
> >> watchdog tickle, but it should restart hslsd.  I guess it's possible 
> >> hslsd
> >> will fail to restart if, say, the memory disk is full of core files....
> >>
> >>>  I could tweak hellowdog to scp over some information before rebooting
> >>>  (core files, dmesg output, etc); I guess you'd need the unstripped
> >>>  binaries too.  Let me know if this would be useful.
> >>
> >> That would be very useful.
> >
> >I've upgraded since the last time this happened (CUWiN and NetBSD srcs 
> >rsync'd to yours), and haven't seen it since; though the traffic pattern 
> >on the testbed may have changed as well.
> 
> Ok, since my last upgrade I've seen a node in the testbed rebooting 
> frequently again.  This time I had it dump a bunch of things if hellowdog 
> sends STOP to wdogctl.  No hslsd core files, but from the daemon log it 
> looks like the tickle process is hanging or not being reaped or something:
> 
> $ grep tickle daemon
> daemon:Sep  5 18:55:37 cuw hslsd: hsls_shell_tickle: tickle cancelled, 
> tickle process 6648 still runs
> daemon:Sep  5 18:55:52 cuw hslsd: hsls_shell_tickle: tickle cancelled, 
> tickle process 6648 still runs
> daemon:Sep  5 18:56:07 cuw hslsd: hsls_shell_tickle: tickle cancelled, 
> tickle process 6648 still runs
> daemon:Sep  5 18:56:22 cuw hslsd: hsls_shell_tickle: tickle cancelled, 
> tickle process 6648 still runs
> daemon:Sep  5 18:56:37 cuw hslsd: hsls_shell_tickle: tickle cancelled, 
> tickle process 6648 still runs
> daemon:Sep  5 18:56:53 cuw hslsd: hsls_shell_tickle: tickle cancelled, 
> tickle process 6648 still runs
> daemon:Sep  5 18:57:08 cuw hslsd: hsls_shell_tickle: tickle cancelled, 
> tickle process 6648 still runs
> daemon:Sep  5 18:57:23 cuw hslsd: hsls_shell_tickle: tickle cancelled, 
> tickle process 6648 still runs
> daemon:Sep  5 18:57:38 cuw hslsd: hsls_shell_tickle: tickle cancelled, 
> tickle process 6648 still runs
> daemon:Sep  5 18:57:53 cuw hslsd: hsls_shell_tickle: tickle cancelled, 
> tickle process 6648 still runs
> daemon:Sep  5 18:58:08 cuw hslsd: hsls_shell_tickle: tickle cancelled, 
> tickle process 6648 still runs
> daemon:Sep  5 18:58:23 cuw hslsd: hsls_shell_tickle: tickle cancelled, 
> tickle process 6648 still runs
> daemon:Sep  5 18:58:38 cuw hslsd: hsls_shell_tickle: tickle cancelled, 
> tickle process 6648 still runs
> daemon:Sep  5 18:58:54 cuw hslsd: hsls_shell_tickle: tickle cancelled, 
> tickle process 6648 still runs
> daemon:Sep  5 18:59:09 cuw hslsd: hsls_shell_tickle: tickle cancelled, 
> tickle process 6648 still runs
> daemon:Sep  5 18:59:24 cuw hslsd: hsls_shell_tickle: tickle cancelled, 
> tickle process 6648 still runs
> daemon:Sep  5 18:59:39 cuw hslsd: hsls_shell_tickle: tickle cancelled, 
> tickle process 6648 still runs
> daemon:Sep  5 18:59:54 cuw hslsd: hsls_shell_tickle: tickle cancelled, 
> tickle process 6648 still runs
> daemon:Sep  5 19:00:10 cuw hslsd: hsls_shell_tickle: tickle cancelled, 
> tickle process 6648 still runs
> daemon:Sep  5 19:00:25 cuw hslsd: hsls_shell_tickle: tickle cancelled, 
> tickle process 6648 still runs
> daemon:Sep  5 19:00:40 cuw hslsd: hsls_shell_tickle: tickle cancelled, 
> tickle process 6648 still runs

Bill,

I believe this patch will make the problem go away.

Dave

-- 
David Young             OJC Technologies
dyoung at ojctech.com      Urbana, IL * (217) 278-3933
-------------- next part --------------
Index: hsls/hsls_watchdog.c
===================================================================
--- hsls/hsls_watchdog.c	(revision 3562)
+++ hsls/hsls_watchdog.c	(working copy)
@@ -140,6 +140,7 @@
 static int
 hsls_shell_tickle(struct hsls_watchdog *hw)
 {
+	int status;
 	struct hsls_shell_watchdog *hsw;
 	struct timeval now;
 
@@ -155,7 +156,8 @@
 	if (!ratecheck(&now, &hsw->hsw_lasttime, &hsw->hsw_mininterval))
 		return 0;
 
-	if (hsw->hsw_pid != 0) {
+	if (hsw->hsw_pid != 0 &&
+	    waitpid(hsw->hsw_pid, &status, WNOHANG) == -1) {
 		loglib_warnx("%s: tickle cancelled, "
 		    "tickle process %u still runs", __func__, hsw->hsw_pid);
 		return 0;


More information about the CU-Wireless-Dev mailing list