[CUWiN-Dev] hslsd updates coming
David Young
dyoung at pobox.com
Wed Oct 12 00:55:44 CDT 2005
On Tue, Sep 27, 2005 at 11:30:09AM -0500, Bill Comisky wrote:
> On Mon, 26 Sep 2005, Bill Comisky wrote:
>
> >On Thu, 22 Sep 2005, David Young wrote:
> >
> >> On Thu, Sep 22, 2005 at 04:17:20PM -0500, Bill Comisky wrote:
> >>> On Wed, 21 Sep 2005, David Young wrote:
> >>>
> >>> > I have found some hslsd bugs by watching the Race Street network,
> >>> > which
> >>> > keeps growing with Tom Wiltzius' help, and by watching the indoor
> >>> > testbed. I have some fixes under development.
> >>> >
> >>> > Dave
> >>>
> >>> I've seen recently a few occasions where a node will reboot frequently,
> >>> though the intervals vary; sometimes hours between reboots and
> >>sometimes
> >>> minutes. The ETX metric and beacon strength to the gateway from the
> >>> node
> >>> in question looks like a solid link, and there is typically a fair
> >>> amount
> >>> of traffic on the wireless network at the time... I have some cron jobs
> >>> fetching files on a few nodes, including the rebooting one.
> >>>
> >>> Once the node has rebooted, the evidence for what happened is gone,
> but
> >>> I
> >>> have seen an hslsd segfault before, in dmesg output and
> >>> /var/core/hslsd.*
> >>> files. Is this symptomatic of the bugs you've found?
> >>
> >> I know of a rare condition where hslsd will segfault. I'm working on
> >> a fix in the ls-refcnt-hsls branch. There may be other conditions, too.
> >>
> >> If hellowdog finds that hslsd isn't running, it should not stop the
> >> watchdog tickle, but it should restart hslsd. I guess it's possible
> >> hslsd
> >> will fail to restart if, say, the memory disk is full of core files....
> >>
> >>> I could tweak hellowdog to scp over some information before rebooting
> >>> (core files, dmesg output, etc); I guess you'd need the unstripped
> >>> binaries too. Let me know if this would be useful.
> >>
> >> That would be very useful.
> >
> >I've upgraded since the last time this happened (CUWiN and NetBSD srcs
> >rsync'd to yours), and haven't seen it since; though the traffic pattern
> >on the testbed may have changed as well.
>
> Ok, since my last upgrade I've seen a node in the testbed rebooting
> frequently again. This time I had it dump a bunch of things if hellowdog
> sends STOP to wdogctl. No hslsd core files, but from the daemon log it
> looks like the tickle process is hanging or not being reaped or something:
>
> $ grep tickle daemon
> daemon:Sep 5 18:55:37 cuw hslsd: hsls_shell_tickle: tickle cancelled,
> tickle process 6648 still runs
> daemon:Sep 5 18:55:52 cuw hslsd: hsls_shell_tickle: tickle cancelled,
> tickle process 6648 still runs
> daemon:Sep 5 18:56:07 cuw hslsd: hsls_shell_tickle: tickle cancelled,
> tickle process 6648 still runs
> daemon:Sep 5 18:56:22 cuw hslsd: hsls_shell_tickle: tickle cancelled,
> tickle process 6648 still runs
> daemon:Sep 5 18:56:37 cuw hslsd: hsls_shell_tickle: tickle cancelled,
> tickle process 6648 still runs
> daemon:Sep 5 18:56:53 cuw hslsd: hsls_shell_tickle: tickle cancelled,
> tickle process 6648 still runs
> daemon:Sep 5 18:57:08 cuw hslsd: hsls_shell_tickle: tickle cancelled,
> tickle process 6648 still runs
> daemon:Sep 5 18:57:23 cuw hslsd: hsls_shell_tickle: tickle cancelled,
> tickle process 6648 still runs
> daemon:Sep 5 18:57:38 cuw hslsd: hsls_shell_tickle: tickle cancelled,
> tickle process 6648 still runs
> daemon:Sep 5 18:57:53 cuw hslsd: hsls_shell_tickle: tickle cancelled,
> tickle process 6648 still runs
> daemon:Sep 5 18:58:08 cuw hslsd: hsls_shell_tickle: tickle cancelled,
> tickle process 6648 still runs
> daemon:Sep 5 18:58:23 cuw hslsd: hsls_shell_tickle: tickle cancelled,
> tickle process 6648 still runs
> daemon:Sep 5 18:58:38 cuw hslsd: hsls_shell_tickle: tickle cancelled,
> tickle process 6648 still runs
> daemon:Sep 5 18:58:54 cuw hslsd: hsls_shell_tickle: tickle cancelled,
> tickle process 6648 still runs
> daemon:Sep 5 18:59:09 cuw hslsd: hsls_shell_tickle: tickle cancelled,
> tickle process 6648 still runs
> daemon:Sep 5 18:59:24 cuw hslsd: hsls_shell_tickle: tickle cancelled,
> tickle process 6648 still runs
> daemon:Sep 5 18:59:39 cuw hslsd: hsls_shell_tickle: tickle cancelled,
> tickle process 6648 still runs
> daemon:Sep 5 18:59:54 cuw hslsd: hsls_shell_tickle: tickle cancelled,
> tickle process 6648 still runs
> daemon:Sep 5 19:00:10 cuw hslsd: hsls_shell_tickle: tickle cancelled,
> tickle process 6648 still runs
> daemon:Sep 5 19:00:25 cuw hslsd: hsls_shell_tickle: tickle cancelled,
> tickle process 6648 still runs
> daemon:Sep 5 19:00:40 cuw hslsd: hsls_shell_tickle: tickle cancelled,
> tickle process 6648 still runs
Bill,
I believe this patch will make the problem go away.
Dave
--
David Young OJC Technologies
dyoung at ojctech.com Urbana, IL * (217) 278-3933
-------------- next part --------------
Index: hsls/hsls_watchdog.c
===================================================================
--- hsls/hsls_watchdog.c (revision 3562)
+++ hsls/hsls_watchdog.c (working copy)
@@ -140,6 +140,7 @@
static int
hsls_shell_tickle(struct hsls_watchdog *hw)
{
+ int status;
struct hsls_shell_watchdog *hsw;
struct timeval now;
@@ -155,7 +156,8 @@
if (!ratecheck(&now, &hsw->hsw_lasttime, &hsw->hsw_mininterval))
return 0;
- if (hsw->hsw_pid != 0) {
+ if (hsw->hsw_pid != 0 &&
+ waitpid(hsw->hsw_pid, &status, WNOHANG) == -1) {
loglib_warnx("%s: tickle cancelled, "
"tickle process %u still runs", __func__, hsw->hsw_pid);
return 0;
More information about the CU-Wireless-Dev
mailing list