[CUWiN-Dev] hslsd updates coming

David Young dyoung at pobox.com
Mon Oct 17 16:38:27 CDT 2005


On Mon, Oct 17, 2005 at 02:46:06PM -0500, Bill Comisky wrote:
> On Wed, 12 Oct 2005, David Young wrote:
> 
> >On Tue, Sep 27, 2005 at 11:30:09AM -0500, Bill Comisky wrote:
>    [ snip ]
> >>Ok, since my last upgrade I've seen a node in the testbed rebooting
> >>frequently again.  This time I had it dump a bunch of things if hellowdog
> >>sends STOP to wdogctl.  No hslsd core files, but from the daemon log it
> >>looks like the tickle process is hanging or not being reaped or something:
> >>
> >>$ grep tickle daemon
> >>daemon:Sep  5 18:55:37 cuw hslsd: hsls_shell_tickle: tickle cancelled,
> >>tickle process 6648 still runs
>    [ snip ]
> >>daemon:Sep  5 18:55:52 cuw hslsd: hsls_shell_tickle: tickle cancelled,
> >>tickle process 6648 still runs
> >
> >Bill,
> >
> >I believe this patch will make the problem go away.
> >
> >Dave
> 
> Dave,
> 
> Good seeing you last week.. Attached is a diff of my patched 
> hsls_watchdog.c against HEAD.  When I read the waitpid() manpage I thought 
> that 0 was returned when WNOHANG was specified if the process was still 
> running (a condition it would've waited for without WNOHANG), and -1 was 
> for processes not found.  I don't think I've hit that IF condition since 
> adding the waitpid() though..

Bill,

It looks more complicated all the time.  My new interpretation of the
manual page is this:

        pid_t rc;
        rc = waitpid(pid, ..., WNOHANG);

        rc == -1, errno == ECHILD: no children
                  errno == EINTR:  interrupted by a signal
                  errno == EINVAL: bad arguments
                  errno == EFAULT: bad address
        rc == 0:                   child `pid' exists, and it still runs
                                   (in the man page, "no stopped or
                                   exited children")
        rc == pid:                 child `pid' has exited or stopped

Under these conditions, the old tickle process no longer runs---perhaps
we should wawrn if the process was stopped:

        rc == -1, errno == ECHILD: no children
        rc == pid:                 child `pid' has exited or stopped

Under this condition, the old tickle process continues to run:

        rc == 0:                   child `pid' exists, and it still runs
                                   (in the man page, "no stopped or
                                   exited children")

In this case, loglib_warn() and try again?

        rc == -1, errno == EINTR:  interrupted by a signal

In these cases, hslsd is probably FUBAR.  May as well
loglib_err(EXIT_FAILURE).  The watchdog script will restart hslsd.

        rc == -1, errno == EINVAL: bad arguments
        rc == -1, errno == EFAULT: bad address

Dave

-- 
David Young             OJC Technologies
dyoung at ojctech.com      Urbana, IL * (217) 278-3933


More information about the CU-Wireless-Dev mailing list