[CUWiN-Dev] hslsd updates coming
David Young
dyoung at pobox.com
Mon Oct 17 16:38:27 CDT 2005
On Mon, Oct 17, 2005 at 02:46:06PM -0500, Bill Comisky wrote:
> On Wed, 12 Oct 2005, David Young wrote:
>
> >On Tue, Sep 27, 2005 at 11:30:09AM -0500, Bill Comisky wrote:
> [ snip ]
> >>Ok, since my last upgrade I've seen a node in the testbed rebooting
> >>frequently again. This time I had it dump a bunch of things if hellowdog
> >>sends STOP to wdogctl. No hslsd core files, but from the daemon log it
> >>looks like the tickle process is hanging or not being reaped or something:
> >>
> >>$ grep tickle daemon
> >>daemon:Sep 5 18:55:37 cuw hslsd: hsls_shell_tickle: tickle cancelled,
> >>tickle process 6648 still runs
> [ snip ]
> >>daemon:Sep 5 18:55:52 cuw hslsd: hsls_shell_tickle: tickle cancelled,
> >>tickle process 6648 still runs
> >
> >Bill,
> >
> >I believe this patch will make the problem go away.
> >
> >Dave
>
> Dave,
>
> Good seeing you last week.. Attached is a diff of my patched
> hsls_watchdog.c against HEAD. When I read the waitpid() manpage I thought
> that 0 was returned when WNOHANG was specified if the process was still
> running (a condition it would've waited for without WNOHANG), and -1 was
> for processes not found. I don't think I've hit that IF condition since
> adding the waitpid() though..
Bill,
It looks more complicated all the time. My new interpretation of the
manual page is this:
pid_t rc;
rc = waitpid(pid, ..., WNOHANG);
rc == -1, errno == ECHILD: no children
errno == EINTR: interrupted by a signal
errno == EINVAL: bad arguments
errno == EFAULT: bad address
rc == 0: child `pid' exists, and it still runs
(in the man page, "no stopped or
exited children")
rc == pid: child `pid' has exited or stopped
Under these conditions, the old tickle process no longer runs---perhaps
we should wawrn if the process was stopped:
rc == -1, errno == ECHILD: no children
rc == pid: child `pid' has exited or stopped
Under this condition, the old tickle process continues to run:
rc == 0: child `pid' exists, and it still runs
(in the man page, "no stopped or
exited children")
In this case, loglib_warn() and try again?
rc == -1, errno == EINTR: interrupted by a signal
In these cases, hslsd is probably FUBAR. May as well
loglib_err(EXIT_FAILURE). The watchdog script will restart hslsd.
rc == -1, errno == EINVAL: bad arguments
rc == -1, errno == EFAULT: bad address
Dave
--
David Young OJC Technologies
dyoung at ojctech.com Urbana, IL * (217) 278-3933
More information about the CU-Wireless-Dev
mailing list