[Commotion-dev] Stress test results

Tue Nov 12 07:56:16 UTC 2013

Ben

As always, thank you for your detailed response! I will put some comments below.

On Monday, November 11, 2013 22:05 EST, Ben West <ben at gowasabi.net> wrote: 

> I recently saw a WasabiNet node whose adhoc interface died due to memory
> exhaustion, and I should point out on a node *not* running serval or
> commotiond.  (I.e. since this isn't Commotion-OpenWRT firmware I'm writing
> about, but very similar.)  The relevant dmesg I've sent to the
> OpenWRT-devel list, which you can read here:
> 
> https://lists.openwrt.org/pipermail/openwrt-devel/2013-November/022398.html
> https://lists.openwrt.org/pipermail/openwrt-devel/2013-November/022399.html
> 
> If particular interest is that, since the node didn't spontaneously reboot
> or become inaccessible, I SSH'ed in and found a 240Kbyte dump file that
> wpa_supplicant had written to /tmp , with about the same timestamp as when
> memory errors began appearing in syslog.  Possibly this points to
> wpa_supplicant itself as source of intermittent memory leaks?  I'm using
> the wpad package, and I haven't yet had a chance to try out the version of
> wpad-mini modified to include IBSS-RSN support.  I would be curious if
> switching to wpad-mini has any effect on the memory errors you're seeing,
> Will.

This is very interesting. And relevant. We've actually got some code waiting to be merged that will move us to wpad-mini. I will try to build an image with that code and see if I can make the node panic in the same way. Thanks for the suggestion!

Are you able to send along that dump file? Your second message to the lists says that you retained it, but I want to see just how good your archive system is :-)

> 
> Besides that, do please note that I run the coovachilli captive portal,
> instead of NDS, and coova is definitely a memory hog of questionable
> stability.  That is, coova may end up being my problem, making this
> unrelated to Commotion.
> 
> Finally, check out these recommended kernel tweaks from OpenWRT-devel for
> having the node just spontaneously reboot upon OOM error or kernel oops.
> Naturally, these wouldn't fix the memory problem itself, but good to know
> for reference.
> 
> " for routers in production i prefer setting
> /proc/sys/vm/panic_on_oom = 2
> /proc/sys/kernel/panic = 10
> 
> also if you like
> /proc/sys/kernel/panic_on_oops = 1 "
> 
> 

Thanks for sending this along. We actually disabled several of these things (like reboot automatically on panic) so that we could attempt to debug the problem when it happened. It turned out that the most efficient way to get a dump like this was to have the node hooked to a console. 

The settings that you are using seem like the way to go for production nodes. I think that's what you're suggesting, right? With those settings, when the node starts to have trouble and eventually falls over, it will automatically restart without any user intervention. 

Again, thanks for the detailed response. I look forward to figuring this out together!

Will

> 
> On Mon, Nov 11, 2013 at 6:53 PM, Will Hawkins <
> hawkinsw at opentechinstitute.org> wrote:
> 
> > Using go (yes, that's right!) I was able to create a test program that
> > opened enough simultaneous HTTP connections to force a crash.
> >
> > Thanks to the fact that we were running a serial console that was
> > logging Pico station console output, we were able to capture the crash
> > information. I am attaching that here.
> >
> > Overall, it looks like the node simply runs out of memory. The first
> > errors are when malloc()s in servald fail Then, when things get really
> > bad, there are errors from the wireless driver saying that it cannot
> > allocate buffer space.
> >
> > Obviously the failures from the wireless driver are bad. They are
> > probably ultimately what causes the node to reboot.
> >
> > I wonder, though, about the servald malloc() failures. I'm not sure if
> > they are pure symptom (i.e, servald just happens to be the application
> > most commonly allocating memory space when the crash happens and so its
> > malloc()s fail first), or if it is part of the problem (i.e, servald
> > causes memory usage to skyrocket under heavy load and *then* these
> > larger memory problems start to occur).
> >
> > In any event, we got some logs, which is a good first step!
> >
> > Will
> >
> > _______________________________________________
> > Commotion-dev mailing list
> > Commotion-dev at lists.chambana.net
> > https://lists.chambana.net/mailman/listinfo/commotion-dev
> >
> >
> 
> 
> -- 
> Ben West
> http://gowasabi.net
> ben at gowasabi.net
> 314-246-9434