[Commotion-dev] [OTI-Tech] LTS Testing Update

Thu Apr 25 02:09:37 UTC 2013

Ben,

Thanks so much for your thoughtful reply re: IBSS-RSN. It looks like we are seeing the same behavior. When I am back at my primary workspace tomorrow I am going to go through the script that you sent. I've left a test running overnight with debugging enabled to see if I can get some "bad output" from wpa_supplicant. I will let you know what I find, if anything!

Thanks again!
Will

On Wednesday, April 24, 2013 19:43 EDT, Ben West <ben at gowasabi.net> wrote: 

> Hi Will,
> 
> Glad that an apparent bug with dhcp was caught.
> 
> Please see my response about IBSS-RSN issues below in green.
> 
> On Wed, Apr 24, 2013 at 5:49 PM, Will Hawkins <
> hawkinsw at opentechinstitute.org> wrote:
> 
> > A few notes to consider after some testing today with LTS and at the
> > office:
> >
> > - Collectd "stalled" one of the nodes at LTS (where collectd is still
> > enabled)
> >
> > - Stations seem to lose their mind w.r.t IBSS RSN and authorization. I'm
> > watching debugging output from wpa_supplicant on one of the nodes in the
> > office to see if I can determine the problem. Of course, it's working
> > great now :-)
> >
> >
> I've observed a definite issue with nodes not reliably 'authorizing'
> themselves when joining an IBSS-RSN adhoc network, at least since r33202 or
> so.  No idea on a cause, besides IBSS-RSN simply being buggy.
> 
> Besides doing something crude like putting 'sleep 30 ; wifi restart' in the
> file /etc/rc.local, I have also written a slightly less crude hotplug.d
> script that attempts to restart the wifi interface, if the node finds
> itself 'authorized' but not 'authenticated' on the adhoc network.  I've
> attached that script to this email, and you can save it on a node as *
> /etc/hotplug.d/firewall/20_mesh_auth_check*.
> 
> Please note this is not a 100% effective solution, as the problem is not
> just with a newly powered on node not consistently authorizing itself, but *
> also* with some of the existing nodes in the adhoc network not consistently
> authorizing the new node on their end too.  That is, I've seen instances
> where node A lists node B as both 'authenticated' and 'authorized', but
> where node B lists node A as 'authenticated' and *not* 'authorized.'  Oy.
> 
> So, when a new node joins an RSN-encrypted adhoc network, it looks like the
> following steps must happen to ensure all nodes are authorized:
> 
> A. Newly powered-up node inspects output of 'iw wlan0 station dump' looking
> for entries where a remote node is 'authenticated' but *not* 'authorized,'
> and if so, restart the wifi and retry test.  Ideally, the node would repeat
> this process X times until giving up.  The script I'm attaching tries to do
> this, albeit without the option to give up after X times.
> 
> B. All existing nodes periodically via cronjob check their own local output
> of 'iw wlan0 station dump', looking for new entries where a new remote node
> is 'authenticated' but *not* 'authorized,'  If such an entry is found,
> restart the *local* node's wifi and retry test.
> 
> Having both of these steps occur simultaneously on all nodes clearly could
> lead to lots of ugly race conditions, so it's not ideal.  Likewise, its
> even less ideal for a particular node with active clients to restart its
> wifi just because a new node powered on, but didn't get successfully
> authorized.  Maybe an alternate way to achieve step B is just to have the
> newly powered-up node repeatedly restart its wifi *until* it can
> successfully ping all other nodes that appear on the adhoc network, tho
> would be tedious and make startup very slow.
> 
> - luci_splash got itself into a nice "wedge" on one of the LTS nodes. I
> > am going to do my best to get it unstuck. If we continue to see the
> > problem, we'll have to take a hard look at pushing the splash rewrite to
> > a higher priority.
> >
> > Will
> >
> > On 04/24/2013 04:26 PM, Ben West wrote:
> > > Hi Will,
> > >
> > > I just checked config on a node again (running Attitude Adjustment circa
> > > r35xxx), and I found these lines in /etc/crontabs/root which had been
> > > commented out:
> > >
> > > #* * * * *      /usr/sbin/ff_olsr_test_gw.sh
> > > #*/5 * * * *    /usr/sbin/ff_olsr_watchdog
> > >
> > > So, be on the lookout for ff packages that deploy these scripts,
> > > although unfortunately it's not clear /which/ package includes these
> > > particular files.  Maybe freifunk-common?
> > >
> > > On Wed, Apr 24, 2013 at 2:28 PM, Will Hawkins
> > > <hawkinsw at opentechinstitute.org <mailto:hawkinsw at opentechinstitute.org>>
> > > wrote:
> > >
> > >     Thanks for your response, Ben. We just recompiled an image w/o most
> > of
> > >     the ff software. We are now testing that image to see if things are
> > any
> > >     better. We will definitely note how ff-watchdog may be useful and how
> > >     ff-gw-check is the likely culprit ;-)
> > >
> > >     Will
> > >
> > >     On 04/24/2013 02:17 PM, Ben West wrote:
> > >     > The Freifunk watchdog package is actually a rather handy package,
> > >     since
> > >     > it will monitor any process you want (via periodic cronjob) and
> > >     restart
> > >     > that service if the active process disappears (aka crashes).  To my
> > >     > knowledge, it doesn't directly start/stop any network interfaces.
> > >      But,
> > >     > ff-watchdog does need to be configured to monitor the processes
> > >     you care
> > >     > about, and to not conflict with any other watchdog-style task.
> >  That
> > >     > conflict may be indirectly causing interfaces to go down or even
> > olsrd
> > >     > to stop in absence of a needed interface.
> > >     >
> > >     > Its config file is /etc/config/freifunk-watchfog, and here is an
> > >     example
> > >     > config I've used (for node using coovachilli):
> > >     >
> > >     > config process
> > >     >     option process 'dropbear'
> > >     >     option initscript '/etc/init.d/dropbear'
> > >     >
> > >     > config process
> > >     >     option process 'crond'
> > >     >     option initscript '/etc/init.d/cron'
> > >     >
> > >     > config process
> > >     >     option process 'olsrd'
> > >     >     option initscript '/etc/init.d/olsrd'
> > >     >
> > >     > config process
> > >     >     option process 'chilli'
> > >     >     option initscript '/etc/init.d/coovachilli'
> > >     >
> > >     > Are you sure you weren't having problems with the ff-gw-check
> > package
> > >     > instead?  I.e. un-installed that package at the same time as
> > >     > un-stinalling ff-watchdog?  I think the gw-check package /will
> > muck/
> > >     > with default routes and possibly also restart active network
> > >     interfaces
> > >     > if it can't get a successful ping to freifunk.net
> > >     <http://freifunk.net> <http://freifunk.net>
> > >     > or something.
> > >     >
> > >     > On Wed, Apr 24, 2013 at 8:07 AM, Dan Staples
> > >     > <danstaples at opentechinstitute.org
> > >     <mailto:danstaples at opentechinstitute.org>
> > >     > <mailto:danstaples at opentechinstitute.org
> > >     <mailto:danstaples at opentechinstitute.org>>> wrote:
> > >     >
> > >     >     Moving this discussion to commotion-dev...
> > >     >
> > >     >     When I was previously setting the wireless interfaces to use
> > >     channel 9
> > >     >     instead of channel 5, the freifunk watchdog would routinely
> > >     bring down
> > >     >     the wireless interfaces. And I have no idea why. The only way
> > >     I got it
> > >     >     to work was uninstalling ff-watchdog. So see if that may be a
> > >     reason why
> > >     >     wireless interfaces are unavailable...there should be a note
> > >     about it in
> > >     >     logread.
> > >     >
> > >     >     I've also noticed that something is killing olsrd on DR1
> > >     nodes, without
> > >     >     any clue in the log. The routing table will still have stale
> > >     routes in
> > >     >     it, indicating that olsrd isn't exiting cleanly. I wonder if
> > >     it's being
> > >     >     killed by the out-of-memory watchdog. When I was
> > >     troubleshooting this
> > >     >     before, I wrote a quick script that ran as a cronjob every
> > >     minute, and
> > >     >     it would pgrep olsrd. If olsrd was running, it would redirect
> > >     the output
> > >     >     of top into ~/top.out. If olsrd wasn't running, it would move
> > >     the last
> > >     >     ~/top.out as well as logread into a separate directory. That
> > way,
> > >     >     whenever olsrd was killed, there would be a record of top the
> > >     minute
> > >     >     before it crashed, as well as the log. Would this be useful for
> > >     >     troubleshooting the LTS nodes?
> > >     >
> > >     >
> > >     >
> > >     > --
> > >     > Ben West
> > >     > http://gowasabi.net
> > >     > ben at gowasabi.net <mailto:ben at gowasabi.net>
> > >     <mailto:ben at gowasabi.net <mailto:ben at gowasabi.net>>
> > >     > 314-246-9434 <tel:314-246-9434> <tel:314-246-9434 <tel:
> > 314-246-9434>>
> > >     >
> > >     >
> > >     > _______________________________________________
> > >     > Commotion-dev mailing list
> > >     > Commotion-dev at lists.chambana.net
> > >     <mailto:Commotion-dev at lists.chambana.net>
> > >     > https://lists.chambana.net/mailman/listinfo/commotion-dev
> > >     >
> > >     _______________________________________________
> > >     Commotion-dev mailing list
> > >     Commotion-dev at lists.chambana.net
> > >     <mailto:Commotion-dev at lists.chambana.net>
> > >     https://lists.chambana.net/mailman/listinfo/commotion-dev
> > >
> > >
> > >
> > >
> > > --
> > > Ben West
> > > http://gowasabi.net
> > > ben at gowasabi.net <mailto:ben at gowasabi.net>
> > > 314-246-9434
> >
> 
> 
> 
> -- 
> Ben West
> http://gowasabi.net
> ben at gowasabi.net
> 314-246-9434