[Commotion-dev] [OTI-Tech] LTS Testing Update
Will Hawkins
hawkinsw at opentechinstitute.org
Thu Apr 25 02:09:37 UTC 2013
Ben,
Thanks so much for your thoughtful reply re: IBSS-RSN. It looks like we are seeing the same behavior. When I am back at my primary workspace tomorrow I am going to go through the script that you sent. I've left a test running overnight with debugging enabled to see if I can get some "bad output" from wpa_supplicant. I will let you know what I find, if anything!
Thanks again!
Will
On Wednesday, April 24, 2013 19:43 EDT, Ben West <ben at gowasabi.net> wrote:
> Hi Will,
>
> Glad that an apparent bug with dhcp was caught.
>
> Please see my response about IBSS-RSN issues below in green.
>
> On Wed, Apr 24, 2013 at 5:49 PM, Will Hawkins <
> hawkinsw at opentechinstitute.org> wrote:
>
> > A few notes to consider after some testing today with LTS and at the
> > office:
> >
> > - Collectd "stalled" one of the nodes at LTS (where collectd is still
> > enabled)
> >
> > - Stations seem to lose their mind w.r.t IBSS RSN and authorization. I'm
> > watching debugging output from wpa_supplicant on one of the nodes in the
> > office to see if I can determine the problem. Of course, it's working
> > great now :-)
> >
> >
> I've observed a definite issue with nodes not reliably 'authorizing'
> themselves when joining an IBSS-RSN adhoc network, at least since r33202 or
> so. No idea on a cause, besides IBSS-RSN simply being buggy.
>
> Besides doing something crude like putting 'sleep 30 ; wifi restart' in the
> file /etc/rc.local, I have also written a slightly less crude hotplug.d
> script that attempts to restart the wifi interface, if the node finds
> itself 'authorized' but not 'authenticated' on the adhoc network. I've
> attached that script to this email, and you can save it on a node as *
> /etc/hotplug.d/firewall/20_mesh_auth_check*.
>
> Please note this is not a 100% effective solution, as the problem is not
> just with a newly powered on node not consistently authorizing itself, but *
> also* with some of the existing nodes in the adhoc network not consistently
> authorizing the new node on their end too. That is, I've seen instances
> where node A lists node B as both 'authenticated' and 'authorized', but
> where node B lists node A as 'authenticated' and *not* 'authorized.' Oy.
>
> So, when a new node joins an RSN-encrypted adhoc network, it looks like the
> following steps must happen to ensure all nodes are authorized:
>
> A. Newly powered-up node inspects output of 'iw wlan0 station dump' looking
> for entries where a remote node is 'authenticated' but *not* 'authorized,'
> and if so, restart the wifi and retry test. Ideally, the node would repeat
> this process X times until giving up. The script I'm attaching tries to do
> this, albeit without the option to give up after X times.
>
> B. All existing nodes periodically via cronjob check their own local output
> of 'iw wlan0 station dump', looking for new entries where a new remote node
> is 'authenticated' but *not* 'authorized,' If such an entry is found,
> restart the *local* node's wifi and retry test.
>
> Having both of these steps occur simultaneously on all nodes clearly could
> lead to lots of ugly race conditions, so it's not ideal. Likewise, its
> even less ideal for a particular node with active clients to restart its
> wifi just because a new node powered on, but didn't get successfully
> authorized. Maybe an alternate way to achieve step B is just to have the
> newly powered-up node repeatedly restart its wifi *until* it can
> successfully ping all other nodes that appear on the adhoc network, tho
> would be tedious and make startup very slow.
>
> - luci_splash got itself into a nice "wedge" on one of the LTS nodes. I
> > am going to do my best to get it unstuck. If we continue to see the
> > problem, we'll have to take a hard look at pushing the splash rewrite to
> > a higher priority.
> >
> > Will
> >
> > On 04/24/2013 04:26 PM, Ben West wrote:
> > > Hi Will,
> > >
> > > I just checked config on a node again (running Attitude Adjustment circa
> > > r35xxx), and I found these lines in /etc/crontabs/root which had been
> > > commented out:
> > >
> > > #* * * * * /usr/sbin/ff_olsr_test_gw.sh
> > > #*/5 * * * * /usr/sbin/ff_olsr_watchdog
> > >
> > > So, be on the lookout for ff packages that deploy these scripts,
> > > although unfortunately it's not clear /which/ package includes these
> > > particular files. Maybe freifunk-common?
> > >
> > > On Wed, Apr 24, 2013 at 2:28 PM, Will Hawkins
> > > <hawkinsw at opentechinstitute.org <mailto:hawkinsw at opentechinstitute.org>>
> > > wrote:
> > >
> > > Thanks for your response, Ben. We just recompiled an image w/o most
> > of
> > > the ff software. We are now testing that image to see if things are
> > any
> > > better. We will definitely note how ff-watchdog may be useful and how
> > > ff-gw-check is the likely culprit ;-)
> > >
> > > Will
> > >
> > > On 04/24/2013 02:17 PM, Ben West wrote:
> > > > The Freifunk watchdog package is actually a rather handy package,
> > > since
> > > > it will monitor any process you want (via periodic cronjob) and
> > > restart
> > > > that service if the active process disappears (aka crashes). To my
> > > > knowledge, it doesn't directly start/stop any network interfaces.
> > > But,
> > > > ff-watchdog does need to be configured to monitor the processes
> > > you care
> > > > about, and to not conflict with any other watchdog-style task.
> > That
> > > > conflict may be indirectly causing interfaces to go down or even
> > olsrd
> > > > to stop in absence of a needed interface.
> > > >
> > > > Its config file is /etc/config/freifunk-watchfog, and here is an
> > > example
> > > > config I've used (for node using coovachilli):
> > > >
> > > > config process
> > > > option process 'dropbear'
> > > > option initscript '/etc/init.d/dropbear'
> > > >
> > > > config process
> > > > option process 'crond'
> > > > option initscript '/etc/init.d/cron'
> > > >
> > > > config process
> > > > option process 'olsrd'
> > > > option initscript '/etc/init.d/olsrd'
> > > >
> > > > config process
> > > > option process 'chilli'
> > > > option initscript '/etc/init.d/coovachilli'
> > > >
> > > > Are you sure you weren't having problems with the ff-gw-check
> > package
> > > > instead? I.e. un-installed that package at the same time as
> > > > un-stinalling ff-watchdog? I think the gw-check package /will
> > muck/
> > > > with default routes and possibly also restart active network
> > > interfaces
> > > > if it can't get a successful ping to freifunk.net
> > > <http://freifunk.net> <http://freifunk.net>
> > > > or something.
> > > >
> > > > On Wed, Apr 24, 2013 at 8:07 AM, Dan Staples
> > > > <danstaples at opentechinstitute.org
> > > <mailto:danstaples at opentechinstitute.org>
> > > > <mailto:danstaples at opentechinstitute.org
> > > <mailto:danstaples at opentechinstitute.org>>> wrote:
> > > >
> > > > Moving this discussion to commotion-dev...
> > > >
> > > > When I was previously setting the wireless interfaces to use
> > > channel 9
> > > > instead of channel 5, the freifunk watchdog would routinely
> > > bring down
> > > > the wireless interfaces. And I have no idea why. The only way
> > > I got it
> > > > to work was uninstalling ff-watchdog. So see if that may be a
> > > reason why
> > > > wireless interfaces are unavailable...there should be a note
> > > about it in
> > > > logread.
> > > >
> > > > I've also noticed that something is killing olsrd on DR1
> > > nodes, without
> > > > any clue in the log. The routing table will still have stale
> > > routes in
> > > > it, indicating that olsrd isn't exiting cleanly. I wonder if
> > > it's being
> > > > killed by the out-of-memory watchdog. When I was
> > > troubleshooting this
> > > > before, I wrote a quick script that ran as a cronjob every
> > > minute, and
> > > > it would pgrep olsrd. If olsrd was running, it would redirect
> > > the output
> > > > of top into ~/top.out. If olsrd wasn't running, it would move
> > > the last
> > > > ~/top.out as well as logread into a separate directory. That
> > way,
> > > > whenever olsrd was killed, there would be a record of top the
> > > minute
> > > > before it crashed, as well as the log. Would this be useful for
> > > > troubleshooting the LTS nodes?
> > > >
> > > >
> > > >
> > > > --
> > > > Ben West
> > > > http://gowasabi.net
> > > > ben at gowasabi.net <mailto:ben at gowasabi.net>
> > > <mailto:ben at gowasabi.net <mailto:ben at gowasabi.net>>
> > > > 314-246-9434 <tel:314-246-9434> <tel:314-246-9434 <tel:
> > 314-246-9434>>
> > > >
> > > >
> > > > _______________________________________________
> > > > Commotion-dev mailing list
> > > > Commotion-dev at lists.chambana.net
> > > <mailto:Commotion-dev at lists.chambana.net>
> > > > https://lists.chambana.net/mailman/listinfo/commotion-dev
> > > >
> > > _______________________________________________
> > > Commotion-dev mailing list
> > > Commotion-dev at lists.chambana.net
> > > <mailto:Commotion-dev at lists.chambana.net>
> > > https://lists.chambana.net/mailman/listinfo/commotion-dev
> > >
> > >
> > >
> > >
> > > --
> > > Ben West
> > > http://gowasabi.net
> > > ben at gowasabi.net <mailto:ben at gowasabi.net>
> > > 314-246-9434
> >
>
>
>
> --
> Ben West
> http://gowasabi.net
> ben at gowasabi.net
> 314-246-9434
More information about the Commotion-dev
mailing list