[CUWiN-Dev] Re: ath0: hardware error

David Young dyoung at pobox.com
Tue Aug 30 18:17:29 CDT 2005


On Mon, Aug 29, 2005 at 07:18:28PM -0500, David Young wrote:
> On Mon, Aug 29, 2005 at 06:27:12PM -0500, Bill Comisky wrote:
> > On Mon, 29 Aug 2005, Bill Comisky wrote:
> > 
> > >On Thu, 25 Aug 2005, David Young wrote:
> > >
> > >> On Thu, Aug 25, 2005 at 03:29:16PM -0500, Bill Comisky wrote:
> > >>>  On Sun, 21 Aug 2005, David Young wrote:
> > >>> 
> > >>> >  I added some debug messages as I tried to track down the source of
> > >>> >  "ath0: hardware error; resetting".  The debug messages have not shown
> > >>> >  me any obvious problem, but they do seriously slow down some nodes,
> > >>> >  so I am taking them out.
> > >>> > 
> > >>> >  Apply these patches to your NetBSD sources.  They take out the noisy
> > >>> >  debug messages:
> > >>> > 
> > >>> >  Apply ath-undo-1 in sys/dev/ic/, ath-undo-2 in sys/dev/:
> > >>> > 
> > >>> > %  cd your-netbsd-sources/src/sys/dev/ic
> > >>> > %  patch < ath-undo-1
> > >>> > %  cd -
> > >>> > %  cd your-netbsd-sources/src/sys/dev
> > >>> > %  patch < ath-undo-2
> > >>> > 
> > >>> >  Dave
> > >>> 
> > >>>  Should I be seeing any "ath0: hardware error..." messages after these
> > >>>  patches have been reverse applied?  I'm still seeing them, sometimes a 
> > >>>  lot
> > >>>  of them.  I can see the CPU "% interrupt" in top shoot up when they're
> > >>>  spewing to about 60% (with hw.ath0.debug=0x80000000 commented out in
> > >>>  sysctl.conf).  I feel like my build broke somehow, though I deleted the
> > >>>  build directory and unpacked the source fresh before "patch -R".
> > >>
> > >> You may still see "ath0: hardware error...", but they should not come
> > >> with the Rn and Tn lines that you used to see.
> > >>
> > >> I don't know what causes ath0: hardware error.  The author of the driver
> > >> tells me that I may need a PCI bus analyzer to figure it out. :-(
> > >>
> > >> I am a PCI novice, but I strongly suspect that the error has something
> > >> to do with PCI bus contention:
> > >>
> > >> 1) I get a lot more "ath0: hardware error" indications on my Soekris
> > >> net4521 when it carries one or two Cardbus WiFi cards in addition to
> > >> the MiniPCI Atheros card.  I scarcely get any such indications when the
> > >> MiniPCI cards are not in there.
> > >>
> > >> 2) The madwifi driver (Linux version of ath, also by
> > >> Sam Leffler) sets an unusually large PCI Latency Timer,
> > >> <http://cvs.sourceforge.net/viewcvs.py/madwifi/madwifi/ath/if_ath_pci.c?only_with_tag=HEAD&view=markup>.
> > >>
> > >> Madwifi may also set some other parameters differently from NetBSD,
> > >> which accepts defaults---I will check a little later.
> > >>
> > >> 3) A discussant at
> > >> <http://www.broadbandreports.com/forum/remark,9086546~mode=flat> mentions
> > >> that changing his PCI Latency Timer from 32 to 64 helped prevent some
> > >> system lockups when his Atheros card was activated.
> > >>
> > >> 4) The Soekris BIOS does not set PCI minimum grant, maximum latency,
> > >> or latency timer that make any sense according to the explanation given
> > >> at <http://www.reric.net/linux/pci_latency.html>.
> > >>
> > >> Bill, will you do me a favor, and send me both dmesg(8) and pcictl(8)
> > >> output for your Atheros card?  Here is how I got the pcictl info I 
> > >> wanted:
> > >
> > >Dave,
> > >
> > >After many upgrade/reboot cycles, I've found out a few things about the 
> > >ath0 errors I'm seeing.  I found that only our local builds with our local 
> > >modifications were going into the endless "ath0: hardware error" loop. 
> > >After some more experimenting, I found that the changing the SSID can 
> > >trigger these errors.  We had changed our local cuw_config_ssid setting to 
> > >'cntwireless.net', and those builds were getting ath0 errors as soon as 
> > >the devices were brought up.  I found I could set off the errors and stop 
> > >them by calling 'ifconfig ath0 ssid' with a new SSID from the command 
> > >line. 'cuwireless.net' or anything of the same length always seems ok, 
> > >anything longer triggered the errors.  Sometimes smaller SSIDs would 
> > >trigger the error as well; though it seemed to depend on what you were 
> > >changing from. Sometimes the same SSID would cause the error and sometimes 
> > >not, depending on what you were changing from.  Perhaps some important 
> > >chunk of memory is being written over somewhere?  The repeatability of 
> > >errors at boot time wasn't 100%; occasionally (maybe only on 1st reboot 
> > >after an upgrade?) I would get results that confounded my expectation 
> > >(errors when not expecting them, or no errors when expecting them).
> > >
> > >My first set of experiments I did on source with the ath_undo patches 
> > >reverse applied.  In this case, I either got no console messages when 
> > >switching SSIDs (to cuwireless.net for example), or I got repeated groups 
> > >of the following when changing to a "bad" ssid, such as cntwireless.org:
> > >
> > >ath0: hardware error; resetting
> > >ath_stoprecv: rx queue 0x10f14fc, link 0xc5dff4d0
> > >[followed by a bunch of R0 lines]
> > >
> > >To make sure it wasn't something with our build process or source, I 
> > >downloaded and installed the CUWiN 0.5.8 release; which I don't think has 
> > >the ath_undo patches.  In this case, I got ath0: console messages (with 
> > >more stuff than above, see attachments) whenever I set the SSID, but the 
> > >"bad" cases set of an endless loop of these messages.  When I repeated the 
> > >experiments on a net4526 (had been working on a net4511), I couldn't 
> > >reproduce the endless loop.  I've attached files with console dumps of 
> > >these last experiments showing dmesg, pcictl dumps, and the output from 
> > >"ifconfig ath0 ssid".  I noticed that the radio firmware had a different 
> > >version 3.6 vs. 4.6 for the two radios I used from the dmesg output, so I 
> > >repeated the net4526 experiment after switching the radio.  I still didn't 
> > >get the endless stream of ath0 errors. For all of these tests, there were 
> > >no other nodes up.
> > >
> > >Can you reproduce the repeated ath0: errors on a net4511 with a "ifconfig 
> > >ath0 ssid somebigssidhere"?  Does this give you any clues as to where the 
> > >problem may be?
> > 
> > An addendum:  With my local build (with ath_undo patches and some 
> > customizations) I do see the same stream of ath0: hardware errors on the 
> > net4526.  I can stop them by setting the SSID to cuwireless.net and start 
> > them by setting it to something longer.
> 
> Great work!
> 
> Perhaps "ath0: hardware error; resetting" is caused by Tx FIFO underruns
> during beacon transmissions.  That would fit some of the evidence.
> We would be worse off with a longer SSID (makes a longer beacon) than
> with shorter SSIDs.  Underruns could be connected with bad PCI setup,
> which would tend to delay or fragment PCI bus transfers.

I was able to semi-reliably reproduce the lossage you report with
'ifconfig ath0 ssid cntwireless.net'.  I have attached a patch I applied
that may have fixed the problem.

I see a lot of these messages scrolling up the console when I set ssid
cntwireless.net.  No doubt it's because the SSIDs don't agree.  I remember
you mentioned that before.  I will send a second patch that quiets things.

ieee80211_ibss_merge: merge failed, capabilities mismatch
ieee80211_ibss_merge: merge failed, capabilities mismatch
ieee80211_ibss_merge: merge failed, capabilities mismatch
ieee80211_ibss_merge: merge failed, capabilities mismatch
ieee80211_ibss_merge: merge failed, capabilities mismatch
ieee80211_ibss_merge: merge failed, capabilities mismatch
ieee80211_ibss_merge: merge failed, capabilities mismatch

Dave

-- 
David Young             OJC Technologies
dyoung at ojctech.com      Urbana, IL * (217) 278-3933


More information about the CU-Wireless-Dev mailing list