Ticket #2333 (in_testing defect)

Opened 8 years ago

Last modified 8 years ago

Intermittent kernel oops when using wifi

Reported by: bt4 Owned by: openmoko-devel
Priority: normal Milestone:
Component: kernel Version: unspecified
Severity: normal Keywords: wlan oops
Cc: werner@…, nicola.mfb@… Blocked By:
Blocking: Estimated Completion (week):
HasPatchForReview: no PatchReviewResult:


I'm running SHR-Unstable which is up to date as of today (1st march 2010).

The Kernel version is :-
Linux om-gta02 2.6.29-rc3 #1 PREEMPT Fri Feb 5 18:47:47 CET 2010 armv4tl GNU/Linux

SHR version number in OPKG is :-

I have stated that this is an intermittent problem as I am still working on a reliable method to reproduce it. I use the NWA (GUI) wifi manager and this error regularly occurs forcing me to remove the battery from my Freerunner in order to get it working again.

So far I have been able to reproduce the issue at will using NWA like this :-

  1. I create two profiles in NWA :- a) An WPA-Enterprise network with PEAP and RADIUS server cert defined and no MAC address defined b) An Open network with a MAC address defined
  1. I then enabled these profiles (in NWA) and also enabled the profile

called "any_open".

  1. Then sited the Freerunner in a location where non of the profiles

specified in (1) above are withing range, but there is a third open
network with a weak signal (Quality=4/94 Signal level=-91 dBm Noise
level=-95 dBm).

  1. Let NWA associate with the weak access point (I don't know what

this AP is; its's just a signal I picked up from my house). I see this
in dmesg :-
[ 887.860000] AR6000 connected event on freq 2437 with bssid 00:22:75:dd:76:51 listenInterval=100, beaconInterval = 100, beaconIeLen = 0 assocReqLen=41
assocRespLen =59
[ 887.860000] Network: Infrastructure

  1. The signal is weak and I do not recieve a DHCP lease (perhaps the

UDP does not go well over weak wifi) but NWA says "Connected".

  1. I now disable the "any_open" profile in NWA and see this in dmesg :-

[ 931.290000] ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[ 941.610000] eth0: no IPv6 routers present

  1. Now I quit NWA and although it appears to exit OK, this is when I

see the Kernel Oops.

Others have reported this issue to the mailing list so I am hoping we can come up with a more concrete method to reproduce this. I will keep working on it and if I come up with anything better I will update this bug report.


dmesg_out.txt (32.0 KB) - added by bt4 8 years ago.
dmesg4mar2010.txt (34.6 KB) - added by bt4 8 years ago.
dmesg5mar2010.txt (33.8 KB) - added by bt4 8 years ago.
ar6000.o (2.2 MB) - added by bt4 8 years ago.
ar6000_nwa.patch (592 bytes) - added by PaulFertser 8 years ago.
0001-ar6000-minimise-possibility-of-race-in-ar6000_ioctl_.patch (1.2 KB) - added by PaulFertser 8 years ago.
0002-ar6000-fix-compilation-with-DEBUG.patch (861 bytes) - added by PaulFertser 8 years ago.

Change History

Changed 8 years ago by bt4

comment:2 Changed 8 years ago by PaulFertser

  • Cc werner@…, nicola.mfb@… added
  • Status changed from new to in_testing

Please try the attached patch.

comment:3 Changed 8 years ago by PaulFertser

  • Component changed from unknown to System Software

Also please do not forget to set Component to System Software when you report bugs, i guess OM trac is not and will not be used for anything else, so this is the only section we should pay attention to.

comment:4 Changed 8 years ago by bt4

Thanks for the patch, however I'm still seeing the problem. Niko supplied me with a replacement kernel module, which I tested using the method above and it failed in exactly the same way. The dmesg output looks similar too.

Changed 8 years ago by bt4

comment:5 Changed 8 years ago by PaulFertser

Are you really sure you used a kernel module with the patch and that dmesg output is really identical?

Please, next time you do any testing,

  1. #define DEBUG at the top of ar6000_drv.c (it might change the timing tough, so if you can't reproduce with it, disable it back, i still hope it's possible to reproduce it with this option, please try hard).
  2. post dmesg
  3. post kernel module binary you used (preferably the .o, not .ko file) so i can locate (instead of guessing) the op that caused NULL pointer dereference.

Also you might speed up the process a lot if you enable me to reproduce the issue. Trace all calls to wpa_supplicant and try to reproduce with wpa_cli (though it might be hard due to timing issues). Alternatively, i can install nwa myself but i'm unsure i can reproduce your network conditions so try to minimise the requirements (it's better to have a way to reproduce it once in 10 times than having to find "an ap with weak signal etc etc).

comment:6 Changed 8 years ago by bt4

Thanks again for the info.

That previous test was using a module which Niko sent me as I am not sure how to build the SHR kernel.

However, I have just built a kernel myself using the instruction at: http://wiki.openmoko.org/wiki/Kernel
I inserted #define DEBUG and I could still reproduce the problem so I will attach the files you mentioned.

Kernel details :-

root@om-gta02 ~ $ uname -a
Linux om-gta02 2.6.29-GTA02_andy-tracking-mokodev #1 PREEMPT Fri Mar 5 21:28:42 GMT 2010 armv4tl GNU/Linux


I haven't been able to reproduce this using a scripted method, only by actually using NWA itself. It might be worth you installing it to try, as I am sure you can reproduce the problem with different settings and conditions also.

Changed 8 years ago by bt4

Changed 8 years ago by bt4

comment:7 Changed 8 years ago by PaulFertser

Can't find NWA source anywhere. Neither it's mentioned on the corresponding wiki page, nor it is present in OE recipies. :-|

I'll try to take a deeper look at the code in question and will provide you with additional debugging patches. But the round-trip times of this remote debugging thing are high :/

comment:8 Changed 8 years ago by PaulFertser

And it looks very much like you didn't define DEBUG the way i expected. Are you sure you inserted #define DEBUG _before_ including ar6000_drv.h?

I think i found the race though: ar6000_ioctl_siwscan checks arWmiReady in the beginning but the device gets destroyed while waiting for scan results. And the second wmi_bssfilter_cmd is called with ar->arWmi==NULL :-/

Changed 8 years ago by PaulFertser

comment:9 Changed 8 years ago by PaulFertser

Please try the new patch (the old one is no longer relevant). Even if it solves the problem, it still sucks (other parts of the driver have a lot of similar "solutions" to avoid races) but i do not think anyone's going to rewrite the driver :|

BTW, providing binary helped a lot, yesterday i errorneously thought that it's the first bssfilter_cmd invocation that fails, and that drove me away from more fruitful thoughts.

comment:10 Changed 8 years ago by bt4

Thanks for the patch. I am recompiling now.

I tried putting #define DEBUG before ar6000_drv.h but it would not compile :-

/kernelbuild/linux-2.6/drivers/ar6000/ar6000/ar6000_drv.c: In function 'ar6000_dbglog_event':
/kernelbuild/linux-2.6/drivers/ar6000/ar6000/ar6000_drv.c:505: error: implicit declaration of function 'ar6000_send_event_to_app'

comment:11 Changed 8 years ago by bt4

The patch looks good :-)

I now cannot reproduce the the problem using my method so I think you have fixed it. I will continue testing for the rest of the week and if there are no further problems I will let you know via another update to this ticket.

Many Thanks.

Changed 8 years ago by PaulFertser

comment:12 Changed 8 years ago by bt4

This problem has not occurred again after one week of testing, so as far as I am concerned this ticket can be closed.

Note: See TracTickets for help on using tickets.