Ticket #1983 (closed defect: fixed)

Opened 8 years ago

Last modified 8 years ago

eth0 doesn't exist / Oops during bootup

Reported by: Weiss Owned by: openmoko-devel
Priority: normal Milestone:
Component: unknown Version:
Severity: normal Keywords: wifi kernel
Cc: Blocked By:
Blocking: Estimated Completion (week):
HasPatchForReview: no PatchReviewResult:
Reproducible: always

Description

Rootfs: Om2008.8-gta02-20080904.rootfs.jffs2
Kernel: Om2008.8-gta02-20080903.uImage.bin

I find the wifi interface (eth0) doesn't exist - only lo and usb0 (when connected) appear in 'ifconfig'. This leads to wifi not working at all (iwconfig, ifconfig and friends all fail with "No such device" or similar). Accordingly, I get "Wifi: unknown" in Settings.

While studying 'dmesg' (or just watching the bootup messages), I noticed a kernel oops in a module which I suspect may be related. A log is attached which is the output from 'dmesg' after a bootup before any suspending was allowed to happen. I suspect that useful messages may have scrolled off the top of the ring buffer - please let me know if there's anything I can do to capture this information without using a debug board, if it'd be useful. However, the backtrace is there.

Other features: GPS, voice calls, SMS, suspend etc all work very nicely indeed for me.

Attachments

dmesg.log (15.0 KB) - added by Weiss 8 years ago.
dmesg just after bootup
dmesg-beforejiggling.log (27.3 KB) - added by Weiss 8 years ago.
Full version of dmesg before disassembling and poking wifi module
dmesg-afterjiggling.log (23.0 KB) - added by Weiss 8 years ago.
dmesg after disassembling and poking wifi module
dmesg-afterjiggling-withtimeouts.log (32.0 KB) - added by Weiss 8 years ago.
dmesg from later boot, after poking wifi module, with timeouts having returned
dmesg-debug.log (33.0 KB) - added by Weiss 8 years ago.
dmesg with SDIO bus driver debug messages

Change History

Changed 8 years ago by Weiss

dmesg just after bootup

comment:1 Changed 8 years ago by werner

Seems that the WLAN module isn't responding to the driver. The Oops
is just sloppy error handling. Did WLAN ever work on this device ?

comment:2 Changed 8 years ago by Weiss

Thanks for the very quick reply. I haven't ever seen the WLAN working here, but it's very new (in my hands only on Saturday) so I haven't tested thoroughly with other images or kernels. Do you suspect a hardware fault? I was surprised that noone else had found the same problem.

comment:3 Changed 8 years ago by werner

I would consider the possibility of this being a hardware problem,
yes.

Perhaps first, we could have a look at whether there is something
else going wrong in the kernel that causes the driver to fail to
communicate and then ultimately leads to the Oops.

If you enter u-boot, you can increase the kernel's log buffer size
as follows:

GTA02v6 # setenv bootcmd setenv bootargs \${bootargs_base} \${mtdparts} \${extra}\; nand read.e 0x32000000 kernel 0x200000\; bootm 0x32000000
GTA02v6 # setenv extra log_buf_len=2M
GTA02v6 # saveenv

This will increase the kernel log buffer size to 2MB, which should be
plenty. If you want to return it to its default size later, you would

GTA02v6 # setenv extra
GTA02v6 # saveenv

Then boot
GTA02v6 # boot
and retrieve all the information with
pc% ssh neo dmesg -s 2000000 >log
The -s option is important, because dmesg by default only retrieves
16kB.

In case nothing suspicious shows up:

If you don't mind disassembling the device, it may be worth checking
if the WLAN module is properly seated. It's glued to the top of the
main shield with some conductive adhesive tape, so it has some wiggle
room.

There is a small white connector close to the edge of the main PCB,
which connects the WLAN module. If you look at it from the side,
you should be able to see if the connector is fully inserted. If it
seems loose, you could gently press from the top of the WLAN PCB down
towards the connector.

Changed 8 years ago by Weiss

Full version of dmesg before disassembling and poking wifi module

Changed 8 years ago by Weiss

dmesg after disassembling and poking wifi module

Changed 8 years ago by Weiss

dmesg from later boot, after poking wifi module, with timeouts having returned

comment:4 Changed 8 years ago by Weiss

Thanks for the reply again. I've obtained a full 'dmesg' trace (dmesg-beforejiggling.log) which shows the timeouts from the start with nothing else which looks relevant to me. I opened the case and checked the WLAN module, but it seems nicely seated and is very firmly stuck in place. I tried to wiggle it around a little but it really wasn't moving more than a fraction of a millimetre.

After poking around, the timeouts in dmesg seemed to have gone (dmesg-afterjiggling.log), but there was still no wifi (same 'no such device' errors). On a subsequent bootup, the errors had returned (dmesg-afterjiggling-withtimeouts.log). I'm afraid I couldn't swear to the timeout messages not having been intermittent before the poking around - I wasn't watching too closely until I began to realise that the messages during bootup could be related to the wifi problems.

comment:5 Changed 8 years ago by werner

Thanks for the logs ! It seems that the complaints only start after the
SDIO stack has already decided that there is a WLAN device. I.e., after
it has accessed the module.

The command it was trying to issue is a read operation that normally
follows a sequence of other read operations, and is roughly the 146th
command sent to the device. So basic communication appears to work.

The switch from a 1-bit bus to the 4-bit bus happens much earlier,
around the 36th command. So it seems that this was uneventful.

What does happen around that time is that the function is activated.
(IOE1 is set in CCCR.) The failing access is still for the CCCR,
though.

So it appears that enabling the function upsets your WLAN module such
that it either completely fails to communicate over SDIO, or that it
sends information the SDIO stack is unhappy with (after which it tries
to remove the device, causing the Oops).

So this would make the operation that sometimes begins the string of
timeouts the call to Cmd52ReadByteCommon following the call to
Cmd52WriteByteCommon in the function SDEnableFunction in
drivers/sdio/stack/busdriver/sdio_bus_misc.c

The "silent" failure path (i.e., no timeout messages) is probably
the error exit where SDEnableFunction sets status =
SDIO_STATUS_FUNC_ENABLE_TIMEOUT. This theory could be confirmed by
putting a printk there.

If this is only a temporary confusion, e.g., a race condition inside
the WLAN module, perhaps adding a delay(10) after Cmd52WriteByteCommon
might help.

Otherwise, this still looks like a defective WLAN module.

Changed 8 years ago by Weiss

dmesg with SDIO bus driver debug messages

comment:6 Changed 8 years ago by Weiss

Ok - here's the dmesg log with the SDIO bus driver messages enabled. I can see the initial messages working properly, the switch to four-bit I/O, some power settings etc. The errors start, as you suggested, when the function is enabled. There are some response and error codes in the log, but I don't know what they mean.

I also added a debug message just before the SDIO_STATUS_FUNC_ENABLE_TIMEOUT exit, and it looks like this happens multiple times before the timeout messages appear.

I tried adding a udelay(10) just after the Cmd52WriteByteCommon in SDEnableFunction (after creating the current log), but this didn't appear to make any difference.

So, unless it'd be useful to probe any deeper (which I'd be very willing to do - I'm enjoying learning about how to poke around in the kernel), I'll initiate the warranty repair/replacement procedure with my reseller and this ticket can be closed. Thanks for all the help and advice.

comment:7 Changed 8 years ago by werner

Sorry for the delay - I missed your update among my mails.

This looks bad indeed - I don't know what exactly is wrong, but it doesn't
look like any other problem I'm aware of, and the fact that the failure
happens just after the function gets enabled strongly suggests a hardware
issue - and one that isn't just transitory.

For further debugging, one would have to record the communication on the
SDIO bus, but even that would probably only show that it just stops or
that there's perhaps some garbled last message, without shedding any light
on the real reason.

I hope the replacement goes well. Thanks for the testing, and sorry for
the inconvenience.

comment:8 Changed 8 years ago by werner

  • Status changed from new to closed
  • HasPatchForReview unset
  • Resolution set to fixed

Seems that this one was successfully resolved through warranty replacement,
so let's close it.

Note: See TracTickets for help on using tickets.