Ticket #2217 (new defect)

Opened 6 years ago

Last modified 6 years ago

Noise screen of death: Freerunner looses SDIO connection

Reported by: xbaldauf Owned by: openmoko-kernel
Priority: normal Milestone:
Component: kernel Version:
Severity: major Keywords:
Cc: Blocked By:
Blocking: Estimated Completion (week):
HasPatchForReview: no PatchReviewResult:
Reproducible:

Description

Hello,

When the Freerunner is under load (especially when booting), it happens that the SDIO connection between CPU and glamo chip as well as between CPU and SDHC card breaks. Often, this break is only transient. During this time, the freerunner spits out kernel messages like these:

[20377.570000] glamo-mci glamo-mci.0: Error after cmd: 0x8020
[20377.570000] mmcblk0: error -110 sending read/write command
[20377.595000] end_request: I/O error, dev mmcblk0, sector 11183840
[20377.600000] glamo-mci glamo-mci.0: Error after cmd: 0x20
[20377.600000] mmcblk0: error -110 sending read/write command
[20377.605000] end_request: I/O error, dev mmcblk0, sector 11183840
[20377.620000] glamo-mci glamo-mci.0: Error after cmd: 0x8020
[20377.620000] mmcblk0: error -110 sending read/write command
[20377.625000] end_request: I/O error, dev mmcblk0, sector 11183840
[20377.650000] glamo-mci glamo-mci.0: Error after cmd: 0x20
[20377.650000] mmcblk0: error -110 sending read/write command
[20377.655000] end_request: I/O error, dev mmcblk0, sector 11183840
[20377.680000] glamo-mci glamo-mci.0: Error after cmd: 0x8020
[20377.680000] mmcblk0: error -110 sending read/write command
[20377.690000] end_request: I/O error, dev mmcblk0, sector 11183840
[20377.715000] glamo-mci glamo-mci.0: Error after cmd: 0x20
[20377.715000] mmcblk0: error -110 sending read/write command
[20377.720000] end_request: I/O error, dev mmcblk0, sector 11183840
[20377.765000] glamo-mci glamo-mci.0: Error after cmd: 0x8020
[20377.765000] mmcblk0: error -110 sending read/write command
[20377.775000] end_request: I/O error, dev mmcblk0, sector 11183840
[20377.800000] glamo-mci glamo-mci.0: Error after cmd: 0x20
[20377.800000] mmcblk0: error -110 sending read/write command
[20377.805000] end_request: I/O error, dev mmcblk0, sector 11183840
[20377.845000] glamo-mci glamo-mci.0: Error after cmd: 0x8020
[20377.845000] mmcblk0: error -110 sending read/write command
[20377.850000] end_request: I/O error, dev mmcblk0, sector 11183840
[20377.865000] glamo-mci glamo-mci.0: Error after cmd: 0x20
[20377.865000] mmcblk0: error -110 sending read/write command
[20377.880000] end_request: I/O error, dev mmcblk0, sector 11183840
[20377.885000] glamo-mci glamo-mci.0: Error after cmd: 0x8020
[20377.885000] mmcblk0: error -110 sending read/write command
[20377.890000] end_request: I/O error, dev mmcblk0, sector 11183840
[20377.910000] glamo-mci glamo-mci.0: Error after cmd: 0x20
[20377.910000] mmcblk0: error -110 sending read/write command
[20377.915000] end_request: I/O error, dev mmcblk0, sector 11183840
[20377.950000] glamo-mci glamo-mci.0: Error after cmd: 0x8020
[20377.950000] mmcblk0: error -110 sending read/write command
[20377.955000] end_request: I/O error, dev mmcblk0, sector 11183840
[20377.985000] glamo-mci glamo-mci.0: Error after cmd: 0x120
[20377.985000] mmcblk0: error -110 sending read/write command
[20377.990000] end_request: I/O error, dev mmcblk0, sector 11183840
[20378.015000] glamo-mci glamo-mci.0: Error after cmd: 0x8020
[20378.015000] mmcblk0: error -110 sending read/write command
[20378.050000] end_request: I/O error, dev mmcblk0, sector 11183840
[20378.055000] glamo-mci glamo-mci.0: Error after cmd: 0x120
[20378.055000] mmcblk0: error -110 sending read/write command
[20378.060000] end_request: I/O error, dev mmcblk0, sector 11183840
[20378.070000] glamo-mci glamo-mci.0: Error after cmd: 0x8020
[20378.070000] mmcblk0: error -110 sending read/write command
[20378.075000] end_request: I/O error, dev mmcblk0, sector 11183840
[20378.110000] glamo-mci glamo-mci.0: Error after cmd: 0x20
[20378.110000] mmcblk0: error -110 sending read/write command
[20378.115000] end_request: I/O error, dev mmcblk0, sector 11183840
[20378.160000] glamo-mci glamo-mci.0: Error after cmd: 0x8020
[20378.160000] mmcblk0: error -110 sending read/write command
[20378.165000] end_request: I/O error, dev mmcblk0, sector 11183840
[20378.185000] glamo-mci glamo-mci.0: Error after cmd: 0x20
[20378.185000] mmcblk0: error -110 sending read/write command
[20378.190000] end_request: I/O error, dev mmcblk0, sector 11183840
[20378.200000] glamo-mci glamo-mci.0: Error after cmd: 0x8020
[20378.200000] mmcblk0: error -110 sending read/write command
[20378.205000] end_request: I/O error, dev mmcblk0, sector 11183840
[20378.250000] glamo-mci glamo-mci.0: Error after cmd: 0x20
[20378.250000] mmcblk0: error -110 sending read/write command
[20378.315000] end_request: I/O error, dev mmcblk0, sector 11183840
[20378.320000] glamo-mci glamo-mci.0: Error after cmd: 0x8020
[20378.320000] mmcblk0: error -110 sending read/write command
[20378.325000] end_request: I/O error, dev mmcblk0, sector 11183840
[20378.340000] glamo-mci glamo-mci.0: Error after cmd: 0x20
[20378.340000] mmcblk0: error -110 sending read/write command
[20378.345000] end_request: I/O error, dev mmcblk0, sector 11183840
[20378.380000] glamo-mci glamo-mci.0: Error after cmd: 0x8020
[20378.380000] mmcblk0: error -110 sending read/write command
[20378.395000] end_request: I/O error, dev mmcblk0, sector 11183840
[20378.455000] glamo-mci glamo-mci.0: Error after cmd: 0x20
[20378.455000] mmcblk0: error -110 sending read/write command
[20378.460000] end_request: I/O error, dev mmcblk0, sector 9979414
[20378.460000] end_request: I/O error, dev mmcblk0, sector 9979422
[20378.465000] glamo-mci glamo-mci.0: Error after cmd: 0x8020
[20378.470000] mmcblk0: error -110 sending read/write command
[20378.480000] end_request: I/O error, dev mmcblk0, sector 9979438
[20378.480000] end_request: I/O error, dev mmcblk0, sector 9979446
[20378.490000] glamo-mci glamo-mci.0: Error after cmd: 0x20
[20378.490000] mmcblk0: error -110 sending read/write command
[20378.495000] end_request: I/O error, dev mmcblk0, sector 9979462
[20378.495000] end_request: I/O error, dev mmcblk0, sector 9979470
[20378.495000] end_request: I/O error, dev mmcblk0, sector 9979478
[20378.495000] end_request: I/O error, dev mmcblk0, sector 9979486
[20378.495000] end_request: I/O error, dev mmcblk0, sector 9979494
[20378.495000] end_request: I/O error, dev mmcblk0, sector 9979502
[20378.500000] glamo-mci glamo-mci.0: Error after cmd: 0x8120
[20378.520000] mmcblk0: error -110 sending read/write command
[20378.520000] end_request: I/O error, dev mmcblk0, sector 9979518
[20378.525000] glamo-mci glamo-mci.0: Error after cmd: 0x20
[20378.530000] mmcblk0: error -110 sending read/write command
[20378.580000] end_request: I/O error, dev mmcblk0, sector 9979542
[20378.580000] end_request: I/O error, dev mmcblk0, sector 9979550
[20378.585000] glamo-mci glamo-mci.0: Error after cmd: 0x8020
[20378.590000] mmcblk0: error -110 sending read/write command
[20378.615000] end_request: I/O error, dev mmcblk0, sector 9979566
[20378.615000] glamo-mci glamo-mci.0: Error after cmd: 0x20
[20378.620000] mmcblk0: error -110 sending read/write command
[20378.625000] end_request: I/O error, dev mmcblk0, sector 9979542
[20378.665000] glamo-mci glamo-mci.0: Error after cmd: 0x8120
[20378.665000] mmcblk0: error -110 sending read/write command
[20378.670000] end_request: I/O error, dev mmcblk0, sector 1992380
[20378.670000] end_request: I/O error, dev mmcblk0, sector 1992388
[20378.670000] end_request: I/O error, dev mmcblk0, sector 1992396
[20378.670000] end_request: I/O error, dev mmcblk0, sector 1992404
[20378.670000] end_request: I/O error, dev mmcblk0, sector 1992412
[20378.670000] end_request: I/O error, dev mmcblk0, sector 1992420
[20378.685000] glamo-mci glamo-mci.0: Error after cmd: 0x20
[20378.685000] mmcblk0: error -110 sending read/write command
[20378.690000] end_request: I/O error, dev mmcblk0, sector 1992508
[20378.695000] glamo-mci glamo-mci.0: Error after cmd: 0x8020
[20378.700000] mmcblk0: error -110 sending read/write command
[20378.710000] end_request: I/O error, dev mmcblk0, sector 1992540
[20378.710000] glamo-mci glamo-mci.0: Error after cmd: 0x20
[20378.715000] mmcblk0: error -110 sending read/write command
[20378.720000] end_request: I/O error, dev mmcblk0, sector 1992572
[20378.730000] glamo-mci glamo-mci.0: Error after cmd: 0x8020
[20378.730000] mmcblk0: error -110 sending read/write command
[20378.735000] end_request: I/O error, dev mmcblk0, sector 1992508
[20379.270000] glamo-mci glamo-mci.0: Error after cmd: 0x20
[20379.270000] mmcblk0: error -110 sending read/write command
[20379.280000] end_request: I/O error, dev mmcblk0, sector 11162288
[20379.285000] glamo-mci glamo-mci.0: Error after cmd: 0x8020
[20379.285000] mmcblk0: error -110 sending read/write command
[20379.295000] end_request: I/O error, dev mmcblk0, sector 11162288
[20384.005000] glamo-mci glamo-mci.0: Error after cmd: 0x20
[20384.005000] mmcblk0: error -110 sending read/write command
[20384.010000] glamo-mci glamo-mci.0: Error after cmd: 0x8022
[20384.015000] end_request: I/O error, dev mmcblk0, sector 1333568
[20384.050000] glamo-mci glamo-mci.0: Error after cmd: 0x22
[20384.050000] mmcblk0: error -84 sending read/write command
[20384.060000] glamo-mci glamo-mci.0: Error after cmd: 0x8202
[20384.065000] glamo-mci glamo-mci.0: Error after cmd: 0x202
[20384.070000] glamo-mci glamo-mci.0: Error after cmd: 0x8202
[20384.075000] glamo-mci glamo-mci.0: Error after cmd: 0x202
[20384.080000] glamo-mci glamo-mci.0: Error after cmd: 0x8202
[20384.085000] glamo-mci glamo-mci.0: Error after cmd: 0x202
[20384.090000] mmcblk0: error -84 requesting status
[20384.095000] glamo-mci glamo-mci.0: Error after cmd: 0x8202
[20384.100000] end_request: I/O error, dev mmcblk0, sector 7880864
[20384.100000] end_request: I/O error, dev mmcblk0, sector 7880866
[20384.100000] end_request: I/O error, dev mmcblk0, sector 7880868
[20384.100000] end_request: I/O error, dev mmcblk0, sector 7880870
[20384.100000] end_request: I/O error, dev mmcblk0, sector 7880872
[20384.100000] end_request: I/O error, dev mmcblk0, sector 7880874
[20384.100000] end_request: I/O error, dev mmcblk0, sector 7880876
[20384.100000] end_request: I/O error, dev mmcblk0, sector 7880878
[20384.100000] end_request: I/O error, dev mmcblk0, sector 7880880
[20384.100000] end_request: I/O error, dev mmcblk0, sector 7880882
[20384.100000] end_request: I/O error, dev mmcblk0, sector 7880884
[20384.100000] end_request: I/O error, dev mmcblk0, sector 7880886
[20384.100000] end_request: I/O error, dev mmcblk0, sector 7880888
[20384.100000] end_request: I/O error, dev mmcblk0, sector 7880890
[20384.100000] end_request: I/O error, dev mmcblk0, sector 7880892
[20384.100000] end_request: I/O error, dev mmcblk0, sector 7880894
[20384.100000] end_request: I/O error, dev mmcblk0, sector 7880896
[20384.100000] end_request: I/O error, dev mmcblk0, sector 7880898
[20384.100000] end_request: I/O error, dev mmcblk0, sector 7880900
[20384.135000] glamo-mci glamo-mci.0: Error after cmd: 0x22
[20384.135000] mmcblk0: error -84 sending read/write command
[20384.140000] glamo-mci glamo-mci.0: Error after cmd: 0x8202
[20384.145000] glamo-mci glamo-mci.0: Error after cmd: 0x202
[20384.155000] glamo-mci glamo-mci.0: Error after cmd: 0x8202
[20384.155000] glamo-mci glamo-mci.0: Error after cmd: 0x202
[20384.160000] glamo-mci glamo-mci.0: Error after cmd: 0x8202
[20384.165000] glamo-mci glamo-mci.0: Error after cmd: 0x202
[20384.170000] mmcblk0: error -84 requesting status
[20384.175000] glamo-mci glamo-mci.0: Error after cmd: 0x8202
[20384.180000] end_request: I/O error, dev mmcblk0, sector 7867966
[20389.945000] glamo-mci glamo-mci.0: Error after cmd: 0x20
[20389.945000] mmcblk0: error -110 sending read/write command
[20389.955000] end_request: I/O error, dev mmcblk0, sector 7949666
[20389.970000] glamo-mci glamo-mci.0: Error after cmd: 0x8120
[20389.970000] mmcblk0: error -110 sending read/write command
[20389.980000] glamo-mci glamo-mci.0: Error after cmd: 0x120
[20389.980000] end_request: I/O error, dev mmcblk0, sector 1051962
[20393.490000] glamo-mci glamo-mci.0: Error after cmd: 0x8020
[20393.490000] mmcblk0: error -110 sending read/write command
[20393.500000] end_request: I/O error, dev mmcblk0, sector 7949666

This makes the filesystem driver remount the filesystem - in the best case - and to not find certain files. Thus, almost always, the system crashes due to the inavailability of files.

When the error happens, the screen is filled with noise, e.g. it looks like colorfull snow.

I'm using the self-compiled testing kernel "Linux om-gta02 2.6.28-GTA02_master2_4ff379a06a70e179-mokodev #2 PREEMPT Sun Jan 18 23:40:09 CET 2009 armv4tl unknown" (as of http://git.openmoko.org/?p=kernel.git;a=log;h=4ff379a06a70e17997e196c9c393bc7c8648e42a ).

My system boots from SDHC card, so any IO traffic goes over the SDIO connection.

I can reproduce that problem reliably, about 40% of all reboots fail. (Just rebooting shows the problem.)

Attachments

OpenMoko Freerunner crash with kernel 9029dff1f370018665a6e2999632a34fd0518f4d.2.jpeg (225.8 KB) - added by xbaldauf 6 years ago.
This is a screenshot of what happenes during the bug. It is nice to see that the overwriting sometimes stops before finishing the whole screen. (I'm blind... I did not see the "Attach file" button, because it was 4 pages from the beginning and many pages from the end...)

Change History

comment:1 Changed 6 years ago by andy

I think this can be due to process spread of Glamo meaning that not every device can handle the 89MHz memory clock that is used in 2.6.28.

I have seen this symptom when working wth Glamo registers but don't see it on devices with kernels from the last few months. If you pretty reliably can reproduce it, it will be something about your GTA02 I think.

The problem is actually communication to Glamo internal SDRAM is broken, and as the MMC traffic travels through that it is also broken.

I'll add a kernel commandline option to push this back to 80MHz from 89MHz (although the 89MHz was recommended by SMedia) so we can test and see if makes any change for you.

comment:2 Changed 6 years ago by andy

I added a patch

http://git.openmoko.org/?p=kernel.git;a=commitdiff;h=302131b55d1f922fe73b238202795a4cd4537ad3

that allows us to start debugging this by appending

glamo3362.slow_memory=1

on your kernel commandline.

comment:3 Changed 6 years ago by xbaldauf

Well,

I've been compiling http://git.openmoko.org/?p=kernel.git;a=commit;h=5b3137236f5d5665774f9705220b3fe6c1e23692 and booting from this kernel (with "glamo3362.slow_memory=1" appended to the kernel command line). However, there is no audio (neither when phoning nor when using mplayer or when being called and the phone should ring) (which looks like bug https://docs.openmoko.org/trac/ticket/2216), and thus I cannot use this kernel for daily use, and thus I cannot reliably test whether the bug still appears.

So for practical reasons, I'd recommend setting this bug as blocked by #2216 .

comment:4 in reply to: ↑ description Changed 6 years ago by TAsn

by using the kernel from http://build.shr-project.org/tests/uImage-2.6.28-rc4-ms5-fixes_34240a1c06ae3618.bin
and glamo3362.slow_memory=1
everything works great, no nsod (so far).

comment:5 Changed 6 years ago by andy

That's good news, but I can't find sources for that kernel, git.shr-project.org doesn't seem to have it, so I can't confirm what it actually is in there. I guess it's a backport of cherrypicked patches to a build just before the 2.6.29-rc2 uplevel, if it pays any attention to glamo3362.slow_memory=1 then it must have the patch.

It'll be very handy if we found the tap to turn the problem off for Glamo chips that can't handle the faster bus speed.

We're still trying to figure out the audio issue on 2.6.29-rc2, seems it's due to Alsa changes upstream at the minute.

comment:6 Changed 6 years ago by mrmoku

that's 34240a1c06ae36180dee695aa25bbae869b2aa26 with the patches from
https://paulfertser.is-a-geek.org/files/FSO-stable-patchset/
which were cherry-picked / backported from Paul

comment:7 Changed 6 years ago by andy

Ah it's fine then, they're just backports from andy-tracking including the magic one for this.

So if we don't hear about anything to the contrary I guess we know how to work around it.

comment:8 Changed 6 years ago by andy

Actually thinking about this there is probably some middle ground between the two choices. I added a patch which allows the inbetween settings to be selected as well

http://git.openmoko.org/?p=kernel.git;a=commitdiff;h=7f859d161097631a3c76ed1dbb1cfcb08ebe9759

If someone with a "bad Glamo" can try the other settings, it may be possible to still get some improvement over the 2.6.24 settings without triggering this issue.

comment:9 Changed 6 years ago by xbaldauf

I've been running http://git.openmoko.org/?p=kernel.git;a=commit;h=9029dff1f370018665a6e2999632a34fd0518f4d ( Thu, 5 Feb 2009 17:01:56 +0000 (17:01 +0000) ) with kernel parameter "glamo3362.slow_memory=1" now for about 24 hours and I have experienced 3 crashes|lockups (all during screen blank, so I do not know the cause or whether these had to do something to do with this bug), but no crash of the type "Noise screen of death:". Judging from the frequency of the bug before, I think your workaround works. :-)
I'll try to change the setting from time to time to try out the other clock and wait state settings to find an optimum setting.

I did not notice any speed difference, is there any howto or document describing which speed should differ with respect to the waitstate number and frequency used? Maybe there is even a kind of benchmark program? (E.g. Is 50 MHz, 0 waitstates faster than 90 MHz, 1 waitstate?)

comment:10 Changed 6 years ago by andy

I think the workaround probably works fine... I've given you the wrong kernel commandline though, it's

glamo_core.slow_memory=1

as someone pointed out to me.

comment:11 Changed 6 years ago by xbaldauf

Well, why does it (apparently) work then if I have not given the correct kernel parameter?

comment:12 Changed 6 years ago by xbaldauf

Unfortunately, both with glamo_core.slow_memory=1 or with glamo3362.slow_memory=1, the problem is still there. :-(

comment:13 Changed 6 years ago by xbaldauf

With "glamo_core.slow_memory=1 glamo3362.slow_memory=1" on the kernel command line and kernel http://git.openmoko.org/?p=kernel.git;a=commit;h=4e9be3539e402cb2b9aa9caf5050756916b9345e (Thu, 5 Feb 2009 21:11:28 +0000 (21:11 +0000)), I still get the noise screen of death and this in the log file:

[ 3879.530000] glamo-mci glamo-mci.0: Error after cmd: 0x302
[ 3879.530000] mmcblk0: retrying using single block read
[ 3879.540000] glamo-mci glamo-mci.0: Error after cmd: 0x8020
[ 3879.540000] glamo-mci glamo-mci.0: Error after cmd: 0x8122
[ 3879.550000] glamo-mci glamo-mci.0: Error after cmd: 0x4302
[ 3879.550000] mmcblk0: error -84 sending status comand<3>mmcblk0: error -84 sending read/write command, response 0x0, card status 0x0
[ 3879.565000] end_request: I/O error, dev mmcblk0, sector 14732172
[ 3879.575000] glamo-mci glamo-mci.0: Error after cmd: 0x8020
[ 3879.575000] glamo-mci glamo-mci.0: Error after cmd: 0x8122
[ 3879.595000] glamo-mci glamo-mci.0: Error after cmd: 0x4302

:-(

comment:14 Changed 6 years ago by tilman

This just looks like the problems created by GSM interference... try setting glamo_mci.sd_max_clk=1000000 on the kernel command line to rule out that it has anything to do with it.

comment:15 Changed 6 years ago by xbaldauf

Well... running the kernel with parameter "glamo_mci.sd_max_clk=1000000" makes booting really painfully slow. But additionally, this does not fix the problem.

After entering the PIN (e.g. connecting to GSM), I get these error messages.

[ 939.205000] glamo-mci glamo-mci.0: Error after cmd: 0x8302
[ 939.205000] mmcblk0: retrying using single block read
[ 939.220000] glamo-mci glamo-mci.0: Error after cmd: 0x120
[ 939.220000] mmcblk0: error -110 sending read/write command, response 0x0, card status 0x40400b00
[ 939.230000] end_request: I/O error, dev mmcblk0, sector 11188690
[ 939.240000] glamo-mci glamo-mci.0: Error after cmd: 0x120
[ 939.240000] mmcblk0: error -110 sending read/write command, response 0x0, card status 0x400b00
[ 939.260000] end_request: I/O error, dev mmcblk0, sector 11188691
[ 939.260000] glamo-mci glamo-mci.0: Error after cmd: 0x120
[ 939.275000] mmcblk0: error -110 sending read/write command, response 0x0, card status 0x400b00
[ 939.295000] end_request: I/O error, dev mmcblk0, sector 11188692
[ 939.295000] glamo-mci glamo-mci.0: Error after cmd: 0x120
[ 939.310000] mmcblk0: error -110 sending read/write command, response 0x0, card status 0x400b00
[ 939.315000] end_request: I/O error, dev mmcblk0, sector 11188693
[ 939.330000] glamo-mci glamo-mci.0: Error after cmd: 0x120

So the problem may indeed be GSM-related. However, your suggested fix does not seem to work.

(My relevant kernel command line parameters are now "glamo_core.slow_memory=1 glamo3362.slow_memory=1 glamo_mci.sd_max_clk=1000000".)

comment:16 follow-up: ↓ 18 Changed 6 years ago by xbaldauf

Interestingly, however, it looks like that, by using glamo_mci.sd_max_clk=1000000, I can raise the probability of getting such a crash like above to 100%. The crash happens always at GSM-login-time.

comment:17 Changed 6 years ago by andy

Holy crap, not liking the sound of this.

GSM taking a dump on SDIO communication is one thing, but this problem is to do with either CPU <-> Glamo communications or Glamo <-> internal memory another way.

It's a new thing in the world if we are saying GSM TX impacts either of those.

comment:18 in reply to: ↑ 16 ; follow-up: ↓ 19 Changed 6 years ago by andy

Replying to xbaldauf:

Interestingly, however, it looks like that, by using glamo_mci.sd_max_clk=1000000, I can raise the probability of getting such a crash like above to 100%. The crash happens always at GSM-login-time.

To be clear, at the time of this "crash", you have the permanent noisy screen business?

comment:19 in reply to: ↑ 18 ; follow-up: ↓ 21 Changed 6 years ago by xbaldauf

Replying to andy:

Replying to xbaldauf:

Interestingly, however, it looks like that, by using glamo_mci.sd_max_clk=1000000, I can raise the probability of getting such a crash like above to 100%. The crash happens always at GSM-login-time.

To be clear, at the time of this "crash", you have the permanent noisy screen business?

Sometimes, sometimes not. Sometimes the noisy screen is busy (e.g. updated), sometimes even normal redraws happen (so the noisy screen is overwritten with the correct content), and sometimes everything stops, including screen updates.

The triggering of the bug seems to be strongly GSM-related. AFAIK, it only happened when I login into the GSM network, place a phonecall or receive a phonecall or short message. Thus, when I have no GSM traffic for some hours or even days, the system seems to remain stable.

comment:20 Changed 6 years ago by werner

Cool, the evil twin of "MMC kills GPS" :-(

A few things to try:

When this happens, is the battery fully charged ? Does it also happen with USB connected and providing 500mA ?

Does this also happen with a different SIM, a different SD card, or at a different location ?

Also, if there is the possibility to test with another GTA02 in the same settings (same battery, SD card, SIM, location), that may shed some light.

  • Werner

comment:21 in reply to: ↑ 19 ; follow-up: ↓ 22 Changed 6 years ago by andy

Replying to xbaldauf:

Replying to andy:

Replying to xbaldauf:

Interestingly, however, it looks like that, by using glamo_mci.sd_max_clk=1000000, I can raise the probability of getting such a crash like above to 100%. The crash happens always at GSM-login-time.

I wonder if we simply delay SD access enough that we still intensively use it by the time GSM stuff starts up.

To be clear, at the time of this "crash", you have the permanent noisy screen business?

Sometimes, sometimes not. Sometimes the noisy screen is busy (e.g. updated), sometimes even normal redraws happen (so the noisy screen is overwritten with the correct content), and sometimes everything stops, including screen updates.

Ah... that's something totally different than I understood until now. I thought we were talking about a dynamic, jittering permanently changing full-display "snow" of noise on the display. I have seen this many times when working with Glamo internal memory parameters.

It would imply we crapped on the control registers by accident.

But when you say the "noisy screen is overwritten with the correct content" it sounds instead like the bitmap placed there is wrong data, and later it might come and invalidate and redraw that area perfectly well.

The triggering of the bug seems to be strongly GSM-related. AFAIK, it only happened when I login into the GSM network, place a phonecall or receive a phonecall or short message. Thus, when I have no GSM traffic for some hours or even days, the system seems to remain stable.

I hope only some "lucky" devices can suffer from this or we would have heard about it long before.

comment:22 in reply to: ↑ 21 Changed 6 years ago by xbaldauf

Replying to andy:

Replying to xbaldauf:

Replying to andy:

Replying to xbaldauf:

Interestingly, however, it looks like that, by using glamo_mci.sd_max_clk=1000000, I can raise the probability of getting such a crash like above to 100%. The crash happens always at GSM-login-time.

I wonder if we simply delay SD access enough that we still intensively use it by the time GSM stuff starts up.

Well, by now, this conclusion was too fast. Up to now, I've managed to login 2 times without such a crash. Maybe the crash probability is dependent on a frequency between my device and the GSM tower, or something like this.

To be clear, at the time of this "crash", you have the permanent noisy screen business?

Sometimes, sometimes not. Sometimes the noisy screen is busy (e.g. updated), sometimes even normal redraws happen (so the noisy screen is overwritten with the correct content), and sometimes everything stops, including screen updates.

Ah... that's something totally different than I understood until now. I thought we were talking about a dynamic, jittering permanently changing full-display "snow" of noise on the display. I have seen this many times when working with Glamo internal memory parameters.

It would imply we crapped on the control registers by accident.

But when you say the "noisy screen is overwritten with the correct content" it sounds instead like the bitmap placed there is wrong data, and later it might come and invalidate and redraw that area perfectly well.

Yes. It looks like that, initially, the bug is transient in nature. It is just that severe that the damages done are persistent (until the next power cycle). Sometimes it may happen that I can continue working with the device, sometimes I can continue working with the device as long as there is no write access (because the filesystem was re-mounted read-only due to I/O errors, but the I/O errors are now gone), and sometimes I cannot continue at all (because of some crash, sometimes I see the red lights of the AUX button blinking fast, sometimes not). Being able to continue implies that the correct screen is redrawn (e.g. by changing the currently visible application, the screen is redrawn).

The triggering of the bug seems to be strongly GSM-related. AFAIK, it only happened when I login into the GSM network, place a phonecall or receive a phonecall or short message. Thus, when I have no GSM traffic for some hours or even days, the system seems to remain stable.

I hope only some "lucky" devices can suffer from this or we would have heard about it long before.

So you are suggesting a hardware failure...? I've never experienced this bug under kernel 2.6.24 AFAIK.

comment:23 Changed 6 years ago by xbaldauf

By the way, the bug also appears under kernel http://git.openmoko.org/?p=kernel.git;a=commit;h=e515295f2d76beff3986cc681b5e4da78bd9f484 (The "stable" series of openmoko kernels, Mon, 9 Feb 2009 13:35:23 +0000 (13:35 +0000) ). Logging in into the GSM network was okay, but placing a call triggered the bug.

[ 936.865000] glamo-mci glamo-mci.0: Error after cmd: 0x8302
[ 936.870000] mmcblk0: retrying using single block read
[ 936.870000] glamo-mci glamo-mci.0: Error after cmd: 0x20
[ 936.870000] glamo-mci glamo-mci.0: Error after cmd: 0x122
[ 936.885000] glamo-mci glamo-mci.0: Error after cmd: 0x8302
[ 936.885000] mmcblk0: error -84 sending status comand<3>mmcblk0: error -84 sending read/write command, response 0x0, card status 0x0
[ 936.900000] end_request: I/O error, dev mmcblk0, sector 11196910
[ 936.910000] glamo-mci glamo-mci.0: Error after cmd: 0x20
[ 936.910000] glamo-mci glamo-mci.0: Error after cmd: 0x122
[ 936.920000] glamo-mci glamo-mci.0: Error after cmd: 0xc302
[ 936.920000] mmcblk0: error -84 sending status comand<3>mmcblk0: error -84 sending read/write command, response 0x0, card status 0x0

Note that, when "moving" a part of the screen after the bug was triggered (e.g. pushing the "middle button" in the englightenment UI to get a menu), the noise is moved with the part of the screen, it is not overwritten immediately.

comment:24 follow-up: ↓ 25 Changed 6 years ago by werner

xbaldauf, perhaps a screenshot could help to clarify the pattern. If you can still access the Neo with SSH, you could take screenshots with

dd if=/dev/fb0 bs=4096 count=150 of=filename

They can then be converted with
http://svn.openmoko.org/developers/werner/scr2ppm.pl

  • Werner

comment:25 in reply to: ↑ 24 Changed 6 years ago by xbaldauf

Replying to werner:

xbaldauf, perhaps a screenshot could help to clarify the pattern. If you can still access the Neo with SSH, you could take screenshots with

dd if=/dev/fb0 bs=4096 count=150 of=filename

They can then be converted with
http://svn.openmoko.org/developers/werner/scr2ppm.pl

  • Werner

Good idea. I've already made digicam photographs some days ago, but I did not found any possibility to attach files to this bug report. However, it seems that I can attach files to new bug reports. Is it possible (for you) to edit some settings of this bug report such that I can attach files, or should I create a new bug report for these attachments?

comment:26 follow-up: ↓ 27 Changed 6 years ago by werner

I don't think I have any special powers in trac. Don't you have an "Attach file" button above the change history ?

  • Werner

Changed 6 years ago by xbaldauf

This is a screenshot of what happenes during the bug. It is nice to see that the overwriting sometimes stops before finishing the whole screen. (I'm blind... I did not see the "Attach file" button, because it was 4 pages from the beginning and many pages from the end...)

comment:27 in reply to: ↑ 26 Changed 6 years ago by xbaldauf

Replying to werner:

I don't think I have any special powers in trac. Don't you have an "Attach file" button above the change history ?

  • Werner

I'm blind... I did not see the "Attach file" button, because it was 4 pages from the beginning and many pages from the end...

There is a screenshot available here:

https://docs.openmoko.org/trac/attachment/ticket/2217/OpenMoko%20Freerunner%20crash%20with%20kernel%209029dff1f370018665a6e2999632a34fd0518f4d.2.jpeg

comment:28 follow-ups: ↓ 29 ↓ 30 Changed 6 years ago by andy

That randomish noise is not that easy to generate. Plus, it comes from the top and ends in a linear way.

I wonder if what has happened is that too much data is written from the SD Card unit in the Glamo (which is done by local Glamo DMA), so that it blows through the allocation of Glamo internal memory and wraps into an alias that is the framebuffer.

Is the noise always appearing from the "top" when it comes?

comment:29 in reply to: ↑ 28 Changed 6 years ago by xbaldauf

Replying to andy:

Is the noise always appearing from the "top" when it comes?

The overwrite direction is always from top to bottom.
I think it happens almost always that it starts at the top.
In rare cases (but I do not quite remember), it may be that it starts, stops, then starts again in a lower region, so there may be gaps.

Maybe I should add something to the system setup:

That randomish noise is not that easy to generate.

My complete filesystem is encrypted. That is, I run ext3 over dm-crypt over /dev/mmcblk0p2. So, in my case, the encrypted filesystem data is a good explanation of why the data visible is actually noisy.

Triggering the bug may also have something to with the CPU load due to encryption or with effects of encryption onto timing (e.g. loading takes longer).

I wonder if what has happened is that too much data is written from the SD Card unit in the Glamo (which is done by local Glamo DMA), so that it blows through the allocation of Glamo internal memory and wraps into an alias that is the framebuffer.

This looks like an interesting theory, but as I'm not a hardware hacker, I cannot really comment. However, I'm eager to compile some special kernels with debugging messages and the like in order to help you diagnose and fix.

I may add that, when running from internal flash memory instead of SD-card, I almost never experienced such crashes. However, I cannot properly compare, because I cannot run dm-crypt over the internal flash memory, because the internal flash memory is not a normal Linux block device.

comment:30 in reply to: ↑ 28 Changed 6 years ago by xbaldauf

Replying to andy:

I wonder if what has happened is that too much data is written from the SD Card unit in the Glamo (which is done by local Glamo DMA), so that it blows through the allocation of Glamo internal memory and wraps into an alias that is the framebuffer.

Please also note that the noise does not always come. Often, the communication between CPU und SD card just fails, without exhibiting the screen corruption. For example, I've now tested kernel 2.6.24 (http://git.openmoko.org/?p=kernel.git;a=commit;h=a1e97c611253511ffc2d8c45e3e6d6894fa03fa3 , Sat, 23 Aug 2008 10:01:50 +0000 (11:01 +0100) ), and I now also experienced this bug:

glamo-mci glamo-mci.0: Error after cmd: 0x8121
glamo-mci glamo-mci.0: Error after cmd: 0x8123
mmcblk0: error -84 sending read/write command
end_request: I/O error, dev mmcblk0, sector 1992412
glamo-mci glamo-mci.0: Error after cmd: 0x121
glamo-mci glamo-mci.0: Error after cmd: 0x123
mmcblk0: error -84 sending read/write command
end_request: I/O error, dev mmcblk0, sector 1992412
glamo-mci glamo-mci.0: Error after cmd: 0x8121
glamo-mci glamo-mci.0: Error after cmd: 0x8123
mmcblk0: error -84 sending read/write command
end_request: I/O error, dev mmcblk0, sector 1992412
glamo-mci glamo-mci.0: Error after cmd: 0x121
glamo-mci glamo-mci.0: Error after cmd: 0x123
mmcblk0: error -84 sending read/write command

However, I did not see any screen corruption, I just could not tap button screen anymore (the application "qpe" did not react). I could kill "qpe" from the command line successfully, but the screen did change either (maybe the X server was affected, too). I could not access the filesystem anymore for new commands, but old commands like "dmesg" (I presume which were cached fully into RAM) still have been working.

Having tested this under 2.6.24, the comment
Replying to xbaldauf:

I've never experienced this bug under kernel 2.6.24 AFAIK.

is now wrong. :-(

So maybe the screen corruption and the presumed DMA-wise overwrite is just a consequence of the communication problem between CPU and SD-card, and not the cause, but I do not know...

comment:31 Changed 6 years ago by xbaldauf

I've been thinking about this issue even further, and moved my root filesystem from my 8GB "Kingston SDC4/8GB" to the 512MB card which was included in the freerunner shipping box. This was quite a pain because I had to omit|delete a lot.

However, after logging in and placing and receiving calls, I have not got the crash, so far...

Could it be that the communication problem somewhere between CPU and SDHC-card is, indeed, at the SDHC-card end?

I've been looking for reports of problems with this SDHC card, did not find any, but I found this report: http://wiki.openmoko.org/wiki/Supported_microSD_cards/SD-C02G

This looks pretty familiar... "unreliable behavior if the GSM modem is activated".

So it seems that there is at least one other user with the same problem. Could it be that there is electromagnetic noise coming from GSM impacting the card?

Note: See TracTickets for help on using tickets.