Ticket #2057 (new defect)

Opened 5 years ago

Last modified 5 years ago

1bit errors in files

Reported by: Richard.Kralovic Owned by: openmoko-devel
Priority: normal Milestone:
Component: unknown Version:
Severity: normal Keywords: file corruption, 1bit errors
Cc: riso@… Blocked By:
Blocking: Estimated Completion (week):
HasPatchForReview: PatchReviewResult:
Reproducible: rarely

Description

After some time of usage, I notice 1 bit errors in some binaries/libraries (affected application crashes, restarting the application does not, help, of course). Files are correct after rebooting Neo again; even without reflash.

I experienced this bug with different distributions (FSO, ASU...) and different kernels (downloaded Om2008.9-gta02-20080916.uImage.bin, custom built uImage-2.6.24+git0+a1e97c611253511ffc2d8c45e3e6d6894fa03fa3-r1.01-om-gta02.bin, etc.).

There is no relevant information in dmesg output. I also added warning messages to the software ECC correcting code (drivers/mtd/nand/nand_ecc.c) to the custom build kernel, but none of them appeared in dmesg.

The bug is hard to reproduce on demand, but usually occurs after several hours/days of usage.

Observed file corruption was always a single bit flip from 0 to 1, at offset 0x????0c2 or 0x????8c2 of the affected file, e.g. as follows:

< 00008c0: 0200 0200 0000 0000 0300 0300 0100 0000 ................
---

00008c0: 0200 2200 0000 0000 0300 0300 0100 0000 ..".............

Attachments

memtest.c (1.1 KB) - added by Richard.Kralovic 5 years ago.
memtest.log (560 bytes) - added by Richard.Kralovic 5 years ago.

Change History

Changed 5 years ago by Richard.Kralovic

Changed 5 years ago by Richard.Kralovic

comment:1 Changed 5 years ago by Richard.Kralovic

Ok, it looks like a hardware bug in ram. I run attached simple memory testing application and it indeed found 1 bit errors... I just hope I'll be able to get a replacement...

comment:2 Changed 5 years ago by cedric.berger

if you have to live with faulty ram, I guess "badram" might help.
http://rick.vanrein.org/linux/badram/
I did not check this since a while ago. But I do not think it is yet included in standard kernel ?

comment:3 Changed 5 years ago by andy

It's the same bit wrong in both the memtest and the error seen in the actual file too. I would contact your vendor about it for a replacement. 64MB of the RAM is in the CPU module, the other 64MB is external, it's quite possible then half the physical memory is fine and half is broken. As you allocate more virtual memory, it might start to use the second, broken half after some time.

comment:4 Changed 5 years ago by Richard.Kralovic

Thank you very much for your replies. I have just realized that there is also builtin memtest in uboot, but I am not able to figure out correct address ranges for gta02 (the default ranges just crash neo). What is the correct way of running the uboot memtest?

comment:5 Changed 5 years ago by andy

Hum seems U-Boot doesn't take care of its own memory footprint for the memory test, I found I could run it in two halves missing a critical bit out...

mtest 30000000 33c00000
mtest 34000000 38000000

comment:6 Changed 5 years ago by Richard.Kralovic

Thanks. I still had problems running mtest on full 34000000 38000000 range, but 34100000 37e00000 works fine. After few hours, mtest found the bad bits, too:

GTA02v6 # mtest 34100000 37e00000 55555555
Pattern 555555DD Writing... Reading...
Mem error @ 0x351000C0: found 55B5560D, expected 5595560D
Mem error @ 0x351198C0: found 55B5BC0D, expected 5595BC0D
Mem error @ 0x351278C0: found 55B5F40D, expected 5595F40D
Mem error @ 0x351360C0: found 55B62E0D, expected 55962E0D
Mem error @ 0x351398C0: found 55B63C0D, expected 55963C0D
Mem error @ 0x351440C0: found 55B6660D, expected 5596660D
Mem error @ 0x3515C0C0: found 55B6C60D, expected 5596C60D
Mem error @ 0x3515D8C0: found 55B6CC0D, expected 5596CC0D
Mem error @ 0x351620C0: found 55B6DE0D, expected 5596DE0D
Mem error @ 0x3516D8C0: found 55B70C0D, expected 55970C0D
Mem error @ 0x351758C0: found 55B72C0D, expected 55972C0D
Mem error @ 0x3517D8C0: found 55B74C0D, expected 55974C0D
Mem error @ 0x3517E0C0: found 55B74E0D, expected 55974E0D
Mem error @ 0x351CF8C0: found 55B8940D, expected 5598940D
Mem error @ 0x351E40C0: found 55B8E60D, expected 5598E60D
Mem error @ 0x351EC0C0: found 55B9060D, expected 5599060D
Mem error @ 0x351ED8C0: found 55B90C0D, expected 55990C0D
Mem error @ 0x351F00C0: found 55B9160D, expected 5599160D

So it definitely is bad sdram.

comment:7 Changed 5 years ago by andy

And as expected in the second half of it so in the external SDRAM chip. I think it is probably related that you had to trim the extent of the test as well, I did not have to do that. I would request a swapout from your vendor, I'm not sure how that's handled but it does seem to be a faulty device from our side.

comment:8 Changed 5 years ago by Vladimir.Koutny

I have a unit from the same (pulster.de) delivery as Richard and I don't see any memory issues (yet). However, I also can't run mtest in the full range - the range that doesn't cause crash/han
g is:

30000040 - 33e80000
34008000 - 38000000

(btw. first 0x40 bytes contains vector table - I guess you don't want to mtest that area :) )

That might be influenced by uboot version - in my case it is 1.3.2-rc2-dirty-moko12 (NAND) and 1.3.2-moko12 (NOR) (both as shipped), not sure where uboot code/data are mapped.

Btw., when looking at Freerunner memory map at http://wiki.openmoko.org/wiki/Neo_FreeRunner_Memory_Mapping, I would guess that external ram chip would be mapped starting at 38000000...

comment:9 Changed 5 years ago by werner

The external RAM is at 0x34..., using the variable bank 6/7 size feature.
See table 5-1 of the 2442 manual. Ignore the left-hand side of figure
5-1 :-)

comment:10 Changed 5 years ago by Vladimir.Koutny

Ok, thanks - clear now (I didn't check the specs before). I've updated the wiki accordingly.

Note: See TracTickets for help on using tickets.