Ticket #567 (closed defect: fixed)

Opened 12 years ago

Last modified 11 years ago

I/O errors on flash after heavy flash access

Reported by: elrond+bugzilla.openmoko.org@… Owned by: michael@…
Priority: high Milestone:
Component: kernel Version: unspecified
Severity: major Keywords:
Cc: buglog@…, werner@… Blocked By:
Blocking: Estimated Completion (week):
HasPatchForReview: PatchReviewResult:
Reproducible:

Description

Hardware: GTA01Bv03 (phase 0)
Kernel: Linux version 2.6.20.2-moko8 (stefan@fairlight) (gcc version 4.1.1) #56
PREEMPT Sun Apr 8 13:58:03 CEST 2007
u-boot: I think 1.2.0-moko7 (getting to the u-boot prompt isn't stricly easy)

The whole thing started, when I downloaded openmoko-theme-standard.ipk to tmpfs
and tried to install it using ipkg install *.ipk. I got a few Input/Output?
errors on files and the ipkg command hung.

Logging in via ssh still worked. And all commands, which were still in RAM,
worked fine. But any access to jffs2 locked up that command.
I could grab dmesg output, which contains a kernel-oops. (will be attached to
this bug as write-oops.dmesg)

As /sbin/halt.sysvinit (for "halt -p -d -i -p") was not in RAM (and I copied a
fresh one via scp to tmpfs) and some other needed stuff was not available, I had
to hard power off the machine (remove battery, half minute pressing power-button
did not work).

The next day, the machine seemed to boot fine. Until the fs gave me again I/O
errors (this time I did not try to write to it explicitly). This time I have a
bigger dmesg, which goes back to the boot. (will be attached to this bug as
boot_and_jffs2notices.dmesg).
Again I had to remove the battery, as I still haven't found the node in /sys,
that will shut down the phone IMMEDIATELY when echoing something into it.

Attachments

write-oops.dmesg (14.1 KB) - added by elrond+bugzilla.openmoko.org@… 12 years ago.
write-oops.dmesg
boot_and_jffs2notices.dmesg (20.3 KB) - added by elrond+bugzilla.openmoko.org@… 12 years ago.
boot_and_jffs2notices.dmesg

Change History

Changed 12 years ago by elrond+bugzilla.openmoko.org@…

write-oops.dmesg

Changed 12 years ago by elrond+bugzilla.openmoko.org@…

boot_and_jffs2notices.dmesg

comment:1 Changed 12 years ago by laforge@…

I think this might be similar to #245. We've had a memory initialization
bug that resulted in memory corruption as soon as the kernel used the upper 64MB
of RAM. That memory corruption can obviously result in filesystem corruption as
soon as pages get written out to disk.

I'd recommend to install a way more up-to-date u-boot, kernel and rootfs image
and test again.

comment:2 Changed 12 years ago by elrond+bugzilla.openmoko.org@…

According to http://wiki.openmoko.org/wiki/ChangeLog#2007-03-11 1.2.0-moko6
fixes this problem and this phone has moko7.

I don't have a debug board so if you still recommend updating u-boot I'd kindly
ask you to name a known-to-be-working-on-P0 version of u-boot and its SHA1 (so I
can verify it after downloading).

comment:3 Changed 12 years ago by elrond+bugzilla.openmoko.org@…

  • Priority changed from high to low

Okay, I have a new u-boot (rev2040) on it.
Still the old kernel, because the new prebuilt kernels are buggy at the
backlight (new bug on its way).

I have reduced the Priority to P4, because this is now a "wait for it to happen
again".

I leave the bug open for a while (two weeks - month), so people can look at the
Oops and maybe make the kernel more stable in this area.

comment:4 Changed 12 years ago by elrond+bugzilla.openmoko.org@…

I got this issue a few days ago again.

I have now updated the kernel (2.6.21.3, rev 2xxx; I'll post the precise
revision, when needed). I'll see, if this is now better.

I have some ideas on how to reproduce it, so I hope to either give this bug more
info or close it on my own within a few weeks.

comment:5 Changed 12 years ago by elrond+bugzilla.openmoko.org@…

  • Priority changed from low to normal

Okay,

Just was able to reproduce it:

u-boot rev2040
kernel rev2118

phone$ dd if=/dev/mtdblock2 of=/tmp/kernel.bin

This command alone seems to be enough to trigger the problem for me.
(Just tried it four times in a row.)

comment:6 Changed 12 years ago by elrond+bugzilla.openmoko.org@…

  • Priority changed from normal to high

"XorA on #openmoko" reproduced this just a few minutes ago on

Model: GTA01Bv04
Kernel: uImage-2.6.21.6-moko10-r1_0_0_2360

I'm raising Priority back to the default P2, because this problem is relevant.

comment:7 Changed 12 years ago by laforge@…

  • Status changed from new to closed
  • Resolution set to fixed

I think this might be related to the bug we had (#419) that didn't fully erase
the rootfs when flashing a new rootfs. This could introduce all kinds of
inconsistencies into the JFFS2 file system.

I therefore recommend trying this with a new u-boot version, and using that new
version to install a new rootfs image.. I'm confident the problem will disappear
at that point.

Please re-open if it still occurrs.

comment:8 Changed 12 years ago by elrond+bugzilla.openmoko.org@…

  • Status changed from closed to reopened
  • Resolution fixed deleted
  • Summary changed from I/O errors on flash and frozen fs to I/O errors on flash after heavy flash access

Okay, next reproduce:

1) u-boot 2040
2) upload kernel via DFU
3) nand erase rootfs
4) Upload rootfs via DFU
boot
5) dd if=/dev/mtd2 of=/tmp/kernel.bin
6) ls -la /usr/bin

(givws I/O errors)

kernel and rootfs are from http://people.openmoko.org/roh/

  • uImage-2.6.22.5-moko11+svnr2937-r2-fic-gta01.bin

-
OpenMoko?-openmoko-devel-image-glibc-ipk-P1-September-Snapshot-20070919-fic-gta01.rootfs.jffs2

I have retitled the Bug to better describe the problem.

comment:9 Changed 12 years ago by willie_chen@…

  • Status changed from reopened to new
  • Owner changed from laforge@… to michael@…

comment:10 Changed 11 years ago by werner@…

  • Status changed from new to closed
  • Cc werner@… added
  • Resolution set to fixed

If CONFIG_MTD_NAND_S3C2410_CLKSTOP is set, opening an MTD device with
open(2) will cause all hell to break loose. See also:

http://lists.infradead.org/pipermail/linux-mtd/2007-July/019010.html

This should be fixed by now in the OE kernel configuration (revision
0238eff8862126ac83c3f05d7a6fb094feff89e9, say my files).

This explains #7 and #10. Not sure if #1 was also caused by this or not,
but I give it the benefit of the doubt, and close the bug :-)

Please reopen if there are more gremlins lurking.

Note: See TracTickets for help on using tickets.