Ticket #788 (closed defect: fixed)

Opened 10 years ago

Last modified 9 years ago

Starting or stopping gsmd completely locks up the Neo

Reported by: mwester@… Owned by: openmoko-kernel
Priority: normal Milestone:
Component: kernel Version: 2007.2
Severity: critical Keywords:
Cc: sean_chiang@…, buglog@…, werner@…, rod@…, elrond+bugzilla.openmoko.org@…, bluelightning@…, cw@…, jserv@…, balrogg@…, tick@…, cesarb@…, mcl@…, hozer@…, fabchevalier@…, willie_chen@… Blocked By:
Blocking: Estimated Completion (week):
HasPatchForReview: no PatchReviewResult:
Reproducible:

Description

Tested on several 2007.2 images, this problem was only observed on those with
the 2.6.22 kernel. Most recently tested:

  • uImage-2.6.22.5-moko11-r2-fic-gta01.bin

*
OpenMoko?-openmoko-devel-image-glibc-ipk-P1-August-Snapshot-20070901-fic-gta01.rootfs.jffs2

Observations:

  • The indication on the panel is the antenna with a question-mark -- unable to

communicate with the modem. This is similar to other bugs reported, but the
difference with the problem is that one cannot shutdown the running gsmd, nor
can one start a new gsmd if the gsmd process exits.

Specifically, shutting down the phone will result in the device hanging in the

shutdown. At that point, it is completely unresponsive to any input, including
the power button. Recovery requires that the battery be removed.

If, instead of shutting down, one examines the last lines in the /tmp/gsm.log,

the text indicates that the modem was declared dead by gsmd, and gsmd exited.
Restarting the gsmd process by executing /etc/init.d/gsmd start will result in a
total lockup of the neo, as described above -- even the power button is
non-responsive.

uname: Linux fic-gta01 2.6.22.5-moko11 #1 PREEMPT Sat Sep 1 14:02:08 CDT 2007
armv4tl unknown

Contents of /tmp/gsm.log:
Sat Sep 1 18:49:13 2007 <1> machine.c:131:gsmd_machine_plugin_init() detected
'GTA01' hardware
Sat Sep 1 18:49:13 2007 <1> machine.c:72:gsmd_machine_plugin_load() loading
machine plugin "generic"
Sat Sep 1 18:49:13 2007 <1> vendor.c:75:gsmd_vendor_plugin_load() loading
vendor plugin "ti"
Sat Sep 1 18:49:13 2007 <1> machine.c:56:gsmd_machine_plugin_find() selecting
machine plugin "generic"
Sat Sep 1 18:49:14 2007 <1> atcmd.c:561:atcmd_drain() c_iflag = 0x00000500,
c_oflag = 0x00000005, c_cflag = 0x800018b2, c_lflag = 0x00008a3b
Sat Sep 1 18:49:14 2007 <1> vendor.c:59:gsmd_vendor_plugin_find() selecting
vendor plugin "TI Calypso"
Sat Sep 1 18:49:14 2007 <1> atcmd.c:545:atcmd_submit() submitting command `ATZ'
Sat Sep 1 18:49:26 2007 <1> atcmd.c:545:atcmd_submit() submitting command
`AT+CFUN=1'
Sat Sep 1 18:49:46 2007 <1> atcmd.c:545:atcmd_submit() submitting command
`AT+COPS=0'
Sat Sep 1 18:54:14 2007 <1> gsmd.c:124:alive_interval_tmr_cb() interval
expired, starting next alive inquiry
Sat Sep 1 18:54:14 2007 <1> atcmd.c:545:atcmd_submit() submitting command `AT'
Sat Sep 1 18:54:44 2007 <1> gsmd.c:79:alive_tmr_cb() gsmd_alive timer expired
Sat Sep 1 18:54:44 2007 <8> gsmd.c:82:alive_tmr_cb() modem dead!
gsmd - (C) 2006-2007 by OpenMoko?, Inc. and contributors
This program is FREE SOFTWARE under the terms of GNU GPL

Attachments

gsmd (1.3 KB) - added by mwester@… 10 years ago.
/etc/init.d/gsmd script
gta01_serial_flowcontrol.patch (4.9 KB) - added by mwester@… 10 years ago.
Proposed patch to disable/enable flowcontrol as GSM is disabled/enabled
gsm_serial_console_lockup.patch (3.7 KB) - added by fabchevalier@… 10 years ago.
gsm_serial_console_lockup patch using unregister/register serial
fix-gta01-flowcontrol2-2.6.22.5.patch (5.3 KB) - added by mwester@… 10 years ago.
Updated the original flowcontrol kernel patch

Change History

comment:1 Changed 10 years ago by mwester@…

UPDATE:

Kernel version is a red herring; the problem is with the power-off of the GSM

modem. Commenting out the power-off and reset operations recently added to
/etc/init.d/gsmd, with the exception of the first power-on event in the startup,
is sufficient to prevent the lockups from occurring.

(Note: $GSM_RES is not defined in /etc/defaults/*, also the syntax for the "[

-n $GSM_RES ]" lines needs to enclose the variable in quotes, otherwise the
expression is true, resulting in a syntax error. Since testing with the reset
enabled/disabled doesn't change anything, I'm not bothering to attach any
patches at this time.)

comment:2 Changed 10 years ago by fabchevalier@…

Hi Mike,

Looks like this bug is a regression introduced by the rewriting of gsmd init
script (see #704)

Cheers,

Fabien

comment:3 Changed 10 years ago by cesarb@…

  • Cc cesarb@… added

Stopping a hand-started foreground gsmd (after the previous one quit due to the
alive timer timeout) with Ctrl-C also completely locked up my Neo. So, I'd guess
it's not the powering off of the modem, but the killing of the gsmd that's
locking up the Neo.

comment:4 Changed 10 years ago by fabchevalier@…

More details on the issue:
In fact neo crashes during gsmd init script stop.

The line that triggers neo crashes is :
echo "0" > /sys/bus/platform/devices/gta01-pm-gsm.0/power_on

I managed to reproduce it quite reliably doing the folloging:
echo "1" > /sys/bus/platform/devices/gta01-pm-gsm.0/power_on
start gsmd
stop gsmd
echo "0" > /sys/bus/platform/devices/gta01-pm-gsm.0/power_on

The funny thing is that if gsmd is not started and stopped (step 2 and 3), then
the 4th step does not trigger the crash.

I attached the debug board and did a bt to see where we are:

(gdb) target remote localhost:3333
Remote debugging using localhost:3333
warning: shared library handler failed to enable breakpoint
s3c24xx_serial_console_putchar (port=0xc033b080, ch=48) at
drivers/serial/s3c2410.c:1706
1706 ufstat = rd_regl(port, S3C2410_UFSTAT);
(gdb) l
1701 unsigned long ufstat, utrstat;
1702
1703 if (ufcon & S3C2410_UFCON_FIFOMODE) {
1704 /* fifo mode - check ammount of data in fifo registers... */
1705
1706 ufstat = rd_regl(port, S3C2410_UFSTAT);
1707 return (ufstat & info->tx_fifofull) ? 0 : 1;
1708 }
1709
1710 /* in non-fifo mode, we go and use the tx buffer empty */
(gdb) bt
#0 s3c24xx_serial_console_putchar (port=0xc033b080, ch=48) at
drivers/serial/s3c2410.c:1706
#1 0xc01851bc in uart_console_write (port=0xc033b080,

s=0xc034cfa3 "gta01-pm-gsm gta01-pm-gsm.0: powered down GSM, thus enabhling

seial console\n", count=76,

putchar=0xc018829c <s3c24xx_serial_console_putchar>) at

drivers/serial/serial_core.c:1785
#2 0xc01888d0 in s3c24xx_serial_console_write (co=<value optimized out>, s=0x30
<Address 0x30 out of bounds>,

count=520) at drivers/serial/s3c2410.c:1729

Cannot access memory at address 0xe89da7fc
(gdb)

Does any experienced (==> not me ;-) ) kernel hacker see where that could come
from ?

Cheers,

Fabien

comment:5 Changed 10 years ago by fabchevalier@…

Yet again some more precisions.

This time i had to add a fifth step : echo "1" > .....
to be able to crash the phone.

openocd gives me:
(gdb) target remote localhost:3333
Remote debugging using localhost:3333
warning: shared library handler failed to enable breakpoint
0xc01882f0 in s3c24xx_serial_console_putchar (port=0xc033b080, ch=97) at
drivers/serial/s3c2410.c:1720
1720 while (!s3c24xx_serial_console_txrdy(port, ufcon))
(gdb) bt
#

Looks like we're busy looping on calling the s3c24xx_serial_console_txrdy that
never returns true !!

comment:6 Changed 10 years ago by mcl@…

  • Cc mcl@… added

comment:7 Changed 10 years ago by fabchevalier@…

  • Cc fabchevalier@… added

comment:8 Changed 10 years ago by jserv@…

  • Cc jserv@… added

comment:9 Changed 10 years ago by sean_chiang@…

  • Cc sean_chiang@… added

comment:10 Changed 10 years ago by cw@…

  • Cc cw@… added
  • Component changed from gsmd to kernel

the hardware does NOT lock up, it's not a gsmd issue but a kernel/driver issue

the issue is that when the gsm modem is powered down, the usb endpoint changes
from usbnet to a uart or something and doesn't like to flip back

for now tweak the gsmd script to avoid writing 0 to
/sys/bus/platform/devices/gta01-pm-gsm.0/power_on

comment:11 Changed 10 years ago by thomas@…

I altered my /etc/init.d/gsmd script as suggested to prevent the modem being
powered down, but I still experience the original issue when running gsmd restart.

comment:12 Changed 10 years ago by fabchevalier@…

Chris,
Could you explain what you think usb has to do with it ?
I would have though it was only an UART related issue.

I altered my /etc/init.d/gsmd script as suggested to prevent the modem being
powered down, but I still experience the original issue when running gsmd restart.

Yes, that's definately not a fix for this issue, it's more a "half working
workaround" ;-)

Cheers,

Fabien

comment:13 Changed 10 years ago by roh@…

* Bug 789 has been marked as a duplicate of this bug. *

comment:14 Changed 10 years ago by roh@…

  • Priority changed from high to normal

comment:15 Changed 10 years ago by laforge@…

this looks to me as if for whatever reason the kernel still wants to use ttySAC0
as serial console, rather than as non-console port for the GSM modem.

Due to UART shortage, FIC made the early design choice to multiplex the UART0
between GSM modem and console.

The gsm power control kernel driver unregisters the port from the kernel console
switch and then uses it as GSM uart. Now if that fails, and the kernel still
wants to use it as console, we could expect some problems.

Now maybe this unregistration fails for whatever reason.

I suggest booting without the "console=ttySAC0,115200" part in the bootcmd.
Then the serial port will never be used as console port in the first place.

comment:16 Changed 10 years ago by Tuukka.Hastrup@…

Indeed, booting without "console=ttySAC0,115200", I can't reproduce the crashes
anymore. I'm testing on uImage-2.6.22.5-moko11-r2-fic-gta01.bin and
OpenMoko?-openmoko-devel-image-glibc-ipk-P1-September-Snapshot-20070906-fic-gta01.rootfs.jffs2
.

With "console=ttySAC0,115200", it was enough to wait for the booting to finish
and run echo "0" > /sys/bus/platform/devices/gta01-pm-gsm.0/power_on . It would
take the kernel a second or two to freeze after that (assumingly when it first
tried to write to the console). Now, I can run that and restart gsmd and
xserver-noxdm and successfully shut down the phone unlike before.

comment:17 Changed 10 years ago by mickey@…

Harald, is removing the environment the only way to workaround or can we do
something in kernel, too? If not, then we need to release a new u-boot
environment for people to download via dfu-util. If we are going to do that, we
can use the chance to add some more boot flags (nosplash for psplash) and also
booting from NFS.

comment:18 Changed 10 years ago by mwester@…

Experimentation shows that removing or commenting out the following lines of
code in gta01_pm_gsm.c appears to workaround the problem by simply not
re-enabling the console during the gsm power-off:

092: if (gta01_gsm.con) {
093: console_start(gta01_gsm.con);
094:
095: dev_info(dev, "powered down GSM, thus enabling "
096: "serial console\n");
097: }

This is not a fix, just a workaround, and is noted here as this additional
information might be helpful for someone to find the root cause.

Additional info: simple testing of performing the three steps involved in the
power-off operation (power-off of the GSM modem, switching of the MUX to the
console, and the re-enabling of the serial console itself) in sequence from
user-space (by exposing the three individual operations via sysfs) permits one
to duplicate the sequence without triggering the hang/lock-up. This may
indicate a timing sensitivity in the operations, perhaps?

Investigation continues.

comment:19 Changed 10 years ago by elrond+bugzilla.openmoko.org@…

  • Cc elrond+bugzilla.openmoko.org@… added

comment:20 Changed 10 years ago by rod@…

  • Cc rod@… added

comment:21 Changed 10 years ago by bluelightning@…

  • Cc bluelightning@… added

comment:22 Changed 10 years ago by mwester@…

The root problem has been determined to be that enabling hardware flow control
on the console (not the GSM modem) will result in the system becoming
unresponsive to user input.

OVERVIEW: Upon starting, gsmd will set the the ttySAC0 port to enable hardware
flow control. When gsmd exits, or is killed, it leaves the port with hardware
flow control enabled. In and of itself, this situation is not a problem, as
long as the port remains switched to the GSM modem. However, as soon as the
/etc/init.d/gsmd script (or the user) turns off the GSM modem, the port is
switched to the console, hardware flow control is left enabled, and the first
kernel message printed to the console port will result in the device locking up
in a loop in the serial driver, presumably waiting for the correct modem control
lines to signal that it is ok to send data -- an event that won't happen on a
standard P1 device.


WORKAROUND: the simple workaround is to explicitly set the port to disable
hardware flow control before the GSM modem is powered off and switched back to
the console. This can be accomplished by editing the /etc/init.d/gsmd script,
and changing the startup as follows:

case "$1" in

start)

stty -F /dev/ttySAC0 -crtscts
[ -n "$GSM_POW" ] && ( echo "0" >$GSM_POW; sleep 1 )
[ -n "$GSM_POW" ] && ( echo "1" >$GSM_POW; sleep 1 )

It is also necessary to change the shutdown as follows:

stop)

echo -n "Stopping GSM daemon: "
start-stop-daemon -K -x /usr/sbin/gsmd
stty -F /dev/ttySAC0 -crtscts
[ -n "$GSM_POW" ] && echo "0" >$GSM_POW
echo "gsmd."

For convenience, a suitably modified (and tested) gsmd script is attached to
this bug.


LONG-TERM SOLUTION: It seems probable that there are multiple issues here. In
the first place, it seems unreasonable that the device becomes completely
unresponsive in the event that the console is flow-controlled. This may be a
bug in the console driver, or the serial driver.

However, the contention between the console and gsmd over the port settings is
an issue that needs resolution. Disabling the console completely is one
solution, but that makes kernel debugging work difficult at best. Adding logic
to ensure that gsmd resets the flow control on the port before exiting is a poor
solution, as there are numerous failure modes for gsmd that would not permit it
to perform such a reset, resulting in the same problem. In addition, since gsmd
does not actually do the power-on of the GSM modem, simply starting gsmd without
powering on the GSM modem will result in the same lock-up of the device. Thus,
it seems logical that the code that switches the gsm modem should also ensure
that the port flow-control settings (at least) are switched as appropriate,
rather than rely on separate user actions, or on gsmd. This would mean that
powering up the GSM modem would also require that flow control be enabled, and
conversely powering down would disable flow control.


OTHER CONSIDERATIONS: The current console/GSM modem switch code is extremely
limited. It might be worthwhile considering a more ambitious architecture, one
that would be more flexible. For example, in the event of a kernel oops, it
would not be unreasonable to forcibly switch the console on and power off the
GSM modem, thus enabling the kernel to display the oops output.

comment:23 Changed 10 years ago by mickey@…

ok, i have applied this stty stuff as a workaround in OE. I really don't like it
though and I sincerely hope this will be a temporary thing.

comment:24 Changed 10 years ago by balrogg@…

  • Cc balrogg@… added

comment:25 Changed 10 years ago by balrogg@…

The kernel should not allow user space to lock-up the system by enabling or
disabling hardware flow control (by mistake or not). This wouldn't happen if the
UART txrdy polling was in a separate kernel thread, but it would still be wrong.
The situation here is that hardware flow control should never be enabled - in
fact the entire tty should not be present (as if UART0 wasn't listed in
gta01_uartcfgs[] in arch/arm/mach-s3c2410/mach-gta01.c) whenevet there is
nothing connected to UART0, i.e. the modem is off and no debug board is
connected. I'm not sure if this could be achieved by calling
platform_device_unregister(&s3c24xx_uart_src[0]) when the modem is being
disabled, and at the same time enabling the clear-to-send pin as an IRQ source.
When the IRQ is asserted it would re-register UART0 by calling
platform_device_register(&s3c24xx_uart_src[0]).

comment:26 Changed 10 years ago by laforge@…

I disagree with balrogg's comment that the ttySAC device / UART should only be
present once there is a device connected.

Just compare this to your PC. The serial port device exists all the time, if
you decide to connect something to it or not. It's the same with the GTA01
UART0. It is there, if you connect a debug board (or something completely else)
to it, or not.

As for the actual bugfix: I believe it should be easy to solve this problem by
using register_console() and unregister_console() instead of the
console_{start,stop}() calls from gta01_pm_gsm.c

The difference is that register_console will in re-configure the serial port
with the parameters given in the kernel command line. This sounds like the
right thing to do, considering that it is the same set of parameters that were
used during boot.

I have cooked up a patch and will give it some testing later.

Changed 10 years ago by mwester@…

Proposed patch to disable/enable flowcontrol as GSM is disabled/enabled

comment:27 Changed 10 years ago by mwester@…

  • attachments.isobsolete changed from 0 to 1

comment:28 Changed 10 years ago by mwester@…

Proposed new patch is attached.

This proposed patch changes the behavior of serial port 0 on the GTA01 device
such that hardware flow control is not permitted when in console mode, and
permitted when switched to the GSM modem. If an application (such as gsmd)
turns on hardware flow control, the operation will succeed as expected.
However, the serial driver will only actually enable hardware flow control when
the serial port is switched to the GSM modem.

The behavior is exactly as if the hardware RTS line on the console port was in
the on state (as opposed to the current GTA01 hardware, where the RTS line is off.

This patch resolves the issue originally reported by this bug. It also resolves
potential issues that might arise from mis-use or operator error, such as
starting gsmd while the GSM modem is not turned on and enabled (currently, this
action will lock up the device if the user does not use the /etc/init.d/gsmd
script to start/stop the gsmd service.)

[Re: Harald's comment #18: Changing the console to register/unregister may
resolve the issue by properly resetting the flow control state of the port when
the GSM modem changes state, and if it works, it is a fairly simple solution
that will address most cases. However, this proposed patch also protects in the
case where flow control is enabled but the GSM modem has not been switched, and
also permits the serial port to be switched between the modem and the console
and back again without loss of any port settings (consider the ability to switch
on the fly to print a kernel "oops" output and switch back again, without gsmd
noticing anything more than a few dropped characters).]

comment:29 Changed 10 years ago by hozer@…

  • Cc hozer@… added

comment:30 Changed 10 years ago by hozer@…

* Bug 880 has been marked as a duplicate of this bug. *

Changed 10 years ago by fabchevalier@…

gsm_serial_console_lockup patch using unregister/register serial

comment:31 Changed 10 years ago by mwester@…

  • attachments.isobsolete changed from 0 to 1

Changed 10 years ago by mwester@…

Updated the original flowcontrol kernel patch

comment:32 Changed 10 years ago by jserv@…

  • Cc tick@… added

comment:33 Changed 10 years ago by tick@…

I think change /etc/init.d/gsmd as

  • [ -n "$GSM_POW" ] && ( echo "0" >$GSM_POW; sleep 1 )


+ [ -n "$GSM_POW" ] && [ cat $GSM_POW -eq 0 ] &&( echo "1" >$GSM_POW;
sleep 1 )

may avoid this lock but not solving this problem.

/etc/init.d/gsmd will be start and stop for many times when boots up. (This
will cause system hang)
I think there should be only one place to start it, wherever it is.
:-P

comment:34 Changed 10 years ago by tick@…

  • blocked set to 707

comment:35 Changed 10 years ago by mickey@…

Agreed. In my opinion, the proper place for this is the phone service (dbus)
daemon. Until we actually have one, I'd like to make this phone service daemon
acting as a watchdog for gsmd.

comment:36 Changed 10 years ago by tick@…

  • Owner changed from laforge@… to tick@…

comment:37 Changed 10 years ago by tick@…

  • Status changed from new to assigned

comment:38 Changed 10 years ago by mwester@…

Respectfully, I disagree. The correct place to fix this is in the kernel, with
the posted kernel patches.

User-space commands, especially ones that are required to be issued in order to
make certain vital components functional, should NEVER cause the kernel to crash.

comment:39 Changed 10 years ago by tick@…

In my opinion. Both Mickey and Mike are right.
In the point of view of robust, phone service device should has a watchdog
scheme to monitor if gsmd is still working.
In the point of view of finding root cause, it's obviously something wrong in
kernel. And we should find out the root cause and release a kernel patch.

However, in the point of view of making whole project runs well, I think we
should put the Neo into a safe area, which let Neo can work well and that
further develop can go on, if we can, and mark all the bugs we found. Enlarge
the safe area step by step by solving each bug.

I think we can solving the issue with the following steps.

  1. Since we know pull down the power pin will crash the system, we shall not

pull down this pin in the script. (For Neo's stability) (For not blocking
non-gsmd developers)

  1. Phone service daemon shall monitor if gsmd is still alive.
  2. Let gsmd work for the first initialization
  3. Gsmd introduces timeout and error detection (handling) scheme. (For stability

and robust)

  1. Fix the system hang bug from kernel. (For Kernel and GSMD developers)
  2. Adding the pull pin scheme back to script if needed. (Enlarge the safe area)

All the error detection and handling schemes shall be left there, but warnings
and errors should be reported so that we have clues to solve them.

I may be wrong, it's just my opinion and proposal. How you guys think?
We can improve this together.

comment:40 Changed 10 years ago by mwester@…

I think there's been some confusion, probably because there are just too many
comments on this bug - 28 of them.

This *IS* a bug in the kernel, it is well understood, and has been fixed. A
patch is attached to this bug, and is currently applied in OE. The patch has
not been accepted into the Openmoko svn repository, presumably because Harald
indicated that he would review this or do something with this.

I think the only thing preventing this bug from being closed is the review of
the patch, and acceptance of same into the openmoko svn tree.

comment:41 Changed 10 years ago by balrogg@…

I agree that this is solely a kernel bug and trying to fix it elsewhere is a
wrong approach. However, Mike's patch not only makes changes outside GTA01 code
but also adds platform code in generic drivers, something to avoid if possible.

The remembering of flowcontrol bit also looks like a workaround (and shouldn't

be platform specific).

If Fabien Chevalier's patch works then I'm in favor of this patch, although it
looks to me like there should be an easier possible fix.

There seem to be two issues:
one is that the console and flowcontrol can be enabled even when we know that
there's nothing on the receiving end of the port, nothing that will toggle the
CTS bit. (This is partally addressed by both patches)

second issue is that the s3c serial driver does busy-waiting. It should never
hang the whole system *even* when the CTS bit stays low and flowcontrol is
enabled. It can block the tty input or do whatever but it shouldn't lock-up the
system. This issue is not GTA01 specific.

That said something should really be done to "unfreeze" the kernel tree and get
either of the patches in to prevent user-space from being able to lock-up the
system.

comment:42 Changed 10 years ago by mickey@…

Kernel tree will be "unfrozen" again soon, sorry for the delay.

comment:43 Changed 10 years ago by mwester@…

  • blocked set to 1003

comment:44 Changed 10 years ago by willie_chen@…

  • Cc willie_chen@… added

comment:45 Changed 10 years ago by willie_chen@…

  • Owner changed from tick@… to willie_chen@…
  • Status changed from assigned to new

comment:46 Changed 10 years ago by andy@…

What is the plan to resolve this, then: there seem to be two patches talked
about, Harald/Fabien?'s and Mike's. But I didn't see a response from testing
them or a resolution of which is best accepted into the actual working
patchset.

comment:47 Changed 10 years ago by laforge@…

I completely agree with Andy and Andrew's comments. This is why I never merged
any of those patches so far. They all seem to not solve the real problem and
just implement workarounds with questionable changes breaking layers of abstraction.

comment:48 Changed 10 years ago by mwester@…

Hehe! The classic developer vs support clash develops!

So the developer mindset says it's better to leave it broken until someone
rewrites the serial driver to correct its behavior. The support mindset says
that the users need something to keep this horrible crash from occurring, even
if that fix is less than technically pure and elegant.

May I suggest that *SOMEBODY* pick up the task of (at least) providing guidance
on the "Non-Questionable" solution so that said solution can be at least
investigated, and in the meantime, perhaps it might be possible to incorporate
one of these patches so that these crashes cease?

Just FYI, as "Questionable" as these patches might be, one of them is applied to
all 2.6.22.5 kernels built with OE, and as "Questionable" as it might be it has
nicely resolved the problem with nary an issue or side-effect reported. I fully
realize that does not qualify the fix as "elegant" or "correct", it merely
points out that for those who are "support-minded", the fix qualifies as a
solution that keeps the device running. :-)

C'mon folks -- we have a phone to build. If we let ourselves be distracted into
re-writing stock parts of the Linux kernel to *elegantly* support the inelegant
GTA01 hardware hackery, we'll never get anything done. One vote here for
pragmatism.

comment:49 Changed 10 years ago by werner@…

  • Cc werner@… added

Shouldn't just resetting flow control when switching away from / powering down
the modem solve the problem at hand ?

I don't think we absolutely have to preserve flow control state, since gsmd
can very well take care of setting it up upon restart. If someone sets flow
control by accident, well, that's called tragic pilot error :-)

If I had to choose between the two patches, I'd prefer Fabien's, because it's
less intrusive. However, I'd be even happier with a patch that just clears
the darn flow control bit (in the platform-specific code).

Any takers ?

Getting rid of the potential endless loop is a different issue, and certainly
a worthwhile endeavour. (I'd suggest to find a suitable time source, then
loop until a configurable amount of time has passed, and then setting a flag
that disables further looping, until the UART becomes operational again).

I.e.,

static int stuck = 0;

do ok = ready();
while (!stuck && !ok && !timeout())
if (!ok) {

stuck = 1;
return;

}
stuck = 0;

comment:50 Changed 10 years ago by cesarb@…

(in reply to comment #34)

If it's just to avoid an infinite loop, there's no need to find a timesource;
just loop a maximum number of times, with a suitable udelay() within the loop.
See for instance s3c24xx_serial_rx_enable() on drivers/serial/s3c2410.c, which
does exactly that.

comment:51 Changed 10 years ago by werner@…

re #35: very nice idea, thanks ! Keep the patches coming ! :-)

comment:52 Changed 10 years ago by mwester@…

If I had to choose between the two patches, I'd prefer Fabien's, because it's

less intrusive.
Yes, but see comment #20 -- that approach still leaves a vulnerability where a
lockup can be triggered by something as simple as someone messing with startup
scripts or boot-time ordering issues.

However, I'd be even happier with a patch that just clears the darn flow

control bit (in the platform-specific code).
I tried. The way the tty stuff is handled at the highest levels in the kernel
means that we just can't find all the possible places where flow control might
be set. The fact that the current ttys structures say that it isn't set is only
good for as long as the tty is held open; it reverts as soon as the tty is
closed. And then, of course, another process can open it and change the flow
control at any time. Overall, it meant mucking about in code several levels
higher in the tty subsystem, and I decided that was not the place to try to fix
this.

Getting rid of the potential endless loop is a different issue, and certainly

a worthwhile endeavour.

Agreed. But that doesn't solve the problem, really. It just replaces the
kernel lockup with total lack of any console output on the serial device; might
as well just disable the console (a la Fabien's fix) if you're going to do that.

I remain firmly convinced that the correct solution is one that:
a) ensures that no combination of setting flow control and diddling with the
GTA01 serial port mux should *ever* result in the crash of the kernel -- this is
even more important now that efforts are underway to streamline the boot time.
b) ensures that in any combination of setting flow control and diddling with the
GTA01 serial port mux, the minimum of kernel messages are lost to the serial
port. This also is important as work continues on boot time issues, but also as
effort continues to simply debug 2.6.24.

Perhaps the problem is the way the patch is written? Does anyone have any
concrete suggestions on how to rewrite it such that it's more palatable?
Someone had concerns about the use of the "unused[]" elements in the
datastructure to record state (although there is precedent for this elsewhere in
the driver). Would it make it better if elements were added to the structure,
or if the data were recorded elsewhere? I guess I'm still not really clear on
the objections to the solution.

comment:53 Changed 10 years ago by werner@…

Let's take this to openmoko-kernel. Two quick remarks to #37:

a) Agreed, hanging the kernel is cruel and unusual punishment for merely
turning on flow control.

b) I don't think kernel messages should be printed at any price. In
particular, interrupting communication with GSM arbitrarily when the
kernel feels like mumbling something will not help towards making
gsmd more reliable :) I could see an exception for messages indicating
a clearly fatal condition.

comment:54 Changed 10 years ago by mwester@…

  • blocked set to 1221

comment:55 Changed 10 years ago by roh

  • Owner changed from willie_chen@… to willie_chen

comment:56 Changed 9 years ago by john_lee

  • Status changed from new to assigned
  • Owner changed from willie_chen to openmoko-kernel
  • HasPatchForReview unset

willie_chen doesn't work for us anymore. Reassign to openmoko-kernel. last update was 9 months ago, any development on this?

comment:57 Changed 9 years ago by andy

"flowcontrolled" /sys option was added to neo1973-pm-gsm for this, it was enough of a solution that we can close this?

comment:58 Changed 9 years ago by mickey

  • Status changed from assigned to closed
  • Resolution set to fixed

With regards to the original problem of starting or stopping gsmd locking the Neo, I'll close this bug now for several reasons:

a) ogsmd, fso-abyss, gsm0710muxd all seem to start, stop, restart GSM without problems (apart from the fancy sysfs dance that is required to reliably reset it).
a) gsmd is dead.

With regards to the other problems, please open dedicated bugs for that as you see fit.

Note: See TracTickets for help on using tickets.