Ticket #2328 (new defect)

Opened 5 years ago

Last modified 3 years ago

touschreen sometimes stops generating events

Reported by: lindi Owned by: openmoko-kernel
Priority: normal Milestone:
Component: kernel Version:
Severity: normal Keywords: touchscreen
Cc: Blocked By:
Blocking: Estimated Completion (week):
HasPatchForReview: no PatchReviewResult:
Reproducible: rarely

Description

Steps to reproduce:
1) hexdump -C /dev/input/event1
2) touch the touchscreen

Expected results:
2) some events are generated

Actual results:
2) nothing is generated

More info:
1) This started to happen when I upgraded from andy-tracking a3587e4ed77974ad to andy-tracking a15608f241a40b41 and disabled CONFIG_DEBUG_KERNEL and CONFIG_PREEMPT.
2) sudo lsof | grep event1 shows that only Xorg keeps the device open => there is nothing running EVIOGRAB that could disable events.
3) open files of Xorg:

lindi@ginger:~$ sudo ls -l /proc/$(pidof X)/fd
total 0
l-wx------ 1 root root 64 Jan 15 10:38 0 -> /var/log/Xorg.0.log
lrwx------ 1 root root 64 Jan 15 10:38 1 -> socket:[1745]
lrwx------ 1 root root 64 Jan 15 10:38 10 -> socket:[1781]
lrwx------ 1 root root 64 Jan 15 10:38 11 -> socket:[1965]
lrwx------ 1 root root 64 Jan 15 10:38 12 -> socket:[3064]
lrwx------ 1 root root 64 Jan 15 10:38 13 -> socket:[49650]
lrwx------ 1 root root 64 Jan 15 10:38 14 -> socket:[3082]
lrwx------ 1 root root 64 Jan 15 10:38 15 -> socket:[51145]
lrwx------ 1 root root 64 Jan 15 10:38 17 -> socket:[62017]
lrwx------ 1 root root 64 Jan 15 10:38 18 -> socket:[43528]
lrwx------ 1 root root 64 Jan 15 10:38 19 -> socket:[64051]
l-wx------ 1 root root 64 Jan 15 10:38 2 -> /var/log/xdm.log
lrwx------ 1 root root 64 Jan 15 10:38 20 -> socket:[64120]
lrwx------ 1 root root 64 Jan 15 10:38 23 -> socket:[64206]
lrwx------ 1 root root 64 Jan 15 10:38 3 -> socket:[1746]
lr-x------ 1 root root 64 Jan 15 10:38 4 -> /usr/lib/xorg/protocol.txt
lrwx------ 1 root root 64 Jan 15 10:38 5 -> /dev/tty2
lrwx------ 1 root root 64 Jan 15 10:38 6 -> /dev/apm_bios
lrwx------ 1 root root 64 Jan 15 10:38 7 -> /dev/fb0
lr-x------ 1 root root 64 Jan 15 10:38 8 -> /dev/input/event1
lrwx------ 1 root root 64 Jan 15 10:38 9 -> socket:[1778]

3) strace shows that Xorg does

select(256, [1 3 5 6 8 9 10 11 12 13 14 15 16 17 18 19 20 23], NULL, NULL, NULL) = 1 (in [11])

every second but it never returns that there is something to read from fd 8 (the event1 device).
4) This happened immediately after resume so suspend/resume code might have something to do with this.
5) Stopping Xorg did not fix this
6) Starting Xorg did not fix this
7) Suspending/resuming again did not fix this
8) the interrupt count of s3c2410_action in /proc/interrupts does increase every time I touch the screen so the hardware is not totally dead

9) Just in case this happens next, what extra information should I gather to make this easier to fix?

Attachments

syslog1.log (12.7 KB) - added by lindi 5 years ago.
syslog showing last suspend/resume
syslog_and_devices.txt (27.2 KB) - added by md2k7 5 years ago.
dmesg output, and cat /proc/bus/input/devices

Change History

Changed 5 years ago by lindi

syslog showing last suspend/resume

comment:1 Changed 5 years ago by TimoJyrinki

I hit this now as well for the first time. It would seem likely the touch screen driver(s) also need some larger wait somewhere, similar to WLAN which broke with the faster kernel...

comment:2 follow-up: ↓ 3 Changed 5 years ago by md2k7

Reproduction: You need to touch the screen exactly after it has blanked (in SHR, something - I guess FSO - is turning screen off after some time to save battery).

Kernel: 2.6.34, dd1225cc08c3375bf80289ac1965c724881b149a

Changed 5 years ago by md2k7

dmesg output, and cat /proc/bus/input/devices

comment:3 in reply to: ↑ 2 Changed 5 years ago by md2k7

Replying to md2k7:

Reproduction: You need to touch the screen exactly after it has blanked (in SHR, something - I guess FSO - is turning screen off after some time to save battery).

Kernel: 2.6.34, dd1225cc08c3375bf80289ac1965c724881b149a

though it's not that easy to reproduce as I previously thought. Don't know how I managed it 2 reboots in a row.

comment:4 Changed 5 years ago by jama

I can confirm also with
SHR Kernel: 2.6.34, dd1225cc08c3375bf80289ac1965c724881b149a (same as in #2337) and IIRC I didn't even suspend/resume before this happened yesterday.

comment:5 Changed 4 years ago by gena2x

I spent some time trying to investigate this (or similar?) issue.

I got problems then i changed optimization from optimization for size to optimization for speed (for andy-tracking).

Perfectly reproducible.

Interesting thing i found while investigation is that our touchscreen sometimes start sending absolutely correct events (no need to filter). This happens then screen is blanked for some period and perfectly reproducible too.

Reloading touchscreen module restores functionality.

I can't recall correctly (i did investigation in Feb) but problem real causes were following:

dmesg is like following:

...
Jan 1 03:49:06 debian-gta02 kernel: [ 2539.550000] s3c2440-ts s3c2440-ts: Stylus timer, down state, samples: 1, 1, 4
Jan 1 03:49:06 debian-gta02 kernel: [ 2539.550000] s3c2440-ts s3c2440-ts: Stylus irq, down state: 0, 0
Jan 1 03:49:06 debian-gta02 kernel: [ 2539.550000] s3c2440-ts s3c2440-ts: stylus_irq: count=4
Jan 1 03:49:06 debian-gta02 kernel: [ 2539.560000] s3c2440-ts s3c2440-ts: Stylus irq, down state: 0, 0
Jan 1 03:49:06 debian-gta02 kernel: [ 2539.560000] s3c2440-ts s3c2440-ts: stylus_irq: count=4
Jan 1 03:49:06 debian-gta02 kernel: [ 2539.565000] s3c2440-ts s3c2440-ts: Stylus timer, down state, samples: 0, 0, 4
Jan 1 03:49:06 debian-gta02 kernel: [ 2540.005000] s3c2440-ts s3c2440-ts: Stylus irq, down state: 1, 1
Jan 1 03:49:06 debian-gta02 kernel: [ 2540.010000] s3c2440-ts s3c2440-ts: Stylus timer, down state, samples: 1, 1, 4
Jan 1 03:49:06 debian-gta02 kernel: [ 2540.020000] s3c2440-ts s3c2440-ts: Stylus timer, down state, samples: 1, 1, 4
[...same...]
Jan 1 03:49:06 debian-gta02 kernel: [ 2540.135000] s3c2440-ts s3c2440-ts: Stylus timer, down state, samples: 1, 1, 4
Jan 1 03:49:06 debian-gta02 kernel: [ 2540.140000] s3c2440-ts s3c2440-ts: Stylus timer, down state, samples: 1, 1, 4
Jan 1 03:49:06 debian-gta02 kernel: [ 2540.145000] s3c2440-ts s3c2440-ts: Stylus irq, down state: 0, 0
Jan 1 03:49:06 debian-gta02 kernel: [ 2540.145000] s3c2440-ts s3c2440-ts: stylus_irq: count=4
Jan 1 03:49:06 debian-gta02 kernel: [ 2540.150000] s3c2440-ts s3c2440-ts: Stylus irq, down state: 1, 0

"Stylus irq, down state: 0, 0' is a 'up' interrupt
up prints also 'stylus_irq: count=?', this is count of measurements ready at this point.

"Stylus irq, down state: 1, ?' is a 'down' interrupt
down starts 'timer'

main source of info about interrupt is reading registers after each conversion/on interrupt. if that registers return 1 - ts is down. and we start timer and adc. then it is 'up' 0 we only start waiting for down. (this seen unneeded for me as we already know what are we waiting, we can say state=!state, but this not works somehow)

this down state '1,1' is normal situation, the '1,0' is then bug occured.

so, interrupt happens but adc registers report stylus is 'up'. and so they do forever from some point. I rewrited driver to check only interrupts, but somehow this didn't helped. i tried other ideas also, but without success.

all the tests are from .34-backported (as it was in february) driver without filtering.

i fact i also did some investigation on touchscreen buzz, i tried different combinations of delays, and scaling and all without luck too.

comment:6 Changed 4 years ago by gena2x

issue is caused by thing i called 'unexpected interrupts'. then device is touched up/down interrupt recieved, but original drivers do not rely on interrupt cause, they check adc registers for current state instead. this may cause that 'down' interrupt recieved while handler thinks this is 'up', that leads to situation then driver no longer watch for down interrupts and ts generate no events until ts driver reload. other situation then 'up' interrupt is interpreted as 'down' sometimes lead to data corruption is adc conversion and 'number remaining of samples need' can go below 0, so adc conversion will be requested for infinite amount of times. in this case attempt to reloading module will hang system. as driver written with some 'states' in mind it can't handle such interrupts and in fact they are not interesting for us (is we already have pen down, no need for more interrupt informing as about this), so we have to ignore em.

for me some thing left unexplained - sometimes recieving interrupts while we totally not expect them, and why we see this problem only on kernel without debugging information.

mine solution for this is to accept only expected interrupts.

some logs of failures with added debug info:
http://www.bsdmn.com/openmoko/kernel/touchscreen/34failexpectupgotdown.log
http://www.bsdmn.com/openmoko/kernel/touchscreen/34failexpectdowngotup.log
http://www.bsdmn.com/openmoko/kernel/touchscreen/34unknowunexpected.log

patches for .34 and .29 kernel:
http://www.bsdmn.com/openmoko/kernel/touchscreen/touchscreen_ignoreunexpectedintr29.patch
http://www.bsdmn.com/openmoko/kernel/touchscreen/touchscreen_ignoreunexpectedintr34.patch

.34 patch were tested much more than .29 version for which only basic test very done.

i hope this bug fix is good step forward to kernel optimized for speed.

comment:7 Changed 4 years ago by lars

Hi

Could you write down you findings send them with your patch to the upstream maintainers for the driver? (scripts/get_maintainers.pl touchscreen_ignoreunexpectedintr34.patch)

  • Lars

comment:8 Changed 3 years ago by lindi

Since udev dropped support for 2.6.29 I finally tried to use 2.6.34 in "production". After a few days I hit this bug again even though I have applied

http://www.bsdmn.com/openmoko/kernel/touchscreen/touchscreen_ignoreunexpectedintr34.patch

comment:9 Changed 3 years ago by purg

Buy Neopoints

The bug should be resolved now.

Note: See TracTickets for help on using tickets.