Ticket #2328 (new defect)
touschreen sometimes stops generating events
| Reported by: | lindi | Owned by: | openmoko-kernel |
|---|---|---|---|
| Priority: | normal | Milestone: | |
| Component: | kernel | Version: | |
| Severity: | normal | Keywords: | touchscreen |
| Cc: | Blocked By: | ||
| Blocking: | Estimated Completion (week): | ||
| HasPatchForReview: | no | PatchReviewResult: | |
| Reproducible: | rarely |
Description
Steps to reproduce:
1) hexdump -C /dev/input/event1
2) touch the touchscreen
Expected results:
2) some events are generated
Actual results:
2) nothing is generated
More info:
1) This started to happen when I upgraded from andy-tracking a3587e4ed77974ad to andy-tracking a15608f241a40b41 and disabled CONFIG_DEBUG_KERNEL and CONFIG_PREEMPT.
2) sudo lsof | grep event1 shows that only Xorg keeps the device open => there is nothing running EVIOGRAB that could disable events.
3) open files of Xorg:
lindi@ginger:~$ sudo ls -l /proc/$(pidof X)/fd total 0 l-wx------ 1 root root 64 Jan 15 10:38 0 -> /var/log/Xorg.0.log lrwx------ 1 root root 64 Jan 15 10:38 1 -> socket:[1745] lrwx------ 1 root root 64 Jan 15 10:38 10 -> socket:[1781] lrwx------ 1 root root 64 Jan 15 10:38 11 -> socket:[1965] lrwx------ 1 root root 64 Jan 15 10:38 12 -> socket:[3064] lrwx------ 1 root root 64 Jan 15 10:38 13 -> socket:[49650] lrwx------ 1 root root 64 Jan 15 10:38 14 -> socket:[3082] lrwx------ 1 root root 64 Jan 15 10:38 15 -> socket:[51145] lrwx------ 1 root root 64 Jan 15 10:38 17 -> socket:[62017] lrwx------ 1 root root 64 Jan 15 10:38 18 -> socket:[43528] lrwx------ 1 root root 64 Jan 15 10:38 19 -> socket:[64051] l-wx------ 1 root root 64 Jan 15 10:38 2 -> /var/log/xdm.log lrwx------ 1 root root 64 Jan 15 10:38 20 -> socket:[64120] lrwx------ 1 root root 64 Jan 15 10:38 23 -> socket:[64206] lrwx------ 1 root root 64 Jan 15 10:38 3 -> socket:[1746] lr-x------ 1 root root 64 Jan 15 10:38 4 -> /usr/lib/xorg/protocol.txt lrwx------ 1 root root 64 Jan 15 10:38 5 -> /dev/tty2 lrwx------ 1 root root 64 Jan 15 10:38 6 -> /dev/apm_bios lrwx------ 1 root root 64 Jan 15 10:38 7 -> /dev/fb0 lr-x------ 1 root root 64 Jan 15 10:38 8 -> /dev/input/event1 lrwx------ 1 root root 64 Jan 15 10:38 9 -> socket:[1778]
3) strace shows that Xorg does
select(256, [1 3 5 6 8 9 10 11 12 13 14 15 16 17 18 19 20 23], NULL, NULL, NULL) = 1 (in [11])
every second but it never returns that there is something to read from fd 8 (the event1 device).
4) This happened immediately after resume so suspend/resume code might have something to do with this.
5) Stopping Xorg did not fix this
6) Starting Xorg did not fix this
7) Suspending/resuming again did not fix this
8) the interrupt count of s3c2410_action in /proc/interrupts does increase every time I touch the screen so the hardware is not totally dead
9) Just in case this happens next, what extra information should I gather to make this easier to fix?
Attachments
Change History
Changed 3 years ago by lindi
- Attachment syslog1.log added
comment:1 Changed 3 years ago by TimoJyrinki
I hit this now as well for the first time. It would seem likely the touch screen driver(s) also need some larger wait somewhere, similar to WLAN which broke with the faster kernel...
comment:2 follow-up: ↓ 3 Changed 3 years ago by md2k7
Reproduction: You need to touch the screen exactly after it has blanked (in SHR, something - I guess FSO - is turning screen off after some time to save battery).
Kernel: 2.6.34, dd1225cc08c3375bf80289ac1965c724881b149a
Changed 3 years ago by md2k7
- Attachment syslog_and_devices.txt added
dmesg output, and cat /proc/bus/input/devices
comment:3 in reply to: ↑ 2 Changed 3 years ago by md2k7
Replying to md2k7:
Reproduction: You need to touch the screen exactly after it has blanked (in SHR, something - I guess FSO - is turning screen off after some time to save battery).
Kernel: 2.6.34, dd1225cc08c3375bf80289ac1965c724881b149a
though it's not that easy to reproduce as I previously thought. Don't know how I managed it 2 reboots in a row.
comment:4 Changed 3 years ago by jama
I can confirm also with
SHR Kernel: 2.6.34, dd1225cc08c3375bf80289ac1965c724881b149a (same as in #2337) and IIRC I didn't even suspend/resume before this happened yesterday.
comment:5 Changed 3 years ago by gena2x
I spent some time trying to investigate this (or similar?) issue.
I got problems then i changed optimization from optimization for size to optimization for speed (for andy-tracking).
Perfectly reproducible.
Interesting thing i found while investigation is that our touchscreen sometimes start sending absolutely correct events (no need to filter). This happens then screen is blanked for some period and perfectly reproducible too.
Reloading touchscreen module restores functionality.
I can't recall correctly (i did investigation in Feb) but problem real causes were following:
dmesg is like following:
...
Jan 1 03:49:06 debian-gta02 kernel: [ 2539.550000] s3c2440-ts s3c2440-ts: Stylus timer, down state, samples: 1, 1, 4
Jan 1 03:49:06 debian-gta02 kernel: [ 2539.550000] s3c2440-ts s3c2440-ts: Stylus irq, down state: 0, 0
Jan 1 03:49:06 debian-gta02 kernel: [ 2539.550000] s3c2440-ts s3c2440-ts: stylus_irq: count=4
Jan 1 03:49:06 debian-gta02 kernel: [ 2539.560000] s3c2440-ts s3c2440-ts: Stylus irq, down state: 0, 0
Jan 1 03:49:06 debian-gta02 kernel: [ 2539.560000] s3c2440-ts s3c2440-ts: stylus_irq: count=4
Jan 1 03:49:06 debian-gta02 kernel: [ 2539.565000] s3c2440-ts s3c2440-ts: Stylus timer, down state, samples: 0, 0, 4
Jan 1 03:49:06 debian-gta02 kernel: [ 2540.005000] s3c2440-ts s3c2440-ts: Stylus irq, down state: 1, 1
Jan 1 03:49:06 debian-gta02 kernel: [ 2540.010000] s3c2440-ts s3c2440-ts: Stylus timer, down state, samples: 1, 1, 4
Jan 1 03:49:06 debian-gta02 kernel: [ 2540.020000] s3c2440-ts s3c2440-ts: Stylus timer, down state, samples: 1, 1, 4
[...same...]
Jan 1 03:49:06 debian-gta02 kernel: [ 2540.135000] s3c2440-ts s3c2440-ts: Stylus timer, down state, samples: 1, 1, 4
Jan 1 03:49:06 debian-gta02 kernel: [ 2540.140000] s3c2440-ts s3c2440-ts: Stylus timer, down state, samples: 1, 1, 4
Jan 1 03:49:06 debian-gta02 kernel: [ 2540.145000] s3c2440-ts s3c2440-ts: Stylus irq, down state: 0, 0
Jan 1 03:49:06 debian-gta02 kernel: [ 2540.145000] s3c2440-ts s3c2440-ts: stylus_irq: count=4
Jan 1 03:49:06 debian-gta02 kernel: [ 2540.150000] s3c2440-ts s3c2440-ts: Stylus irq, down state: 1, 0
"Stylus irq, down state: 0, 0' is a 'up' interrupt
up prints also 'stylus_irq: count=?', this is count of measurements ready at this point.
"Stylus irq, down state: 1, ?' is a 'down' interrupt
down starts 'timer'
main source of info about interrupt is reading registers after each conversion/on interrupt. if that registers return 1 - ts is down. and we start timer and adc. then it is 'up' 0 we only start waiting for down. (this seen unneeded for me as we already know what are we waiting, we can say state=!state, but this not works somehow)
this down state '1,1' is normal situation, the '1,0' is then bug occured.
so, interrupt happens but adc registers report stylus is 'up'. and so they do forever from some point. I rewrited driver to check only interrupts, but somehow this didn't helped. i tried other ideas also, but without success.
all the tests are from .34-backported (as it was in february) driver without filtering.
i fact i also did some investigation on touchscreen buzz, i tried different combinations of delays, and scaling and all without luck too.
comment:6 Changed 3 years ago by gena2x
issue is caused by thing i called 'unexpected interrupts'. then device is touched up/down interrupt recieved, but original drivers do not rely on interrupt cause, they check adc registers for current state instead. this may cause that 'down' interrupt recieved while handler thinks this is 'up', that leads to situation then driver no longer watch for down interrupts and ts generate no events until ts driver reload. other situation then 'up' interrupt is interpreted as 'down' sometimes lead to data corruption is adc conversion and 'number remaining of samples need' can go below 0, so adc conversion will be requested for infinite amount of times. in this case attempt to reloading module will hang system. as driver written with some 'states' in mind it can't handle such interrupts and in fact they are not interesting for us (is we already have pen down, no need for more interrupt informing as about this), so we have to ignore em.
for me some thing left unexplained - sometimes recieving interrupts while we totally not expect them, and why we see this problem only on kernel without debugging information.
mine solution for this is to accept only expected interrupts.
some logs of failures with added debug info:
http://www.bsdmn.com/openmoko/kernel/touchscreen/34failexpectupgotdown.log
http://www.bsdmn.com/openmoko/kernel/touchscreen/34failexpectdowngotup.log
http://www.bsdmn.com/openmoko/kernel/touchscreen/34unknowunexpected.log
patches for .34 and .29 kernel:
http://www.bsdmn.com/openmoko/kernel/touchscreen/touchscreen_ignoreunexpectedintr29.patch
http://www.bsdmn.com/openmoko/kernel/touchscreen/touchscreen_ignoreunexpectedintr34.patch
.34 patch were tested much more than .29 version for which only basic test very done.
i hope this bug fix is good step forward to kernel optimized for speed.
comment:7 Changed 3 years ago by lars
Hi
Could you write down you findings send them with your patch to the upstream maintainers for the driver? (scripts/get_maintainers.pl touchscreen_ignoreunexpectedintr34.patch)
- Lars
comment:8 Changed 2 years ago by lindi
Since udev dropped support for 2.6.29 I finally tried to use 2.6.34 in "production". After a few days I hit this bug again even though I have applied
http://www.bsdmn.com/openmoko/kernel/touchscreen/touchscreen_ignoreunexpectedintr34.patch

syslog showing last suspend/resume