Ticket #1315 (new defect)

Opened 7 years ago

Last modified 6 years ago

xglamo: Performance decreases after a while

Reported by: zecke@… Owned by: openmoko-devel
Priority: highest Milestone:
Component: Xfbdev (kdrive), Xglamo Version: Om2008.8
Severity: critical Keywords:
Cc: buglog@…, will@…, tick@… Blocked By:
Blocking: Estimated Completion (week):
HasPatchForReview: PatchReviewResult:
Reproducible:

Description

I have Xglamo, elightenment with illume and Qtopia running. If I use the slide,
show and hide the application. Xglamo starts to crawl. The load is fine, it
looks like Xglamo is still going to the event loop. But enlightenment seems to
be stuck in XQueryPointer (poll) for more than a minute. Allow new clients take
ages to connect to the Xserver.

This behavior can easily be provoked by clicking a lot with a Qtopia/X11 image
and the cause is not really known. I will upload my image somewhere.

Change History

comment:1 Changed 6 years ago by zecke@…

  • Summary changed from Stability issues with illume/e and Qtopia to Xglamo issues with illume/e

comment:2 Changed 6 years ago by zecke@…

  • Priority changed from high to highest
  • Severity changed from normal to critical

Using the revision 9a7f5bc90c3e385e9b0d82d5a49af65dca0e73a6 makes Xglamo
unusable. It has serve issues updating the regions it touched with the cursor,
all screen moves result in a black background. I'm pretty sure this is not an
enlightenment issue.

How to reproduce:

  • build e-wm
  • install it
  • start "enlightenment_starts"
  • move the cursor and see black rects
  • start the files application
  • move it around

Result:

You end up with a black screen. The device is not usable.

comment:3 Changed 6 years ago by dodji@…

Oops. Must be a mistake of mine. Could you please try
http://folks.o-hand.com/dodji/omoko/xglamo/xserver-kdrive-glamo_1.3.0.0+git-r4_armv4t.ipk
and tell me if that fixes the problem for you ?

comment:4 Changed 6 years ago by roh

  • Owner changed from openmoko-devel@… to openmoko-devel

comment:5 Changed 6 years ago by zecke

  • Cc will@… added
  • Summary changed from Xglamo issues with illume/e to xglamo: Performance decreases after a while
  • Milestone set to ASU

Turn this bug into the glamo performance bug. After a while even answering simple requests take a long time and client processes seem to be stuck. The tricky thing is how to reproduce this. I think the expedite suite might be a good help here.

comment:6 Changed 6 years ago by zecke

From an old mail to openmoko-devel:

e/illume:
(gdb) bt
#0  0x4050f250 in poll () from /lib/libc.so.6
#1  0x4032d8a0 in _XWaitForReadable () from /usr/lib/libX11.so.6
#2  0x4032dc4c in _XRead () from /usr/lib/libX11.so.6
#3  0x4032e634 in _XReply () from /usr/lib/libX11.so.6
#4  0x4032332c in XQueryPointer () from /usr/lib/libX11.so.6
#5  0x4004a7dc in from /usr/lib/libecore_x.so.0
#6  0x00046eb4 in _e_pointer_cb_idle_timer_pre ()
#7  0x4041fa00 in _ecore_timer_call () from /usr/lib/libecore.so.0
#8  0x4041f9e4 in _ecore_timer_call () from /usr/lib/libecore.so.0
#9  0x4041c8bc in _ecore_main_loop_iterate_internal ()
from /usr/lib/libecore.so.0
#10 0x4041ccd4 in ecore_main_loop_begin () from /usr/lib/libecore.so.0
#11 0x000291f4 in main ()

Xglamo:
0x4028328c in gettimeofday () from /lib/libc.so.6
(gdb) bt
#0  0x4028328c in gettimeofday () from /lib/libc.so.6
#1  0x00014fd4 in GLAMOEngineWait ()
#2  0x000156d0 in GLAMOWaitMarker ()
#3  0x00043504 in kaaWaitSync ()
#4  0x00015118 in GLAMODoneCopy ()
#5  0x00044aec in kaaCopyNtoN ()
#6  0x00056140 in fbCopyRegion ()
#7  0x000564b0 in fbDoCopy ()
#8  0x00043d0c in kaaCopyArea ()
#9  0x000cc840 in damageCopyArea ()
#10 0x0007a66c in miDCRestoreUnderCursor ()
#11 0x000856d0 in miSpriteRemoveCursor ()
#12 0x000862c8 in miSpriteSetCursor ()
#13 0x0007fc90 in miPointerUpdate ()
#14 0x0004a060 in ProcessInputEvents? ()
#15 0x0002e988 in Dispatch ()
#16 0x00018d44 in main ()
#17 0x4021ca00 in libc_start_main () from /lib/libc.so.6
#18 0x00012388 in _start ()

comment:7 Changed 6 years ago by zecke

Also looking at the cmd queue changes in the kernel (locking, waiting for it to be empty..) would be helpful. Maybe even working on dodji's #1115 issue?

comment:8 Changed 6 years ago by andy

Last time I looked at it XGlamo has obvious problem that it cut and pasted the kernel fb code and pressed the Super MMAP button and "did it in Userspace": it does not cooperate with locking and modal command queue handling in kernel world. Maybe that can make the trouble, in which case the answer is put XGlamo guts into kernel world and talk to it on IOCTL or something.

comment:9 Changed 6 years ago by tick

  • Cc tick@… added

comment:10 Changed 6 years ago by olv

I am still trying to reproduce the problem.

Following comment #2, I encounter https://docs.openmoko.org/trac/ticket/1066. It might not be the cause, or, I might be very lucky so far that glamo happens to finish its job before each Copy command is issued.

comment:11 Changed 6 years ago by zecke

  • Version changed from unspecified to Om2008.8
  • Milestone Om2008.8 deleted

#1296 is another such bug.

comment:12 Changed 6 years ago by denver

Is there some way I can log the commands sent by Xglamo? I am seeing this bug periodically (more specifically I'm seeing #1296, but this is similar) and I'd like to help debug it. Looking at the command log was suggested by cjb on #openmoko.

comment:13 Changed 6 years ago by zecke

Well, the log will be quite big and you would need the exact point where things got wrong. So currently the most wanted thing is an easy way to reproduce this.

comment:14 Changed 6 years ago by denver

Here are some more details on the specific problem I am encountering:

When running a recently-updated 2007.2 build, the device becomes unresponsive to touchscreen taps after a few hours of being on (the exact number of hours varies). This seems to occur at random; it even occurs when there is no apparent activity on the unit (screen has not been updated and touchscreen has not being pressed for several minutes). The unit still responds to SSH while unresponsive to the touchscreen. Running top shows that Xglamo is consuming a lot of CPU (presumably spinning). After a few minutes (about 5), Xglamo stops consuming CPU. The unit will then respond with an audible click when pressing the touchscreen (and the screen turns on if dimmed), but Xglamo immediately starts hogging the CPU again and the screen becomes unresponsive. After a few minutes, Xglamo stops hogging the CPU again and the cycle continues.

Even if one does not press the touchscreen once Xglamo stops hogging the CPU, Xglamo can start spinning again at random. Again, it will stop hogging the CPU after a few minutes but starts spinning immediately after the touchscreen is pressed.

The problem can be corrected by stopping and starting the X server. However, you must either send Xglamo a SIGKILL or wait for it to stop spinning (at which point it exits) following "/etc/init.d/xserver-nodm stop".

Attaching to Xglamo with gdb while it's spinning tends to show Xglamo in gettimeofday or GLAMOEngineWait, much like in the gdb output of comment 6.

comment:15 Changed 6 years ago by denver

zecke: It depends what you mean by reproducible. As you can see in my problem description, Xglamo stops spinning for a while and then starts again when you press the touchscreen. So activating the spinning is reproducible. If there's an easy way to turn on logging after the initial spinning, then I can get a concise log around the point where it starts spinning again. But if you need me to run a modified Xglamo binary that has logging turned on all the time, then the reproducibility I have doesn't help much. Hopefully there will be a way to turn on the logging while Xglamo is running.

comment:16 Changed 6 years ago by zecke

Small temporary summary:

  • We see Xglamo busy waiting for the command queue to empty but have no idea why
  • There is #1066 but I could not find much trace inside the glamo docu abut it
  • We know that the kernel might enable/disable engines and cmd queue without Xglamo knowing that.

Resolutions:

  • Find out if #1066 is the case and adjust Xglamo to wait for the queue to empty if there is something else queued
  • Share implementation with the kernel... provide ioctl's for certain operation to make sure that Xglamo and kernel agree... this is on my list

Testing:

  • Any testing will be quite hard without really understanding the issue (indeed a manageable cmd-queue log would come in handy). Maybe I will prepare patches for that first.

comment:17 Changed 6 years ago by andy

One thing about Glamo command queue code in kernel anyway, despite what may be there it is not sync'd to anything that matters, the only element of it that allows entry and exit from command queue mode "apparently OK" is the big mdleay(100). I spent some time a few weeks ago trying to optimize it to "wait for LCD ready" and "wait for command mode ready" but nothing worked without the big mdelay.

So I would check the bit that says you are actually in command mode -- maybe this pretty random not-understood delay thing failed and that's where the insanity about queue empty started.

Also, kernel won't take down engines randomly, only in suspend.

comment:18 Changed 6 years ago by TimoJyrinki

Hi. I tested Andy's kernel (compiled myself) briefly yesterday evening, and there was a visible improvement in speed. Unfortunately I couldn't charge my Neo while using the kernel (modules should also be rebuild, or a bug?), so I went back.

Could the two glamo patches be put to stable kernel show that it would show up on regular updates?

Anyway, the speed was not that great still, though better, and I wonder if the mdelay(100) mentioned here is now the main culprit for rather low frames per second everywhere? Eg. scrolling a long contacts list or simply playing Numpty Physics. 90MHz clock for core/memory should definitely somehow be able to update ca. 480x480 area (eg. contacts list main area) smoothly.

comment:19 Changed 6 years ago by sillern

I'm experiencing this issue quite often, I haven't been able to figure out how to reproduce it quick enough though.
I just let it suspend and wake up a few times during a longer period, I notice it after a few hours.

What can I do to debug this?

From what I can see, as others also have pointed out, it's spinning heavily in gettimeofday()

root@om-gta02:~# time strace -cp pidof Xglamo
Process 2090 attached - interrupt to quit
Process 2090 detached
% time seconds usecs/call calls errors syscall


99.86 15.487621 76 204940 gettimeofday

0.05 0.007351 18 412 nanosleep
0.04 0.006493 63 103 select
0.03 0.004623 22 206 setitimer
0.02 0.003630 35 103 sigreturn


100.00 15.509718 205764 total
real 8m 40.40s
user 0m 27.72s
sys 3m 33.95s
root@om-gta02:~#

Note: See TracTickets for help on using tickets.