Closed
Bug 821379
Opened 12 years ago
Closed 9 years ago
Pandaboard will become unresponsive after idling
Categories
(Core Graveyard :: Widget: Gonk, defect)
Tracking
(Not tracked)
RESOLVED
WONTFIX
People
(Reporter: mdas, Unassigned)
References
Details
If you let the pandaboard idle, it will eventually become unresponsive. If you do 'adb devices' it will just hang. You have to reboot the board for it to work again.
I find that this problem happens intermittently, over a random period of idle time.
Comment 1•12 years ago
|
||
This can be seen on the releng side by a nagios check declaring the board as ping down.
This can be easily fixed by scripting something that will check if the mozpool status is free and ping is down then we can just reboot the device without asking any further questions.
Reporter | ||
Comment 2•12 years ago
|
||
Hmm, after building and flashing today's build, I haven't seen this problem yet. I'll keep this open for a while to make sure it's not a fluke.
Reporter | ||
Comment 3•12 years ago
|
||
New problem!
It's now powered on, but not listed in adb devices. Weird.
Reporter | ||
Comment 4•12 years ago
|
||
This was after running a few gaia smoketests, waiting an hour or so, then running them again. Midtest, it went into this state.
Reporter | ||
Comment 5•12 years ago
|
||
and now it just came back up all by itself. Hmm.
Reporter | ||
Comment 6•12 years ago
|
||
After coming back online, it is unresponsive to adb shell and logcat. It is listed in adb devices, and lsusb, but there isn't much I can do other than that.
Comment 7•12 years ago
|
||
This just happened now. Log cat only shows lots of:
W/GraphicBufferMapper( 1267): unregisterBuffer(0x46ec1880) failed -22 (Invalid argument)
W/GraphicBufferMapper( 1267): unregisterBuffer(0x46ec1e80) failed -22 (Invalid argument)
W/GraphicBufferMapper( 1267): unregisterBuffer(0x46ecf7c0) failed -22 (Invalid argument)
W/GraphicBufferMapper( 1267): unregisterBuffer(0x46ecfd80) failed -22 (Invalid argument)
W/GraphicBufferMapper( 1267): unregisterBuffer(0x46ecffc0) failed -22 (Invalid argument)
W/GraphicBufferMapper( 1267): unregisterBuffer(0x46ed3780) failed -22 (Invalid argument)
W/GraphicBufferMapper( 1267): unregisterBuffer(0x46ed3cc0) failed -22 (Invalid argument)
W/GraphicBufferMapper( 1267): unregisterBuffer(0x46ef61c0) failed -22 (Invalid argument)
W/GraphicBufferMapper( 1267): unregisterBuffer(0x46ef6800) failed -22 (Invalid argument)
W/GraphicBufferMapper( 1267): unregisterBuffer(0x46ef6e80) failed -22 (Invalid argument)
(which happens before the freeze, too). The serial port shows only:
[ 32.051666] omapdss HDMI: ENTER hdmi_display_enable
[ 32.158691] omapdss DISPC error: timeout waiting for EVSYNC
[ 33.508697] misc dsscomp: [eceb4c00] ignoring set failure -22
[ 177.840911] adb_release
[ 177.841766] android_work: sent uevent USB_STATE=DISCONNECTED
[ 177.856170] adb_open
[ 177.877624] android_work: sent uevent USB_STATE=CONNECTED
[ 177.887542] android_work: sent uevent USB_STATE=DISCONNECTED
[ 177.945343] android_work: sent uevent USB_STATE=CONNECTED
[ 178.496856] android_usb gadget: high speed config #1: android
[ 178.505981] android_work: sent uevent USB_STATE=CONFIGURED
Comment 8•12 years ago
|
||
Note that the display is also frozen, not blanked. I see the Gaia lock screen.
Comment 9•12 years ago
|
||
Here's what I see on serial:
[ 6338.648010] omapdss HDMI: ENTER hdmi_display_enable
[ 6338.751373] omapdss DISPC error: timeout waiting for EVSYNC
[ 6338.759490] omap_thermal_unthrottle: temperature reduced, ending cpu throttling
[ 6338.940246] misc dsscomp: [ecbd3800] ignoring set failure -22
[ 6346.758758] omap_thermal_throttle: temperature too high, cpu throttle at max 90
[ 6347.766448] throttle_delayed_work_fn: OMAP temp read 66200 exceeds the threshod
[ 6347.782135] omap_thermal_throttle: temperature too high, cpu throttle at max 70
[ 6358.759307] omap_thermal_unthrottle: temperature reduced, ending cpu throttling
Not sure whether this means we have two issues, or that one or both sets of serial console output don't offer a clue to the problem.
In case these messages are relevant: My panda seems to be relatively warm to the touch, although the room it's in isn't hot by any means (I'd estimate maybe 22 or 23 degrees centigrade).
Comment 10•12 years ago
|
||
I just had this problem recur; I got no output on either serial port or logcat when it happened. :(
Comment 11•12 years ago
|
||
(In reply to Jonathan Griffin (:jgriffin) from comment #7)
> This just happened now. Log cat only shows lots of:
>
> W/GraphicBufferMapper( 1267): unregisterBuffer(0x46ec1880) failed -22
> (Invalid argument)
> W/GraphicBufferMapper( 1267): unregisterBuffer(0x46ec1e80) failed -22
> (Invalid argument)
> W/GraphicBufferMapper( 1267): unregisterBuffer(0x46ecf7c0) failed -22
> (Invalid argument)
> W/GraphicBufferMapper( 1267): unregisterBuffer(0x46ecfd80) failed -22
> (Invalid argument)
> W/GraphicBufferMapper( 1267): unregisterBuffer(0x46ecffc0) failed -22
> (Invalid argument)
> W/GraphicBufferMapper( 1267): unregisterBuffer(0x46ed3780) failed -22
> (Invalid argument)
> W/GraphicBufferMapper( 1267): unregisterBuffer(0x46ed3cc0) failed -22
> (Invalid argument)
> W/GraphicBufferMapper( 1267): unregisterBuffer(0x46ef61c0) failed -22
> (Invalid argument)
> W/GraphicBufferMapper( 1267): unregisterBuffer(0x46ef6800) failed -22
> (Invalid argument)
> W/GraphicBufferMapper( 1267): unregisterBuffer(0x46ef6e80) failed -22
> (Invalid argument)
>
> (which happens before the freeze, too). The serial port shows only:
This has already been reported in bug 801658.
(In reply to William Lachance (:wlach) from comment #9)
> Here's what I see on serial:
>
> [ 6338.648010] omapdss HDMI: ENTER hdmi_display_enable
>
> [ 6338.751373] omapdss DISPC error: timeout waiting for EVSYNC
>
> [ 6338.759490] omap_thermal_unthrottle: temperature reduced, ending cpu
> throttling
> [ 6338.940246] misc dsscomp: [ecbd3800] ignoring set failure -22
>
> [ 6346.758758] omap_thermal_throttle: temperature too high, cpu throttle at
> max 90
> [ 6347.766448] throttle_delayed_work_fn: OMAP temp read 66200 exceeds the
> threshod
> [ 6347.782135] omap_thermal_throttle: temperature too high, cpu throttle at
> max 70
> [ 6358.759307] omap_thermal_unthrottle: temperature reduced, ending cpu
> throttling
I've seen this too, but it seems uncritical. The value is reported in /sys/bus/platform/drivers/omap_temp_sensor/omap_temp_sensor.0/temperature. Normally my board runs between 50000 to 55000. Throttling the CPU is just a safety measure.
Comment 12•12 years ago
|
||
I just managed to reproduce the problem and got this at the serial console:
> [ 290.180847] hub 1-1:1.0: port 1 disabled by hub (EMI?), re-enabling...
> [ 290.188720] usb 1-1.1: USB disconnect, device number 3
> [ 290.195404] smsc95xx 1-1.1:1.0: eth0: unregister 'smsc95xx' usb-ehci-omap.0-1.1, smsc95xx USB 2.0 Ethernet
> [ 290.409790] init: untracked pid 1413 exited
> [ 295.366882] hub 1-1:1.0: hub_port_status failed (err = -110)
> [ 295.375610] hub 1-1:1.0: connect-debounce failed, port 1 disabled
It looks like the USB port fails after some time.
I checked the reported temperature, but it was only ~45000.
Comment 13•12 years ago
|
||
Pid 1413 is the DHCP client, errno number 110 is ETIMEDOUT.
USB suspending is enabled in the kernel. Maybe we'll just need to disable it...
An EMI problem is reported here:
http://softsolder.com/2009/01/10/mysterious-usb-disconnects/
and solved here
http://softsolder.com/2009/01/28/usb-disconnects-nobody-moves-nobody-gets-hurt/
/me is wondering if we need to ground the PandaBoards or put them into metal boxes...
Comment 14•12 years ago
|
||
(In reply to Thomas Zimmermann [:tzimmermann] from comment #13)
> USB suspending is enabled in the kernel. Maybe we'll just need to disable
> it...
Nope, didn't help.
Comment 15•12 years ago
|
||
I looked deeper into this today and it really to be a problem in the USB chipset. After the USB port failed, I get a number of debugging messages like the ones below.
> [ 523.298522] ehci-omap ehci-omap.0: detected XactErr len 0/18944 retry 1
> [ 523.306427] ehci-omap ehci-omap.0: detected XactErr len 0/18944 retry 2
> [ 523.314239] ehci-omap ehci-omap.0: detected XactErr len 0/18944 retry 3
> [ 523.322052] ehci-omap ehci-omap.0: detected XactErr len 0/18944 retry 4
> [ 523.329681] ehci-omap ehci-omap.0: detected XactErr len 0/18944 retry 5
> [ 523.337402] ehci-omap ehci-omap.0: detected XactErr len 0/16 retry 1
> [ 523.344696] ehci-omap ehci-omap.0: detected XactErr len 0/18944 retry 6
> [ 523.352447] ehci-omap ehci-omap.0: detected XactErr len 0/18944 retry 7
> [ 523.359985] ehci-omap ehci-omap.0: detected XactErr len 0/18944 retry 8
> [ 523.367736] ehci-omap ehci-omap.0: detected XactErr len 0/18944 retry 9
> [ 523.375427] ehci-omap ehci-omap.0: detected XactErr len 0/16 retry 2
> [ 523.382751] ehci-omap ehci-omap.0: detected XactErr len 0/18944 retry 10
> [ 523.390563] ehci-omap ehci-omap.0: detected XactErr len 0/18944 retry 11
> [ 523.398162] ehci-omap ehci-omap.0: detected XactErr len 0/18944 retry 12
> [ 523.406005] ehci-omap ehci-omap.0: detected XactErr len 0/18944 retry 13
> [ 523.413848] ehci-omap ehci-omap.0: detected XactErr len 0/16 retry 3
> [ 523.421112] ehci-omap ehci-omap.0: detected XactErr len 0/18944 retry 14
> [ 523.428924] ehci-omap ehci-omap.0: detected XactErr len 0/18944 retry 15
> [ 523.430114] hub 1-1:1.0: state 7 ports 5 chg 0000 evt 0002
> [ 523.442962] ehci-omap ehci-omap.0: detected XactErr len 0/18944 retry 16
> [ 524.452880] usb 1-1: khubd timed out on ep0in len=0/4
> [ 525.468566] usb 1-1: khubd timed out on ep0in len=4/4
> [ 526.484100] usb 1-1: khubd timed out on ep0in len=4/4
> [ 527.500488] usb 1-1: khubd timed out on ep0in len=4/4
> [ 528.516143] usb 1-1: khubd timed out on ep0in len=4/4
> [ 528.522857] hub 1-1:1.0: hub_port_status failed (err = -110)
> [ 875.439697] ehci-omap ehci-omap.0: detected XactErr len 0/8 retry 31
> [ 875.440368] ehci-omap ehci-omap.0: devpath 1.2 ep0out 3strikes
> [ 875.440368] usb 1-1: clear tt buffer port 2, a4 ep0 t00080248
> [ 875.441040] ehci-omap ehci-omap.0: reused qh e42f2d00 schedule
> [ 875.441101] usb 1-1.2: link qh8-0e01/e42f2d00 start 3 [1/2 us]
> [ 875.441101] generic-usb 0003:046D:C03E.0001: can't reset device, ehci-omap.0-1.2/input0, status -71
It's not predictable when this happens, but it is always reproducible. I tried various changes to the kernel config, but none made a difference.
Comment 16•12 years ago
|
||
I'm removing bug 802317 from the blocking list.
Regardless of the state that a panda gets into, our automation will force a re-image (thanks to mozpool) before assigning a job and running tests. This bug does not block releng's setup.
If you believe I'm missing something please re-add the bug and let me know what I missed.
No longer blocks: 802317
Updated•12 years ago
|
Comment 17•9 years ago
|
||
No longer using pandas at mozilla
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → WONTFIX
Updated•6 years ago
|
Product: Core → Core Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•