821379 - Pandaboard will become unresponsive after idling

Malini Das [:mdas] - Away, not checking bugmail

Reporter

Description

•

12 years ago

If you let the pandaboard idle, it will eventually become unresponsive. If you do 'adb devices' it will just hang. You have to reboot the board for it to work again. I find that this problem happens intermittently, over a random period of idle time.

Armen [:armenzg]

Comment 1

•

12 years ago

This can be seen on the releng side by a nagios check declaring the board as ping down. This can be easily fixed by scripting something that will check if the mozpool status is free and ping is down then we can just reboot the device without asking any further questions.

Malini Das [:mdas] - Away, not checking bugmail

Reporter

Comment 2

•

12 years ago

Hmm, after building and flashing today's build, I haven't seen this problem yet. I'll keep this open for a while to make sure it's not a fluke.

Malini Das [:mdas] - Away, not checking bugmail

Reporter

Comment 3

•

12 years ago

New problem! It's now powered on, but not listed in adb devices. Weird.

Malini Das [:mdas] - Away, not checking bugmail

Reporter

Comment 4

•

12 years ago

This was after running a few gaia smoketests, waiting an hour or so, then running them again. Midtest, it went into this state.

Malini Das [:mdas] - Away, not checking bugmail

Reporter

Comment 5

•

12 years ago

and now it just came back up all by itself. Hmm.

Malini Das [:mdas] - Away, not checking bugmail

Reporter

Comment 6

•

12 years ago

After coming back online, it is unresponsive to adb shell and logcat. It is listed in adb devices, and lsusb, but there isn't much I can do other than that.

Jonathan Griffin (:jgriffin)

Comment 7

•

12 years ago

This just happened now. Log cat only shows lots of: W/GraphicBufferMapper( 1267): unregisterBuffer(0x46ec1880) failed -22 (Invalid argument) W/GraphicBufferMapper( 1267): unregisterBuffer(0x46ec1e80) failed -22 (Invalid argument) W/GraphicBufferMapper( 1267): unregisterBuffer(0x46ecf7c0) failed -22 (Invalid argument) W/GraphicBufferMapper( 1267): unregisterBuffer(0x46ecfd80) failed -22 (Invalid argument) W/GraphicBufferMapper( 1267): unregisterBuffer(0x46ecffc0) failed -22 (Invalid argument) W/GraphicBufferMapper( 1267): unregisterBuffer(0x46ed3780) failed -22 (Invalid argument) W/GraphicBufferMapper( 1267): unregisterBuffer(0x46ed3cc0) failed -22 (Invalid argument) W/GraphicBufferMapper( 1267): unregisterBuffer(0x46ef61c0) failed -22 (Invalid argument) W/GraphicBufferMapper( 1267): unregisterBuffer(0x46ef6800) failed -22 (Invalid argument) W/GraphicBufferMapper( 1267): unregisterBuffer(0x46ef6e80) failed -22 (Invalid argument) (which happens before the freeze, too). The serial port shows only: [ 32.051666] omapdss HDMI: ENTER hdmi_display_enable [ 32.158691] omapdss DISPC error: timeout waiting for EVSYNC [ 33.508697] misc dsscomp: [eceb4c00] ignoring set failure -22 [ 177.840911] adb_release [ 177.841766] android_work: sent uevent USB_STATE=DISCONNECTED [ 177.856170] adb_open [ 177.877624] android_work: sent uevent USB_STATE=CONNECTED [ 177.887542] android_work: sent uevent USB_STATE=DISCONNECTED [ 177.945343] android_work: sent uevent USB_STATE=CONNECTED [ 178.496856] android_usb gadget: high speed config #1: android [ 178.505981] android_work: sent uevent USB_STATE=CONFIGURED

Jonathan Griffin (:jgriffin)

Comment 8

•

12 years ago

Note that the display is also frozen, not blanked. I see the Gaia lock screen.

William Lachance (:wlach)

Comment 9

•

12 years ago

Here's what I see on serial: [ 6338.648010] omapdss HDMI: ENTER hdmi_display_enable [ 6338.751373] omapdss DISPC error: timeout waiting for EVSYNC [ 6338.759490] omap_thermal_unthrottle: temperature reduced, ending cpu throttling [ 6338.940246] misc dsscomp: [ecbd3800] ignoring set failure -22 [ 6346.758758] omap_thermal_throttle: temperature too high, cpu throttle at max 90 [ 6347.766448] throttle_delayed_work_fn: OMAP temp read 66200 exceeds the threshod [ 6347.782135] omap_thermal_throttle: temperature too high, cpu throttle at max 70 [ 6358.759307] omap_thermal_unthrottle: temperature reduced, ending cpu throttling Not sure whether this means we have two issues, or that one or both sets of serial console output don't offer a clue to the problem. In case these messages are relevant: My panda seems to be relatively warm to the touch, although the room it's in isn't hot by any means (I'd estimate maybe 22 or 23 degrees centigrade).

Jonathan Griffin (:jgriffin)

Comment 10

•

12 years ago

I just had this problem recur; I got no output on either serial port or logcat when it happened. :(

Thomas Zimmermann [:tzimmermann] [:tdz]

Comment 11

•

12 years ago

(In reply to Jonathan Griffin (:jgriffin) from comment #7) > This just happened now. Log cat only shows lots of: > > W/GraphicBufferMapper( 1267): unregisterBuffer(0x46ec1880) failed -22 > (Invalid argument) > W/GraphicBufferMapper( 1267): unregisterBuffer(0x46ec1e80) failed -22 > (Invalid argument) > W/GraphicBufferMapper( 1267): unregisterBuffer(0x46ecf7c0) failed -22 > (Invalid argument) > W/GraphicBufferMapper( 1267): unregisterBuffer(0x46ecfd80) failed -22 > (Invalid argument) > W/GraphicBufferMapper( 1267): unregisterBuffer(0x46ecffc0) failed -22 > (Invalid argument) > W/GraphicBufferMapper( 1267): unregisterBuffer(0x46ed3780) failed -22 > (Invalid argument) > W/GraphicBufferMapper( 1267): unregisterBuffer(0x46ed3cc0) failed -22 > (Invalid argument) > W/GraphicBufferMapper( 1267): unregisterBuffer(0x46ef61c0) failed -22 > (Invalid argument) > W/GraphicBufferMapper( 1267): unregisterBuffer(0x46ef6800) failed -22 > (Invalid argument) > W/GraphicBufferMapper( 1267): unregisterBuffer(0x46ef6e80) failed -22 > (Invalid argument) > > (which happens before the freeze, too). The serial port shows only: This has already been reported in bug 801658. (In reply to William Lachance (:wlach) from comment #9) > Here's what I see on serial: > > [ 6338.648010] omapdss HDMI: ENTER hdmi_display_enable > > [ 6338.751373] omapdss DISPC error: timeout waiting for EVSYNC > > [ 6338.759490] omap_thermal_unthrottle: temperature reduced, ending cpu > throttling > [ 6338.940246] misc dsscomp: [ecbd3800] ignoring set failure -22 > > [ 6346.758758] omap_thermal_throttle: temperature too high, cpu throttle at > max 90 > [ 6347.766448] throttle_delayed_work_fn: OMAP temp read 66200 exceeds the > threshod > [ 6347.782135] omap_thermal_throttle: temperature too high, cpu throttle at > max 70 > [ 6358.759307] omap_thermal_unthrottle: temperature reduced, ending cpu > throttling I've seen this too, but it seems uncritical. The value is reported in /sys/bus/platform/drivers/omap_temp_sensor/omap_temp_sensor.0/temperature. Normally my board runs between 50000 to 55000. Throttling the CPU is just a safety measure.

Thomas Zimmermann [:tzimmermann] [:tdz]

Comment 12

•

12 years ago

I just managed to reproduce the problem and got this at the serial console: > [ 290.180847] hub 1-1:1.0: port 1 disabled by hub (EMI?), re-enabling... > [ 290.188720] usb 1-1.1: USB disconnect, device number 3 > [ 290.195404] smsc95xx 1-1.1:1.0: eth0: unregister 'smsc95xx' usb-ehci-omap.0-1.1, smsc95xx USB 2.0 Ethernet > [ 290.409790] init: untracked pid 1413 exited > [ 295.366882] hub 1-1:1.0: hub_port_status failed (err = -110) > [ 295.375610] hub 1-1:1.0: connect-debounce failed, port 1 disabled It looks like the USB port fails after some time. I checked the reported temperature, but it was only ~45000.

Thomas Zimmermann [:tzimmermann] [:tdz]

Comment 13

•

12 years ago

Pid 1413 is the DHCP client, errno number 110 is ETIMEDOUT. USB suspending is enabled in the kernel. Maybe we'll just need to disable it... An EMI problem is reported here: http://softsolder.com/2009/01/10/mysterious-usb-disconnects/ and solved here http://softsolder.com/2009/01/28/usb-disconnects-nobody-moves-nobody-gets-hurt/ /me is wondering if we need to ground the PandaBoards or put them into metal boxes...

Thomas Zimmermann [:tzimmermann] [:tdz]

Comment 14

•

12 years ago

(In reply to Thomas Zimmermann [:tzimmermann] from comment #13) > USB suspending is enabled in the kernel. Maybe we'll just need to disable > it... Nope, didn't help.

Thomas Zimmermann [:tzimmermann] [:tdz]

Comment 15

•

12 years ago

I looked deeper into this today and it really to be a problem in the USB chipset. After the USB port failed, I get a number of debugging messages like the ones below. > [ 523.298522] ehci-omap ehci-omap.0: detected XactErr len 0/18944 retry 1 > [ 523.306427] ehci-omap ehci-omap.0: detected XactErr len 0/18944 retry 2 > [ 523.314239] ehci-omap ehci-omap.0: detected XactErr len 0/18944 retry 3 > [ 523.322052] ehci-omap ehci-omap.0: detected XactErr len 0/18944 retry 4 > [ 523.329681] ehci-omap ehci-omap.0: detected XactErr len 0/18944 retry 5 > [ 523.337402] ehci-omap ehci-omap.0: detected XactErr len 0/16 retry 1 > [ 523.344696] ehci-omap ehci-omap.0: detected XactErr len 0/18944 retry 6 > [ 523.352447] ehci-omap ehci-omap.0: detected XactErr len 0/18944 retry 7 > [ 523.359985] ehci-omap ehci-omap.0: detected XactErr len 0/18944 retry 8 > [ 523.367736] ehci-omap ehci-omap.0: detected XactErr len 0/18944 retry 9 > [ 523.375427] ehci-omap ehci-omap.0: detected XactErr len 0/16 retry 2 > [ 523.382751] ehci-omap ehci-omap.0: detected XactErr len 0/18944 retry 10 > [ 523.390563] ehci-omap ehci-omap.0: detected XactErr len 0/18944 retry 11 > [ 523.398162] ehci-omap ehci-omap.0: detected XactErr len 0/18944 retry 12 > [ 523.406005] ehci-omap ehci-omap.0: detected XactErr len 0/18944 retry 13 > [ 523.413848] ehci-omap ehci-omap.0: detected XactErr len 0/16 retry 3 > [ 523.421112] ehci-omap ehci-omap.0: detected XactErr len 0/18944 retry 14 > [ 523.428924] ehci-omap ehci-omap.0: detected XactErr len 0/18944 retry 15 > [ 523.430114] hub 1-1:1.0: state 7 ports 5 chg 0000 evt 0002 > [ 523.442962] ehci-omap ehci-omap.0: detected XactErr len 0/18944 retry 16 > [ 524.452880] usb 1-1: khubd timed out on ep0in len=0/4 > [ 525.468566] usb 1-1: khubd timed out on ep0in len=4/4 > [ 526.484100] usb 1-1: khubd timed out on ep0in len=4/4 > [ 527.500488] usb 1-1: khubd timed out on ep0in len=4/4 > [ 528.516143] usb 1-1: khubd timed out on ep0in len=4/4 > [ 528.522857] hub 1-1:1.0: hub_port_status failed (err = -110) > [ 875.439697] ehci-omap ehci-omap.0: detected XactErr len 0/8 retry 31 > [ 875.440368] ehci-omap ehci-omap.0: devpath 1.2 ep0out 3strikes > [ 875.440368] usb 1-1: clear tt buffer port 2, a4 ep0 t00080248 > [ 875.441040] ehci-omap ehci-omap.0: reused qh e42f2d00 schedule > [ 875.441101] usb 1-1.2: link qh8-0e01/e42f2d00 start 3 [1/2 us] > [ 875.441101] generic-usb 0003:046D:C03E.0001: can't reset device, ehci-omap.0-1.2/input0, status -71 It's not predictable when this happens, but it is always reproducible. I tried various changes to the kernel config, but none made a difference.

Armen [:armenzg]

Comment 16

•

12 years ago

I'm removing bug 802317 from the blocking list. Regardless of the state that a panda gets into, our automation will force a re-image (thanks to mozpool) before assigning a job and running tests. This bug does not block releng's setup. If you believe I'm missing something please re-add the bug and let me know what I missed.

No longer blocks: 802317

Thomas Zimmermann [:tzimmermann] [:tdz]

Updated

•

12 years ago

See Also: → https://bugzilla.mozilla.org/show_bug.cgi?id=840109

Justin Wood (:Callek)

Comment 17

•

9 years ago

No longer using pandas at mozilla

Status: NEW → RESOLVED

Closed: 9 years ago

Resolution: --- → WONTFIX

BMO Automation

Updated

•

6 years ago

Product: Core → Core Graveyard

Bugzilla

Pandaboard will become unresponsive after idling

Categories

(Core Graveyard :: Widget: Gonk, defect)

Tracking

(Not tracked)

People

(Reporter: mdas, Unassigned)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Comment 13

Comment 14

Comment 15

Comment 16

Updated

Comment 17

Updated