Closed
Bug 1141962
Opened 10 years ago
Closed 9 years ago
[MTBF] Device random reboot with reason wdog_bark
Categories
(Firefox OS Graveyard :: Stability, defect)
Tracking
(Not tracked)
RESOLVED
WONTFIX
People
(Reporter: ashiue, Assigned: ting)
References
Details
Attachments
(1 file)
(deleted),
text/x-vhdl
|
Details |
MTBF testing aborted(test device has been reboot and marionette connection lost).
After checking log, and found lots of "E MP-Decision: Error 1 setting online status to 1 forcpuX" error in logcat.
Also see some log "Sending message to frozen Nuwa", not sure if this is expected?
<build info>
Build ID 20150310162504
Gaia Revision 5af6f8d5d6161dea02002634c6d0a570a122e5dd
Gaia Date 2015-03-10 19:17:12
Gecko Revision https://hg.mozilla.org/releases/mozilla-b2g37_v2_2/rev/ec87adb8cf13
Gecko Version 37.0
Device Name hammerhead
Firmware(Release) 5.0
Firmware(Incrmental) eng.cltbld.20150310.201846
Firmware Date Tue Mar 10 20:19:04 EDT 2015
Bootloader HHZ12d
Reporter | ||
Updated•10 years ago
|
QA Whiteboard: [COM=MTBF-Nexus5L]
Assignee | ||
Updated•10 years ago
|
Assignee: nobody → janus926
Status: NEW → ASSIGNED
Assignee | ||
Comment 3•10 years ago
|
||
A crash dump with symbol will be helpful. I'll check what are the messages about.
Reporter | ||
Comment 4•10 years ago
|
||
The device did not crash, so there was no crash report.
Assignee | ||
Comment 5•10 years ago
|
||
(In reply to Alison Shiue from comment #0)
> MTBF testing aborted(test device has been reboot and marionette connection
> lost).
QA doubt it has been reboot, but there's no crash dump.
"Sending message to frozen Nuwa" means someone is sending message to Nuwa after it's frozen which is not expected. But the log is not sufficient to see what the message is, I'll try to add some more logs.
"MP-Decision: Error 1 setting online status to 1 forcpuX" should be from QC's proprietary module mpdecision, it seems failed setting cpuX online. We need their input for how to debug this.
Assignee | ||
Comment 6•10 years ago
|
||
(In reply to Ting-Yu Chou [:ting] from comment #5)
> "MP-Decision: Error 1 setting online status to 1 forcpuX" should be from
> QC's proprietary module mpdecision, it seems failed setting cpuX online. We
> need their input for how to debug this.
Bhavana, I'm not sure whom to ask. But how can I have QC's comment regarding this?
Flags: needinfo?(bbajaj)
Assignee | ||
Comment 7•10 years ago
|
||
(In reply to Ting-Yu Chou [:ting] from comment #5)
> "Sending message to frozen Nuwa" means someone is sending message to Nuwa
> after it's frozen which is not expected. But the log is not sufficient to
> see what the message is, I'll try to add some more logs.
The message is PHal::Msg_NotifyNetworkChange, and happens whenever enable/disable WiFi. It does not cause any reboot or crash so don't care for MTBF.
Comment 8•10 years ago
|
||
(In reply to Ting-Yu Chou [:ting] from comment #6)
> (In reply to Ting-Yu Chou [:ting] from comment #5)
> > "MP-Decision: Error 1 setting online status to 1 forcpuX" should be from
> > QC's proprietary module mpdecision, it seems failed setting cpuX online. We
> > need their input for how to debug this.
>
> Bhavana, I'm not sure whom to ask. But how can I have QC's comment regarding
> this?
NI :inder from CAF who should be able to here.
Inder, it seems we are seeing this error message in a CAF module that could be relating to aborts, any thoughts ?
Flags: needinfo?(bbajaj) → needinfo?(ikumar)
Bhavana - No, we haven't seen such an error in our setup and not sure what could be causing it.
Flags: needinfo?(ikumar)
Updated•10 years ago
|
blocking-b2g: --- → 2.2?
Assignee | ||
Comment 10•10 years ago
|
||
Since there's no crash dump nor noticable logs, I tried to attach gdb to b2g process see if I can get anything. The device shutdown at the first try, and in dmesg there're a lot of:
_cpu_up: attempt to bring up CPU 3 failed
So maybe this does not cause the bug. Also I saw this:
healthd: battery l=70 v=3963 t=44.3 h=3 st=2 c=181 chg=u
But still nothing noticable, gdb showed "Remote connection closed". Will try another run.
Comment 11•10 years ago
|
||
Lets try to catch in gdb on the next run and we can try to get help from CAF as necessary.
For now, I am blocking on this as we basically cannot get ahead on MTBF hours on n-5L
blocking-b2g: 2.2? → 2.2+
Assignee | ||
Comment 12•10 years ago
|
||
I hit the issue at 2nd run, but gdb doesn't catch anything, still "Remote conneciton closed"...
Assignee | ||
Comment 13•10 years ago
|
||
I have observed twice that the device ran till out of battery, but I haven't seen it back alive.
I placed MOZ_CRASH() at PowerManager::Shutdown() and PowerManager::Reboot() to make sure it's not our code to reboot/shutdown the device also added some logs to BatteryManager. Alice is trying another run.
Assignee | ||
Comment 14•10 years ago
|
||
(In reply to Ting-Yu Chou [:ting] from comment #13)
> logs to BatteryManager. Alice is trying another run.
It's Alison...
Assignee | ||
Comment 15•10 years ago
|
||
There's a file /sys/class/power_supply/battery/capacity shows the battery's level.
Assignee | ||
Comment 16•10 years ago
|
||
2 devices were out of battery and both in shutdown state when entered the office this morning. Will try another run.
Reporter | ||
Comment 17•10 years ago
|
||
After running about 9.9 hours last Friday, the debug build hit the issue of device connection lost.
Please refer the log info: https://drive.google.com/a/mozilla.com/folderview?id=0B_lQ-lNDWe2oZS1saVM2QmdDVFk&usp=drive_web
Assignee | ||
Comment 18•10 years ago
|
||
The device I left last Firday is off this morning, but it has a crash report, not sure does it relate to this bug:
Operating system: Android
0.0.0 Linux 3.4.0-g42750ec-00031-gfa25ffd #9 SMP PREEMPT Wed Nov 19 15:22:54 CST 2014 armv7l Android/full_hammerhead/hammerhead:5.0/LRX21T/ting03171018:eng/test-keys
CPU: arm
4 CPUs
Crash reason: SIGSEGV
Crash address: 0x5a5a5a5e
Thread 160 (crashed)
0 bluetooth.default.so!fixed_queue_enqueue [fixed_queue.c : 88 + 0x0]
r4 = 0x5a5a5a5a r5 = 0x5a5a5a66 r6 = 0xb1fc4a99 r7 = 0x00000000
r8 = 0x00000000 r9 = 0x00000c03 r10 = 0x00000c35 fp = 0xb1fff9ed
sp = 0x952e3c58 lr = 0xb1fc7599 pc = 0xb1fc76a8
Found by: given as instruction pointer in context
1 bluetooth.default.so!thread_post [thread.c : 126 + 0x7]
r4 = 0x8d19fa98 r5 = 0x84e7fbb0 r6 = 0xb1fc4a99 r7 = 0x00000000
r8 = 0x00000000 r9 = 0x00000c03 r10 = 0x00000c35 fp = 0xb1fff9ed
sp = 0x952e3c68 pc = 0xb1fc7599
Found by: call frame info
2 bluetooth.default.so!bthc_tx [bt_hci_bdroid.c : 233 + 0xf]
r4 = 0x8b408388 r5 = 0xb2024738 r6 = 0x00000001 r7 = 0x00000000
r8 = 0x00000000 r9 = 0x00000c03 r10 = 0x00000c35 fp = 0xb1fff9ed
sp = 0x952e3c88 pc = 0xb1fc5003
Found by: call frame info
3 bluetooth.default.so!transmit_buf [bt_hci_bdroid.c : 428 + 0x3]
r4 = 0x8b408388 r5 = 0xb210108c r6 = 0x00000001 r7 = 0x00000000
r8 = 0x00000000 r9 = 0x00000c03 r10 = 0x00000c35 fp = 0xb1fff9ed
sp = 0x952e3c98 pc = 0xb1fc502b
Found by: call frame info
4 bluetooth.default.so!bte_main_hci_send [bte_main.c : 597 + 0x9]
r4 = 0x8b408388 r5 = 0xb210108c r6 = 0x00000001 r7 = 0x00000000
r8 = 0x00000000 r9 = 0x00000c03 r10 = 0x00000c35 fp = 0xb1fff9ed
sp = 0x952e3ca0 pc = 0xb1f1a1c7
Found by: call frame info
5 bluetooth.default.so!btu_hcif_send_cmd [btu_hcif.c : 502 + 0x9]
r4 = 0x8b408388 r5 = 0xb210108c r6 = 0x00000001 r7 = 0x00000000
r8 = 0x00000000 r9 = 0x00000c03 r10 = 0x00000c35 fp = 0xb1fff9ed
sp = 0x952e3ca8 pc = 0xb1fa4c69
Found by: call frame info
6 bluetooth.default.so!btu_hcif_process_event [btu_hcif.c : 1531 + 0x7]
r4 = 0x84e80015 r5 = 0x00002006 r6 = 0x00000000 r7 = 0x00000000
r8 = 0xb21010a8 r9 = 0xb210108c r10 = 0xb2101098 fp = 0x00000001
sp = 0x952e3cd8 pc = 0xb1fa52cb
Found by: call frame info
7 bluetooth.default.so!btu_task [btu_task.c : 259 + 0x7]
r4 = 0x84e80008 r5 = 0xb2012ad8 r6 = 0x00000005 r7 = 0xb1fffda4
r8 = 0xb1fffd0f r9 = 0xb1fffd45 r10 = 0xb2101008 fp = 0x00000002
sp = 0x952e3d48 pc = 0xb1fa5d63
Found by: call frame info
8 bluetooth.default.so!gki_task_entry [gki_ulinux.c : 263 + 0x5]
r4 = 0xb2101508 r5 = 0xb210025c r6 = 0x00000000 r7 = 0xb1ff079e
r8 = 0xade24e88 r9 = 0xade24e88 r10 = 0xb6eab2c5 fp = 0x952e3db0
sp = 0x952e3d78 pc = 0xb1f754eb
Found by: call frame info
9 libc.so!__pthread_start [pthread_create.cpp : 141 + 0x5]
r4 = 0xade24e80 r5 = 0xade24ec0 r6 = 0xade24e80 r7 = 0x00000078
r8 = 0xade24e88 r9 = 0xade24e88 r10 = 0xb6eab2c5 fp = 0x952e3db0
sp = 0x952e3d98 pc = 0xb6eab2e5
Found by: call frame info
10 libc.so!__start_thread [clone.cpp : 41 + 0x3]
r4 = 0xade24e88 r5 = 0xb6eab2c5 r6 = 0xade24e80 r7 = 0x00000078
r8 = 0xade24e88 r9 = 0xade24e88 r10 = 0xb6eab2c5 fp = 0x952e3db0
sp = 0x952e3da8 pc = 0xb6ea92d5
Found by: call frame info
Assignee | ||
Comment 19•10 years ago
|
||
(In reply to Alison Shiue from comment #17)
> After running about 9.9 hours last Friday, the debug build hit the issue of
> device connection lost.
> Please refer the log info:
> https://drive.google.com/a/mozilla.com/folderview?id=0B_lQ-
> lNDWe2oZS1saVM2QmdDVFk&usp=drive_web
Since there's no crash dump, so it is not a reboot/poweroff from PowerManager. Also, the last BatteryManager::Notify() call still has 1/4 battery:
logcat204
03-21 00:05:53.839 8885 8885 D x : BatteryManager::Notify pLevel=0.240000 pCharging=1 mLevel=0.230000 mCharging=1
which is about 18 mins ago before the last line of logcat:
logcat209
03-21 00:23:01.152 869 884 E MP-Decision: Error 1 setting online status to 1 forcpu3
So it also seems not out of battery.
Inder, are there any suggestions for such a random device reboot that dmesg shows no panic log also there's no crash dump?
Flags: needinfo?(ikumar)
Assignee | ||
Comment 20•10 years ago
|
||
(In reply to Ting-Yu Chou [:ting] from comment #18)
> The device I left last Firday is off this morning, but it has a crash
> report, not sure does it relate to this bug:
This should be a different issue since the bug is device random reboot.
Assignee | ||
Comment 21•10 years ago
|
||
(In reply to Ting-Yu Chou [:ting] from comment #19)
> Inder, are there any suggestions for such a random device reboot that dmesg
> shows no panic log also there's no crash dump?
This is the last message from /proc/last_kmsg:
[36170.628019] PDN configure_ldo_or_hs_one: could not switch krait3 to ldo rc = -6
[36170.675940] _cpu_up: attempt to bring up CPU 1 failed
[36170.677219] _cpu_up: attempt to bring up CPU 2 failed
[36171.745205] Watchdog bark! Now = 36171.745199
[36171.745259] Watchdog last pet at 36160.744938
[36171.745351] cpu alive mask from last pet 0
[36171.745397] Causing a watchdog bite!
No errors detected
Boot info:
Last boot reason: wdog_bark
Assignee | ||
Updated•10 years ago
|
Summary: [MTBF][Nexus5-L] MTBF testing aborted and found lots of "E MP-Decision: Error 1 setting online status to 1 forcpuX" error in logcat → [MTBF][Nexus5-L] MTBF testing aborted and the device is random reboot with last boot reason wdog_bark
Assignee | ||
Updated•10 years ago
|
Summary: [MTBF][Nexus5-L] MTBF testing aborted and the device is random reboot with last boot reason wdog_bark → [MTBF][Nexus5-L] MTBF testing aborted and the device random reboot with reason wdog_bark
Comment 22•10 years ago
|
||
According to comment 21, should we ask partner to check this bug?
Assignee | ||
Comment 23•10 years ago
|
||
(In reply to Ting-Yu Chou [:ting] from comment #21)
> [36171.745205] Watchdog bark! Now = 36171.745199
> [36171.745259] Watchdog last pet at 36160.744938
> [36171.745351] cpu alive mask from last pet 0
> [36171.745397] Causing a watchdog bite!
> No errors detected
> Boot info:
> Last boot reason: wdog_bark
There're no files under /data/dontpanic.
Assignee | ||
Comment 24•10 years ago
|
||
(In reply to Ken Chang[:ken] from comment #22)
> According to comment 21, should we ask partner to check this bug?
NI'd Inder in comment 19 already.
Comment 25•10 years ago
|
||
Sorry, I am not sure what branch Nexus uses nor do I have any clues from the info here. Let me know if you are able to reproduce on the ZTE L based device or on Flame and we can look into it.
Flags: needinfo?(ikumar)
Reporter | ||
Comment 26•10 years ago
|
||
2 devices with Ting-Yu's debug build hit this issue again, please refer log information for both devices: https://drive.google.com/a/mozilla.com/folderview?id=0B_lQ-lNDWe2oaHhmVTVhbHQzOVU&usp=drive_web
Assignee | ||
Comment 27•10 years ago
|
||
(In reply to Alison Shiue from comment #26)
> 2 devices with Ting-Yu's debug build hit this issue again, please refer log
> information for both devices:
> https://drive.google.com/a/mozilla.com/folderview?id=0B_lQ-
> lNDWe2oaHhmVTVhbHQzOVU&usp=drive_web
So I put a dump_stack() call in wdog_bark_handler(), and the last_kmsg Alison attached are:
1.
16035.716846] [<c010e294>] (unwind_backtrace+0x0/0x140) from [<c0a3fec4>] (dump_stack+0x20/0x24)
[16035.716902] [<c0a3fec4>] (dump_stack+0x20/0x24) from [<c016d994>] (wdog_bark_handler+0x128/0x1d0)
[16035.716997] [<c016d994>] (wdog_bark_handler+0x128/0x1d0) from [<c01f9a20>] (handle_irq_event_percpu+0x94/0x2b4)
[16035.717088] [<c01f9a20>] (handle_irq_event_percpu+0x94/0x2b4) from [<c01f9c8c>] (handle_irq_event+0x4c/0x6c)
[16035.717181] [<c01f9c8c>] (handle_irq_event+0x4c/0x6c) from [<c01fcaa4>] (handle_fasteoi_irq+0xb0/0x180)
[16035.717272] [<c01fcaa4>] (handle_fasteoi_irq+0xb0/0x180) from [<c01f9278>] (generic_handle_irq+0x3c/0x4c)
[16035.717366] [<c01f9278>] (generic_handle_irq+0x3c/0x4c) from [<c0107d88>] (handle_IRQ+0x6c/0xc8)
[16035.717418] [<c0107d88>] (handle_IRQ+0x6c/0xc8) from [<c0100560>] (gic_handle_irq+0x50/0xf4)
[16035.717470] [<c0100560>] (gic_handle_irq+0x50/0xf4) from [<c0107000>] (__irq_svc+0x40/0x70)
[16035.717555] Exception stack(0xebf37d58 to 0xebf37da0)
[16035.717603] 7d40: c10c9d40 60000013
[16035.717692] 7d60: 47e447e4 00010000 c16fee00 00000400 c5a05ffc 00000400 ebf37dc8 00100100
[16035.717778] 7d80: c10c9ac0 ebf37dac 00000000 ebf37da0 c023bec8 c0a5210c 60000013 ffffffff
[16035.717871] [<c0107000>] (__irq_svc+0x40/0x70) from [<c0a5210c>] (_raw_spin_unlock_irqrestore+0x3c/0x74)
[16035.717930] [<c0a5210c>] (_raw_spin_unlock_irqrestore+0x3c/0x74) from [<c023bec8>] (release_pages+0x1b0/0x1e8)
[16035.718023] [<c023bec8>] (release_pages+0x1b0/0x1e8) from [<c02533a4>] (unmap_single_vma+0x428/0x6e4)
[16035.718113] [<c02533a4>] (unmap_single_vma+0x428/0x6e4) from [<c0253e70>] (unmap_vmas+0x68/0x7c)
[16035.718203] [<c0253e70>] (unmap_vmas+0x68/0x7c) from [<c0253f38>] (zap_page_range+0xb4/0x15c)
[16035.718292] [<c0253f38>] (zap_page_range+0xb4/0x15c) from [<c0251e30>] (sys_madvise+0x260/0x5d0)
[16035.718345] [<c0251e30>] (sys_madvise+0x260/0x5d0) from [<c0107400>] (ret_fast_syscall+0x0/0x30)
[16035.718430] Causing a watchdog bite!
No errors detected
Boot info:
Last boot reason: wdog_bark
2.
10885.616958] [<c010e294>] (unwind_backtrace+0x0/0x140) from [<c0a3fec4>] (dump_stack+0x20/0x24)
[10885.617089] [<c0a3fec4>] (dump_stack+0x20/0x24) from [<c016d994>] (wdog_bark_handler+0x128/0x1d0)
[10885.617223] [<c016d994>] (wdog_bark_handler+0x128/0x1d0) from [<c01f9a20>] (handle_irq_event_percpu+0x94/0x2b4)
[10885.617355] [<c01f9a20>] (handle_irq_event_percpu+0x94/0x2b4) from [<c01f9c8c>] (handle_irq_event+0x4c/0x6c)
[10885.617434] [<c01f9c8c>] (handle_irq_event+0x4c/0x6c) from [<c01fcaa4>] (handle_fasteoi_irq+0xb0/0x180)
[10885.617565] [<c01fcaa4>] (handle_fasteoi_irq+0xb0/0x180) from [<c01f9278>] (generic_handle_irq+0x3c/0x4c)
[10885.617697] [<c01f9278>] (generic_handle_irq+0x3c/0x4c) from [<c0107d88>] (handle_IRQ+0x6c/0xc8)
[10885.617827] [<c0107d88>] (handle_IRQ+0x6c/0xc8) from [<c0100560>] (gic_handle_irq+0x50/0xf4)
[10885.617955] [<c0100560>] (gic_handle_irq+0x50/0xf4) from [<c01071c0>] (__irq_usr+0x40/0x60)
[10885.618024] Exception stack(0xebd37fb0 to 0xebd37ff8)
[10885.618146] 7fa0: 00100000 ad38ff60 00002b4e 0000ff60
[10885.618271] 7fc0: ad3f9520 b2defcc0 00000001 00110000 00000001 b1e2a6e0 be7fd188 000054dc
[10885.618342] 7fe0: 00000000 be7fd120 b5b40b37 b5b40be2 00070030 ffffffff
[10885.618463] Causing a watchdog bite!
No errors detected
Boot info:
Last boot reason: wdog_bark
Assignee | ||
Comment 28•10 years ago
|
||
I should use abort() instead, will try another run.
Assignee | ||
Comment 29•10 years ago
|
||
(In reply to Ting-Yu Chou [:ting] from comment #28)
> I should use abort() instead, will try another run.
I am not sure how to generate core dump in kernel mode, I placed panic() and die() (with NULL registers) in the watchdog handler.
Reporter | ||
Comment 30•10 years ago
|
||
2 devices with new debug build hit reboot issue again, but the last boot reason are different, please refer: https://drive.google.com/a/mozilla.com/folderview?id=0B_lQ-lNDWe2odXBsZU1FaHBoLTA&usp=drive_web
Assignee | ||
Comment 31•10 years ago
|
||
(In reply to Alison Shiue from comment #30)
> 2 devices with new debug build hit reboot issue again, but the last boot
> reason are different, please refer:
> https://drive.google.com/a/mozilla.com/folderview?id=0B_lQ-
> lNDWe2odXBsZU1FaHBoLTA&usp=drive_web
The one with panic() have some output but still nothing helpful, there's no user space stack:
[17117.263699] Watchdog bark! Now = 17117.263684
[17117.263917] Watchdog last pet at 17103.677383
[17117.264280] cpu alive mask from last pet 0-3
[17117.264469] Causing a watchdog bite!
[17117.264662] Kernel panic - not syncing: Panic!
[17117.265231] [<c010e294>] (unwind_backtrace+0x0/0x140) from [<c0a3fe68>] (dump_stack+0x20/0x24)
[17117.265612] [<c0a3fe68>] (dump_stack+0x20/0x24) from [<c0a4076c>] (panic+0xb0/0x224)
[17117.265842] [<c0a4076c>] (panic+0xb0/0x224) from [<c016d9a4>] (wdog_bark_handler+0x138/0x164)
[17117.266222] [<c016d9a4>] (wdog_bark_handler+0x138/0x164) from [<c01f99b4>] (handle_irq_event_percpu+0x94/0x2b4)
[17117.266603] [<c01f99b4>] (handle_irq_event_percpu+0x94/0x2b4) from [<c01f9c20>] (handle_irq_event+0x4c/0x6c)
[17117.266991] [<c01f9c20>] (handle_irq_event+0x4c/0x6c) from [<c01fca38>] (handle_fasteoi_irq+0xb0/0x180)
[17117.267381] [<c01fca38>] (handle_fasteoi_irq+0xb0/0x180) from [<c01f920c>] (generic_handle_irq+0x3c/0x4c)
[17117.267766] [<c01f920c>] (generic_handle_irq+0x3c/0x4c) from [<c0107d88>] (handle_IRQ+0x6c/0xc8)
[17117.267990] [<c0107d88>] (handle_IRQ+0x6c/0xc8) from [<c0100560>] (gic_handle_irq+0x50/0xf4)
[17117.268367] [<c0100560>] (gic_handle_irq+0x50/0xf4) from [<c01071c0>] (__irq_usr+0x40/0x60)
[17117.268718] Exception stack(0xeb46ffb0 to 0xeb46fff8)
[17117.268918] ffa0: a34bc000 74800000 00000001 b4c97e01
[17117.269285] ffc0: a34bc030 00000000 00000000 002e001b 00000000 beaa33b4 beaa337c 0000000c
[17117.269645] ffe0: b6fb3e18 beaa32b8 b4c5b93f b4c97e00 600f0030 ffffffff
[17117.269850] CPU2: stopping
[17117.269899] [<c010e294>] (unwind_backtrace+0x0/0x140) from [<c0a3fe68>] (dump_stack+0x20/0x24)
[17117.269945] [<c0a3fe68>] (dump_stack+0x20/0x24) from [<c010cd0c>] (handle_IPI+0x1c0/0x1d4)
[17117.270021] [<c010cd0c>] (handle_IPI+0x1c0/0x1d4) from [<c01005c4>] (gic_handle_irq+0xb4/0xf4)
[17117.270096] [<c01005c4>] (gic_handle_irq+0xb4/0xf4) from [<c01071c0>] (__irq_usr+0x40/0x60)
[17117.270168] Exception stack(0xea441fb0 to 0xea441ff8)
[17117.270209] 1fa0: 00000000 b6fc7ec8 00000000 00000000
[17117.270284] 1fc0: b6858800 00000010 b6fc7ec8 00000001 beaa4ef0 b6804670 beaa4e98 b5f3d5d3
[17117.270325] 1fe0: b6fb3e54 beaa4db0 b6f855df b6f855d6 600e0030 ffffffff
[17117.270399] CPU1: stopping
[17117.270448] [<c010e294>] (unwind_backtrace+0x0/0x140) from [<c0a3fe68>] (dump_stack+0x20/0x24)
[17117.270523] [<c0a3fe68>] (dump_stack+0x20/0x24) from [<c010cd0c>] (handle_IPI+0x1c0/0x1d4)
[17117.270599] [<c010cd0c>] (handle_IPI+0x1c0/0x1d4) from [<c01005c4>] (gic_handle_irq+0xb4/0xf4)
[17117.270643] [<c01005c4>] (gic_handle_irq+0xb4/0xf4) from [<c01071c0>] (__irq_usr+0x40/0x60)
[17117.270716] Exception stack(0xe57d1fb0 to 0xe57d1ff8)
[17117.270789] 1fa0: 00000004 20000000 00000000 b6762dcd
[17117.270831] 1fc0: beaa48c4 b10c47a0 9e3779b9 00000001 b10c47a0 beaa49c4 b32aae40 b32aae6c
[17117.270904] 1fe0: 00000031 beaa48a0 b4ade257 b4ade8c0 000e0030 ffffffff
[17118.271469] Rebooting in 5 seconds..
[17123.273753] Going down for restart now
[17123.274942] Calling SCM to disable SPMI PMIC arbiter
No errors detected
Boot info:
Last boot reason: kernel_panic
The one with die(), even it is accessing invalid address, there's no core dump:
[21376.329037] Watchdog bark! Now = 21376.329031
[21376.329127] Watchdog last pet at 21365.328771
[21376.329175] cpu alive mask from last pet 0
[21376.329220] Causing a watchdog bite!
[21376.329307] Unable to handle kernel NULL pointer dereference at virtual address 00000040
[21376.329397] pgd = ebcd0000
[21376.329481] [00000040] *pgd=00000000
No errors detected
Boot info:
Last boot reason: wdog_bite
Assignee | ||
Comment 32•10 years ago
|
||
BTW, not sure if it is related, the message "_cpu_up: attempt to bring up CPU * failed" is because msm_thermal_cpu_callback() returns NOTIFY_BAD due to cpu core overheating (>80°c).
Assignee | ||
Comment 33•10 years ago
|
||
pet_watchdog_work() is enqueued to cpu0 and is expected to be executed after 10s (qcom,pet-time) delay. The watchdog barks if it is not pet for 11s (qcom,bark-time), which I don't think user space can cause this. So I'd stop trying to get a core dump.
Assignee | ||
Comment 34•10 years ago
|
||
(In reply to Ting-Yu Chou [:ting] from comment #33)
> (qcom,bark-time), which I don't think user space can cause this. So I'd stop
> trying to get a core dump.
Scratch this, https://code.google.com/p/android/issues/detail?id=63385 shows it's possible that user space causes it.
Assignee | ||
Comment 35•10 years ago
|
||
Paul & Alison have setup video recording with 3 devices, we will check what are they doing right before reboot. I will also trace workqueue and cpu hotplug in kernel.
Reporter | ||
Comment 36•10 years ago
|
||
Hi Ting-Yu,
3 devices all encountered reboot issue, and unfortunately, the video recording was not long enough that we only recorded one device reboot behaviour.
Here are log information for 3 devices: https://drive.google.com/a/mozilla.com/folderview?id=0B_lQ-lNDWe2oWVNhNktKdlV4QkE&usp=drive_web
Video (device 0ae0abef0337dcbf reboot at about 9:00): https://vreplay.mozilla.com/replay/showRecordDetails.html?sortBy=date&viewCount=1¤tPage=1&groupBy=combo&roomFilter=&usernameFilter=&searchFilter=&usernameFullFilter=&myManager=-1&adminManager=0&webCast=0&command=&recId=1745&auxMessage=&auxMessage1=&lang=en&langChanged=&tenantFilter=&securityTab=
Reporter | ||
Comment 37•10 years ago
|
||
Executed 2 devices with debug again and both encountered reboot issue.
Besides, one device (0ae0abef0337dcbf) hit one crash when browser play video, and I think this issue is tracked at bug 1146861.
Here are log information and videos for 2 devices: https://drive.google.com/a/mozilla.com/folderview?id=0B_lQ-lNDWe2oSXdSdGVZVkZVNTA&usp=drive_web
* Note:
[devices 075a1a2000d22892 (left, red)] reboot at capture-2.avi 1:16:39
[devices 0ae0abef0337dcbf (right, white)] reboot at capture-4.avi 4:50:10, and this device executed camera multiple shots before reboot.
Comment 38•10 years ago
|
||
This case also found on Flame KK. Update title.
Summary: [MTBF][Nexus5-L] MTBF testing aborted and the device random reboot with reason wdog_bark → [MTBF]MTBF testing aborted and the device random reboot with reason wdog_bark
Assignee | ||
Comment 39•10 years ago
|
||
(In reply to Alison Shiue from comment #36)
> Video (device 0ae0abef0337dcbf reboot at about 9:00):
> https://vreplay.mozilla.com/replay/showRecordDetails.
> html?sortBy=date&viewCount=1¤tPage=1&groupBy=combo&roomFilter=&username
> Filter=&searchFilter=&usernameFullFilter=&myManager=-
> 1&adminManager=0&webCast=0&command=&recId=1745&auxMessage=&auxMessage1=&lang=
> en&langChanged=&tenantFilter=&securityTab=
The device was idle at homescreen before reboot.
(In reply to Alison Shiue from comment #37)
> [devices 075a1a2000d22892 (left, red)] reboot at capture-2.avi 1:16:39
It was doing text input.
> [devices 0ae0abef0337dcbf (right, white)] reboot at capture-4.avi 4:50:10,
> and this device executed camera multiple shots before reboot.
And from last_kmsg, pet_watchdog_work() did enqueue the delayed work, but it was not triggered.
Assignee | ||
Comment 40•10 years ago
|
||
Are there any suggestions to figure out what cause the watchdog bite? Thanks.
Flags: needinfo?(ikumar)
Assignee | ||
Comment 41•10 years ago
|
||
(In reply to Ting-Yu Chou [:ting] from comment #40)
> Are there any suggestions to figure out what cause the watchdog bite? Thanks.
NI mwu, dhyland, gsvelto as well.
Flags: needinfo?(mwu)
Flags: needinfo?(gsvelto)
Flags: needinfo?(dhylands)
Assignee | ||
Comment 42•10 years ago
|
||
I have talked to mwu yesterday, he suggests to make sure the issue can reproduce on Flame so we can have QC to help. QA will try to get last_kmsg from a Flame with the same symptom to see if it is also a watchdog bite.
Flags: needinfo?(mwu)
Flags: needinfo?(gsvelto)
Flags: needinfo?(dhylands)
Comment 43•10 years ago
|
||
Flame tested, they were "unable to connect to device" but didn't reboot. Possibly test harness or adb busy. I'll add retry and see if I could reproduce. Can't tell any relation by flame's latest build.
Assignee | ||
Comment 44•10 years ago
|
||
I added some logs to the kernel, it shows the timer function (delayed_work_timer_fn()) for queuing pet work does happen, but pet_watchdog_work() is not called.
Assignee | ||
Comment 45•10 years ago
|
||
From last night's logs, there was a work being processed and queued repeatedly around every 30us before watchdog bite. I'll see if I can figure out what it is.
Assignee | ||
Comment 46•10 years ago
|
||
(In reply to Ting-Yu Chou [:ting] from comment #45)
> From last night's logs, there was a work being processed and queued
> repeatedly around every 30us before watchdog bite. I'll see if I can figure
> out what it is.
Hmm... I thoguht it could be some busy loop cause the watchdog bite, but there were 2~3 seconds before watchdog bite which run_timer_softirq() comes in and pet_watchdog_work() is not called.
Assignee | ||
Updated•10 years ago
|
Summary: [MTBF]MTBF testing aborted and the device random reboot with reason wdog_bark → [MTBF] Device random reboot with reason wdog_bark
Assignee | ||
Comment 47•10 years ago
|
||
I've seen more than one patterns triggering watchdog bite:
1. workqueue seems block in kgsl_process_events
2. a lot of flush_to_ldisc()
3. only see tick_nohz_stop_sched_tick() logs
Comment 48•10 years ago
|
||
I have told to Ting-Yu for this bug. I suggest to disable watchdog or reduce intervals of the timer.
By definition, this kernel watchdog is watching if kernel is running correctly. Once watchdog barks, it means some part of kernel is wrong. It means kernel do something wrong, it is a bug of kernel. Any good kernel should not make good watchdog barking.
Since we are meaning to running MTBF on nexus 5, we don't ship nexus 5 and its kernel as a product, it is insane to spend a lot of resource to fix it. If disabling or reducing intervals could fix MTBF, why not?
Assignee | ||
Comment 49•10 years ago
|
||
(In reply to Thinker Li [:sinker] from comment #48)
> I have told to Ting-Yu for this bug. I suggest to disable watchdog or
> reduce intervals of the timer.
>
> By definition, this kernel watchdog is watching if kernel is running
> correctly. Once watchdog barks, it means some part of kernel is wrong. It
> means kernel do something wrong, it is a bug of kernel. Any good kernel
> should not make good watchdog barking.
>
> Since we are meaning to running MTBF on nexus 5, we don't ship nexus 5 and
> its kernel as a product, it is insane to spend a lot of resource to fix it.
> If disabling or reducing intervals could fix MTBF, why not?
Viral, how do you think?
Flags: needinfo?(vwang)
Assignee | ||
Comment 50•10 years ago
|
||
I have talked to Alison yesterday as I noticed Nexus-5-L MTBF reaches 25 hours in a recent update. She told me the number was based on 5.1, which is different from this bug (5.0). She will try more runs and upadte here if it can no longer reproduce.
Flags: needinfo?(vwang)
Flags: needinfo?(ikumar)
Comment 51•10 years ago
|
||
(In reply to Ting-Yu Chou [:ting] from comment #49)
> (In reply to Thinker Li [:sinker] from comment #48)
> > I have told to Ting-Yu for this bug. I suggest to disable watchdog or
> > reduce intervals of the timer.
> >
> > By definition, this kernel watchdog is watching if kernel is running
> > correctly. Once watchdog barks, it means some part of kernel is wrong. It
> > means kernel do something wrong, it is a bug of kernel. Any good kernel
> > should not make good watchdog barking.
> >
> > Since we are meaning to running MTBF on nexus 5, we don't ship nexus 5 and
> > its kernel as a product, it is insane to spend a lot of resource to fix it.
> > If disabling or reducing intervals could fix MTBF, why not?
>
> Viral, how do you think?
I agree the watchdog problem comes from kernel side should not relative to gecko.
Yes, we can provide a kernel build disable watchdog, but if we still met the case that kernel didn't kick watchdog in time, the MTBF may stop at the same time (since kernel has no bandwidth to kick watchdog, it can't service gecko/gaia)
I will try it in my local device to see if MTBF can run longer.
Reporter | ||
Comment 52•10 years ago
|
||
Until now, we have executed MTBF with Nexus-5-L-5.1 9 times, did not hit this reboot issue.
Comment 53•10 years ago
|
||
Hi! Alison,
Just want to check that do you see this issue again in past three days?
Thanks
--
Keven
Flags: needinfo?(ashiue)
Reporter | ||
Comment 54•10 years ago
|
||
(In reply to Keven Kuo [:kkuo] from comment #53)
> Hi! Alison,
>
> Just want to check that do you see this issue again in past three days?
> Thanks
>
> --
> Keven
No, I did not see this issue again in past three days.
Flags: needinfo?(ashiue)
Reporter | ||
Comment 55•10 years ago
|
||
(In reply to Keven Kuo [:kkuo] from comment #53)
> Hi! Alison,
>
> Just want to check that do you see this issue again in past three days?
> Thanks
>
> --
> Keven
No, I did not see this issue again in past three days.
Reporter | ||
Comment 56•10 years ago
|
||
mark unconfirmed by comment 52 and comment 55.
Status: ASSIGNED → UNCONFIRMED
Ever confirmed: false
Comment 57•9 years ago
|
||
Hi Alison,
Please renominate if you still observe the issue again. Thanks!
blocking-b2g: 2.2+ → ---
Flags: needinfo?(ashiue)
Reporter | ||
Comment 58•9 years ago
|
||
(In reply to Josh Cheng [:josh] from comment #57)
> Hi Alison,
> Please renominate if you still observe the issue again. Thanks!
OK.
Flags: needinfo?(ashiue)
Comment 59•9 years ago
|
||
Triage: mark as wontfix. Not to keep track on nexus 5-L. If z3c-L has this issue, please create new one.
Status: UNCONFIRMED → RESOLVED
Closed: 9 years ago
Resolution: --- → WONTFIX
You need to log in
before you can comment on or make changes to this bug.
Description
•