Closed
Bug 1121374
Opened 10 years ago
Closed 10 years ago
Devices crash during test runs and can no longer be found via ADB
Categories
(Firefox OS Graveyard :: Infrastructure, defect)
Tracking
(Not tracked)
RESOLVED
DUPLICATE
of bug 1122119
People
(Reporter: RobertC, Assigned: gmealer)
References
Details
We have seen a lot of device not found errors lately and some of those errors seem to be cause by the device crashing in the previous test run with the following stacktrace: http://pastebin.mozilla.org/8214824
I looked over the nodes for a pattern, but could not find one. Below you will find the information from some of the nodes and the job in which the device seems to have crashed.
node: 28.2
job: flame-kk-319.mozilla-central.ui.graphics.smoke #58
job link: http://jenkins1.qa.scl3.mozilla.com/job/flame-kk-319.mozilla-central.ui.graphics.smoke/58/console
job run time: 02:46:13 -> 03:10:01
last test running: TestCalendar.test_calendar_flick
node: 15.1
job: flame-kk-319.b2g-inbound.ui.functional.smoke #2014
job link: http://jenkins1.qa.scl3.mozilla.com/job/flame-kk-319.b2g-inbound.ui.functional.smoke/2014/console
job run time: 14:33:28 -> 15:07:23
last test running: TestSetupAndSendIMAPEmail.test_setup_and_send_imap_email
node:23.2
job: flame-kk-319.b2g-inbound.ui.functional.sanity #6027
job link: http://jenkins1.qa.scl3.mozilla.com/job/flame-kk-319.b2g-inbound.ui.functional.sanity/6027/console
job run time: 16:53:24 -> 17:06:11
last test runnning: TestSms.test_sms_send
The time after the start of the job when the device crashes varies, so does test that ran last and the job that fails.
Comment 1•10 years ago
|
||
I think we need more data before we can identify a pattern. Just three examples is probably not enough.
Comment 2•10 years ago
|
||
Bebe noticed an increase in jobs being aborted by the build timeout plugin, which was set to abort after 5 minutes of inactivity. I could imagine an abort could leave the device in an unknown state (although I wouldn't really expect it to go offline, and it should recover). A recently released version of the build timeout plugin may have made this detection more effective. I understand Bebe is going to disable these timeouts to see if it helps with the stability.
Comment 3•10 years ago
|
||
Looks like more and more devices are crashing because of this issue
See Bug 1121501:
b2g-20.1
http://jenkins1.qa.scl3.mozilla.com/job/flame-kk-319.b2g-inbound.ui.functional.smoke/2070/console
mozdevice.devicemanager.DMError: bad status for device e472d8d7: offline
b2g-23.1
http://jenkins1.qa.scl3.mozilla.com/job/flame-kk-319.b2g-inbound.ui.functional.smoke/2042/console
b2g-09.1
http://jenkins1.qa.scl3.mozilla.com/job/flame-kk-319.mozilla-central.ui.functional.smoke/259/console
are in the same issue.
Geo we need some investigation on these devices and see why and how this is reproducing?
Flags: needinfo?(gmealer)
Comment 5•10 years ago
|
||
(In reply to Dave Hunt (:davehunt) from comment https://bugzilla.mozilla.org/show_bug.cgi?id=1121501#c7)
> It looks like this often happens in the e-mail or SMS tests? Perhaps this is
> a genuine crasher. Can we replicate it locally, or try with an older build?
> Is it possible that we need the v18D-1 base build?
I ran the email tests and can't replicate the crash.
Gaia-Rev e018c9de2f28927e5ab0d12baeb9d463ad7ff883
Gecko-Rev https://hg.mozilla.org/integration/b2g-inbound/rev/8741b8bde94f
Build-ID 20150114044721
Version 38.0a1
Device-Name flame
FW-Release 4.4.2
FW-Incremental eng.cltbld.20150114.083952
FW-Date Wed Jan 14 08:40:02 EST 2015
Bootloader L1TC10011880
Assignee | ||
Comment 6•10 years ago
|
||
(In reply to Florin Strugariu [:Bebe] from comment #3)
> Geo we need some investigation on these devices and see why and how this is
> reproducing?
Ugh, I'm not even sure how to get started with trying to reproduce that. Any tips from historical fixes?
Flags: needinfo?(gmealer)
Assignee | ||
Updated•10 years ago
|
Flags: needinfo?(florin.strugariu)
Comment 7•10 years ago
|
||
Geo let's start by logging the current state of the phones and maybe try to get a logcat from them
I will run some more tests and see if I can reproduce the issue.
Flags: needinfo?(florin.strugariu)
Assignee | ||
Comment 8•10 years ago
|
||
OK, I'll have a better report forthcoming, but I'm finding a lot of devices off and at battery prompt. More concerning is after booting them, usb debugging is disabled.
Are we putting a poison build on these? What are the various things that could cause usb debugging to be flipped to disabled? I know I personally flipped many these to adb on Monday.
Comment 9•10 years ago
|
||
as far as I know we are running the same setup as always... from the console log the devices are running the usual test until for some reason during the email/sms tests something goes bad and we can't find the device anymore...
Assignee | ||
Comment 10•10 years ago
|
||
OK, I've reset a number of machines again and verified ADB access.
Before I list them, the "disabled" under USB Debugging is a red herring. Something about our test suite is making that appear that way, but ADB isn't really disabled.
I was able to verify this running the sanity suite locally on 25-2's phone, which was sufficient to make "Disabled" occur. I used the same build and gaia revision as the last set of sanity tests that ran on that phone. This happened even though the bluetooth test didn't run for lack of installed module (we should add it to setup.py, maybe?).
I did go through the same tests manually, and couldn't make it happen. Either there's a subtle difference in what I did and the test did, or it's the "reset to known state" intermission. I suspect the latter, and probably not user-facing. That said, I did try killing b2g to restart it and the problem didn't occur.
These are the phones I touched:
Were off when I found them:
9
13
16.2
17.1
17.2
18.1
23.1
24.2
26.2
28.2
20.1 was so far off that it wouldn't respond to power button. Removing/replacing battery reset it and it booted from there.
All phones were charged to 100%. I did check version, and it was a 3.0.0 prerelease, with a range of different build dates. USB Storage was disabled (I sanity-checked it just in case we were shutting down ADB that way).
Of the other phones mentioned above:
28.2 was idle but on. It is, however, showing a "must be online to do that" type error, and the sim has a !.
23.2 was idle but on. It was sitting on the contacts screen.
15.1 was actually running tests by the time I looked at it, so couldn't have been *that* non-viable.
Of these, these devices overlapped with Monday's bug 1120307, so I know they were reset recently:
16.2
18.1
24.2
28.2
I'm keeping this bug open for more investigation. In particular, I think we should look at test runs and configuration for the four devices above, focusing on things between Monday and today as a range, and see if we can figure out a commonality.
Assignee | ||
Updated•10 years ago
|
Assignee: nobody → gmealer
Comment 11•10 years ago
|
||
I just bricked my device, and it came back up after 10 minutes or so with the crash report. (usually there is no crash report, but this time for some reason it showed up) the link is:
https://crash-stats.mozilla.com/report/index/d23813bd-fb9a-41be-a919-30c522150115
Just in case it helps.
Comment 12•10 years ago
|
||
(In reply to No-Jun Park [:njpark] from comment #11)
> I just bricked my device, and it came back up after 10 minutes or so with
> the crash report. (usually there is no crash report, but this time for some
> reason it showed up) the link is:
> https://crash-stats.mozilla.com/report/index/d23813bd-fb9a-41be-a919-
> 30c522150115
>
> Just in case it helps.
In my case the crash happens (intermittently) when python marionette restarts the device.
Comment 13•10 years ago
|
||
I took a look over the logcat's from Jenkins.
I could not find anything that would point out our issue.
http://jenkins1.qa.scl3.mozilla.com/job/flame-kk-319.mozilla-central.ui.functional.smoke/259/artifact/tests/python/gaia-ui-tests/logcat.txt
http://jenkins1.qa.scl3.mozilla.com/job/flame-kk-319.b2g-inbound.ui.functional.smoke/2042/artifact/tests/python/gaia-ui-tests/logcat.txt
http://jenkins1.qa.scl3.mozilla.com/job/flame-kk-319.b2g-inbound.ui.functional.smoke/2070/artifact/tests/python/gaia-ui-tests/logcat.txt
Comment 14•10 years ago
|
||
Also ran tests all morning and could not reproduce the fail.
Comment 15•10 years ago
|
||
Looks like b2g-28.1 is down
http://jenkins1.qa.scl3.mozilla.com/computer/b2g-28.1/
I will mark the node as offline :(
Comment 16•10 years ago
|
||
b2g-21.1 is in the same state.
http://jenkins1.qa.scl3.mozilla.com/computer/b2g-21.1
Failed in build:
http://jenkins1.qa.scl3.mozilla.com/job/flame-kk-319.b2g-inbound.ui.functional.smoke/2090/
Updated•10 years ago
|
Severity: normal → critical
Assignee | ||
Comment 17•10 years ago
|
||
I'll follow up briefly since I'm going to have to turn on more devices today.
But you need to look further than just show_version to check the device. If you'd run adb on the machine, you'd have seen two devices there as expected.
Instead, the job failed because 28.1 has reverted to generic serial "0123456789ABCDEF" rather than what we have in ANDROID_SERIAL.
I'm going to temporarily change the job spec for 28.1 to the generic serial, since the other device has a real number. But I'll also file a bug, since that shouldn't happen in the first place.
NI'ing so that you see this. I really recommend you use the listb2g() shell function I sent in email to check rather than running show_version. At the very least, a show_version fail should be doublechecked (a pass is likely legit).
Flags: needinfo?(florin.strugariu)
Assignee | ||
Comment 18•10 years ago
|
||
Filed bug on b2g-28.1, since I couldn't get the config change to take, and more work might be necessary there anyway.
Today's list of restarts are:
16.1
18.1
21.1
26.1
26.2
27.2
18.1 and 26.2 are repeats from yesterday, making them the current best candidates for investigation.
Right now, my best guess is that the following scenario is happening:
1) Something crashes or freezes the phone to the point that USB charging is disabled.
2) Phone runs out of power, turning itself off.
3) Having turned off, and without whatever 1) was in the way, the powered-off-charge kicks in.
Alternately, something is simply shutting down the phone and not turning it back on, but I can't imagine what would do that spontaneously.
Am going to continue to investigate. I've brought 21.1 back online as well.
Assignee | ||
Comment 19•10 years ago
|
||
More info:
First, thought I'd found a pattern:
Killed 18.1: http://jenkins1.qa.scl3.mozilla.com/job/flame-kk-319.b2g-inbound.perf.gaia/3700/console
Was good at 9:22:44 after test data push
Bad as of 10:03:57 after making node modules
Why 40 minute gulf between making node modules and next line?
Killed 26.2: http://jenkins1.qa.scl3.mozilla.com/job/flame-kk-319.b2g-inbound.perf.gaia/3670/console
Good at 20:15:03 after test data push
Bad as of 20:47:01 after making node modules
Same 40 minute gulf.
Killed 16.1: http://jenkins1.qa.scl3.mozilla.com/job/flame-kk-319.b2g-inbound.perf.gaia/3697/console 08:42:08 -> 08:48:16, same bracketing
No 40 minute gulf.
Unfortunately, the next one I checked wasn't an inbound perf job:
Killed 21.1: http://jenkins1.qa.scl3.mozilla.com/job/flame-kk-319.b2g-inbound.ui.functional.smoke/2090/console
Good at 3:45:06 (testing 3gp video)
Bad at 3:46:16
With the crash theory in mind, I investigated whether crash reports were left on the device. No such luck. This was curious, though, on 21.1:
root@flame:/data/b2g/mozilla/Crash Reports # ls -l
-rw------- root root 10 2015-01-15 03:45 InstallTime20150115003933
drwx------ root root 2015-01-15 03:45 events
The contents of InstallTime was an epoch timestamp for 03:45:10. That's when the crash happened.
So, I went back to 16.1:
root@flame:/data/b2g/mozilla/Crash Reports # ls -l
-rw------- root root 10 2015-01-15 15:09 InstallTime20150115054735
drwx------ root root 2015-01-15 15:09 events
Contents: 1421334571 == Thu Jan 15 07:09:31 PST 2015
It is a little curious that the InstallTime contents are a clean 12 hours off from the filestamp, but either way that doesn't correspond with the 08:42:08 -> 08:48:16 window the log shows.
Still, I think the next step might be to see what makes InstallTime* update itself, since at least on 16.1 that's one heck of a coincidence.
Assignee | ||
Comment 20•10 years ago
|
||
Just realized I misread the file timestamp on 16.1, which is 3:09PM. So that's 8 hours off, which makes more sense for a phone set to GMT.
Assignee | ||
Comment 21•10 years ago
|
||
Killed 23.1: http://jenkins1.qa.scl3.mozilla.com/job/flame-kk.ui.adhoc/593/console
Good at 10:58:59 (possibly 10:57:42, depends on whether locator failed due to crash), bad at 11:00:17, after testing a11y callscreen visibility.
Killed 24.2: http://jenkins1.qa.scl3.mozilla.com/job/flame-kk-319.b2g-inbound.ui.functional.sanity/6152/console
Good at 10:01:43, bad at 10:02:57, after testing sms send or bluetooth settings (probably former).
Those are both radio tests, so there's that.
That said, I just watched 24.2's InstallTime* timestamp change during the test, during an otherwise clean run. I think it's the kill/restart of b2g that does it, so no luck there.
b2g-16's non-corresponding time must be because the perf tests aren't restart as part of setup, so that really was the install time.
Assignee | ||
Comment 22•10 years ago
|
||
(In reply to No-Jun Park [:njpark] from comment #12)
> In my case the crash happens (intermittently) when python marionette
> restarts the device.
That aligns with the InstallTime* timestamps I'm finding, when they haven't been clobbered by a later test run like 24.2's was.
Comment 23•10 years ago
|
||
With v188-1, the crash happens as well, although the recovery time seem to be shorter in v188-1 (3-4 minutes or so)
When rebooted, it asked me to go through the FTE again.
Crash log in v188-1:
https://crash-stats.mozilla.com/report/index/f5ea907e-4277-4dcf-b6c5-a70f02150116
Comment 24•10 years ago
|
||
Sometimes it does not recover from crash, and marionette gives out this message:
Most recent errors/exceptions are:
01-16 01:44:36.673 E/ACDB-LOADER( 9029): Error: ACDB ANC returned = -101-16 01:44:36.673 E/ACDB-LOADER( 9029): Error: ACDB ANC returned = -101-16 01:44:36.673 E/ACDB-LOADER( 9029): Error: ACDB ANC returned = -101-16 01:44:36.673 E/ACDB-LOADER( 9029): Error: ACDB ANC returned = -101-16 01:44:36.673 E/ACDB-LOADER( 9029): Error: ACDB ANC returned = -1
The B2G process has restarted after crashing during the tests so Marionette can't respond due to either a Gecko, Gaia or Marionette error. Above, the 5 most recent errors are listed. Check logcat for all errors if these errors are not the cause of the failure.
Assignee | ||
Comment 25•10 years ago
|
||
(In reply to No-Jun Park [:njpark] from comment #24)
> Sometimes it does not recover from crash, and marionette gives out this
> message:
>
> Most recent errors/exceptions are:
> 01-16 01:44:36.673 E/ACDB-LOADER( 9029): Error: ACDB ANC returned = -101-16
> 01:44:36.673 E/ACDB-LOADER( 9029): Error: ACDB ANC returned = -101-16
> 01:44:36.673 E/ACDB-LOADER( 9029): Error: ACDB ANC returned = -101-16
> 01:44:36.673 E/ACDB-LOADER( 9029): Error: ACDB ANC returned = -101-16
> 01:44:36.673 E/ACDB-LOADER( 9029): Error: ACDB ANC returned = -1
> The B2G process has restarted after crashing during the tests so Marionette
> can't respond due to either a Gecko, Gaia or Marionette error. Above, the 5
> most recent errors are listed. Check logcat for all errors if these errors
> are not the cause of the failure.
What I'd also really like to know is whether the phone powers off or stops charging if you leave it at the crash. If you can make one happen before you go to bed or something, that'd be really useful.
Assignee | ||
Comment 26•10 years ago
|
||
Just checked logcat for 23.1's. It cuts off cold at 11:00:09, no particular errors showing above it.
The system was not in the middle of a restart, afaict--that happened somewhat prior to 10:59, but the homescreen had been found by Marionette and it looks like it was in the middle of a11y test code.
Last few entries:
01-15 11:00:09.295 7773 7773 W Communications: at reportMissingEntity (app://communications.gaiamobile.org/shared/js/l10n.js:1513:8)
01-15 11:00:09.425 7289 7289 I Gecko : 1421348409439 Marionette DEBUG Got request: findElement, data: {"to":"0","sessionId":"{a1eb7cbf-c070-4705-b3ae-61d2effc8d8c}","name":"findElement","parameters":{"using":"css selector","value":"div.keypad-key[data-value=\"9\"]"}}, id: null
01-15 11:00:09.435 7289 7289 I Gecko : 1421348409446 Marionette INFO sendToClient: {"from":"0","sessionId":"{3f28d6f7-52cb-440e-8a4c-a6cb6d1e1bca}","value":{"ELEMENT":"{ac878fb4-74b0-442d-a523-a63cc52cfee1}"}}, {4b3bf113-13cf-44bc-a967-6f517f9e72af}, {4b3bf113-13cf-44bc-a967-6f517f9e72af}
01-15 11:00:09.445 7289 7289 I Gecko : 1421348409456 Marionette DEBUG Got request: executeAsyncScript, data: {"to":"0","sessionId":"{a1eb7cbf-c070-4705-b3ae-61d2effc8d8c}","name":"executeAsyncScript","parameters":{"scriptTimeout":null,"specialPowers":true,"script":"return Accessibility.click.apply(Accessibility, arguments)","newSandbox":true,"args":[{"ELEMENT":"{ac878fb4-74b0-442d-a523-a63cc52cfee1}"}],"filename":"gaia_test.py","line":491}}, id: null
01-15 11:00:09.535 7289 7289 I Gecko : 1421348409545 Marionette INFO sendToClient: {"from":"0","sessionId":"{3f28d6f7-52cb-440e-8a4c-a6cb6d1e1bca}","value":null}, {af36e5ee-5d3a-40bc-a10d-5b6c11403574}, {af36e5ee-5d3a-40bc-a10d-5b6c11403574}
01-15 11:00:09.795 7289 7289 I Gecko : 1421348409801 Marionette DEBUG Got request: findElement, data: {"to":"0","sessionId":"{a1eb7cbf-c070-4705-b3ae-61d2effc8d8c}","name":"findElement","parameters":{"using":"css selector","value":"div.keypad-key[data-value=\"7\"]"}}, id: null
I'll look in other ones tomorrow--I'm out of time for today.
Comment 27•10 years ago
|
||
(In reply to Geo Mealer [:geo] from comment #19)
> Was good at 9:22:44 after test data push
> Bad as of 10:03:57 after making node modules
>
> Why 40 minute gulf between making node modules and next line?
The perf builds don't flush anything to console during this time. A successful build could be quiet for up to two hours.
Comment 28•10 years ago
|
||
http://jenkins1.qa.scl3.mozilla.com/computer/b2g-07.1/ is down again
Flags: needinfo?(florin.strugariu)
Comment 29•10 years ago
|
||
I had to put some more devices down today.
Also Investigated No-Jun's fail and it looks like a different Critical issue see: Bug 1122503
Tested locally with multiple build and with both flasing scrips (Jenkins and Tw) and could not reproduce the fail.
Comment 31•10 years ago
|
||
I had to put down some nodes this morning :(
Take a look over http://jenkins1.qa.scl3.mozilla.com/computer/ and see the disabled nodes
Reporter | ||
Comment 32•10 years ago
|
||
I disabled 2 more nodes:
20.1 - adb devices return 1 device
24.1 - adb devices return 0 devices
Can someone try and restart them? We are running out of nodes to run the tests on.
Flags: needinfo?(pmathur)
Flags: needinfo?(gmealer)
Assignee | ||
Comment 33•10 years ago
|
||
Restarted:
4
13
15
17-1
18-1
18-2
19-1
20-1
22-2
23-1
24-1
24-2
25-2
26-2
28-2
22-2 came up in a recovery mode, so I need to manually reflash it.
28-1 was taken offline, but looked viable to me. I'm checking it further.
I brought all nodes but those two (and 25-1, which is a perf node Dave disabled) back online in Jenkins. By the time you read this, I'll probably have fixed those and brought them online too. I'll update only if I do not do this.
Flags: needinfo?(gmealer)
Assignee | ||
Comment 34•10 years ago
|
||
I brought 22-2 back up. 28-1 was the one with the android serial number issue which I have not solved yet.
I also took 26-1 down for post-mortem on bug 1122119, and am keeping it for a day so people can respond without us having blown away the logs.
Assignee | ||
Updated•10 years ago
|
Flags: needinfo?(pmathur)
Reporter | ||
Comment 35•10 years ago
|
||
23.1 crashed inhttp://jenkins1.qa.scl3.mozilla.com/view/UI/job/flame-kk-319.mozilla-central.ui.functional.smoke/275/console
I took the node offline, but it need to be restarted.
Flags: needinfo?(gmealer)
Assignee | ||
Comment 36•10 years ago
|
||
Have continued to bring nodes up on a daily basis. I'm going to quit posting the list, as I think it's not useful for isolation like I'd hoped. I'm also going to start bringing up nodes twice a day (when I come in and before I go home).
Flags: needinfo?(gmealer)
Assignee | ||
Comment 37•10 years ago
|
||
Upstream bug is fixed via backout at this point. A bunch of phones were still down this morning, but that's not surprising given the backout didn't happen until today. Once I see that no phones are crashing, I'll resolve-fix this.
Assignee | ||
Comment 38•10 years ago
|
||
I'm closing this out. We're not white-screening anymore, which is what this bug really applied to.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Marking this bug as a duplicate of bug 1122119 as per comment 38.
Resolution: FIXED → DUPLICATE
You need to log in
before you can comment on or make changes to this bug.
Description
•