Closed Bug 1064800 Opened 10 years ago Closed 10 years ago

[MTBF] Marionette keeps script timeout exception after running 20 minutes

Categories

(Core :: DOM: Core & HTML, defect)

ARM
Gonk (Firefox OS)
defect
Not set
normal

Tracking

()

VERIFIED FIXED
mozilla35
blocking-b2g 2.1+
Tracking Status
firefox33 --- wontfix
firefox34 --- fixed
firefox35 --- fixed
b2g-v2.1 --- fixed
b2g-v2.2 --- fixed

People

(Reporter: pyang, Assigned: ting)

References

Details

(Whiteboard: [mtbf])

Attachments

(10 files, 7 obsolete files)

(deleted), text/x-vhdl
Details
(deleted), text/x-vhdl
Details
(deleted), text/x-vhdl
Details
(deleted), application/zip
Details
(deleted), text/x-log
Details
(deleted), text/plain
Details
(deleted), application/x-gzip
Details
(deleted), application/x-gzip
Details
(deleted), application/zip
Details
(deleted), patch
Details | Diff | Splinter Review
STR: Starting to test mtbf run on v2.1 branch, after more than 20 minutes execution test cases began to fail when disabling ftu. Device: flame Revision info Gaia 95e9b099aa89ded133e44014dd40b19dc0193c01 Gecko https://hg.mozilla.org/releases/mozilla-aurora/rev/92a6bbdfd945 BuildID 20140905000202 Version 34.0a2 ro.build.date Fri Jun 27 15:57:58 CST 2014 ro.bootloader L1TC00011230 ro.build.version.incremental 110 EXPECT: Test run will not stop in 100 hours ACTUAL: Test ended in 30~60 minutes
Blocks: MTBF-B2G
Blocks: MTBF-2014Q3
Paul, any stacktraces/logs/memdumps we can get here to get started here ? We need to bisect if this is atest bug or a gecko/gaia issue. This test works on 2.0 as expected?
Flags: needinfo?(pyang)
re-run on 2.0 didn't see this issue. keep ni?
Attached file logcat1 (deleted) —
Attached file logcat2 (deleted) —
Add last 2 logcat before test run stopped.
Look like bug 1061742
Depends on: 1061742
Depends on: 997909
Flags: needinfo?(pyang)
Attached file logcat3 (deleted) —
Remove disabling keyboard ftu, log repeated E/GeckoConsole( 314): Content JS LOG at dummy file:645 in GaiaDataLayer.disableWiFi/<: wifi enabled status: true Detail in logcat3 Malini, can you give comments from marionette? It looks pretty similar to bug 1061742, a waitfor condition can't be matched, when the client raises timeout exception, previously executing script kept running in server side.
No longer depends on: 997909
Flags: needinfo?(mdas)
Depends on: 997909
Whiteboard: [mtbf]
(In reply to Paul Yang [: pyang] from comment #7) > Remove disabling keyboard ftu, log repeated > E/GeckoConsole( 314): Content JS LOG at dummy file:645 in > GaiaDataLayer.disableWiFi/<: wifi enabled status: true > > Detail in logcat3 > > Malini, can you give comments from marionette? Sure, only logcat12 has marionette output, and it looks healthy. > It looks pretty similar to bug 1061742, a waitfor condition can't be > matched, when the client raises timeout exception, previously executing > script kept running in server side. In Bug 1061742's logcat, it looks like E/GeckoConsole( 4168): [JavaScript Error: "ReadOnlyError: A mutation operation was attempted in a READ_ONLY transaction." {file: "resource://gre/modules/SettingsRequestManager.jsm" line: 509}] is the error that's causing the problems. Logcat3, logcat12 and logcat13 don't contain this error, nor any other suspicious error
Flags: needinfo?(mdas)
Attached file logcat4 (deleted) —
Thanks Malini. I just re-run the test and it gave the same log as you mentioned (in logcat4, line 32027)
Summary: [MTBF] Script timeout execption in disabling ftu → [MTBF] Marionette keeps script timeout execption after running 20 minutes
Attached file output_debug_83.zip (obsolete) (deleted) —
Re-run on today's pvt build and it failed. Add attachment of periodically gathering logcat/b2g-ps/top Cervantes, can you help to see this case?
Flags: needinfo?(cyu)
From the top output it's not clear whether there is a CPU spin like the one you showed me yesterday, but one recurring log entry concerns me: 09-16 14:45:01.979 310 310 E GeckoConsole: Content JS LOG at dummy file:352 in GaiaApps.getDisplayedApp: app with origin 'app://verticalhome.gaiamobile.org' is displayed 09-16 14:45:01.979 310 310 E GeckoConsole: Content JS LOG at dummy file:644 in GaiaDataLayer.disableWiFi/<: wifi enabled status: true Does it keep popping up even after you take the device offline? If yes I think we have a performance problem because it will consume CPU clocks and suck up the battery. We need to figure out whether this is caused by the test harness, or will it potentially happen on end user's device.
Flags: needinfo?(cyu) → needinfo?(pyang)
Yes, it kept popping on devices even test ended. What else do we need to go deeper? You can setup mtbf and run locally, should be 100% percent to reproduce.
Flags: needinfo?(pyang)
We checked one device that finished the test and shows CPU spin (42% on flame). This keeps popping up in adb logcat: E/GeckoConsole( 321): Content JS LOG at dummy file:644 in GaiaDataLayer.disableWiFi/<: wifi enabled status: true There is definitely one infinite loop.
Attached file marionette_stdout.log (deleted) —
The dummy files in log are gaia/tests/atoms/gaia_app.js and gaia/tests/atoms/gaia_data_layer.js Attachment is stdout from one of the test run. Malini, can you help to do more investigation for current logs?
Flags: needinfo?(mdas)
You should also consider this being an actual bug on the phone. During an event this weekend we had several phones that started to be super slow after using for an hour or so. Scrolling in the setting app looked like performing with 2fps.
I think this is bug on the phone but still need advice from framework to confirm. After tracing marionette server, waitFor function in contact simply loop until timeout. The phones didn't run slowly, so might not be the same issue. Keep trying to narrow down.
(In reply to Paul Yang [: pyang] from comment #14) > Created attachment 8490615 [details] > marionette_stdout.log > > The dummy files in log are gaia/tests/atoms/gaia_app.js and > gaia/tests/atoms/gaia_data_layer.js > Attachment is stdout from one of the test run. Malini, can you help to do > more investigation for current logs? The framework looks like it's doing the right thing here. If there was a marionette bug, you'd see some IOError/"connection lost" errors. Seems more like a wifi problem, especially since you get those wifi messages in the logcat after the tests end from comment #12. Maybe a memory dump would be useful here? I've seen an issue where we'd have tons of wifi workers being spun up and it was causing failures, and that was detected using the memory dump.
Flags: needinfo?(mdas)
Given the above comments - should we consider putting this on the blocking 2.1? queue & have someone from the wifi team take a look at this bug?
blocking-b2g: --- → 2.1+
It might not be a wifi issue, because sometimes it loops in getDisplayedApp() so I will think the queue is full or busy waiting for something.
We saw a phone that never run automatic test and could reproduce script timeout issue. However, it didn't print any log except E/QCALOG ( 415): [MessageQ] ProcessNewMessage: [XTWWAN-PE] unknown deliver target [OS-Agent] E/QCALOG ( 415): [MessageQ] ProcessNewMessage: [XTWiFi-PE] unknown deliver target [OS-Agent] and look so much like suspended.
What is the phone showing when the test gets hung (e.g. a screenshot)?
Actually it didn't get hung. I can still operate apps like homescreen or settings. One thing strange is developer menu in settings disabled and can't be enabled anymore. I'll try to bisect for a regression window next week.
Can't reproduce by today's build Trigger more test runs and see if it becomes WFM Build info: Gaia-Rev b3f9b97d16a1ab55f80239d63c1a85c3da3d39ad Gecko-Rev https://hg.mozilla.org/releases/mozilla-aurora/rev/2c6e3261c47b Build ID 20140921160204 Version 34.0a2 Device Name flame FW-Release 4.3 FW-Incremental 110 FW-Date Fri Jun 27 15:57:58 CST 2014 Bootloader L1TC00011230
Attached file mem-log.tgz (obsolete) (deleted) —
memory report by latest pvt build
Attached file cc-log.tgz (obsolete) (deleted) —
Attached file gc-log.tgz (obsolete) (deleted) —
(In reply to Malini Das [:mdas] from comment #17) > (In reply to Paul Yang [: pyang] from comment #14) > > Created attachment 8490615 [details] > > marionette_stdout.log > > > > The dummy files in log are gaia/tests/atoms/gaia_app.js and > > gaia/tests/atoms/gaia_data_layer.js > > Attachment is stdout from one of the test run. Malini, can you help to do > > more investigation for current logs? > > The framework looks like it's doing the right thing here. If there was a > marionette bug, you'd see some IOError/"connection lost" errors. Seems more > like a wifi problem, especially since you get those wifi messages in the > logcat after the tests end from comment #12. > > Maybe a memory dump would be useful here? I've seen an issue where we'd have > tons of wifi workers being spun up and it was causing failures, and that was > detected using the memory dump. malini, can you please help investigate the memory reports/logs paul attached recently to see if its the wifi workers causing failures that you suspected?
Flags: needinfo?(mdas)
I don't know how to read memory log/crash log data. CC'ing khuey since he's knowledgeable about memory reports. Does anything here looks suspicious?
Flags: needinfo?(mdas) → needinfo?(khuey)
I see ~5400 SettingsLocks in that log. That's the only thing that jumps out at me.
Flags: needinfo?(khuey)
Attached file output_debug_87.zip (obsolete) (deleted) —
Update zip for debug info every 5 tests
Attachment #8489855 - Attachment is obsolete: true
Attached file memory-reports (deleted) —
Memory report from get_about memory
Attachment #8493013 - Attachment is obsolete: true
Attached file cc-log.tgz (deleted) —
Update latest cc log
Attachment #8493014 - Attachment is obsolete: true
Attached file gc-log.tgz (deleted) —
Update gc log
Attachment #8493015 - Attachment is obsolete: true
Kyle, Just update memory report and cc & gc log by our latest build, logcat and b2g-ps, top are included in output_debug_87.zip. The device has memory pressure even suspended. Please help to investigate and comment, thanks.
Flags: needinfo?(khuey)
That log has about 8000 SettingsLocks in it. It also has a ton of iframes leaked via bug 1051995.
Flags: needinfo?(khuey)
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #35) > That log has about 8000 SettingsLocks in it. > > It also has a ton of iframes leaked via bug 1051995. Nice, that would match what I saw the other morning.
qdot, We triggered test yesterday with revision Gaia 93a99bea0b40d81bd063f7d8b1964dc1ba35ba7b Gecko https://hg.mozilla.org/releases/mozilla-aurora/rev/d7e31ecd48af BuildID 20140924000243 Version 34.0a2 ro.build.date Thu Sep 4 14:59:02 CST 2014 ro.bootloader L1TC10011800 ro.build.version.incremental 27 From greping cc-log and gc-log, SettingsLock reduce to 388 lines. However it didn't solve the issue. I'll try it today and see if it's improved.
Attached file output_debug_47.zip (deleted) —
Attachment #8493596 - Attachment is obsolete: true
Update latest test run status: based on flame-kk v2.1 12 devices brief metrics https://docs.google.com/a/mozilla.com/spreadsheets/d/1RT1yMbXFnMiJSNRzaCddFfW55NX9vNdjaL3ChouCH54/edit#gid=1020998015 MTBF: 8.23 hours cpu usage is up to 50% even suspended. Update output(b2g-ps, top, dmesg, logcat) as well
On a device which b2g process consumes cpu ~50%, perf shows ~70% in the for loop matching regexp of mozilla::hal_impl::OomVictimLogger::Observe. I'd like to run gdb on the device directly to check, but found there're no symbols in PVT build. Just made a local build, will ask Paul to take another run and use gdb to check when test finish.
BTW, I am not sure whether cpu usage high relates to marionete timeout, will also check that.
From logcat there're a lot of: E/GeckoConsole( 210): Content JS LOG at dummy file:352 in GaiaApps.getDisplayedApp: app with origin 'app://verticalhome.gaiamobile.org' is displayed and there's a bug 1036631 which has been resolved (?).
There's also a repeated died/reborn process from logcat: D/QSEECOMAPI: (22157): QSEECom_get_handle sb_length = 0x2000 D/QSEECOMAPI: (22157): App is not loaded in QSEE E/QSEECOMAPI: (22157): Error::Cannot open the file /vendor/firmware/keymaster/keymaster.mdt E/QSEECOMAPI: (22157): Error::Loading image failed with ret = -1 D/QSEECOMAPI: (22157): QSEECom_get_handle sb_length = 0x2000 D/QSEECOMAPI: (22157): App is not loaded in QSEE E/QSEECOMAPI: (22157): Error::Cannot open the file /firmware/image/keymaste.mdt E/QSEECOMAPI: (22157): Error::Loading image failed with ret = -1 I/cat ( 303): <4>[57586.589981] QSEECOM: qseecom_release: data: released = false, type = 1, data = 0xd030e300 I/cat ( 303): <4>[57586.598860] QSEECOM: qseecom_release: data: released = false, type = 1, data = 0xd030e300 E/QCOMKeyMaster(22157): Loading keymaster app failied E/keystore(22157): could not open keymaster device in keystore (Operation not permitted) E/keystore(22157): keystore keymaster could not be initialized; exiting
(In reply to Ting-Yu Chou [:ting] from comment #42) > From logcat there're a lot of: > > E/GeckoConsole( 210): Content JS LOG at dummy file:352 in > GaiaApps.getDisplayedApp: app with origin > 'app://verticalhome.gaiamobile.org' is displayed > > and there's a bug 1036631 which has been resolved (?). I'll dig into this, the patch of bug 1036631 does not fix it.
I'm not sure if they are the same. Gary, do you think this bug and bug 1036631 are the same?
Flags: needinfo?(gary)
So GaiaApps.getDisplayedApp is called from here http://mxr.mozilla.org/gaia/source/tests/python/gaia-ui-tests/gaiatest/gaia_test.py#80, but the breakpoint in xpc::EvalInSandbox is not hit nor JS::Evaluate when the script keeps executing repeatedly, any ideas?
(In reply to Paul Yang [: pyang] from comment #45) > Gary, do you think this bug and bug 1036631 are the same? Probably, mine was just a continuous burst of that Content JS LOG message, but I never really got around to verifying that it went away properly after the fix landed. So there is a chance that patch didn't fix the issue.
Flags: needinfo?(gary)
Ting and I reviewed the patch - it turn off 2 line of log which seems not related. The symptom of this bug is, when it started to burst of message, you need to check following 1. execute javascript from marionette raise timeout all the time 2. cpu high load even suspended 3. unusual memory usage Can you try to reproduce it by latest build?
Flags: needinfo?(gary)
MTBF runs still encountered this bug multiple times as last run of 10 devices. Some of them even turn themselves off, and there is no log in jenkins about their shutdown of themselves.
(In reply to Ting-Yu Chou [:ting] from comment #46) > So GaiaApps.getDisplayedApp is called from here > http://mxr.mozilla.org/gaia/source/tests/python/gaia-ui-tests/gaiatest/ > gaia_test.py#80, but the breakpoint in xpc::EvalInSandbox is not hit nor > JS::Evaluate > when the script keeps executing repeatedly, any ideas? Not sure if it is related, but from attachment 8493599 [details] there're Sandbox with 1 unknown edge.
I found also GaiaApps.launch() uses waitFor() to wait for the application get launched but does not have a timeout, which will check the condition every 100ms. The checking http://dxr.mozilla.org/mozilla-central/source/testing/marionette/marionette-simpletest.js#178 is not true when timeout is undefined.
Scratch comment 51, correction: When waitFor is called without timeout, and the corresponding Marionette instance has timeout undefined, the waitFor will check the condition every 100ms indefinitely until the test function returns true.
What I found in comment 52 seems duplicate bug 997909, but I haven't figured out is it the reason marionette keeps script timeout. Added more logs for next run as I couldn't repro it locally with the codebase 20140905 in 20 minutes like the subject described.
Paul just reminded me that why waitFor does not timeout, after double checking, its logic is a bit odd: waitFor: function test_waitFor(callback, test, timeout) { ... var now = Date.now(); var deadline = now + (typeof(timeout) == "undefined" ? this.timeout : timeout); if (deadline <= now) { // timeout... } ... this.window.setTimeout(this.waitFor.bind(this), 100, callback, test, deadline - now); }, The 3rd argument to next waitFor call |deadline - now| is exactly the same as timeout or this.timeout, and it does not decrease in following invocations. So if the test function always returns false, the waitFor will never end. Did I misunderstand anything?
Flags: needinfo?(jgriffin)
You're right, there's a bug in this code. Thanks for pointing it out.
Flags: needinfo?(jgriffin)
Gary spoke with me about this. I'll take a stab at this tomorrow. ni?ing myself as a reminder
Flags: needinfo?(gary) → needinfo?(mdas)
Assignee: nobody → jgriffin
Good job, Ting. We'll trigger another run as soon as patch landed.
When MTBF keeps script timeout, the traceback shows: Traceback (most recent call last): File "/home/ting/w/fx/os/mtbf/venv/local/lib/python2.7/site-packages/marionette_client-0.8.4-py2.7.egg/marionette/marionette_test.py", line 246, in run self.setUp() File "/home/ting/w/fx/os/mtbf/venv/local/lib/python2.7/site-packages/gaiatest-0.28-py2.7.egg/gaiatest/gaia_test.py", line 788, in setUp self.cleanup_gaia(full_reset=True) File "/home/ting/w/fx/os/mtbf/MTBF-Driver/mtbf_driver/MtbfTestCase.py", line 76, in cleanup_gaia self.data_layer.set_setting('lockscreen.passcode-lock.code', '1111') File "/home/ting/w/fx/os/mtbf/venv/local/lib/python2.7/site-packages/gaiatest-0.28-py2.7.egg/gaiatest/gaia_test.py", line 221, in set_setting result = self.marionette.execute_async_script('return GaiaDataLayer.setSetting("%s", %s)' % (name, value), special_powers=True) File "/home/ting/w/fx/os/mtbf/venv/local/lib/python2.7/site-packages/marionette_client-0.8.4-py2.7.egg/marionette/marionette.py", line 1252, in execute_async_script filename=os.path.basename(frame[0])) File "/home/ting/w/fx/os/mtbf/venv/local/lib/python2.7/site-packages/marionette_client-0.8.4-py2.7.egg/marionette/decorators.py", line 35, in _ return func(*args, **kwargs) File "/home/ting/w/fx/os/mtbf/venv/local/lib/python2.7/site-packages/marionette_client-0.8.4-py2.7.egg/marionette/marionette.py", line 638, in _send_message self._handle_error(response) File "/home/ting/w/fx/os/mtbf/venv/local/lib/python2.7/site-packages/marionette_client-0.8.4-py2.7.egg/marionette/marionette.py", line 700, in _handle_error raise errors.ScriptTimeoutException(message=message, status=status, stacktrace=stacktrace) This seems not related to waitFor.
Hi Jonathan, According to #60, The symptom of this bug was gecko rejected marionette to execute script. We triggered several run yesterday, and in some of them, we saw loop log but the test was not stopped; that means blocking execution issue and infinite waitFor should be independent. Therefore the patch is for bug 997909 , not sure whether we should leave the patch there otherwise even if it is landed, we can't mark this bug as resolved fixed.(In reply to Ting-Yu Chou [:ting] from comment #60)
Flags: needinfo?(jgriffin)
SettingsLock.set is called, but the logs in onsettingstransactionsuccess and onsettingstransactionfailure both not seen from adb logcat: http://mxr.mozilla.org/gaia/source/tests/atoms/gaia_data_layer.js#216
The onsettingstransaction* callbacks are not happened since the set task is queued for waiting another lock. The pending lock actually runs to broadcastMessage() for sending settings change, but is interrupted by a child-process-shutdown because of OOM.
Attachment #8498099 - Flags: review?(bent.mozilla)
(In reply to Jonathan Griffin (:jgriffin) from comment #58) > try: https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=88ca30f0d803 A gip run on Try with this patch could be valuable because it uses Settings a lot.
(In reply to Ting-Yu Chou [:ting] from comment #63) > Created attachment 8498099 [details] [diff] [review] > patch v1: Prevent interrupting broadcastMessage() Try: https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=25e0a6eb30f4
Comment on attachment 8497733 [details] [diff] [review] Make waitFor actually terminate if condition is never true, Review of attachment 8497733 [details] [diff] [review]: ----------------------------------------------------------------- Thanks for the quick fix!
Attachment #8497733 - Flags: review?(mdas) → review+
Flags: needinfo?(mdas)
Comment on attachment 8497733 [details] [diff] [review] Make waitFor actually terminate if condition is never true, Moving patch to bug 997909 per pyang.
Attachment #8497733 - Attachment is obsolete: true
Flags: needinfo?(jgriffin)
Assignee: jgriffin → tchou
Status: NEW → ASSIGNED
Component: MTBF → DOM
Product: Firefox OS → Core
Version: unspecified → Trunk
Comment on attachment 8498099 [details] [diff] [review] patch v1: Prevent interrupting broadcastMessage() Review of attachment 8498099 [details] [diff] [review]: ----------------------------------------------------------------- bent is on vacation.
Attachment #8498099 - Flags: review?(bent.mozilla) → review+
Attached patch patch v2 (deleted) — Splinter Review
Added r=khuey to commit message. Kyle, thanks for your help.
Attachment #8498099 - Attachment is obsolete: true
Keywords: checkin-needed
Comment on attachment 8498595 [details] [diff] [review] patch v2 Approval Request Comment [Feature/regressing bug #]: n/a [User impact if declined]: settings change will be queued but not executed once the issue occur [Describe test coverage new/current, TBPL]: n/a [Risks and why]: low, simply add a catch [String/UUID change made/needed]: n/a
Attachment #8498595 - Flags: approval-mozilla-aurora?
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla35
Attachment #8498595 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Paul, Walter, can you verify this is fixed on trunk and 2.1?
Flags: needinfo?(wachen)
Flags: needinfo?(pyang)
Summary: [MTBF] Marionette keeps script timeout execption after running 20 minutes → [MTBF] Marionette keeps script timeout exception after running 20 minutes
Unfortunately, as I saw in last two runs. It still happens for almost 100%...
Flags: needinfo?(wachen)
Correction: It's bug 997909 that didn't get fixed. However, I haven't seem this bug reproduced yet. I think it should be safe.
Verified by latest pvt build.
Status: RESOLVED → VERIFIED
Flags: needinfo?(pyang)
Component: DOM → DOM: Core & HTML
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: