1629824 - Intermittent asan TEST-UNEXPECTED-TIMEOUT | automation.py | application timed out after 370 seconds with no output

Reporter

Description

•

5 years ago

treeherder

Filed by: hskupin [at] mozilla.com
Parsed log: https://treeherder.mozilla.org/logviewer.html#?job_id=297446937&repo=autoland
Full log: https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/IplYzIsaQbyN0RK9SxutFA/runs/0/artifacts/public/logs/live_backing.log

This bug covers the ASAN specific subset for bug 1414495, which is an early hang of Firefox during startup.

Henrik Skupin [:whimboo][⌚️UTC+2]

Comment 1

•

5 years ago

From the above log:

[task 2020-04-14T01:10:57.721Z] 01:10:57     INFO - Application command: Z:\task_1586825398\build\application\firefox\firefox.exe -marionette --wait-for-browser -foreground -profile c:\users\task_1586825398\appdata\local\temp\tmpigabhm.mozrunner
[task 2020-04-14T01:10:57.726Z] 01:10:57     INFO - runtests.py | Application pid: 5452
[task 2020-04-14T01:10:57.726Z] 01:10:57     INFO - TEST-INFO | started process GECKO(5452)
[task 2020-04-14T01:13:58.043Z] 01:13:58     INFO - runtests.py | Waiting for browser...
[task 2020-04-14T01:17:07.743Z] 01:17:07     INFO - Buffered messages finished
[task 2020-04-14T01:17:07.743Z] 01:17:07    ERROR - TEST-UNEXPECTED-TIMEOUT | automation.py | application timed out after 370 seconds with no output

Usually it would look like:

[task 2020-04-14T01:10:35.490Z] 01:10:35     INFO - Application command: Z:\task_1586825398\build\application\firefox\firefox.exe -marionette --wait-for-browser -foreground -profile c:\users\task_1586825398\appdata\local\temp\tmpv2ndif.mozrunner
[task 2020-04-14T01:10:35.492Z] 01:10:35     INFO - runtests.py | Application pid: 6272
[task 2020-04-14T01:10:35.492Z] 01:10:35     INFO - TEST-INFO | started process GECKO(6272)
[task 2020-04-14T01:10:37.699Z] 01:10:37     INFO - GECKO(6272) | 1586826637693	Marionette	TRACE	Marionette enabled
[task 2020-04-14T01:10:37.942Z] 01:10:37     INFO - GECKO(6272) | 1586826637941	Marionette	TRACE	Received observer notification toplevel-window-ready

Marionette enabled gets logged for the command-line-startup notification, or when the marionette.enabled preference or environment variable is set. As you can see above in this case it's for the command line argument, so it means this notification never gets sent.

Henrik Skupin [:whimboo][⌚️UTC+2]

Comment 2

•

5 years ago

The first failure of this type is actually on March 16th:

https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=293425763&repo=autoland&lineNumber=1783

And the number of failures increased on March 19th.

Comment hidden (Intermittent Failures Robot)

Florian Quèze [:florian]

Comment 7

•

5 years ago

This seems to be a perma fail on the TV jobs for Windows 10 x64 asan on this try push: https://treeherder.mozilla.org/#/jobs?repo=try&revision=787cafdb4679d41cda6f8ac1cef2af6125949b2b
This might help to debug the intermittent failures where the last message in the log is "Waiting until startup recorder finished recording startup scripts..."

Henrik Skupin [:whimboo][⌚️UTC+2]

Comment 8

•

5 years ago

Florian, for reference it would be good to know which changes you actually pushed to get the test jobs into that state, maybe also based on which revision on central.

Flags: needinfo?(florian)

Florian Quèze [:florian]

Comment 9

•

5 years ago

Attached file Bug 1629824 - On Windows asan disable browser startup tests depending on startup profiling, due to intermittent startup hangs, r=emalysz. (deleted) — Details

Depends on D75104

Phabricator Automation

Updated

•

5 years ago

Assignee: nobody → florian

Status: NEW → ASSIGNED

Florian Quèze [:florian]

Comment 10

•

5 years ago

(In reply to Henrik Skupin (:whimboo) [⌚️UTC+2] from comment #8)

Florian, for reference it would be good to know which changes you actually pushed to get the test jobs into that state, maybe also based on which revision on central.

I don't think any of what I pushed is responsible for making this fail more often. It's just flaky enough that TV's chaos mode makes it almost perma-fail. I actually had one green run yesterday, after several retriggers. For the runs that weren't green, sometimes we hang after the "Waiting until startup recorder finished recording startup scripts..." message and I get a profile uploaded, sometimes we fail with no output at all (similar to your first quote in comment 1). I also did a try push with the mainthreadio feature disabled, and the TV job still failed.

I suspect the underlying bug is that startup profiling doesn't work well on Windows asan, and we either hang during early startup (when enabling the profiler?) or never resolve the promise when trying to capture the profile. I don't see fixing this being a top priority soon, so I attached a patch to skip this entire test manifest for Windows asan.

I haven't looked at the logs of all the intermittent failures reported in the last few weeks to get a sense of how much of them were when running this folder of tests, but I suspect a lot were.

Flags: needinfo?(florian)

Pulsebot

Comment 11

•

5 years ago

Pushed by fqueze@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/7a0696cf87a3 On Windows asan disable browser startup tests depending on startup profiling, due to intermittent startup hangs, r=emalysz.

Dorel Luca [:dluca]

Comment 12

•

5 years ago

Backed out 4 changesets (bug 1580390, bug 1637603, bug 1629824) for Browser-chrome failures in performance/browser_startup_images.js. CLOSED TREE

Log:
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=302310802&repo=autoland&lineNumber=16761

Push with failures:
https://treeherder.mozilla.org/#/jobs?repo=autoland&group_state=expanded&revision=7a0696cf87a3256f13dbeddb620ed783f4dbeb90

Backout:
https://hg.mozilla.org/integration/autoland/rev/8714cc14400550c3ef4f69b8b6ef8d844e830a93

Flags: needinfo?(florian)

Comment hidden (Intermittent Failures Robot)

Pulsebot

Comment 14

•

5 years ago

Pushed by fqueze@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/91484d28b0cb On Windows asan disable browser startup tests depending on startup profiling, due to intermittent startup hangs, r=emalysz.

Florian Quèze [:florian]

Updated

•

5 years ago

Flags: needinfo?(florian)

Natalia Csoregi [:nataliaCs]

Comment 15

•

5 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/91484d28b0cb

Status: ASSIGNED → RESOLVED

Closed: 5 years ago

status-firefox78: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla78

Comment hidden (Intermittent Failures Robot)

Henrik Skupin [:whimboo][⌚️UTC+2]

Comment 17

•

4 years ago

This is not fixed but the test was disabled. As such reopening this bug.

Status: RESOLVED → REOPENED

status-firefox78: fixed → disabled

Keywords: test-disabled

Resolution: FIXED → ---

Target Milestone: mozilla78 → ---

Henrik Skupin [:whimboo][⌚️UTC+2]

Comment 18

•

4 years ago

Note that I can also see the hang on the linux1804 machines in CI. It's perma failing there for the wdspec tests:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=504cf49688f54117ee871e0575768ba70a0107ab

Christian, have you seen such a bad interaction of asan builds with the gecko profiler yet? Any idea what could be done to further investigate that?

Assignee: florian → nobody

Status: REOPENED → NEW

Flags: needinfo?(choller)

Christian Holler (:decoder)

Comment 20

•

4 years ago

ASan is slightly slower and consumes more memory than the regular builds. This can cause different timings (and make certain hangs more likely) as well as hit resource limits and timeouts that are hit less frequently in regular builds (or only barely reached there).

In any case, it is highly unlikely that this is an ASan specific problem. If you think there is a hang, then it probably makes sense to investigate it further as there is a chance that the hang can also occur in regular builds. Otherwise, I'd stick with disabling the test as a fix.

Flags: needinfo?(choller)

Comment hidden (Intermittent Failures Robot)

Henrik Skupin [:whimboo][⌚️UTC+2]

Comment 26

•

4 years ago

Florian, is the test which was causing the problem still disabled? If yes, do we have other tests that can cause the same problem?

Flags: needinfo?(florian)

Florian Quèze [:florian]

Comment 27

•

4 years ago

(In reply to Henrik Skupin (:whimboo) [⌚️UTC+2] from comment #26)

Florian, is the test which was causing the problem still disabled? If yes, do we have other tests that can cause the same problem?

The tests we disabled were the browser chrome tests that used startup profiling. When looking at the failures from comment 25, none of them is for a browser-chrome test job.

A few of the asan failure logs I opened seem to have an actionable error message:

 ==1241==ERROR: AddressSanitizer: heap-use-after-free on address 0x604000093732 at pc 0x7fe8eaa4728b bp 0x7ffc3b8dbcf0 sp 0x7ffc3b8dbce8
READ of size 1 at 0x604000093732 thread T0
    #0 0x7fe8eaa4728a in SplitSingleCharHelper<unsigned char> /builds/worker/checkouts/gecko/js/src/builtin/String.cpp:3508:31
    #1 0x7fe8eaa4728a in SplitSingleCharHelper /builds/worker/checkouts/gecko/js/src/builtin/String.cpp:3543:12
    #2 0x7fe8eaa4728a in js::StringSplitString(JSContext*, JS::Handle<js::ObjectGroup*>, JS::Handle<JSString*>, JS::Handle<JSString*>, unsigned int) /builds/worker/checkouts/gecko/js/src/builtin/String.cpp:3573:12
    #3 0x7fe8eacc2cd5 in js::intrinsic_StringSplitString(JSContext*, unsigned int, JS::Value*) /builds/worker/checkouts/gecko/js/src/vm/SelfHosting.cpp:1608:20
    #4 0x1f77c0dbaf7f  (<unknown module>)
0x604000093732 is located 18 bytes inside of 31-byte region [0x604000093720,0x60400009373f)
freed by thread T7 (JS Helper) here:
    #0 0x558792deb0ad in free /builds/worker/fetches/llvm-project/llvm/projects/compiler-rt/lib/asan/asan_malloc_linux.cc:123:3
    #1 0x7fe8eb356584 in js_free /builds/worker/workspace/obj-build/dist/include/js/Utility.h:432:3
    #2 0x7fe8eb356584 in freeUntracked /builds/worker/checkouts/gecko/js/src/gc/FreeOp.h:75:33
    #3 0x7fe8eb356584 in js::gc::GCRuntime::freeFromBackgroundThread(js::AutoLockHelperThreadState&) /builds/worker/checkouts/gecko/js/src/gc/GC.cpp:3492:12
    #4 0x7fe8eb3560f9 in js::gc::BackgroundFreeTask::run() /builds/worker/checkouts/gecko/js/src/gc/GC.cpp:3468:7
    #5 0x7fe8eb384286 in js::GCParallelTask::runTask() /builds/worker/checkouts/gecko/js/src/gc/GCParallelTask.cpp:146:3
    #6 0x7fe8eb384065 in js::GCParallelTask::runFromHelperThread(js::AutoLockHelperThreadState&) /builds/worker/checkouts/gecko/js/src/gc/GCParallelTask.cpp:131:5

Jon, is this a GC bug?

Flags: needinfo?(florian) → needinfo?(jcoppeard)

Jon Coppeard (:jonco) (PTO until 14th September)

Comment 28

•

4 years ago

(In reply to Florian Quèze [:florian] from comment #27)
This may be related to nursery string deduplication which landed recently. Steve, can you take a look?

Flags: needinfo?(jcoppeard) → needinfo?(sphink)

Comment hidden (Intermittent Failures Robot)

Steve Fink [:sfink] [:s:]

Updated

•

4 years ago

Flags: needinfo?(sphink)

Comment hidden (Intermittent Failures Robot)

Henrik Skupin [:whimboo][⌚️UTC+2]

Comment 92

•

2 years ago

Lots of these recent failures are wrongly classified and should be bug 1784591. Maybe a fix there could also help here.

Depends on: 1784591

Comment hidden (Intermittent Failures Robot)

Henrik Skupin [:whimboo][⌚️UTC+2]

Comment 94

•

2 years ago

(In reply to Intermittent Failures Robot from comment #94)

15 failures in 2916 pushes (0.005 failures/push) were associated with this bug in the last 7 days.

windows10-64-2004-asan-qr: 2

opt: 2

Note that only 2 of these failures are actually ASAN build related. All others have been wrongly classified.

Comment hidden (Intermittent Failures Robot)

Henrik Skupin [:whimboo][⌚️UTC+2]

Updated

•

2 years ago

No longer depends on: 1784591

Comment hidden (Intermittent Failures Robot)

Henrik Skupin [:whimboo][⌚️UTC+2]

Comment 98

•

2 years ago

Tests for ASAN builds on Linux seem to also have issues with Firefox not opening an initial browser window within a 6 minutes timeout:

[task 2022-09-26T15:04:13.780Z] 15:04:13     INFO - GECKO(17874) | 1664204653779	Marionette	DEBUG	Waiting for initial application window
[task 2022-09-26T15:04:17.633Z] 15:04:17     INFO - GECKO(17874) | console.warn: SearchSettings: "get: No settings file exists, new profile?" (new NotFoundError("Could not open the file at /tmp/tmpzrxqzmzx.mozrunner/search.json.mozlz4", (void 0)))
[task 2022-09-26T15:10:13.705Z] 15:10:13     INFO - runtests.py | Waiting for browser...
[task 2022-09-26T15:10:27.636Z] 15:10:27     INFO - Buffered messages finished
[task 2022-09-26T15:10:27.636Z] 15:10:27    ERROR - TEST-UNEXPECTED-TIMEOUT | automation.py | application timed out after 370 seconds with no output

Florian, should we extend your already landed patch and also disable Linux or finally get someone to investigate?

Flags: needinfo?(florian)

Florian Quèze [:florian]

Comment 99

•

2 years ago

(In reply to Henrik Skupin (:whimboo) [⌚️UTC+2] from comment #99)

Florian, should we extend your already landed patch and also disable Linux

The patch I landed was specifically related to the startup tests that rely on startup profiling, and were near perma fail on Windows asan. I see no indication that the recent failures involved startup profiling.

Flags: needinfo?(florian)