Open Bug 1629824 Opened 5 years ago Updated 1 year ago

Intermittent asan TEST-UNEXPECTED-TIMEOUT | automation.py | application timed out after 370 seconds with no output

Categories

(Core :: Sanitizers, defect, P5)

defect

Tracking

()

Tracking Status
firefox78 --- disabled

People

(Reporter: intermittent-bug-filer, Unassigned)

References

(Blocks 1 open bug)

Details

(Keywords: intermittent-failure, test-disabled)

Attachments

(1 file)

Filed by: hskupin [at] mozilla.com
Parsed log: https://treeherder.mozilla.org/logviewer.html#?job_id=297446937&repo=autoland
Full log: https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/IplYzIsaQbyN0RK9SxutFA/runs/0/artifacts/public/logs/live_backing.log


This bug covers the ASAN specific subset for bug 1414495, which is an early hang of Firefox during startup.

From the above log:

[task 2020-04-14T01:10:57.721Z] 01:10:57     INFO - Application command: Z:\task_1586825398\build\application\firefox\firefox.exe -marionette --wait-for-browser -foreground -profile c:\users\task_1586825398\appdata\local\temp\tmpigabhm.mozrunner
[task 2020-04-14T01:10:57.726Z] 01:10:57     INFO - runtests.py | Application pid: 5452
[task 2020-04-14T01:10:57.726Z] 01:10:57     INFO - TEST-INFO | started process GECKO(5452)
[task 2020-04-14T01:13:58.043Z] 01:13:58     INFO - runtests.py | Waiting for browser...
[task 2020-04-14T01:17:07.743Z] 01:17:07     INFO - Buffered messages finished
[task 2020-04-14T01:17:07.743Z] 01:17:07    ERROR - TEST-UNEXPECTED-TIMEOUT | automation.py | application timed out after 370 seconds with no output

Usually it would look like:

[task 2020-04-14T01:10:35.490Z] 01:10:35     INFO - Application command: Z:\task_1586825398\build\application\firefox\firefox.exe -marionette --wait-for-browser -foreground -profile c:\users\task_1586825398\appdata\local\temp\tmpv2ndif.mozrunner
[task 2020-04-14T01:10:35.492Z] 01:10:35     INFO - runtests.py | Application pid: 6272
[task 2020-04-14T01:10:35.492Z] 01:10:35     INFO - TEST-INFO | started process GECKO(6272)
[task 2020-04-14T01:10:37.699Z] 01:10:37     INFO - GECKO(6272) | 1586826637693	Marionette	TRACE	Marionette enabled
[task 2020-04-14T01:10:37.942Z] 01:10:37     INFO - GECKO(6272) | 1586826637941	Marionette	TRACE	Received observer notification toplevel-window-ready

Marionette enabled gets logged for the command-line-startup notification, or when the marionette.enabled preference or environment variable is set. As you can see above in this case it's for the command line argument, so it means this notification never gets sent.

The first failure of this type is actually on March 16th:

https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=293425763&repo=autoland&lineNumber=1783

And the number of failures increased on March 19th.

This seems to be a perma fail on the TV jobs for Windows 10 x64 asan on this try push: https://treeherder.mozilla.org/#/jobs?repo=try&revision=787cafdb4679d41cda6f8ac1cef2af6125949b2b
This might help to debug the intermittent failures where the last message in the log is "Waiting until startup recorder finished recording startup scripts..."

Florian, for reference it would be good to know which changes you actually pushed to get the test jobs into that state, maybe also based on which revision on central.

Flags: needinfo?(florian)
Assignee: nobody → florian
Status: NEW → ASSIGNED

(In reply to Henrik Skupin (:whimboo) [⌚️UTC+2] from comment #8)

Florian, for reference it would be good to know which changes you actually pushed to get the test jobs into that state, maybe also based on which revision on central.

I don't think any of what I pushed is responsible for making this fail more often. It's just flaky enough that TV's chaos mode makes it almost perma-fail. I actually had one green run yesterday, after several retriggers. For the runs that weren't green, sometimes we hang after the "Waiting until startup recorder finished recording startup scripts..." message and I get a profile uploaded, sometimes we fail with no output at all (similar to your first quote in comment 1). I also did a try push with the mainthreadio feature disabled, and the TV job still failed.

I suspect the underlying bug is that startup profiling doesn't work well on Windows asan, and we either hang during early startup (when enabling the profiler?) or never resolve the promise when trying to capture the profile. I don't see fixing this being a top priority soon, so I attached a patch to skip this entire test manifest for Windows asan.

I haven't looked at the logs of all the intermittent failures reported in the last few weeks to get a sense of how much of them were when running this folder of tests, but I suspect a lot were.

Flags: needinfo?(florian)
Pushed by fqueze@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/7a0696cf87a3 On Windows asan disable browser startup tests depending on startup profiling, due to intermittent startup hangs, r=emalysz.
Pushed by fqueze@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/91484d28b0cb On Windows asan disable browser startup tests depending on startup profiling, due to intermittent startup hangs, r=emalysz.
Flags: needinfo?(florian)
Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla78

This is not fixed but the test was disabled. As such reopening this bug.

Status: RESOLVED → REOPENED
Keywords: test-disabled
Resolution: FIXED → ---
Target Milestone: mozilla78 → ---

Note that I can also see the hang on the linux1804 machines in CI. It's perma failing there for the wdspec tests:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=504cf49688f54117ee871e0575768ba70a0107ab

Christian, have you seen such a bad interaction of asan builds with the gecko profiler yet? Any idea what could be done to further investigate that?

Assignee: florian → nobody
Status: REOPENED → NEW
Flags: needinfo?(choller)

ASan is slightly slower and consumes more memory than the regular builds. This can cause different timings (and make certain hangs more likely) as well as hit resource limits and timeouts that are hit less frequently in regular builds (or only barely reached there).

In any case, it is highly unlikely that this is an ASan specific problem. If you think there is a hang, then it probably makes sense to investigate it further as there is a chance that the hang can also occur in regular builds. Otherwise, I'd stick with disabling the test as a fix.

Flags: needinfo?(choller)

Florian, is the test which was causing the problem still disabled? If yes, do we have other tests that can cause the same problem?

Flags: needinfo?(florian)

(In reply to Henrik Skupin (:whimboo) [⌚️UTC+2] from comment #26)

Florian, is the test which was causing the problem still disabled? If yes, do we have other tests that can cause the same problem?

The tests we disabled were the browser chrome tests that used startup profiling. When looking at the failures from comment 25, none of them is for a browser-chrome test job.

A few of the asan failure logs I opened seem to have an actionable error message:

 ==1241==ERROR: AddressSanitizer: heap-use-after-free on address 0x604000093732 at pc 0x7fe8eaa4728b bp 0x7ffc3b8dbcf0 sp 0x7ffc3b8dbce8
READ of size 1 at 0x604000093732 thread T0
    #0 0x7fe8eaa4728a in SplitSingleCharHelper<unsigned char> /builds/worker/checkouts/gecko/js/src/builtin/String.cpp:3508:31
    #1 0x7fe8eaa4728a in SplitSingleCharHelper /builds/worker/checkouts/gecko/js/src/builtin/String.cpp:3543:12
    #2 0x7fe8eaa4728a in js::StringSplitString(JSContext*, JS::Handle<js::ObjectGroup*>, JS::Handle<JSString*>, JS::Handle<JSString*>, unsigned int) /builds/worker/checkouts/gecko/js/src/builtin/String.cpp:3573:12
    #3 0x7fe8eacc2cd5 in js::intrinsic_StringSplitString(JSContext*, unsigned int, JS::Value*) /builds/worker/checkouts/gecko/js/src/vm/SelfHosting.cpp:1608:20
    #4 0x1f77c0dbaf7f  (<unknown module>)
0x604000093732 is located 18 bytes inside of 31-byte region [0x604000093720,0x60400009373f)
freed by thread T7 (JS Helper) here:
    #0 0x558792deb0ad in free /builds/worker/fetches/llvm-project/llvm/projects/compiler-rt/lib/asan/asan_malloc_linux.cc:123:3
    #1 0x7fe8eb356584 in js_free /builds/worker/workspace/obj-build/dist/include/js/Utility.h:432:3
    #2 0x7fe8eb356584 in freeUntracked /builds/worker/checkouts/gecko/js/src/gc/FreeOp.h:75:33
    #3 0x7fe8eb356584 in js::gc::GCRuntime::freeFromBackgroundThread(js::AutoLockHelperThreadState&) /builds/worker/checkouts/gecko/js/src/gc/GC.cpp:3492:12
    #4 0x7fe8eb3560f9 in js::gc::BackgroundFreeTask::run() /builds/worker/checkouts/gecko/js/src/gc/GC.cpp:3468:7
    #5 0x7fe8eb384286 in js::GCParallelTask::runTask() /builds/worker/checkouts/gecko/js/src/gc/GCParallelTask.cpp:146:3
    #6 0x7fe8eb384065 in js::GCParallelTask::runFromHelperThread(js::AutoLockHelperThreadState&) /builds/worker/checkouts/gecko/js/src/gc/GCParallelTask.cpp:131:5

Jon, is this a GC bug?

Flags: needinfo?(florian) → needinfo?(jcoppeard)

(In reply to Florian Quèze [:florian] from comment #27)
This may be related to nursery string deduplication which landed recently. Steve, can you take a look?

Flags: needinfo?(jcoppeard) → needinfo?(sphink)
Flags: needinfo?(sphink)

Lots of these recent failures are wrongly classified and should be bug 1784591. Maybe a fix there could also help here.

Depends on: 1784591

(In reply to Intermittent Failures Robot from comment #94)

15 failures in 2916 pushes (0.005 failures/push) were associated with this bug in the last 7 days.

  • windows10-64-2004-asan-qr: 2
    • opt: 2

Note that only 2 of these failures are actually ASAN build related. All others have been wrongly classified.

No longer depends on: 1784591

Tests for ASAN builds on Linux seem to also have issues with Firefox not opening an initial browser window within a 6 minutes timeout:

[task 2022-09-26T15:04:13.780Z] 15:04:13     INFO - GECKO(17874) | 1664204653779	Marionette	DEBUG	Waiting for initial application window
[task 2022-09-26T15:04:17.633Z] 15:04:17     INFO - GECKO(17874) | console.warn: SearchSettings: "get: No settings file exists, new profile?" (new NotFoundError("Could not open the file at /tmp/tmpzrxqzmzx.mozrunner/search.json.mozlz4", (void 0)))
[task 2022-09-26T15:10:13.705Z] 15:10:13     INFO - runtests.py | Waiting for browser...
[task 2022-09-26T15:10:27.636Z] 15:10:27     INFO - Buffered messages finished
[task 2022-09-26T15:10:27.636Z] 15:10:27    ERROR - TEST-UNEXPECTED-TIMEOUT | automation.py | application timed out after 370 seconds with no output

Florian, should we extend your already landed patch and also disable Linux or finally get someone to investigate?

Flags: needinfo?(florian)

(In reply to Henrik Skupin (:whimboo) [⌚️UTC+2] from comment #99)

Florian, should we extend your already landed patch and also disable Linux

The patch I landed was specifically related to the startup tests that rely on startup profiling, and were near perma fail on Windows asan. I see no indication that the recent failures involved startup profiling.

Flags: needinfo?(florian)
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: