Closed Bug 1427396 Opened 7 years ago Closed 4 years ago

Intermittent TEST-UNEXPECTED-TIMEOUT | dom/tests/mochitest/fetch/test_fetch_basic_http_sw_reroute.html | application timed out after 370 seconds with no output

Categories

(Core :: DOM: Service Workers, defect, P5)

defect

Tracking

()

RESOLVED INCOMPLETE

People

(Reporter: intermittent-bug-filer, Assigned: edenchuang)

References

Details

(Keywords: intermittent-failure)

Attachments

(3 files)

Filed by: apavel [at] mozilla.com https://treeherder.mozilla.org/logviewer.html#?job_id=153561027&repo=autoland https://queue.taskcluster.net/v1/task/TFg3jD1rRrKTzFjDNnjdAg/runs/0/artifacts/public/logs/live_backing.log [task 2017-12-29T21:05:46.505Z] 21:05:46 INFO - TEST-START | dom/tests/mochitest/fetch/test_fetch_basic_http_sw_reroute.html [task 2017-12-29T21:11:56.524Z] 21:11:56 INFO - Buffered messages finished [task 2017-12-29T21:11:56.528Z] 21:11:56 ERROR - TEST-UNEXPECTED-TIMEOUT | dom/tests/mochitest/fetch/test_fetch_basic_http_sw_reroute.html | application timed out after 370 seconds with no output [task 2017-12-29T21:11:56.530Z] 21:11:56 ERROR - Force-terminating active process(es). [task 2017-12-29T21:11:56.531Z] 21:11:56 INFO - Determining child pids from psutil... [task 2017-12-29T21:11:56.540Z] 21:11:56 INFO - [4945, 4992] [task 2017-12-29T21:11:56.545Z] 21:11:56 INFO - ==> process 4884 launched child process 4909 [task 2017-12-29T21:11:56.545Z] 21:11:56 INFO - ==> process 4884 launched child process 4945 [task 2017-12-29T21:11:56.545Z] 21:11:56 INFO - ==> process 4884 launched child process 4992 [task 2017-12-29T21:11:56.546Z] 21:11:56 INFO - Found child pids: set([4992, 4945, 4909]) [task 2017-12-29T21:11:56.546Z] 21:11:56 INFO - Failed to get child procs [task 2017-12-29T21:11:56.546Z] 21:11:56 INFO - Killing process: 4992 [task 2017-12-29T21:11:56.546Z] 21:11:56 INFO - TEST-INFO | started process screentopng [task 2017-12-29T21:11:56.826Z] 21:11:56 INFO - TEST-INFO | screentopng: exit 0 [task 2017-12-29T21:11:56.826Z] 21:11:56 INFO - Killing process: 4945 [task 2017-12-29T21:11:56.827Z] 21:11:56 INFO - Not taking screenshot here: see the one that was previously logged [task 2017-12-29T21:11:56.827Z] 21:11:56 INFO - Killing process: 4909 [task 2017-12-29T21:11:56.828Z] 21:11:56 INFO - Not taking screenshot here: see the one that was previously logged [task 2017-12-29T21:11:56.828Z] 21:11:56 INFO - Can't trigger Breakpad, process no longer exists [task 2017-12-29T21:11:56.828Z] 21:11:56 INFO - Killing process: 4884 [task 2017-12-29T21:11:56.835Z] 21:11:56 INFO - Not taking screenshot here: see the one that was previously logged [task 2017-12-29T21:11:56.956Z] 21:11:56 INFO - psutil found pid 4884 dead [task 2017-12-29T21:11:57.148Z] 21:11:57 WARNING - runtests.py | Failed to get app exit code - running/crashed? [task 2017-12-29T21:11:57.150Z] 21:11:57 INFO - TEST-INFO | Main app process: exit 0 [task 2017-12-29T21:11:57.150Z] 21:11:57 INFO - runtests.py | Application ran for: 0:06:29.615614 [task 2017-12-29T21:11:57.151Z] 21:11:57 INFO - zombiecheck | Reading PID log: /tmp/tmpf8q42Apidlog [task 2017-12-29T21:11:57.152Z] 21:11:57 INFO - ==> process 4884 launched child process 4909 [task 2017-12-29T21:11:57.153Z] 21:11:57 INFO - ==> process 4884 launched child process 4945 [task 2017-12-29T21:11:57.154Z] 21:11:57 INFO - ==> process 4884 launched child process 4992 [task 2017-12-29T21:11:57.155Z] 21:11:57 INFO - zombiecheck | Checking for orphan process with PID: 4992 [task 2017-12-29T21:11:57.155Z] 21:11:57 INFO - zombiecheck | Checking for orphan process with PID: 4945 [task 2017-12-29T21:11:57.155Z] 21:11:57 INFO - zombiecheck | Checking for orphan process with PID: 4909 [task 2017-12-29T21:11:57.155Z] 21:11:57 INFO - mozcrash Downloading symbols from: https://queue.taskcluster.net/v1/task/SF_YeJbNRxu2Ccf3RZEByg/artifacts/public/build/target.crashreporter-symbols.zip [task 2017-12-29T21:12:03.197Z] 21:12:03 INFO - mozcrash Copy/paste: /usr/local/bin/linux64-minidump_stackwalk /tmp/tmpeZ2h5A.mozrunner/minidumps/3c9b2d6a-b3e7-27cf-6330-3053cc36b10c.dmp /tmp/tmpuieA0b [task 2017-12-29T21:12:12.157Z] 21:12:12 INFO - mozcrash Saved minidump as /builds/worker/workspace/build/blobber_upload_dir/3c9b2d6a-b3e7-27cf-6330-3053cc36b10c.dmp [task 2017-12-29T21:12:12.158Z] 21:12:12 INFO - mozcrash Saved app info as /builds/worker/workspace/build/blobber_upload_dir/3c9b2d6a-b3e7-27cf-6330-3053cc36b10c.extra [task 2017-12-29T21:12:12.218Z] 21:12:12 INFO - PROCESS-CRASH | Main app process exited normally | application crashed [@ libc-2.23.so + 0xfb70d] [task 2017-12-29T21:12:12.219Z] 21:12:12 INFO - Crash dump filename: /tmp/tmpeZ2h5A.mozrunner/minidumps/3c9b2d6a-b3e7-27cf-6330-3053cc36b10c.dmp [task 2017-12-29T21:12:12.220Z] 21:12:12 INFO - Operating system: Linux [task 2017-12-29T21:12:12.221Z] 21:12:12 INFO - 0.0.0 Linux 4.4.0-98-generic #121~14.04.1-Ubuntu SMP Wed Oct 11 11:54:55 UTC 2017 x86_64 [task 2017-12-29T21:12:12.222Z] 21:12:12 INFO - CPU: amd64 [task 2017-12-29T21:12:12.222Z] 21:12:12 INFO - family 6 model 62 stepping 4 [task 2017-12-29T21:12:12.223Z] 21:12:12 INFO - 2 CPUs [task 2017-12-29T21:12:12.224Z] 21:12:12 INFO - [task 2017-12-29T21:12:12.224Z] 21:12:12 INFO - GPU: UNKNOWN [task 2017-12-29T21:12:12.225Z] 21:12:12 INFO - [task 2017-12-29T21:12:12.226Z] 21:12:12 INFO - Crash reason: SIGABRT [task 2017-12-29T21:12:12.227Z] 21:12:12 INFO - Crash address: 0x3e8000003d3 [task 2017-12-29T21:12:12.228Z] 21:12:12 INFO - Process uptime: not available
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → INCOMPLETE
Status: REOPENED → RESOLVED
Closed: 7 years ago6 years ago
Resolution: --- → INCOMPLETE
Status: RESOLVED → REOPENED
Resolution: INCOMPLETE → ---
Status: REOPENED → RESOLVED
Closed: 6 years ago6 years ago
Resolution: --- → INCOMPLETE

(In reply to Andreea Pavel [:apavel] from comment #29)

Recent occurrences for bug 1492340, bug 1531158 and this one occurred on the 27th of February.

Could the recent failures be from bug 1525959? James, can you please take a look?

We did start running this test in that bug, but it's not the root cause of the problem. Andrew, maybe some SW folks could try to make this more reliable? Otherwise I guess we can skip it, but I'd rather not.

Flags: needinfo?(snorp) → needinfo?(overholt)

Sure. I think Eden may have a bit of time to look.

I briefly glanced and am confused about where the relevant stack trace is. At https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=231507730&repo=autoland&lineNumber=3748 I see lots of frames in AOSP code. Is there a better place to be looking?

Component: DOM → DOM: Service Workers
Flags: needinfo?(snorp)
Flags: needinfo?(overholt)
Flags: needinfo?(echuang)
Priority: P5 → P2

Oh, and I'll ask the silly question: can one reproduce this locally? :)

You can probably ignore the stack. The stack is just whatever we were running when the harness kills the process due to the timeout. You can probably repro locally with mach mochitest --app org.mozilla.geckoview.test <path to test>.

Flags: needinfo?(snorp)
Assignee: nobody → echuang
Flags: needinfo?(echuang)

In the last 7 days there have been 34 occurrences on Android 7 and Linux 64, build type opt.

Over the last 7 days there are 44 crashes on this bug. These happen on linux32, linux64-pgo-qr.

Here is the latest failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=235663378&repo=autoland&lineNumber=9420

Flags: needinfo?(echuang)

Sorry for so late response. This bug can not be reproduced in my local. :(

All timeout bugs under dom/tests/mochitest/fetch/ are the same reason, I set them as "duplicate" of this bug.

After analyzing the log from the try runs, the root cause of timeout is failing to load the script of <script> element in dom/tests/mochitest/fetch/fetch_test_framework.js::windowTest. Since loading failed, no SimpleTest.finish() is called, then the test goes into timeout status.
Here is the warning message

Web Content: [JavaScript Warning: "Loading failed for the <script> with source “http://mochi.test:8888/tests/dom/tests/mochitest/fetch/test_formdataparsing.jsâ€." {file: "http://mochi.test:8888/tests/dom/tests/mochitest/fetch/reroute.html?test_formdataparsing" line: 1}]

Compare to the test of non-sw_reroute version, the special characters, “ and â€, are prepended and appended on loading URI, it probably is the reason why loading failed.

I am figuring out the actual reason why script loading fail.

Flags: needinfo?(echuang)

(In reply to Eden Chuang[:edenchuang] from comment #45)

Web Content: [JavaScript Warning: "Loading failed for the <script> with source “http://mochi.test:8888/tests/dom/tests/mochitest/fetch/test_formdataparsing.jsâ€." {file: "http://mochi.test:8888/tests/dom/tests/mochitest/fetch/reroute.html?test_formdataparsing" line: 1}]

That reminds me very much of https://bugzilla.mozilla.org/show_bug.cgi?id=1534732#c7 ... although that is apparently not related to particular tests, while this is particular to the sw_reroute tests.

Compare to the test of non-sw_reroute version, the special characters, “ and â€, are prepended and appended on loading URI, it probably is the reason why loading failed.

I think those special characters are "normal" for android, but I'm not 100% sure.

Let's skip all these sw_reroute tests temporarily, while we continue investigating,
so that fewer failures are reported and fewer alerts/needinfos are generated!

Pushed by gbrown@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/3f325e3b7598 Skip *_sw_reroute* mochitests on Android 7.0; r=edenchuang
Whiteboard: [stockwell disable-recommended] → [stockwell disabled]

There are 150 total failures in the last 7 days, however, since yesterday, failure rate has significantly decreased.
Recent failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=237054254&repo=autoland&lineNumber=9583

[task 2019-03-30T00:46:40.875Z] 00:46:40 INFO - TEST-START | dom/tests/mochitest/fetch/test_fetch_basic_http_sw_reroute.html
[task 2019-03-30T00:52:50.884Z] 00:52:50 INFO - Buffered messages finished
[task 2019-03-30T00:52:50.885Z] 00:52:50 ERROR - TEST-UNEXPECTED-TIMEOUT | dom/tests/mochitest/fetch/test_fetch_basic_http_sw_reroute.html | application timed out after 370 seconds with no output
[task 2019-03-30T00:52:50.887Z] 00:52:50 ERROR - Force-terminating active process(es).
[task 2019-03-30T00:52:50.888Z] 00:52:50 INFO - Determining child pids from psutil...
[task 2019-03-30T00:52:50.894Z] 00:52:50 INFO - [5902, 5925, 5974, 5998]
[task 2019-03-30T00:52:50.894Z] 00:52:50 INFO - ==> process 5837 launched child process 5862
[task 2019-03-30T00:52:50.895Z] 00:52:50 INFO - ==> process 5837 launched child process 5902
[task 2019-03-30T00:52:50.897Z] 00:52:50 INFO - ==> process 5837 launched child process 5925
[task 2019-03-30T00:52:50.898Z] 00:52:50 INFO - ==> process 5837 launched child process 5974
[task 2019-03-30T00:52:50.899Z] 00:52:50 INFO - ==> process 5837 launched child process 5998
[task 2019-03-30T00:52:50.900Z] 00:52:50 INFO - Found child pids: set([5862, 5974, 5925, 5902, 5998])
[task 2019-03-30T00:52:50.901Z] 00:52:50 INFO - Failed to get child procs
[task 2019-03-30T00:52:50.901Z] 00:52:50 INFO - Killing process: 5862
[task 2019-03-30T00:52:50.902Z] 00:52:50 INFO - TEST-INFO | started process screentopng

After more analyzing on the log, I found the HttpChannelChild::Cancel(NS_BINDING_ABORTED) is called unexpected, although the data from the parent process are the same with the normal case. However, the Cancel makes the script loading fail and times out the test.

ServiceWorker Termination cancels the HttpChannel and causes the script loading failed.

Comparing to the situation in other platforms, the ServiceWorker termination is expected since no jobs for ServiceWorker. However, the difference is the termination job will be executed after HttpChannel::OnStopRequest, therefore, there is no loading fail in other platforms.

Over the last 7 days there are 38 failures present on this bug. These happen on linux32-shippable, linux64, linux64-qr, linux64-shippable, linux64-shippable-qr, windows7-32-shippable

Here is the most recent log example: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=240860942&repo=autoland&lineNumber=9481

[task 2019-04-17T06:42:28.311Z] 06:42:28 INFO - SimpleTest START
[task 2019-04-17T06:42:28.319Z] 06:42:28 INFO - TEST-START | dom/tests/mochitest/fetch/test_fetch_basic.html
[task 2019-04-17T06:42:30.124Z] 06:42:30 INFO - GECKO(5875) | MEMORY STAT vsizeMaxContiguous not supported in this build configuration.
[task 2019-04-17T06:42:30.126Z] 06:42:30 INFO - GECKO(5875) | MEMORY STAT | vsize 2497MB | residentFast 117MB | heapAllocated 15MB
[task 2019-04-17T06:42:30.147Z] 06:42:30 INFO - TEST-OK | dom/tests/mochitest/fetch/test_fetch_basic.html | took 1828ms
[task 2019-04-17T06:42:30.196Z] 06:42:30 INFO - TEST-START | dom/tests/mochitest/fetch/test_fetch_basic_http.html
[task 2019-04-17T06:42:31.109Z] 06:42:31 INFO - GECKO(5875) | MEMORY STAT | vsize 2500MB | residentFast 119MB | heapAllocated 15MB
[task 2019-04-17T06:42:31.126Z] 06:42:31 INFO - TEST-OK | dom/tests/mochitest/fetch/test_fetch_basic_http.html | took 929ms
[task 2019-04-17T06:42:31.166Z] 06:42:31 INFO - TEST-START | dom/tests/mochitest/fetch/test_fetch_basic_http_sw_empty_reroute.html
[task 2019-04-17T06:42:32.158Z] 06:42:32 INFO - GECKO(5875) | MEMORY STAT | vsize 2501MB | residentFast 120MB | heapAllocated 14MB
[task 2019-04-17T06:42:32.178Z] 06:42:32 INFO - TEST-OK | dom/tests/mochitest/fetch/test_fetch_basic_http_sw_empty_reroute.html | took 1009ms
[task 2019-04-17T06:42:32.199Z] 06:42:32 INFO - TEST-START | dom/tests/mochitest/fetch/test_fetch_basic_http_sw_reroute.html
[task 2019-04-17T06:48:42.216Z] 06:48:42 INFO - Buffered messages finished
[task 2019-04-17T06:48:42.217Z] 06:48:42 ERROR - TEST-UNEXPECTED-TIMEOUT | dom/tests/mochitest/fetch/test_fetch_basic_http_sw_reroute.html | application timed out after 370 seconds with no output
[task 2019-04-17T06:48:42.217Z] 06:48:42 ERROR - Force-terminating active process(es).
[task 2019-04-17T06:48:42.218Z] 06:48:42 INFO - Determining child pids from psutil...
[task 2019-04-17T06:48:42.222Z] 06:48:42 INFO - [6040, 5939, 5962, 5997]
[task 2019-04-17T06:48:42.224Z] 06:48:42 INFO - ==> process 5875 launched child process 5900
[task 2019-04-17T06:48:42.227Z] 06:48:42 INFO - ==> process 5875 launched child process 5939
[task 2019-04-17T06:48:42.229Z] 06:48:42 INFO - ==> process 5875 launched child process 5962
[task 2019-04-17T06:48:42.235Z] 06:48:42 INFO - ==> process 5875 launched child process 5997
[task 2019-04-17T06:48:42.236Z] 06:48:42 INFO - ==> process 5875 launched child process 6040
[task 2019-04-17T06:48:42.237Z] 06:48:42 INFO - Found child pids: set([6040, 5962, 5939, 5900, 5997])
[task 2019-04-17T06:48:42.238Z] 06:48:42 INFO - Failed to get child procs
[task 2019-04-17T06:48:42.240Z] 06:48:42 INFO - Killing process: 6040
[task 2019-04-17T06:48:42.241Z] 06:48:42 INFO - TEST-INFO | started process screentopng
[task 2019-04-17T06:48:42.532Z] 06:48:42 INFO - TEST-INFO | screentopng: exit 0
[task 2019-04-17T06:48:42.532Z] 06:48:42 INFO - Killing process: 5962
[task 2019-04-17T06:48:42.535Z] 06:48:42 INFO - Not taking screenshot here: see the one that was previously logged
[task 2019-04-17T06:48:42.537Z] 06:48:42 INFO - Killing process: 5939
[task 2019-04-17T06:48:42.539Z] 06:48:42 INFO - Not taking screenshot here: see the one that was previously logged
[task 2019-04-17T06:48:42.541Z] 06:48:42 INFO - Killing process: 5900
[task 2019-04-17T06:48:42.543Z] 06:48:42 INFO - Not taking screenshot here: see the one that was previously logged
[task 2019-04-17T06:48:42.545Z] 06:48:42 INFO - Can't trigger Breakpad, process no longer exists
[task 2019-04-17T06:48:42.547Z] 06:48:42 INFO - Killing process: 5997
[task 2019-04-17T06:48:42.548Z] 06:48:42 INFO - Not taking screenshot here: see the one that was previously logged
[task 2019-04-17T06:48:42.550Z] 06:48:42 INFO - Killing process: 5875
[task 2019-04-17T06:48:42.552Z] 06:48:42 INFO - Not taking screenshot here: see the one that was previously logged

Flags: needinfo?(overholt)

Eden's looking at this.

Flags: needinfo?(overholt)

This bug has failed 60 times in the last 7 days. Occurs on linux64-shippable-qr, linux64-qr on opt build type.

Recent log:
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=242800465&repo=mozilla-inbound&lineNumber=9388

Joel, can you take a look at this too?

Flags: needinfo?(jmaher)

Note that all of the recent Android and some of the recent Linux failures are for a different test:

dom/tests/mochitest/fetch/test_formdataparsing_sw_reroute.html

I see half and half for:
dom/tests/mochitest/fetch/test_fetch_basic_http_sw_reroute.html

and:
dom/tests/mochitest/fetch/test_formdataparsing_sw_reroute.html

most of these are on linux && !debug and split between 64/32 and regular/qr.

I would honestly disable both for linux&&!debug so we could see what remains- 2+ weeks of no mention of a pending fix it is safe to disable.

Flags: needinfo?(jmaher)

Pushed by ncsoregi@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/418adda2c4ce
disable test_fetch_basic_http_sw_reroute.html and test_formdataparsing_sw_reroute.html on linux and not debug r=jmaher

Keywords: checkin-needed

The ServiceWorker is terminated by idleWorkerTimer timeout, and in slow platforms/machines, it would cancel the uncompleted fetching and make unexpected behavior during running tests.

The SW idle termination implementation has no defects, so the fix would just double the dom.serviceWorkers.idle_timeout to let fetchings have more chance to finish before idleWorkerTimer timeout.

Keywords: checkin-needed

Pushed by csabou@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/725ea89ab2d2
Twice the dom.serviceWorkers.idle_timeout value for tests under dom/tests/mochitest/fetch/ to reduce the impact SW termination by timeout. r=perry

Keywords: checkin-needed
Priority: P2 → P5
Severity: normal → S3

The latest failures are all on ESR (since 7 months). We should observe this after ESR 78 is out.

Hi Geoff, it seems that the tests are enabled again with Eden's patch and are only failing (rarely) on ESR 68. Can we remove the [stockwell disabled] whiteboard then?

Flags: needinfo?(gbrown)

Yes, that's right - done!

Flags: needinfo?(gbrown)
Whiteboard: [stockwell disabled]
Status: REOPENED → RESOLVED
Closed: 6 years ago4 years ago
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: