Open Bug 1411358 Opened 7 years ago Updated 9 months ago

Intermittent [taskcluster:error] Task timeout after 3600 seconds. Force killing container. / [taskcluster:error] Task timeout after 5400 seconds. Force killing container. / [taskcluster:error] Task timeout after 7200 seconds. Force killing container.

Categories

(Testing :: General, defect, P3)

defect

Tracking

(Not tracked)

People

(Reporter: gbrown, Unassigned, NeedInfo)

References

(Depends on 5 open bugs)

Details

(Keywords: intermittent-failure, Whiteboard: [stockwell infra] [see summary at comment 278] [relops-android])

+++ This bug was initially created as a clone of Bug #1204281 +++

There is considerable history in bug 1204281; I have cloned and closed that bug because it was simply too big -- too many comments, too much ancient history.

I expect these failures to continue. Task timeouts are the last resort whenever a task is hung or otherwise runs for too long. The immediate goal is to address timeouts which recur in the same task frequently.
Whiteboard: [stockwell disable-recommended] → [stockwell infra]
No longer depends on: 1407687
Bug 1339568 is the single biggest cause of these failures.
Depends on: 1412953
Waiting for updates on bug 1339568, which depends on bug 1357466, which had no progress in the last months.
Still waiting for updates on bug 1339568.
Flags: needinfo?(gbrown)
Bug 1339568 remains the biggest contributer of failures reported in this bug.
Flags: needinfo?(gbrown)
Update: there are 65 failures for the last 7 days
Some tests have been disabled in bug 1339568, reducing mochitest-media-e10s-2 shutdown hangs considerably.

I expect to still see some intermittent failures reported here from other causes.
Depends on: 1422863
In the last week there have been 31 total failures.
The failures occur on debug and opt build types and on the following platforms:
- android-4-3-armv7-api16: 14
- android-api-16-gradle: 5
- Android 4.2 x86: 2
- Linux x64: 2
- linux64-stylo-disabled: 2
- android-4-0-armv7-api16: 1
- Linux: 1
- linux32-stylo-disabled: 1
- linux64-ccov: 1
- linux64-qr: 1
- windows-mingw32-32: 1

Here is a recent log file: https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-central&job_id=150857059&lineNumber=9563
And a relevant snippet from it:
[task 2017-12-08T21:22:31.523Z] 21:22:31     INFO -  make[4]: Leaving directory '/builds/worker/workspace/build/src/obj-firefox/xpcom/build'
9560
[task 2017-12-08T21:22:31.524Z] 21:22:31     INFO -  make[4]: Entering directory '/builds/worker/workspace/build/src/obj-firefox/xpcom/build'
9561
[task 2017-12-08T21:22:31.557Z] 21:22:31     INFO -  make[4]: Leaving directory '/builds/worker/workspace/build/src/obj-firefox/xpcom/build'
9562

9563
[taskcluster:error] Task timeout after 7200 seconds. Force killing container.
9564
[taskcluster 2017-12-08 21:22:56.375Z] === Task Finished ===
9565
[taskcluster 2017-12-08 21:22:56.383Z] Unsuccessful task run with exit code: -1 completed in 7596.094 seconds


:gbrown, any updates on this?
Flags: needinfo?(gbrown)
The latest increase in frequency is due to failing Android builds. I cannot tell if that is an on-going problem, or a temporary glitch but will continue to monitor.
Flags: needinfo?(gbrown)
Most recent failures have been Android tests, for a variety of reasons. Some show intermittent download failures/retries -- hopefully a temporary condition. I don't see enough consistency to warrant further investigation right now.
Depends on: 1430668
Last week there were a variety of tasks failing here. The biggest group is linux32/debug jsreftests -- bug 1430668.
Depends on: 1433560
Some recent android opt mochitest failures here - bug 1433560 should help.

Also some test-verify failures - bug 1431125.
Depends on: 1431125
Priority: -- → P3
Depends on: 1435079
Recent failures seem to be mostly bug 1321605 -- probably nothing more I can do there. There may also be some fallout from android mochitest run-by-manifest -- keeping an eye on that.
There are 83 failures in the past 7 days.
Platforms: most of them on  android-4-3-armv7-api16 debug/ opt and linux64-jsdcov opt; and a few occurrences on Linux/Linux x64 debug, linux32-stylo-disabled debug, linux64-jsdcov opt, linux64-qr debug and windows10-64-ccov debug.
Recent failure log example:
https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-central&job_id=165586428&lineNumber=5170877
Depends on: 1442823
(In reply to OrangeFactor Robot from comment #37)
> Platform breakdown:
> * linux64-jsdcov: 91
> * android-4-3-armv7-api16: 11

The linux64-jsdcov problem is apparent here -- working on that in bug 1442823.
Keywords: leave-open
Pushed by gbrown@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/c291e7dfe010
Increase taskcluster max-run-time for Android mochitest-chrome and Android/opt mochitest; r=me,a=test-only
There are 105 failures in the past week.
Platforms: 
- Android 4.2 x86 opt
- android-4-0-armv7-api16 opt and debug
- android-4-3-armv7-api16 opt and debug
- android-5-0-aarch64 opt
- Linux opt and debug (many occurrences)
- Linux x64 opt, debug and asan (many occurrences)
- linux32-stylo-disabled debug
- linux64-ccov opt
- linux64-jsdcov opt (many occurrences)
- linux64-noopt debug
- linux64-qr opt and debug
- linux64-stylo-disabled opt and debug
- windows10-64-ccov debug

Recent log failure: https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-central&job_id=168753923&lineNumber=1537

Relevant part of the log: 
[task 2018-03-17T23:01:53.204Z] 23:01:53     INFO -  Successfully installed browsermob-proxy-0.6.0 firefox-puppeteer-52.1.0 marionette-driver-2.5.0 marionette-harness-4.3.0 wptserve-1.4.0
[task 2018-03-17T23:01:53.253Z] 23:01:53     INFO - Return code: 0
[task 2018-03-17T23:01:53.255Z] 23:01:53     INFO - Installing None into virtualenv /builds/worker/workspace/build/venv
[task 2018-03-17T23:01:53.256Z] 23:01:53     INFO - error resolving pypi.pvt.build.mozilla.org (ignoring): 

[taskcluster:error] Task timeout after 5400 seconds. Force killing container.
[task 2018-03-17T23:01:53.260Z] 23:01:53     INFO - retry: Calling run_command with args: [['/builds/worker/workspace/build/venv/bin/pip', 'install', '--timeout', '120', '-r', '/builds/worker/workspace/build/tests/config/marionette_requirements.txt', '--no-index', '--find-links', 'http://pypi.pub.build.mozilla.org/pub', '--trusted-host', 'pypi.pub.build.mozilla.org']], kwargs: {'error_level': 'warning', 'error_list': [{'substr': 'not found or a compiler error:', 'level': 'warning'}, {'regex': <_sre.SRE_Pattern object at 0x7f7cc33d4490>, 'level': 'error'}, {'regex': <_sre.SRE_Pattern object at 0x7f7cc3490300>, 'level': 'warning'}, {'regex': <_sre.SRE_Pattern object at 0x158aa80>, 'level': 'debug'}, {'substr': 'command not found', 'level': 'error'}, {'regex': <_sre.SRE_Pattern object at 0x7f7cc488fbb8>, 'level': 'warning'}, {'substr': 'Traceback (most recent call last)', 'level': 'error'}, {'substr': 'SyntaxError: ', 'level': 'error'}, {'substr': 'TypeError: ', 'level': 'error'}, {'substr': 'NameError: ', 'level': 'error'}, {'substr': 'ZeroDivisionError: ', 'level': 'error'}, {'regex': <_sre.SRE_Pattern object at 0x7f7cc3493188>, 'level': 'critical'}, {'regex': <_sre.SRE_Pattern object at 0x7f7cc48a7b28>, 'level': 'critical'}], 'cwd': '/builds/worker/workspace/build/tests/config', 'env': {'TASKCLUSTER_PORT_80_TCP_ADDR': '172.17.0.2', 'TASKCLUSTER_INSTANCE_TYPE': 'm3.large', 'TASKCLUSTER_WORKER_TYPE': 'gecko-t-linux-large', 'MOZ_AUTOMATION': '1', 'MOZ_SOURCE_CHANGESET': 'efce78e62b6dac195e7cb4898684da54155d1661', 'MOCHITEST_FLAVOR': 'plain', 'LOGNAME': 'worker', 'USER': 'worker', 'HOME': '/builds/worker', 'PATH': '/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/builds/worker/bin', 'DISPLAY': ':0', 'LANG': 'en_US.UTF-8', 'TERM': 'xterm', 'SHELL': '/bin/bash', 'MOZ_NODE_PATH': '/usr/local/bin/node', 'RUN_ID': '0', 'HG_STORE_PATH': '/builds/worker/checkouts/hg-store', 'MOZILLA_BUILD_URL': 'https://queue.taskcluster.net/v1/task/cjvNZxdBSKKfmKHtz6MnGQ/artifacts/public/build/target.tar.bz2', 'TASKCLUSTER_PORT': 'tcp://172.17.0.2:80', 'MOZHARNESS_SCRIPT': 'desktop_unittest.py', 'TASKCLUSTER_NAME': '/laughing_khorana/taskcluster', 'GECKO_HEAD_REPOSITORY': 'https://hg.mozilla.org/mozilla-central', 'SCCACHE_DISABLE': '1', 'TASKCLUSTER_PORT_80_TCP_PORT': '80', 'MOZ_SOURCE_REPO': 'https://hg.mozilla.org/mozilla-central', 'GECKO_HEAD_REV': 'efce78e62b6dac195e7cb4898684da54155d1661', 'MOZHARNESS_URL': 'https://queue.taskcluster.net/v1/task/cjvNZxdBSKKfmKHtz6MnGQ/artifacts/public/build/mozharness.zip', 'GECKO_BASE_REPOSITORY': 'https://hg.mozilla.org/mozilla-unified', 'LC_ALL': 'en_US.UTF-8', 'TASKCLUSTER_PORT_80_TCP_PROTO': 'tcp', 'TASK_ID': 'OgTQKS-PS-udVgG1SWL7lA', 'TASKCLUSTER_PORT_80_TCP': 'tcp://172.17.0.2:80', 'OLDPWD': '/builds/worker/workspace', 'HOSTNAME': 'taskcluster-worker', 'SHLVL': '1', 'PWD': '/builds/worker/workspace', 'NEED_PULSEAUDIO': 'true', 'ENABLE_E10S': 'false', 'NEED_WINDOW_MANAGER': 'true', 'MOZHARNESS_CONFIG': 'unittests/linux_unittest.py remove_executables.py', 'TASKCLUSTER_WORKER_GROUP': 'us-west-1', 'TASKCLUSTER_PUBLIC_IP': '54.183.87.60'}}, attempt #1
[task 2018-03-17T23:01:53.261Z] 23:01:53     INFO - Running command: ['/builds/worker/workspace/build/venv/bin/pip', 'install', '--timeout', '120', '-r', '/builds/worker/workspace/build/tests/config/marionette_requirements.txt', '--no-index', '--find-links', 'http://pypi.pub.build.mozilla.org/pub', '--trusted-host', 'pypi.pub.build.mozilla.org'] in /builds/worker/workspace/build/tests/config
[task 2018-03-17T23:01:53.261Z] 23:01:53     INFO - Copy/paste: /builds/worker/workspace/build/venv/bin/pip install --timeout 120 -r /builds/worker/workspace/build/tests/config/marionette_requirements.txt --no-index --find-links http://pypi.pub.build.mozilla.org/pub --trusted-host pypi.pub.build.mozilla.org
[task 2018-03-17T23:01:53.261Z] 23:01:53     INFO - Using env: {'DISPLAY': ':0',
Some failures are from 'try -p all -u all' runs which have triggered jsdcov/dev-tools failures -- carry over from bug 1442823.
Many of the recent logs on linux and android show pypi failures, similar to bug 1445580.
Depends on: 1445580
Depends on: 1448049
Over the last 7 days there are 81 failures on this bug. These happen on Android 4.2 x86, android-4-0-armv7-api16, android-4-3-armv7-api16, android-5-0-aarch64, Linux, Linux x64, linux64-qr, 

Here is a recent log example: https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=170655917&lineNumber=12538

[taskcluster:error] Task timeout after 7200 seconds. Force killing container.
Flags: needinfo?(gbrown)
See comment 0 and bug dependencies. Investigation and mitigation is on-going.
Flags: needinfo?(gbrown)
Recent failures on Android and Linux are frequent and irregular: Not associated with particular tests or test tasks. Some are related to network failures or delays during 'pip install':

[task 2018-04-03T18:13:55.354Z] 18:13:55     INFO -  Collecting jsonschema==2.5.1
[task 2018-04-03T18:15:55.610Z] 18:15:55     INFO -    Retrying (Retry(total=4, connect=None, read=None, redirect=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.requests.packages.urllib3.connection.HTTPConnection object at 0x7fba9e73e3d0>, 'Connection to pypi.pub.build.mozilla.org timed out. (connect timeout=120.0)')': /pub
[task 2018-04-03T18:17:56.643Z] 18:17:56     INFO -    Retrying (Retry(total=3, connect=None, read=None, redirect=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.requests.packages.urllib3.connection.HTTPConnection object at 0x7fba9e73e550>, 'Connection to pypi.pub.build.mozilla.org timed out. (connect timeout=120.0)')': /pub
[task 2018-04-03T18:19:58.106Z] 18:19:58     INFO -    Retrying (Retry(total=2, connect=None, read=None, redirect=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.requests.packages.urllib3.connection.HTTPConnection object at 0x7fba9e73e6d0>, 'Connection to pypi.pub.build.mozilla.org timed out. (connect timeout=120.0)')': /pub
[task 2018-04-03T18:22:00.205Z] 18:22:00     INFO -    Retrying (Retry(total=1, connect=None, read=None, redirect=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.requests.packages.urllib3.connection.HTTPConnection object at 0x7fba9e73e850>, 'Connection to pypi.pub.build.mozilla.org timed out. (connect timeout=120.0)')': /pub
[task 2018-04-03T18:24:04.445Z] 18:24:04     INFO -    Retrying (Retry(total=0, connect=None, read=None, redirect=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.requests.packages.urllib3.connection.HTTPConnection object at 0x7fba9e73e9d0>, 'Connection to pypi.pub.build.mozilla.org timed out. (connect timeout=120.0)')': /pub
[task 2018-04-03T18:26:04.821Z] 18:26:04     INFO -    Could not find a version that satisfies the requirement jsonschema==2.5.1 (from versions: )
[task 2018-04-03T18:26:04.822Z] 18:26:04     INFO -  No matching distribution found for jsonschema==2.5.1
Depends on: 1452956
Depends on: 1451432
Updates:
There have been 141 failures in the last 7 days.

Summary: Intermittent [taskcluster:error] Task timeout after 3600 seconds. Force killing container. / [taskcluster:error] Task timeout after 5400 seconds. Force killing container. / [taskcluster:error] Task timeout after 7200 seconds. Force killing container.

The most affected build types are debug and opt, but there are also over 10 failures on asan & pgo.

Failures per platforms:
Linux: 32	
Linux x64: 32
android-4-3-armv7-api16: 31
linux64-qr: 19
linux64-ccov: 18
android-4-0-armv7-api16: 5
android-5-0-aarch64: 2
linux32-nightly: 1
Android 4.2 x86: 1
Bug 1451432 remains my biggest current concern here.
Over the last 7 days there are 99 failures present on this bug. These happen on Android 4.2 x86, android-4-3-armv7-api16, android-5-0-aarch64, Linux, Linux x64, linux64-ccov, linux64-qr, Windows 7.

Here is the most recent log example: https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=176126037&lineNumber=25270


[task 2018-04-28T15:11:00.833Z] 15:11:00     INFO - TEST-START | /referrer-policy/unsafe-url/meta-referrer/same-origin/http-https/iframe-tag/generic.swap-origin-redirect.http.html
[task 2018-04-28T15:11:00.844Z] 15:11:00     INFO - PID 15175 | --DOMWINDOW == 28 (0x7fa601cc9800) [pid = 15232] [serial = 17] [outer = (nil)] [url = http://web-platform.test:8000/referrer-policy/generic/subresource/document.py?redirection=no-redirect&cache_destroyer=1524928253055]
[task 2018-04-28T15:11:00.849Z] 15:11:00     INFO - PID 15175 | --DOMWINDOW == 27 (0x7fa601508000) [pid = 15232] [serial = 14] [outer = (nil)] [url = about:blank]
[task 2018-04-28T15:11:00.853Z] 15:11:00     INFO - PID 15175 | --DOMWINDOW == 26 (0x7fa602492800) [pid = 15232] [serial = 3] [outer = (nil)] [url = about:blank]
[task 2018-04-28T15:11:00.855Z] 15:11:00     INFO - PID 15175 | --DOMWINDOW == 25 (0x7fa609723000) [pid = 15232] [serial = 2] [outer = (nil)] [url = about:blank]
[task 2018-04-28T15:11:00.858Z] 15:11:00     INFO - PID 15175 | --DOCSHELL 0x7fa601552800 == 8 [pid = 15232] [id = {62097795-cefa-4626-92b1-e2b01134ebcf}]
[task 2018-04-28T15:11:00.862Z] 15:11:00     INFO - PID 15175 | --DOCSHELL 0x7fa5ff39c800 == 7 [pid = 15232] [id = {907aed2f-82ba-484c-a1a7-209937e276ec}]
[task 2018-04-28T15:11:00.864Z] 15:11:00     INFO - PID 15175 | --DOMWINDOW == 24 (0x7fa5ff365000) [pid = 15232] [serial = 12] [outer = (nil)] [url = http://web-platform.test:8000/referrer-policy/generic/subresource/document.py]

[taskcluster:error] Task timeout after 7200 seconds. Force killing container.
[task 2018-04-28T15:11:00.874Z] 15:11:00     INFO - PID 15175 | --DOMWINDOW == 23 (0x7fa609720c00) [pid = 15232] [serial = 7] [outer = (nil)] [url = about:blank]
[task 2018-04-28T15:11:00.875Z] 15:11:00     INFO - PID 15175 | --DOMWINDOW == 22 (0x7fa609721000) [pid = 15232] [serial = 6] [outer = (nil)] [url = about:blank]
[task 2018-04-28T15:11:00.876Z] 15:11:00     INFO - PID 15175 | --DOMWINDOW == 21 (0x7fa6098bf000) [pid = 15232] [serial = 9] [outer = (nil)] [url = about:blank]
[task 2018-04-28T15:11:00.878Z] 15:11:00     INFO - PID 15175 | --DOCSHELL 0x7fa601555000 == 6 [pid = 15232] [id = {f6b879e5-be92-4156-adc9-5489f3b17629}]
[taskcluster 2018-04-28 15:11:01.679Z] === Task Finished ===
[taskcluster 2018-04-28 15:11:01.685Z] Unsuccessful task run with exit code: -1 completed in 7740.784 seconds
Flags: needinfo?(gbrown)
Update:
There have been 62 failures in the last 7 days, 230 total failures in the last 21 days.

Depends on:	1243080, 1246165, 1321605, 1404190, 1431125, 1448049, 1451432
Blocks:	1306635

Summary: Intermittent [taskcluster:error] Task timeout after 3600 seconds. Force killing container. / [taskcluster:error] Task timeout after 5400 seconds. Force killing container. / [taskcluster:error] Task timeout after 7200 seconds. Force killing container.

Failure per platform and build type:

- Android 4.2 x86 / opt: 2
- android-4-0-armv7-api16 / debug & opt :3
- android-4-3-armv7-api16 / debug & opt: 17
- android-5-0-aarch64 / debug & op: 1
- Linux x64 / pgo, debug & asan: 8
- linux64-ccov / opt: 8
- linux64-qr / opt & debug: 9
- linux64-asan-reporter / opt: 1
- Linux / debug: 13

Recent log with the failure: https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-central&job_id=177354933&lineNumber=1772
Depends on: 1346126
No longer depends on: 1246165
Flags: needinfo?(gbrown)
Depends on: 1461393
Depends on: 1457694
I expect these failures to continue and this bug to remain open indefinitely. Task timeouts are the last resort whenever a task is hung or otherwise runs for too long. My immediate goal is to address timeouts which recur in the same task frequently and identify other related issues in dependent bugs.

I expect infrequent failures on Android especially due to bug 1321605.

There are some infrequent failures on Android that can be addressed with more chunks - bug 1461393.

Recent failures on Android and Linux are frequent and irregular: Not associated with particular tests or test tasks. I'm hoping those might be addressed in bug 1457694.

(Recent network failures and delays during 'pip install' appear to have been resolved.)
Keywords: leave-open
Whiteboard: [stockwell infra] → [stockwell infra] [see summary at comment 92]
Over the last 7 days there are 124 failures present on this bug. These happen on android-4-0-armv7-api16, android-4-2-x86, android-4-3-armv7-api16, linux32, linux64, linux64-ccov, linux64-nightly, linux64-noopt, linux64-qr, windows7-32.
Here is the most recent log example: https://treeherder.mozilla.org/logviewer.html#?job_id=180438377&repo=mozilla-inbound&lineNumber=1806

[task 2018-05-27T02:38:21.245Z] 02:38:21     INFO -  TEST-OK | testSessionOOMRestore | took 335841ms
[task 2018-05-27T02:38:21.246Z] 02:38:21     INFO -  TEST-START | Shutdown
[task 2018-05-27T02:38:21.247Z] 02:38:21     INFO -  Passed: 97
[task 2018-05-27T02:38:21.249Z] 02:38:21     INFO -  Failed: 0
[task 2018-05-27T02:38:21.250Z] 02:38:21     INFO -  Todo: 0
[task 2018-05-27T02:38:21.250Z] 02:38:21     INFO -  SimpleTest FINISHED
[task 2018-05-27T02:38:42.819Z] 02:38:42     INFO -  INFO | automation.py | Application ran for: 0:06:10.045962
[task 2018-05-27T02:38:42.821Z] 02:38:42     INFO -  INFO | zombiecheck | Reading PID log: /tmp/tmp_o252qpidlog
[task 2018-05-27T02:38:43.458Z] 02:38:43     INFO -  /data/tombstones does not exist; tombstone check skipped
[task 2018-05-27T02:38:50.701Z] 02:38:50     INFO -  INFO | automation.py | Application pid: 5599
[task 2018-05-27T02:39:03.345Z] 02:39:03     INFO -  SimpleTest START
[task 2018-05-27T02:39:03.347Z] 02:39:03     INFO -  TEST-START | testSessionPrivateBrowsing

[taskcluster:error] Task timeout after 3600 seconds. Force killing container.
[taskcluster 2018-05-27 02:39:11.369Z] === Task Finished ===
[taskcluster 2018-05-27 02:39:11.372Z] Unsuccessful task run with exit code: -1 completed in 3601.941 seconds
Flags: needinfo?(gbrown)
linux32-debug/jittest-4 and android-debug/xpcshell-11 are frequent and might be actionable. Otherwise, see comment 92.
Flags: needinfo?(gbrown)
Pushed by gbrown@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/5989424834d2
Increase Android/debug xpcshell max-run-time; r=me,a=test-only
Keywords: leave-open
As an update, there are 76 failures in the last 7 days as it follows:
- 36 on linux64 (asan-reporter, ccov and qr)
- 29 on android. Mostly on 4-3-armv7-api16 but there are also 3 on 4-0-armv7-api16 and 2 on 4-2-x86
- 11 on linux32
Depends on: 1469288
Depends on: 1469720
In the last 7 days, there are 111 failures on this bug.
They occur mostly on linux 32, 64, 64-ccov, 64-qr, all builds type.
Recent failure log: https://treeherder.mozilla.org/logviewer.html#?job_id=183726361&repo=mozilla-inbound&lineNumber=24638
Depends on: 1470177
In the last 7 days there are 127 failures:

- 51 failures on linux64 / asan, debug, pgo & lto
- 47 failures on android-em-4-3-armv7-api16 / opt & debug
- 11 failures on linux 32 / debug, pgo & opt
- 7 failures on linux64-ccov / debug
- 6 failures on osx-cross / opt & debug
- 2 failures on android-4-0-armv7-api16 / debug
- 2 failures on linux64-qr / debug
- 1 failure on linux64-noopt / debug 

As per comment 92, these will keep occuring. 

:gbrown is there no way to decrease their rate on the most common platforms they occur: linux64 and android?
Flags: needinfo?(gbrown)
(In reply to Andreea Pavel [:apavel] from comment #134)
> In the last 7 days there are 127 failures:
> 
> - 51 failures on linux64 / asan, debug, pgo & lto
> - 47 failures on android-em-4-3-armv7-api16 / opt & debug
> - 11 failures on linux 32 / debug, pgo & opt
> - 7 failures on linux64-ccov / debug
> - 6 failures on osx-cross / opt & debug
> - 2 failures on android-4-0-armv7-api16 / debug
> - 2 failures on linux64-qr / debug
> - 1 failure on linux64-noopt / debug 
> 
> As per comment 92, these will keep occuring. 
> 
> :gbrown is there no way to decrease their rate on the most common platforms
> they occur: linux64 and android?

These are certainly too frequent, but I don't know what to do about it. There are multiple causes, even on one platform. Bug 1470177 eliminated a common cause of android timeouts, but the weekly stats hardly changed.

Bug 1457694 was an attempt to address a common scenario I (may) see on linux and android, but that hasn't made any progress and I'm not sure how to make progress on it. :(

I will keep monitoring failures and acting on those cases that I can.
Flags: needinfo?(gbrown)
:gbrown thanks for looking into this!
Depends on: 1115253
(In reply to Geoff Brown [:gbrown] from comment #137)
> Bug 1457694 was an attempt to address a common scenario I (may) see on linux
> and android, but that hasn't made any progress and I'm not sure how to make
> progress on it. :(

Bug 1457694 has been confirmed as a real issue and is progressing, I think!
Update:
 There has been a total of 119 failures in the last 7 days( and 339 failures in the last 21 days).
- 2 failures on android-4-0-armv7-api16 debug
- 1 failure on android-4-2-x86 opt
- 1 failure on android-em-4-2-x86 opt
- 19 failures on android-em-4-3-armv7-api16 debug/opt
- 17 failures on linux32 debug/opt
- 51 failures on linux64 asan/debug/lto/opt/pgo
- 10 failures on linux64-ccov debug
- 10 failures on linux64-qr debug/opt
- 1 failure on linux64-noopt debug
- 1 failure on osx-10-10 opt
- 3 failures on osx-cross opt
- 1 failure on windows10-64 asan
- 2 failures on windows7-32 debug
 There are different type of tests running, but all of them have this error in common:
[taskcluster:error] Task timeout after 3600 seconds. Force killing container.
[taskcluster 2018-07-18 12:00:30.241Z] Unsuccessful task run with exit code: -1 completed in 3924.674 seconds
(In reply to OrangeFactor Robot from comment #163)
> 135 failures in 138 pushes (0.978 failures/push) were associated with this
> bug yesterday.    
> 
> Platform breakdown:
> * android-em-4-3-armv7-api16: 105

robocop is perma-failing - https://bugzilla.mozilla.org/show_bug.cgi?id=1451513#c19.
(In reply to Geoff Brown [:gbrown] from comment #164)
> robocop is perma-failing -
> https://bugzilla.mozilla.org/show_bug.cgi?id=1451513#c19.

That was corrected with a backout and subsequent re-landing - great!
(In reply to Geoff Brown [:gbrown] from comment #150)
> Bug 1457694 has been confirmed as a real issue and is progressing, I think!

That bug hasn't been marked as resolved, but some changes have landed in related bugs and I am seeing far fewer of those issues. 

Failure rates in this bug have recently been significantly lower than we've seen in months.
Depends on: 1480348
Thanks :aryx! Yes, it looks like bug 1480494 is the biggest contributor to current failures here.
Update:
There have been 98 failures in the last week.
Failures per platform and build type:

linux64 / debug, asan, opt, lto: 39
osx-10-10 / opt & debug: 12
android-em-4-3-armv7-api16 / debug & opt: 11
linux64-ccov / debug: 11
linux32 / debug & opt: 7
osx-cross / opt & asan: 7
windows2012-32 / debug & pgo: 2
windows7-32 / opt & pgo: 3
android-em-4-2-x86 / debug: 2
linux64-noopt / debug: 1
linux64-qr / debug: 1
android-4-0-armv7-api16 / opt: 1
android-5-0-aarch64 / opt: 1

Summary: Intermittent [taskcluster:error] Task timeout after 3600 seconds. Force killing container. / [taskcluster:error] Task timeout after 5400 seconds. Force killing container. / [taskcluster:error] Task timeout after 7200 seconds. Force killing container.
Depends on: 1487877
Depends on: 1487886
There have been 118 failures in the last week.

The most affected platform / build: linux64-ccov / debug.

Recent relevat log file: https://treeherder.mozilla.org/logviewer.html#?job_id=197122988&repo=mozilla-central&lineNumber=180429

Summary: Intermittent [taskcluster:error] Task timeout after 3600 seconds. Force killing container. / [taskcluster:error] Task timeout after 5400 seconds. Force killing container. / [taskcluster:error] Task timeout after 7200 seconds. Force killing container.

Blocks:	1306635
Depends on: 1115253, 1243080, 1321605, 1404190, 1431125, 1457694, 1480348, 1487877, 1487886
Depends on: 1489351
Depends on: 1489645
linux64/ccov devtool and android/debug mochitest failure are being addressed with more test chunks.

linux64 mochitest-media failures are reportedly 1480348.

There are also some build task timeouts, which seem quite random -- I don't know what to do about those.
Update:
There have been 37 failures in the last week.

Failures per platform and build type:
- linux64 / debug, opt, asan, pgo, lto: 16
- linux32 / debug & opt: 10
- android-em-4-3-armv7-api16 / debug & opt: 5
- linux64-qr / debug: 2
- linux64-ccov / debug: 1
- osx-10-10-dmd / opt: 1
- osx-cross / opt: 1
- osx-cross-noopt / debug: 1

Recent relevant log file:
https://treeherder.mozilla.org/logviewer.html#?job_id=199174505&repo=mozilla-inbound&lineNumber=40467

Summary: Intermittent [taskcluster:error] Task timeout after 3600 seconds. Force killing container. / [taskcluster:error] Task timeout after 5400 seconds. Force killing container. / [taskcluster:error] Task timeout after 7200 seconds. Force killing container.
(In reply to Geoff Brown [:gbrown] from comment #199)
> linux64/ccov devtool and android/debug mochitest failure are being addressed
> with more test chunks.
> 
> linux64 mochitest-media failures are reportedly 1480348.
> 
> There are also some build task timeouts, which seem quite random -- I don't
> know what to do about those.

Many of those (apparent) build task timeouts are in rusttests; I haven't investigated, but note bug 1489277 might be tangentially related.
Depends on: 1485638
There was a recent spike in linux64-ccov/asan build timeouts after 60 minutes; these are "build-linux64-asan-fuzzing-ccov/opt (bocf)" builds, on mozilla-central only. The most recent builds have run close to 50 minutes, so I won't take action until/unless this issue returns.
No longer depends on: 1115253
Some failures last week happened during robustcheckout - fallout from bug 1490703 possibly - should be resolved now.
Depends on: 1500177
Depends on: 1505625
Depends on: 1505174
Recent failure log: https://treeherder.mozilla.org/logviewer.html#?job_id=212394686&repo=autoland&lineNumber=45090

[task 2018-11-17T15:05:43.073Z] 15:05:43     INFO - TEST-START | /referrer-policy/same-origin/attr-referrer/cross-origin/http-https/script-tag/cross-origin.keep-origin-redirect.http.html
[task 2018-11-17T15:05:43.296Z] 15:05:43     INFO - PID 22748 | ++DOCSHELL 0x7fdd2decb800 == 2 [pid = 22854] [id = {f599e7eb-3bd6-4194-91f2-ee9bbeb55f76}]
[task 2018-11-17T15:05:43.298Z] 15:05:43     INFO - PID 22748 | ++DOMWINDOW == 5 (0x7fdd2ca0e800) [pid = 22854] [serial = 5] [outer = (nil)]
[task 2018-11-17T15:05:43.397Z] 15:05:43     INFO - PID 22748 | ++DOMWINDOW == 6 (0x7fdd2dd62400) [pid = 22854] [serial = 6] [outer = 0x7fdd2ca0e800]
[task 2018-11-17T15:05:43.679Z] 15:05:43     INFO - PID 22748 | ++DOMWINDOW == 7 (0x7fdd2dd63400) [pid = 22854] [serial = 7] [outer = 0x7fdd2ca0e800]
[task 2018-11-17T15:05:44.143Z] 15:05:44     INFO - PID 22748 | Couldn't convert chrome URL: chrome://branding/locale/brand.properties

[taskcluster:error] Task timeout after 7200 seconds. Force killing container.
[taskcluster 2018-11-17 15:05:45.094Z] === Task Finished ===
[taskcluster 2018-11-17 15:05:45.096Z] Unsuccessful task run with exit code: -1 completed in 7692.491 seconds

Details in Comment 92.
Tests are in good shape -- almost no test failures here recently.

However, there are a bunch of build failures here. I don't see clear correlations to specific build types, but...needs more investigation.
Recent Android 7.0 failures are on machine-16 -- a ticket has been opened on packet.net to have this machine specific performance problem addressed.
Depends on: 1511084
This bug failed 45 times in the last 7 days. These occur on linux32, linux64 on opt, asan and pgo. For the following platforms there's only 1 failures for each: linux64-ccov, linux64-qr, android-em-4-3-armv7-api16.

Recent log:
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=217017974&repo=mozilla-inbound&lineNumber=87534

15:52:26     INFO - ============================= test session starts ==============================
[task 2018-12-14T15:52:26.563Z] 15:52:26     INFO - platform linux2 -- Python 2.7.9, pytest-3.6.2, py-1.5.4, pluggy-0.6.0 -- /builds/worker/workspace/build/src/obj-firefox/_virtualenvs/obj-firefox-8yIyzR8r-2.7/bin/python
[task 2018-12-14T15:52:26.563Z] 15:52:26     INFO - rootdir: /builds/worker/workspace/build/src, inifile: /builds/worker/workspace/build/src/config/mozunit/mozunit/pytest.ini
[task 2018-12-14T15:52:26.563Z] 15:52:26     INFO - collecting ... collected 1 item
[task 2018-12-14T15:52:26.563Z] 15:52:26     INFO - ../config/tests/unit-mozunit.py::TestMozUnit::test_mocked_open PASSED
[task 2018-12-14T15:52:26.564Z] 15:52:26     INFO - =========================== 1 passed in 0.01 seconds ===========================
[task 2018-12-14T15:52:26.564Z] 15:52:26     INFO - /builds/worker/workspace/build/src/config/tests/test_mozbuild_reading.py
[task 2018-12-14T15:52:26.564Z] 15:52:26     INFO - ============================= test session starts ==============================
[task 2018-12-14T15:52:26.564Z] 15:52:26     INFO - platform linux2 -- Python 2.7.9, pytest-3.6.2, py-1.5.4, pluggy-0.6.0 -- /builds/worker/workspace/build/src/obj-firefox/_virtualenvs/obj-firefox-8yIyzR8r-2.7/bin/python
[task 2018-12-14T15:52:26.565Z] 15:52:26     INFO - rootdir: /builds/worker/workspace/build/src, inifile: /builds/worker/workspace/build/src/config/mozunit/mozunit/pytest.ini
[task 2018-12-14T15:52:26.565Z] 15:52:26     INFO - collecting ... collected 3 items
[task 2018-12-14T15:52:26.565Z] 15:52:26     INFO - ../config/tests/test_mozbuild_reading.py::TestMozbuildReading::test_filesystem_traversal_no_config PASSED
[task 2018-12-14T15:52:26.565Z] 15:52:26     INFO - ../config/tests/test_mozbuild_reading.py::TestMozbuildReading::test_filesystem_traversal_reading <- ../../../../../usr/lib/python2.7/unittest/case.py SKIPPED
[task 2018-12-14T15:52:26.566Z] 15:52:26     INFO - ../config/tests/test_mozbuild_reading.py::TestMozbuildReading::test_orphan_file_patterns PASSED
[task 2018-12-14T15:52:26.566Z] 15:52:26     INFO - ===================== 2 passed, 1 skipped in 79.39 seconds =====================

[taskcluster:error] Task timeout after 7200 seconds. Force killing container.
[taskcluster 2018-12-14 15:53:03.117Z] === Task Finished ===
[taskcluster 2018-12-14 15:53:03.118Z] Unsuccessful task run with exit code: -1 completed in 7252.08 seconds
Depends on: 1514388
There are 29 total failures in the last 7 days and 162 total failures in the last 30 days (some of them might be misclassifications), majority on linux. 
Recent failure log:  https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=219384257&repo=autoland&lineNumber=194352
 
[task 2018-12-31T13:12:56.848Z] 13:12:56     INFO - TEST-START | dom/base/test/test_bug715041.xul

[task 2018-12-31T13:13:08.337Z] 13:13:08     INFO - GECKO(5728) | JS CALLED OBSERVE FUNCTION!!!
[task 2018-12-31T13:13:08.338Z] 13:13:08     INFO - GECKO(5728) | msg 1 Count: 1
[task 2018-12-31T13:13:09.378Z] 13:13:09     INFO - GECKO(5728) | --DOMWINDOW == 28 (0xdb22d000) [pid = 5728] [serial = 132] [outer = (nil)] [url = about:blank]
[task 2018-12-31T13:13:09.382Z] 13:13:09     INFO - GECKO(5728) | --DOMWINDOW == 27 (0xdb238800) [pid = 5728] [serial = 129] [outer = (nil)] [url = chrome://mochikit/content/tests/SimpleTest/iframe-between-tests.html]
[task 2018-12-31T13:13:09.385Z] 13:13:09     INFO - GECKO(5728) | --DOMWINDOW == 26 (0xdb22b800) [pid = 5728] [serial = 124] [outer = (nil)] [url = chrome://mochitests/content/chrome/dom/base/test/test_blocking_image.html]
[task 2018-12-31T13:13:09.387Z] 13:13:09     INFO - GECKO(5728) | --DOMWINDOW == 25 (0xdb231000) [pid = 5728] [serial = 127] [outer = (nil)] [url = chrome://mochikit/content/tests/SimpleTest/iframe-between-tests.html]
[task 2018-12-31T13:13:09.389Z] 13:13:09     INFO - GECKO(5728) | --DOMWINDOW == 24 (0xdb223000) [pid = 5728] [serial = 123] [outer = (nil)] [url = chrome://mochikit/content/tests/SimpleTest/iframe-between-tests.html]
[task 2018-12-31T13:13:09.391Z] 13:13:09     INFO - GECKO(5728) | --DOMWINDOW == 23 (0xdc944c00) [pid = 5728] [serial = 110] [outer = (nil)] [url = chrome://mochitests/content/chrome/dom/base/test/test_blockParsing.html]
[task 2018-12-31T13:13:09.396Z] 13:13:09     INFO - GECKO(5728) | --DOMWINDOW == 22 (0xdb237c00) [pid = 5728] [serial = 128] [outer = (nil)] [url = chrome://mochitests/content/chrome/dom/base/test/test_bug1008126.html]
[task 2018-12-31T13:13:09.398Z] 13:13:09     INFO - GECKO(5728) | --DOMWINDOW == 21 (0xdb225000) [pid = 5728] [serial = 130] [outer = (nil)] [url = chrome://mochitests/content/chrome/dom/base/test/test_bug1016960.html]
[task 2018-12-31T13:13:09.400Z] 13:13:09     INFO - GECKO(5728) | --DOMWINDOW == 20 (0xdb2b9000) [pid = 5728] [serial = 133] [outer = (nil)] [url = chrome://mochikit/content/tests/SimpleTest/iframe-between-tests.html]
[task 2018-12-31T13:13:10.312Z] 13:13:10     INFO - GECKO(5728) | msg 3 Count: 1
[task 2018-12-31T13:13:13.318Z] 13:13:13     INFO - GECKO(5728) | msg 4 Count: 1
[task 2018-12-31T13:13:13.319Z] 13:13:13     INFO - GECKO(5728) | msg 2 Count: 1
[task 2018-12-31T13:13:13.322Z] 13:13:13     INFO - GECKO(5728) | TESTING CASE AddShiftLocalCleanUp
[task 2018-12-31T13:13:13.325Z] 13:13:13     INFO - GECKO(5728) | ==============
[task 2018-12-31T13:13:13.327Z] 13:13:13     INFO - GECKO(5728) | JS REMOVE IDLE OBSERVER () time to be removed: 1
[task 2018-12-31T13:13:13.332Z] 13:13:13     INFO - GECKO(5728) | JS removeIdleObserver() observer time: 0
[task 2018-12-31T13:13:13.334Z] 13:13:13     INFO - GECKO(5728) | JS removeIdleObserver() observer time: 1
[task 2018-12-31T13:13:13.339Z] 13:13:13     INFO - GECKO(5728) | MOCK IDLE SERVICE REMOVING idle observer with time 1
[task 2018-12-31T13:13:13.341Z] 13:13:13     INFO - GECKO(5728) | MOCK IDLE SERVICE numIdleObserversRemoved: 1 numIdleObserversAdded: 0
[task 2018-12-31T13:13:13.342Z] 13:13:13     INFO - GECKO(5728) | JS FAKE IDLE SERVICE end of remove idle observer
[task 2018-12-31T13:13:13.350Z] 13:13:13     INFO - GECKO(5728) | JS NUM OBSERVERS: 1
[task 2018-12-31T13:13:13.351Z] 13:13:13     INFO - GECKO(5728) | AddShiftLocalCleanUp() done clean up
[task 2018-12-31T13:13:13.353Z] 13:13:13     INFO - GECKO(5728) | TESTING CASE AddNewLocalWhileAllIdle
[task 2018-12-31T13:13:13.354Z] 13:13:13     INFO - GECKO(5728) | ==============
[task 2018-12-31T13:13:13.356Z] 13:13:13     INFO - GECKO(5728) | JS FAKE IDLE SERVICE add idle observer before
[task 2018-12-31T13:13:13.358Z] 13:13:13     INFO - GECKO(5728) | JS NUM OBSERVERS: 1
[task 2018-12-31T13:13:13.360Z] 13:13:13     INFO - GECKO(5728) | window is: [object Window]
[task 2018-12-31T13:13:13.361Z] 13:13:13     INFO - GECKO(5728) | MOCK IDLE SERVICE ADDING idle observer with time: 1
[task 2018-12-31T13:13:13.363Z] 13:13:13     INFO - GECKO(5728) | MOCK IDLE SERVICE: num idle observers added: 1
[task 2018-12-31T13:13:13.364Z] 13:13:13     INFO - GECKO(5728) | JS FAKE IDLE SERVICE end of add idle observer
[task 2018-12-31T13:13:13.366Z] 13:13:13     INFO - GECKO(5728) | JS NUM OBSERVERS: 2
[task 2018-12-31T13:13:13.368Z] 13:13:13     INFO - GECKO(5728) | JS FAKE IDLE SERVICE
[task 2018-12-31T13:13:13.369Z] 13:13:13     INFO - GECKO(5728) | JS NUM OBSERVERS: 2
[task 2018-12-31T13:13:13.373Z] 13:13:13     INFO - GECKO(5728) | JS CALLED OBSERVE FUNCTION!!!
[task 2018-12-31T13:13:13.375Z] 13:13:13     INFO - GECKO(5728) | msg 1 Count: 1
[task 2018-12-31T13:13:14.322Z] 13:13:14     INFO - GECKO(5728) | msg 2 Count: 1
[task 2018-12-31T13:13:15.321Z] 13:13:15     INFO - GECKO(5728) | msg 2 Count: 2
[task 2018-12-31T13:13:16.821Z] 13:13:16     INFO - GECKO(5728) | msg 5 Count: 1
[task 2018-12-31T13:13:16.825Z] 13:13:16     INFO - GECKO(5728) | function performNextTest()
[task 2018-12-31T13:13:16.828Z] 13:13:16     INFO - GECKO(5728) | currTestCaseNum: 4
[task 2018-12-31T13:13:16.831Z] 13:13:16     INFO - GECKO(5728) | cleanUp: false
[task 2018-12-31T13:13:16.834Z] 13:13:16     INFO - GECKO(5728) | passed: true
[task 2018-12-31T13:13:16.838Z] 13:13:16     INFO - GECKO(5728) | numIdleObserversRemoved: 0
[task 2018-12-31T13:13:16.840Z] 13:13:16     INFO - GECKO(5728) | numIdleObservesAdded: 1
[task 2018-12-31T13:13:16.844Z] 13:13:16     INFO - GECKO(5728) | TESTING CASE AddNewLocalWhileAllIdleCleanUp
[task 2018-12-31T13:13:16.846Z] 13:13:16     INFO - GECKO(5728) | ==============
[task 2018-12-31T13:13:16.850Z] 13:13:16     INFO - GECKO(5728) | JS REMOVE IDLE OBSERVER () time to be removed: 1
[task 2018-12-31T13:13:16.853Z] 13:13:16     INFO - GECKO(5728) | JS removeIdleObserver() observer time: 0
[task 2018-12-31T13:13:16.854Z] 13:13:16     INFO - GECKO(5728) | JS removeIdleObserver() observer time: 1
[task 2018-12-31T13:13:16.856Z] 13:13:16     INFO - GECKO(5728) | MOCK IDLE SERVICE REMOVING idle observer with time 1
[task 2018-12-31T13:13:16.857Z] 13:13:16     INFO - GECKO(5728) | MOCK IDLE SERVICE numIdleObserversRemoved: 1 numIdleObserversAdded: 0
[task 2018-12-31T13:13:16.858Z] 13:13:16     INFO - GECKO(5728) | JS FAKE IDLE SERVICE end of remove idle observer
[task 2018-12-31T13:13:16.859Z] 13:13:16     INFO - GECKO(5728) | JS NUM OBSERVERS: 1
[task 2018-12-31T13:13:16.861Z] 13:13:16     INFO - GECKO(5728) | TESTING CASE ShiftLocalTimerBack()
[task 2018-12-31T13:13:16.863Z] 13:13:16     INFO - GECKO(5728) | ==============
[task 2018-12-31T13:13:16.866Z] 13:13:16     INFO - GECKO(5728) | JS FAKE IDLE SERVICE add idle observer before
[task 2018-12-31T13:13:16.867Z] 13:13:16     INFO - GECKO(5728) | JS NUM OBSERVERS: 1
[task 2018-12-31T13:13:16.869Z] 13:13:16     INFO - GECKO(5728) | window is: [object Window]
[task 2018-12-31T13:13:16.871Z] 13:13:16     INFO - GECKO(5728) | MOCK IDLE SERVICE ADDING idle observer with time: 1
[task 2018-12-31T13:13:16.873Z] 13:13:16     INFO - GECKO(5728) | MOCK IDLE SERVICE: num idle observers added: 1
[task 2018-12-31T13:13:16.875Z] 13:13:16     INFO - GECKO(5728) | JS FAKE IDLE SERVICE end of add idle observer
[task 2018-12-31T13:13:16.877Z] 13:13:16     INFO - GECKO(5728) | JS NUM OBSERVERS: 2
[task 2018-12-31T13:13:16.879Z] 13:13:16     INFO - GECKO(5728) | JS FAKE IDLE SERVICE
[task 2018-12-31T13:13:16.881Z] 13:13:16     INFO - GECKO(5728) | JS NUM OBSERVERS: 2
[task 2018-12-31T13:13:16.883Z] 13:13:16     INFO - GECKO(5728) | JS CALLED OBSERVE FUNCTION!!!
[task 2018-12-31T13:13:16.885Z] 13:13:16     INFO - GECKO(5728) | msg 2 Count: 1
[task 2018-12-31T13:13:18.751Z] 13:13:18     INFO - GECKO(5728) | msg 4 Count: 1

[taskcluster:error] Task timeout after 3600 seconds. Force killing container.
[taskcluster 2018-12-31 13:13:22.095Z] === Task Finished ===
[taskcluster 2018-12-31 13:13:22.096Z] Unsuccessful task run with exit code: -1 completed in 3603.519 seconds
(In reply to Intermittent Failures Robot from comment #232)
> 29 failures in 1008 pushes (0.029 failures/push) were associated with this
> bug in the last 7 days.
> 
> Platform breakdown:
> * linux64: 11
> * linux32: 8
> * linux64-qr: 10

I reviewed these linux timeouts but couldn't find a common cause. Existing max-run-times seem appropriate.

aws slow-down? Seems better recently -- monitoring.

Most recent failures are on linux/debug, but vary by test suite. Average task durations are << max-run-time. I wonder if these are related to logspam, like bug 1437991, bug 1515833, and bug 1515827.

(In reply to Geoff Brown [:gbrown] from comment #236)

Most recent failures are on linux/debug, but vary by test suite. Average task durations are << max-run-time. I wonder if these are related to logspam, like bug 1437991, bug 1515833, and bug 1515827.

Failure frequency did seem to decrease with resolution of bug 1437991, but linux/debug continues to dominate ongoing failures.

(In reply to Geoff Brown [:gbrown] from comment #236)

Most recent failures are on linux/debug, but vary by test suite. Average task durations are << max-run-time. I wonder if these are related to logspam, like bug 1437991, bug 1515833, and bug 1515827.

Those logspam bugs are all resolved now - great! - and did seem to help the random linux/debug failures here.

Recently we are seeing more linux/debug jsreftest failures specifically -- investigating...

(In reply to Geoff Brown [:gbrown] from comment #241)

Recently we are seeing more linux/debug jsreftest failures specifically -- investigating...

And linux/debug reftest also.

linux64/debug jsreftest might benefit from more chunks.

Other platforms are more puzzling: linux64/debug reftests run in about 30 minutes generally...then suddenly a chunk runs in 60+ minutes. Affected chunks appear "random"...could that be because tests are being shuffled?

Some failures on Feb 4/5 appear to be related to certificate expiry (temporary infrastructure condition caused mass failures, hangs).

Depends on: 1528405

Over the last 7 days this bug has 30 failures. These happen on windows7-32, osx-cross-noopt, osx-cross, linux64-qr, linux64, linux32

Here is the latest log failure: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=228854900&repo=mozilla-inbound&lineNumber=30960

[task 2019-02-16T23:10:38.226Z] 23:10:38 INFO - REFTEST TEST-START | file:///builds/worker/workspace/build/tests/reftest/tests/layout/reftests/w3c-css/received/selectors/focus-within-004.html == file:///builds/worker/workspace/build/tests/reftest/tests/layout/reftests/w3c-css/received/selectors/focus-within-001-ref.html
[task 2019-02-16T23:10:38.231Z] 23:10:38 INFO - REFTEST TEST-LOAD | file:///builds/worker/workspace/build/tests/reftest/tests/layout/reftests/w3c-css/received/selectors/focus-within-004.html | 1167 / 1195 (97%)
[task 2019-02-16T23:10:38.256Z] 23:10:38 INFO - ++DOMWINDOW == 334 (0x7f0529d71c00) [pid = 8577] [serial = 2751] [outer = 0x7f052d531800]
[task 2019-02-16T23:10:38.337Z] 23:10:38 INFO - [Child 8577, Main Thread] WARNING: HTMLEditRules::BeforeEdit() failed to handle something: 'NS_SUCCEEDED(rv)', file /builds/worker/workspace/build/src/editor/libeditor/HTMLEditor.cpp, line 3576
[task 2019-02-16T23:10:38.344Z] 23:10:38 INFO - [Child 8577, Main Thread] WARNING: '!aSelection.RangeCount()', file /builds/worker/workspace/build/src/editor/libeditor/EditorBase.cpp, line 3569
[task 2019-02-16T23:10:38.351Z] 23:10:38 INFO - [Child 8577, Main Thread] WARNING: '!selectionStartPoint.IsSet()', file /builds/worker/workspace/build/src/editor/libeditor/HTMLEditRules.cpp, line 9806
[task 2019-02-16T23:10:38.355Z] 23:10:38 INFO - [Child 8577, Main Thread] WARNING: Failed to normalize Selection: 'NS_SUCCEEDED(rv)', file /builds/worker/workspace/build/src/editor/libeditor/HTMLEditRules.cpp, line 453
[task 2019-02-16T23:10:41.103Z] 23:10:41 INFO - REFTEST TEST-PASS | file:///builds/worker/workspace/build/tests/reftest/tests/layout/reftests/w3c-css/received/selectors/focus-within-004.html == file:///builds/worker/workspace/build/tests/reftest/tests/layout/reftests/w3c-css/received/selectors/focus-within-001-ref.html | image comparison, max difference: 0, number of differing pixels: 0
[task 2019-02-16T23:10:41.106Z] 23:10:41 INFO - REFTEST TEST-END | file:///builds/worker/workspace/build/tests/reftest/tests/layout/reftests/w3c-css/received/selectors/focus-within-004.html == file:///builds/worker/workspace/build/tests/reftest/tests/layout/reftests/w3c-css/received/selectors/focus-within-001-ref.html
[task 2019-02-16T23:10:41.136Z] 23:10:41 INFO - ++DOMWINDOW == 335 (0x7f0528b65000) [pid = 8577] [serial = 2752] [outer = 0x7f052d531800]
[task 2019-02-16T23:10:41.196Z] 23:10:41 INFO - REFTEST TEST-START | file:///builds/worker/workspace/build/tests/reftest/tests/layout/reftests/w3c-css/received/selectors/focus-within-005.html == file:///builds/worker/workspace/build/tests/reftest/tests/layout/reftests/w3c-css/received/selectors/focus-within-001-ref.html
[task 2019-02-16T23:10:41.203Z] 23:10:41 INFO - REFTEST TEST-LOAD | file:///builds/worker/workspace/build/tests/reftest/tests/layout/reftests/w3c-css/received/selectors/focus-within-005.html | 1168 / 1195 (97%)
[task 2019-02-16T23:10:41.243Z] 23:10:41 INFO - ++DOMWINDOW == 336 (0x7f0528d1ac00) [pid = 8577] [serial = 2753] [outer = 0x7f052d531800]
[task 2019-02-16T23:10:41.628Z] 23:10:41 INFO - REFTEST TEST-PASS | file:///builds/worker/workspace/build/tests/reftest/tests/layout/reftests/w3c-css/received/selectors/focus-within-005.html == file:///builds/worker/workspace/build/tests/reftest/tests/layout/reftests/w3c-css/received/selectors/focus-within-001-ref.html | image comparison, max difference: 0, number of differing pixels: 0
[task 2019-02-16T23:10:41.634Z] 23:10:41 INFO - REFTEST TEST-END | file:///builds/worker/workspace/build/tests/reftest/tests/layout/reftests/w3c-css/received/selectors/focus-within-005.html == file:///builds/worker/workspace/build/tests/reftest/tests/layout/reftests/w3c-css/received/selectors/focus-within-001-ref.html
[task 2019-02-16T23:10:41.659Z] 23:10:41 INFO - ++DOMWINDOW == 337 (0x7f0529084c00) [pid = 8577] [serial = 2754] [outer = 0x7f052d531800]
[taskcluster:error] Task timeout after 3600 seconds. Force killing container.
[task 2019-02-16T23:10:41.723Z] 23:10:41 INFO - REFTEST TEST-START | file:///builds/worker/workspace/build/tests/reftest/tests/layout/reftests/w3c-css/received/selectors/focus-within-006.html == file:///builds/worker/workspace/build/tests/reftest/tests/layout/reftests/w3c-css/received/selectors/focus-within-006-ref.html
[task 2019-02-16T23:10:41.746Z] 23:10:41 INFO - REFTEST TEST-LOAD | file:///builds/worker/workspace/build/tests/reftest/tests/layout/reftests/w3c-css/received/selectors/focus-within-006.html | 1169 / 1195 (97%)
[taskcluster 2019-02-16 23:10:42.612Z] === Task Finished ===
[taskcluster 2019-02-16 23:10:42.613Z] Unsuccessful task run with exit code: -1 completed in 4378.344 seconds

Flags: needinfo?(gbrown)
Flags: needinfo?(gbrown)
Depends on: 1529781

Over the last 7 days there are 39 failures on this bug. These happen on osx-cross, linux64-qr, linux64, linux32, android-em-4-3-armv7-api16, android-5-0-x86_64.

Here is the latest failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=231041560&repo=autoland&lineNumber=35054

Flags: needinfo?(gbrown)
Flags: needinfo?(gbrown)
Depends on: 1533565

Thanks Arthur. I noticed and hope my patch in bug 1533565 will resolve this.

Flags: needinfo?(gbrown)

There were 4 timeouts on android builds here:

https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&searchStr=nadroid%2Cnightly&revision=c89f024c023fa816d700b151e8e0cbb9a1907cb8

(On retry, they were all fine.)

It looks like hg robustcheckout was taking a very long time.

[task 2019-03-07T10:38:21.362Z] 10:38:21 INFO - Running main action method: multi_l10n
...
[task 2019-03-07T12:05:06.758Z] 12:05:06 INFO - 12:05:06 INFO - Running command: ['hg', '--config', 'ui.merge=internal:merge', '--config', 'extensions.robustcheckout=/builds/worker/workspace/build/src/testing/mozharness/external_tools/robustcheckout.py', 'robustcheckout', u'https://hg.mozilla.org/l10n-central/mr', u'mr', '--sharebase', u'/builds/hg-shared', '--branch', u'default']
[task 2019-03-07T12:05:06.758Z] 12:05:06 INFO - 12:05:06 INFO - Copy/paste: hg --config ui.merge=internal:merge --config extensions.robustcheckout=/builds/worker/workspace/build/src/testing/mozharness/external_tools/robustcheckout.py robustcheckout https://hg.mozilla.org/l10n-central/mr mr --sharebase /builds/hg-shared --branch default
[task 2019-03-07T12:05:06.809Z] 12:05:06 INFO - 12:05:06 INFO - (using Mercurial 4.8.1)
[task 2019-03-07T12:05:06.810Z] 12:05:06 INFO - 12:05:06 INFO - ensuring https://hg.mozilla.org/l10n-central/mr@default is available at mr
[task 2019-03-07T12:05:07.025Z] 12:05:07 INFO - 12:05:07 INFO - (sharing from new pooled repository cf74f930fde306218d72c790d69ad9ba7c8b8b51)
[task 2019-03-07T12:05:07.216Z] 12:05:07 INFO - 12:05:07 INFO - requesting all changes

[taskcluster:error] Task timeout after 7200 seconds. Force killing container.

vs a successful retry:

[task 2019-03-07T12:43:25.400Z] 12:43:25 INFO - Running main action method: multi_l10n
...
[task 2019-03-07T12:51:28.535Z] 12:51:28 INFO - 12:51:28 INFO - [mozharness: 2019-03-07 12:51:28.535244Z] Finished pull-locale-source step (success)
...
[task 2019-03-07T12:56:32.376Z] 12:56:32 INFO - [mozharness: 2019-03-07 12:56:32.376373Z] Finished multi-l10n step (success)
[task 2019-03-07T12:56:32.376Z] 12:56:32 INFO - [mozharness: 2019-03-07 12:56:32.376448Z] Skipping package-source step.
[task 2019-03-07T12:56:32.376Z] 12:56:32 INFO - Running post-run listener: _parse_build_tests_ccov
[task 2019-03-07T12:56:32.376Z] 12:56:32 INFO - Running post-run listener: _shutdown_sccache
[task 2019-03-07T12:56:32.376Z] 12:56:32 INFO - Running post-run listener: _summarize
[task 2019-03-07T12:56:32.376Z] 12:56:32 INFO - # TBPL SUCCESS #
[task 2019-03-07T12:56:32.376Z] 12:56:32 INFO - [mozharness: 2019-03-07 12:56:32.376769Z] FxDesktopBuild summary:
[task 2019-03-07T12:56:32.376Z] 12:56:32 INFO - # TBPL SUCCESS #
[taskcluster 2019-03-07 12:56:32.926Z] === Task Finished ===
[taskcluster 2019-03-07 12:58:24.583Z] Successful task run with exit code: 0 completed in 3083.283 seconds

I won't follow-up unless there are more problems like this.

Depends on: 1534822

There are still some linux64/debug devtools timeouts after bug 1534822. Chunks are very unbalanced. Filed bug 1536253 as a first step to improvement.

There are some infrequent linux64-qr/debug reftest 3600 s timeouts, but chunks normally run in 20 to 40 minutes so I am reluctant to act; watching...

Depends on: 1539520

source-test-clang-tidy is perma-fail, causing a spike here. There is some existing activity on that test, like bug 1539779 -- waiting to see if that helps. (Otherwise, it might be worth noting that there was a sharp spike in run time around here: https://treeherder.mozilla.org/#/jobs?repo=autoland&searchStr=tidy&tochange=98452610cfcc81cba0d4478797fe1e83a51172e8&fromchange=cf6cfe33476622c202fcbc34ef12f8ca1c039b8a. )

Geoff, I've filed bug 1540325 for the clang tidy timeouts.

Depends on: 1540280

Over the last 7 days there are 31 failures present on this bug. These happen on: android-em-7-0-x86_64, linux32-shippable, linux64, linux64-qr, osx-cross, osx-cross-ccov

Here is the most recent failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=239147679&repo=autoland&lineNumber=78131

Flags: needinfo?(gbrown)
Flags: needinfo?(gbrown)
Depends on: 1543558

Over the last 7 days there are 38 failures present on this bug. These happen on android-4-0-armv7-api16, android-5-0-aarch64, android-em-7-0-x86, android-em-7-0-x86_64, linux32-shippable, linux64, linux64-shippable, osx-cross, osx-cross-noopt, osx-shippable, windows-mingw32, windows2012-32-shippable.

Here is the most recent log example: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=240827727&repo=mozilla-central&lineNumber=5178

[task 2019-04-17T00:24:53.121Z] 00:24:53 INFO - TEST-START | /css/CSS2/selectors/first-letter-punctuation-283.xht
[task 2019-04-17T00:24:55.017Z] 00:24:55 INFO - TEST-PASS | /css/CSS2/selectors/first-letter-punctuation-283.xht | took 1905ms
[task 2019-04-17T00:24:55.019Z] 00:24:55 INFO - TEST-START | /css/CSS2/selectors/first-letter-punctuation-284.xht
[task 2019-04-17T00:24:57.803Z] 00:24:57 INFO - TEST-PASS | /css/CSS2/selectors/first-letter-punctuation-284.xht | took 2782ms
[task 2019-04-17T00:24:57.805Z] 00:24:57 INFO - TEST-START | /css/CSS2/selectors/first-letter-punctuation-285.xht
[task 2019-04-17T00:24:59.685Z] 00:24:59 INFO - TEST-PASS | /css/CSS2/selectors/first-letter-punctuation-285.xht | took 1879ms
[task 2019-04-17T00:24:59.687Z] 00:24:59 INFO - TEST-START | /css/CSS2/selectors/first-letter-punctuation-286.xht
[task 2019-04-17T00:25:01.657Z] 00:25:01 INFO - TEST-PASS | /css/CSS2/selectors/first-letter-punctuation-286.xht | took 1971ms
[task 2019-04-17T00:25:01.657Z] 00:25:01 INFO - TEST-START | /css/CSS2/selectors/first-letter-punctuation-287.xht
[task 2019-04-17T00:25:03.591Z] 00:25:03 INFO - TEST-PASS | /css/CSS2/selectors/first-letter-punctuation-287.xht | took 1935ms
[task 2019-04-17T00:25:03.593Z] 00:25:03 INFO - TEST-START | /css/CSS2/selectors/first-letter-punctuation-288.xht
[task 2019-04-17T00:25:05.387Z] 00:25:05 INFO - TEST-PASS | /css/CSS2/selectors/first-letter-punctuation-288.xht | took 1796ms
[task 2019-04-17T00:25:05.388Z] 00:25:05 INFO - TEST-START | /css/CSS2/selectors/first-letter-punctuation-289.xht
[task 2019-04-17T00:25:07.057Z] 00:25:07 INFO - TEST-PASS | /css/CSS2/selectors/first-letter-punctuation-289.xht | took 1667ms
[task 2019-04-17T00:25:07.058Z] 00:25:07 INFO - TEST-START | /css/CSS2/selectors/first-letter-punctuation-290.xht
[task 2019-04-17T00:25:08.942Z] 00:25:08 INFO - TEST-PASS | /css/CSS2/selectors/first-letter-punctuation-290.xht | took 1880ms
[task 2019-04-17T00:25:08.944Z] 00:25:08 INFO - TEST-START | /css/CSS2/selectors/first-letter-punctuation-291.xht
[task 2019-04-17T00:25:10.795Z] 00:25:10 INFO - TEST-PASS | /css/CSS2/selectors/first-letter-punctuation-291.xht | took 1857ms
[task 2019-04-17T00:25:10.796Z] 00:25:10 INFO - TEST-START | /css/CSS2/selectors/first-letter-punctuation-292.xht
[task 2019-04-17T00:25:12.671Z] 00:25:12 INFO - TEST-PASS | /css/CSS2/selectors/first-letter-punctuation-292.xht | took 1873ms
[task 2019-04-17T00:25:12.671Z] 00:25:12 INFO - TEST-START | /css/CSS2/selectors/first-letter-punctuation-293.xht
[taskcluster:error] Task timeout after 5400 seconds. Force killing container.
[taskcluster 2019-04-17 00:25:15.346Z] === Task Finished ===
[taskcluster 2019-04-17 00:25:15.347Z] Unsuccessful task run with exit code: -1 completed in 5402.945 seconds

Flags: needinfo?(gbrown)

Looking at a couple of the linux builds it seems sometimes it takes 1/2 to 1 hour to clone and update the repo. I do see '(warning: large working directory being used without fsmonitor enabled; enable fsmonitor to improve performance; see "hg help -e fsmonitor")' in the log as well.

Perhaps we a) either adjust the max run time to account for the slow hg performance or b) improve the hg performance perhaps with fsmonitor or both or some other approach. Too bad we have to clone and update > 3G of data each time as well.

Perhaps we can just try to bump the max run time to see if we prevent or reduce these errors for the problematic jobs, not just builds, then for the ones which complete when given enough time, try to improve their run times. For the ones that won't complete regardless of the run time we can reduce it again and investigate that as a distinct issue.

Flags: needinfo?(gbrown)
Depends on: 1545308

(In reply to Bob Clary [:bc:] from comment #272)

Thanks :bc.

Looking at a couple of the linux builds it seems sometimes it takes 1/2 to 1 hour to clone and update the repo. I do see '(warning: large working directory being used without fsmonitor enabled; enable fsmonitor to improve performance; see "hg help -e fsmonitor")' in the log as well.

I also see the fsmonitor warning in other logs, where the cloning completes quickly -- maybe it is normal / expected?

There are 35 total failures in the last 7 days on multiple platforms: https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?startday=2019-04-19&endday=2019-04-26&tree=trunk&bug=1411358

Recent failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=242811869&repo=mozilla-inbound&lineNumber=17137

[task 2019-04-26T09:12:42.250Z] 09:12:42 INFO - REFTEST TEST-START | http://10.0.2.2:8854/jsreftest/tests/jsreftest.html?test=test262/language/expressions/equals/S11.9.1_A3.3.js
[task 2019-04-26T09:12:42.251Z] 09:12:42 INFO - REFTEST TEST-LOAD | http://10.0.2.2:8854/jsreftest/tests/jsreftest.html?test=test262/language/expressions/equals/S11.9.1_A3.3.js | 4550 / 6609 (68%)
[task 2019-04-26T09:12:42.255Z] 09:12:42 INFO - wait for org.mozilla.geckoview.test complete; top activity=org.mozilla.geckoview.test
[task 2019-04-26T09:12:42.466Z] 09:12:42 INFO - org.mozilla.geckoview.test unexpectedly found running. Killing...
[task 2019-04-26T09:12:42.467Z] 09:12:42 INFO - REFTEST TEST-INFO | started process screentopng
[task 2019-04-26T09:12:44.529Z] 09:12:44 INFO - REFTEST TEST-INFO | screentopng: exit 0
[task 2019-04-26T09:13:00.704Z] 09:13:00 WARNING - TEST-UNEXPECTED-FAIL | http://10.0.2.2:8854/jsreftest/tests/jsreftest.html?test=test262/language/expressions/equals/S11.9.1_A3.3.js | application ran for longer than allowed maximum time
[task 2019-04-26T09:13:00.705Z] 09:13:00 INFO - remoteautomation.py | Application ran for: 1:50:49.448078
[task 2019-04-26T09:13:02.391Z] 09:13:02 INFO - REFTEST INFO | Copy/paste: /builds/worker/workspace/build/linux64-minidump_stackwalk /tmp/tmpPTQpi_/4c2ef0dc-18bd-c459-f2b2-5b20c3643e3e.dmp /builds/worker/workspace/build/symbols
[taskcluster:error] Task timeout after 7200 seconds. Force killing container.
[taskcluster 2019-04-26 09:13:20.899Z] === Task Finished ===
[taskcluster 2019-04-26 09:13:20.899Z] Unsuccessful task run with exit code: -1 completed in 7369.418 seconds

Whiteboard: [stockwell infra] [see summary at comment 92] → [stockwell infra] [see summary at comment 92][stockwell needswork:owner]
Whiteboard: [stockwell infra] [see summary at comment 92][stockwell needswork:owner] → [stockwell infra] [see summary at comment 92]

Most failures are android-em-7-0*. I have checked that these are not hung (recent entries in log at timeout) and that the same tasks normally runs in much less time on other pushes. Although android-performance.log artifacts are not available in this bug, I strongly suspect bug 1545308 and hope to see that bug re-open soon.

I expect these failures to continue and this bug to remain open indefinitely. Task timeouts are the last resort whenever a task is hung or otherwise runs for too long. My goal is to address timeouts which recur in the same task frequently and identify other related issues in dependent bugs.

I try to review failures here at least once each week. There is no point in pasting recent failure logs or failure counts in this bug. If a spike in frequency is noted, or you have insight into a failure, or I haven't commented on an on-going issue for more than a week, need-info is appreciated.

Whiteboard: [stockwell infra] [see summary at comment 92] → [stockwell infra] [see summary at comment 278]

(In reply to Intermittent Failures Robot from comment #279)

19 failures in 933 pushes (0.02 failures/push) were associated with this bug yesterday.

Repository breakdown:

  • try: 11

Mass mis-classification.

Most recent failures are on android-em-7-0-x86. These are mostly wpt-reftests which normally complete in 15 to 30 minutes but sometimes exceed 90 minutes. Evidence is not conclusive, but I hope these will be resolved by bug 1545308.

Depends on: 1552580
Depends on: 1552334

Excellent improvement in Android 7.0 since workaround landed in bug 1552334.

Depends on: 1556767
Depends on: 1562078

There are 41 total failures in the last 7 days on android-em-7-0-x86_64 debug, linux64 asan, linux64-qr debug, 1 on osx-cross and 1 one osx-shippable.

Recent failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=254527369&repo=autoland&lineNumber=7420

[task 2019-07-03T03:16:10.735Z] 03:16:10 INFO - PID 18139 | -----------------------------------------------------
[task 2019-07-03T03:16:10.964Z] 03:16:10 INFO - Browser exited with return code 0
[task 2019-07-03T03:16:10.964Z] 03:16:10 INFO - PROCESS LEAKS None
[task 2019-07-03T03:16:10.965Z] 03:16:10 INFO - Closing logging queue
[task 2019-07-03T03:16:10.966Z] 03:16:10 INFO - queue closed
[task 2019-07-03T03:16:10.974Z] 03:16:10 INFO - INFO | runtests.py | ASan using symbolizer at /builds/worker/workspace/build/application/firefox/llvm-symbolizer
[task 2019-07-03T03:16:10.989Z] 03:16:10 INFO - LSan enabled.
[task 2019-07-03T03:16:10.990Z] 03:16:10 INFO - LSan using suppression file /builds/worker/workspace/build/tests/web-platform/prefs/lsan_suppressions.txt
[task 2019-07-03T03:16:10.991Z] 03:16:10 INFO - INFO | runtests.py | ASan running in default memory configuration
[task 2019-07-03T03:16:11.006Z] 03:16:11 INFO - Setting up ssl
[task 2019-07-03T03:16:11.063Z] 03:16:11 INFO - certutil |
[task 2019-07-03T03:16:11.123Z] 03:16:11 INFO - certutil |
[task 2019-07-03T03:16:11.164Z] 03:16:11 INFO - certutil |
[task 2019-07-03T03:16:11.164Z] 03:16:11 INFO - Certificate Nickname Trust Attributes
[task 2019-07-03T03:16:11.164Z] 03:16:11 INFO - SSL,S/MIME,JAR/XPI
[task 2019-07-03T03:16:11.165Z] 03:16:11 INFO -
[task 2019-07-03T03:16:11.165Z] 03:16:11 INFO - web-platform-tests CT,,
[task 2019-07-03T03:16:11.165Z] 03:16:11 INFO -
[task 2019-07-03T03:16:11.180Z] 03:16:11 INFO - Application command: /builds/worker/workspace/build/application/firefox/firefox --marionette about:blank -profile /tmp/tmpENo3lD.mozrunner
[task 2019-07-03T03:16:11.196Z] 03:16:11 INFO - Starting runner
[task 2019-07-03T03:16:13.153Z] 03:16:13 INFO - PID 18469 | 1562123773138 addons.webextension.screenshots@mozilla.org WARN Loading extension 'screenshots@mozilla.org': Reading manifest: Invalid extension permission: mozillaAddons
[task 2019-07-03T03:16:13.154Z] 03:16:13 INFO - PID 18469 | 1562123773139 addons.webextension.screenshots@mozilla.org WARN Loading extension 'screenshots@mozilla.org': Reading manifest: Invalid extension permission: telemetry
[task 2019-07-03T03:16:13.155Z] 03:16:13 INFO - PID 18469 | 1562123773140 addons.webextension.screenshots@mozilla.org WARN Loading extension 'screenshots@mozilla.org': Reading manifest: Invalid extension permission: resource://pdf.js/
[task 2019-07-03T03:16:13.156Z] 03:16:13 INFO - PID 18469 | 1562123773141 addons.webextension.screenshots@mozilla.org WARN Loading extension 'screenshots@mozilla.org': Reading manifest: Invalid extension permission: about:reader*
[task 2019-07-03T03:16:21.769Z] 03:16:21 INFO - PID 18469 | 1562123781762 Marionette INFO Listening on port 42034
[task 2019-07-03T03:16:22.525Z] 03:16:22 INFO - TEST-START | /websockets/keeping-connection-open/001.html
[taskcluster:error] Task timeout after 7200 seconds. Force killing container.
[taskcluster 2019-07-03 03:17:10.911Z] === Task Finished ===
[taskcluster 2019-07-03 03:17:10.911Z] Unsuccessful task run with exit code: -1 completed in 7202.153 seconds

Depends on: 1563811

Most recent failures are bug 1562078, which should be resolved soon.

(In reply to Geoff Brown [:gbrown] from comment #299)

Most recent failures are bug 1562078, which should be resolved soon.

Unfortunately, those continue, even after the wpt emulator upgrade.

Depends on: 1568309
Depends on: 1569061

There have been several static-analysis-auto-st-autotest 3600 second timeouts lately, but the most recent tasks are completing in just 13 minutes.

Depends on: 1574969

Bug 1574254 temporarily introduced some android xpcshell task timeouts.

Depends on: 1576711
Depends on: 1579504

In the last 7 days there have been 27 occurrences on Linux 64, Linux 64 ccov, OS X cross noopt, build types debug and opt.

Recent failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=266479367&repo=mozilla-central&lineNumber=39024

More than half the current failures here are timeouts in build tasks on various osx and linux platforms. Recently build peers have resisted efforts to increase max-run-time to allow for performance variation; I don't know what else can be done.

Depends on: 1582241
Keywords: leave-open
Depends on: 1593536
Depends on: 1596892

There are 26 total failures in the last 7 days on android-em-7-0-x86_64 debug and some failures on linux, windows and osx.

Recent failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=278373433&repo=autoland&lineNumber=7194

[vcs 2019-11-27T10:30:04.128Z] clone [==========================================> ] 3203918168/3346631639 43s
[vcs 2019-11-27T10:30:05.130Z] clone [==========================================> ] 3206277464/3346631639 42s
[vcs 2019-11-27T10:30:06.133Z] clone [==========================================> ] 3209816408/3346631639 41s
[taskcluster:error] Task timeout after 7200 seconds. Force killing container.
[vcs 2019-11-27T10:30:07.150Z] clone [==========================================> ] 3213093208/3346631639 40s
[taskcluster 2019-11-27 10:30:16.425Z] === Task Finished ===
[taskcluster 2019-11-27 10:30:16.460Z] Unsuccessful task run with exit code: -1 completed in 9144.754 seconds

In the last 7 days there have been 21 occurrences on android (1 on andoid 4, 3 on andoid 5), linux 64 and osx-cross, mostly on build types debug and opt.

Recent failure: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=279969270&repo=autoland&lineNumber=3218

In the last 7 days there have been 24 occurrences on linux 64 and osx-cross-noopt, build types asan and debug.

Recent failure: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=280333995&repo=autoland&lineNumber=48848

[task 2019-12-09T19:34:01.024Z] 19:34:01 INFO - make[4]: Leaving directory '/builds/worker/workspace/build/src/obj-firefox/layout/tools/recording'
[task 2019-12-09T19:34:01.024Z] 19:34:01 INFO - make[4]: Entering directory '/builds/worker/workspace/build/src/obj-firefox/layout/tools/reftest'
[task 2019-12-09T19:34:01.024Z] 19:34:01 INFO - mkdir -p '../../../dist/xpi-stage/reftest/chrome/'
[task 2019-12-09T19:34:01.024Z] 19:34:01 INFO - make[4]: Leaving directory '/builds/worker/workspace/build/src/obj-firefox/layout/tools/reftest'
[taskcluster:error] Task timeout after 3600 seconds. Force killing container.
[taskcluster 2019-12-09 19:34:42.469Z] === Task Finished ===
[taskcluster 2019-12-09 19:34:42.469Z] Unsuccessful task run with exit code: -1 completed in 3652.916 seconds

Depends on: 1605433
Depends on: 1605458
Depends on: 1607380
Depends on: 1608219

In the last 7 days there have been 34 occurrences on linux 64 and osx-cross-noopt, build types asan and debug.

recent log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=284270373&repo=autoland&lineNumber=21893

Depends on: 1610998
Depends on: 1611018

Most recent failures are from a variety of build tasks. I am hoping the discussion in bug 1610998 may help to resolve these.

Depends on: 1611477

The recent spike here is caused by Bug 1614852.

Depends on: 1617026
Depends on: 1617031

There are 33 failures in the last 7 days, most of the occurrences can be seen on linux1804-64 (debug), linux1804-64-asan (opt), linux64-tsan (opt): https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?startday=2020-02-14&endday=2020-02-21&tree=trunk&bug=1411358

A recent log example: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=289769127&repo=mozilla-central&lineNumber=22442

I will with Geoff's help work on getting Ubuntu 18.04 in better shape.

Flags: needinfo?(bob)

In the last week we saw for linux1804

count label project platform buildtype
1 mochitest-browser-chrome-e10s-13 autoland linux1804-64 debug
1 mochitest-browser-chrome-fis-e10s-9 autoland linux1804-64 debug
1 mochitest-webgl2-ext-fis-e10s-4 autoland linux1804-64-qr debug
1 reftest-no-accel-e10s-3 autoland linux1804-64 debug
1 reftest-no-accel-e10s-7 autoland linux1804-64 debug
6 web-platform-tests-e10s-2 autoland linux1804-64-asan opt
2 web-platform-tests-e10s-2 mozilla-central linux1804-64-asan opt
3 web-platform-tests-e10s-5 autoland linux1804-64-asan opt
1 web-platform-tests-sw-e10s-11 mozilla-central linux1804-64 debug

Edwin has already changed linux64 to linux.*64 and Geoff just today changed the chunking for web-platform-tests asan from 24 to 28 in Bug 1617026 so most of this is resolved.

Flags: needinfo?(bob)

There are a total of 35 failures on this bug over the last 7 days. These failures take place on android-em-7-0-x86_64, linux1804-64, linux1804-64-asan, linux1804-64-qr, linux32-shippable, linux64, linux64-asan-reporter, linux64-shippable, linux64-tsan, osx-cross, osx-shippable

Recent failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=291938632&repo=autoland&lineNumber=5082

Bugbug thinks this bug is a regression, but please revert this change in case of error.

Keywords: regression
Keywords: regression

There may be a problem with android-em machine-49 - temporary? Watching...

Various linux/asan tasks timed out - temporary? Watching...

This bug is in fairly good shape overall.

Depends on: 1634230

In the last 7 days there have been 71 occurrences on android-em-7-0-x86_64 and linux1804-64 build types debug and opt.

Recent failure: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=300435464&repo=mozilla-central&lineNumber=10564

yes, I am working on this, I will do it as part of Bug 1634230

Flags: needinfo?(jmaher)
Depends on: 1637000

We are suddenly seeing a spike of failures on the nss branch -- I'm not sure what to do about those.

why are the nss branch failures getting annotated? maybe for bugzilla and failure robot we should only focus on specific branches

Still dominated by failures on nss.

I don't know why ./mach try auto runs test-android-em-7.0-x86_64/debug-geckoview-xpcshell-e10s but m-c runs test-android-em-7.0-x86_64/debug-geckoview-xpcshell-e10s-1/2/3/4, but perhaps that could lead to more failures on try.

:ahal - interested in comment 402? It looks like 'try auto' tried to run way too many android xpcshell test manifests in a single chunk.

Flags: needinfo?(ahal)

Thanks, I filed bug 1654867

Flags: needinfo?(ahal)

In the last 7 days there have been 118 occurrences on android-em-7-0-x86_64, linux1804-64-qr and linux1804-64-tsan, build types debug and opt.

Recent failure: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=311694659&repo=autoland&lineNumber=12480

Spike in android-em xpcshell test failures was caused by manifest scheduling running all tests in one chunk: now corrected.

Assignee: whole.grains → nobody
Depends on: 1722428
Depends on: 1724244

Update:
There have been 75 failures within the last 7 days:
• 5 failures on OS X Cross Compiled debug
• 1 failures on Linux 18.04 x64 CCov WebRender opt
• 29 failures on Android 7.0 x86-64 WebRender debug
• 26 failures on android-em-7-0-x86_64-lite-qr debug
• 3 failures on Android 5.0 x86-64 debug
• 3 failures on Android 5.0 x86-64 opt
• 3 failures on android-5-0-x86 opt
• 3 failures on android-5-0-armv7 opt
• 2 failures on Android 5.0 AArch64 opt
Recent failure log: https://treeherder.mozilla.org/logviewer?job_id=347728355&repo=autoland&lineNumber=40865

[task 2021-08-07T11:01:57.171Z] 11:01:57     INFO -  Running: /builds/worker/workspace/obj-build/_virtualenvs/common/bin/python /builds/worker/checkouts/gecko/toolkit/crashreporter/tools/symbolstore.py -c --vcs-info --install-manifest=/builds/worker/workspace/obj-build/_build_manifests/install/dist_include,/builds/worker/workspace/obj-build/dist/include -s /builds/worker/checkouts/gecko /builds/worker/fetches/dump_syms/dump_syms /builds/worker/workspace/obj-build/dist/crashreporter-symbols /builds/worker/workspace/obj-build/mfbt/tests/TestXorShift128PlusRNG
[task 2021-08-07T11:01:57.171Z] 11:01:57     INFO -  Beginning work for file: /builds/worker/workspace/obj-build/mfbt/tests/TestXorShift128PlusRNG
[task 2021-08-07T11:01:57.171Z] 11:01:57     INFO -  Processing file: /builds/worker/workspace/obj-build/mfbt/tests/TestXorShift128PlusRNG
[task 2021-08-07T11:01:57.171Z] 11:01:57     INFO -  /builds/worker/fetches/dump_syms/dump_syms /builds/worker/workspace/obj-build/mfbt/tests/TestXorShift128PlusRNG
[task 2021-08-07T11:01:57.171Z] 11:01:57     INFO -  Finished processing /builds/worker/workspace/obj-build/mfbt/tests/TestXorShift128PlusRNG in 0.01s
[task 2021-08-07T11:01:57.172Z] 11:01:57     INFO -  make[4]: Leaving directory '/builds/worker/workspace/obj-build/mfbt/tests'
[task 2021-08-07T11:01:58.034Z] 11:01:58     INFO -  make[4]: Entering directory '/builds/worker/workspace/obj-build/mfbt/tests'
[task 2021-08-07T11:01:58.034Z] 11:01:58     INFO -  /builds/worker/workspace/obj-build/_virtualenvs/common/bin/python -m mozbuild.action.dumpsymbols /builds/worker/workspace/obj-build/mfbt/tests/TestSPSCQueue /builds/worker/workspace/obj-build/mfbt/tests/TestSPSCQueue_syms.track
[task 2021-08-07T11:01:58.034Z] 11:01:58     INFO -  Running: /builds/worker/workspace/obj-build/_virtualenvs/common/bin/python /builds/worker/checkouts/gecko/toolkit/crashreporter/tools/symbolstore.py -c --vcs-info --install-manifest=/builds/worker/workspace/obj-build/_build_manifests/install/dist_include,/builds/worker/workspace/obj-build/dist/include -s /builds/worker/checkouts/gecko /builds/worker/fetches/dump_syms/dump_syms /builds/worker/workspace/obj-build/dist/crashreporter-symbols /builds/worker/workspace/obj-build/mfbt/tests/TestSPSCQueue
[task 2021-08-07T11:01:58.034Z] 11:01:58     INFO -  Beginning work for file: /builds/worker/workspace/obj-build/mfbt/tests/TestSPSCQueue
[task 2021-08-07T11:01:58.035Z] 11:01:58     INFO -  Processing file: /builds/worker/workspace/obj-build/mfbt/tests/TestSPSCQueue
[task 2021-08-07T11:01:58.035Z] 11:01:58     INFO -  /builds/worker/fetches/dump_syms/dump_syms /builds/worker/workspace/obj-build/mfbt/tests/TestSPSCQueue
[task 2021-08-07T11:01:58.035Z] 11:01:58     INFO -  Finished processing /builds/worker/workspace/obj-build/mfbt/tests/TestSPSCQueue in 0.02s
[task 2021-08-07T11:01:58.035Z] 11:01:58     INFO -  make[4]: Leaving directory '/builds/worker/workspace/obj-build/mfbt/tests'
[task 2021-08-07T11:01:58.882Z] 11:01:58     INFO -  make[4]: Entering directory '/builds/worker/workspace/obj-build/mfbt/tests'
[task 2021-08-07T11:01:58.883Z] 11:01:58     INFO -  /builds/worker/workspace/obj-build/_virtualenvs/common/bin/python -m mozbuild.action.dumpsymbols /builds/worker/workspace/obj-build/mfbt/tests/TestThreadSafeWeakPtr /builds/worker/workspace/obj-build/mfbt/tests/TestThreadSafeWeakPtr_syms.track
[task 2021-08-07T11:01:58.883Z] 11:01:58     INFO -  Running: /builds/worker/workspace/obj-build/_virtualenvs/common/bin/python /builds/worker/checkouts/gecko/toolkit/crashreporter/tools/symbolstore.py -c --vcs-info --install-manifest=/builds/worker/workspace/obj-build/_build_manifests/install/dist_include,/builds/worker/workspace/obj-build/dist/include -s /builds/worker/checkouts/gecko /builds/worker/fetches/dump_syms/dump_syms /builds/worker/workspace/obj-build/dist/crashreporter-symbols /builds/worker/workspace/obj-build/mfbt/tests/TestThreadSafeWeakPtr
[task 2021-08-07T11:01:58.883Z] 11:01:58     INFO -  Beginning work for file: /builds/worker/workspace/obj-build/mfbt/tests/TestThreadSafeWeakPtr
[task 2021-08-07T11:01:58.883Z] 11:01:58     INFO -  Processing file: /builds/worker/workspace/obj-build/mfbt/tests/TestThreadSafeWeakPtr
[task 2021-08-07T11:01:58.883Z] 11:01:58     INFO -  /builds/worker/fetches/dump_syms/dump_syms /builds/worker/workspace/obj-build/mfbt/tests/TestThreadSafeWeakPtr
[task 2021-08-07T11:01:58.884Z] 11:01:58     INFO -  Finished processing /builds/worker/workspace/obj-build/mfbt/tests/TestThreadSafeWeakPtr in 0.01s
[task 2021-08-07T11:01:58.884Z] 11:01:58     INFO -  make[4]: Leaving directory '/builds/worker/workspace/obj-build/mfbt/tests'
[task 2021-08-07T11:01:59.731Z] 11:01:59     INFO -  make[4]: Entering directory '/builds/worker/workspace/obj-build/mfbt/tests'
[task 2021-08-07T11:01:59.732Z] 11:01:59     INFO -  /builds/worker/workspace/obj-build/_virtualenvs/common/bin/python -m mozbuild.action.dumpsymbols /builds/worker/workspace/obj-build/mfbt/tests/TestUtf8 /builds/worker/workspace/obj-build/mfbt/tests/TestUtf8_syms.track
[task 2021-08-07T11:01:59.732Z] 11:01:59     INFO -  Running: /builds/worker/workspace/obj-build/_virtualenvs/common/bin/python /builds/worker/checkouts/gecko/toolkit/crashreporter/tools/symbolstore.py -c --vcs-info --install-manifest=/builds/worker/workspace/obj-build/_build_manifests/install/dist_include,/builds/worker/workspace/obj-build/dist/include -s /builds/worker/checkouts/gecko /builds/worker/fetches/dump_syms/dump_syms /builds/worker/workspace/obj-build/dist/crashreporter-symbols /builds/worker/workspace/obj-build/mfbt/tests/TestUtf8
[task 2021-08-07T11:01:59.732Z] 11:01:59     INFO -  Beginning work for file: /builds/worker/workspace/obj-build/mfbt/tests/TestUtf8
[task 2021-08-07T11:01:59.732Z] 11:01:59     INFO -  Processing file: /builds/worker/workspace/obj-build/mfbt/tests/TestUtf8
[task 2021-08-07T11:01:59.732Z] 11:01:59     INFO -  /builds/worker/fetches/dump_syms/dump_syms /builds/worker/workspace/obj-build/mfbt/tests/TestUtf8
[task 2021-08-07T11:01:59.732Z] 11:01:59     INFO -  Finished processing /builds/worker/workspace/obj-build/mfbt/tests/TestUtf8 in 0.01s
[task 2021-08-07T11:01:59.732Z] 11:01:59     INFO -  make[4]: Leaving directory '/builds/worker/workspace/obj-build/mfbt/tests'
[task 2021-08-07T11:02:00.590Z] 11:02:00     INFO -  make[4]: Entering directory '/builds/worker/workspace/obj-build/mfbt/tests'
[task 2021-08-07T11:02:00.590Z] 11:02:00     INFO -  /builds/worker/workspace/obj-build/_virtualenvs/common/bin/python -m mozbuild.action.dumpsymbols /builds/worker/workspace/obj-build/mfbt/tests/TestPoisonArea /builds/worker/workspace/obj-build/mfbt/tests/TestPoisonArea_syms.track
[task 2021-08-07T11:02:00.590Z] 11:02:00     INFO -  Running: /builds/worker/workspace/obj-build/_virtualenvs/common/bin/python /builds/worker/checkouts/gecko/toolkit/crashreporter/tools/symbolstore.py -c --vcs-info --install-manifest=/builds/worker/workspace/obj-build/_build_manifests/install/dist_include,/builds/worker/workspace/obj-build/dist/include -s /builds/worker/checkouts/gecko /builds/worker/fetches/dump_syms/dump_syms /builds/worker/workspace/obj-build/dist/crashreporter-symbols /builds/worker/workspace/obj-build/mfbt/tests/TestPoisonArea
[task 2021-08-07T11:02:00.591Z] 11:02:00     INFO -  Beginning work for file: /builds/worker/workspace/obj-build/mfbt/tests/TestPoisonArea
[task 2021-08-07T11:02:00.591Z] 11:02:00     INFO -  Processing file: /builds/worker/workspace/obj-build/mfbt/tests/TestPoisonArea
[task 2021-08-07T11:02:00.591Z] 11:02:00     INFO -  /builds/worker/fetches/dump_syms/dump_syms /builds/worker/workspace/obj-build/mfbt/tests/TestPoisonArea
[task 2021-08-07T11:02:00.591Z] 11:02:00     INFO -  Finished processing /builds/worker/workspace/obj-build/mfbt/tests/TestPoisonArea in 0.01s
[task 2021-08-07T11:02:00.591Z] 11:02:00     INFO -  make[4]: Leaving directory '/builds/worker/workspace/obj-build/mfbt/tests'
[taskcluster:error] Task timeout after 7200 seconds. Force killing container.
[taskcluster 2021-08-07 12:50:50.155Z] === Task Finished ===
[taskcluster 2021-08-07 12:50:50.155Z] Unsuccessful task run with exit code: -1 completed in 7220.117 seconds

Geoff, there have been two instances of webdriver jobs failing with a large failure log and lots of failure lines like this one:
Connection refused (os error 111), should they be tracked in a separate bug?

Flags: needinfo?(gbrown)

If it's just two instances, I wouldn't bother. In general, I usually ignore any failure in this bug that happens less than 10 times per week.

Flags: needinfo?(gbrown)
Severity: normal → S3

There have been 53 total failures in the last 7 days, recent failure log.
Affected platforms are:

  • linux1804-64-asan-qr
  • linux1804-64-qr
  • symbols
[task 2022-12-11T00:23:33.303Z] 00:23:33     INFO -  TEST-START | services/fxaccounts/tests/xpcshell/test_commands.js
[taskcluster:error] Task timeout after 7200 seconds. Force killing container.
[taskcluster 2022-12-11 00:26:54.857Z] === Task Finished ===
[taskcluster 2022-12-11 00:26:54.858Z] Unsuccessful task run with exit code: -1 completed in 7201.887 seconds

Latest spike on Android xpcshell timeout runs is from https://bugzilla.mozilla.org/show_bug.cgi?id=1842167#c5.

175 failures from 225 in the last 30 days are on xpcshell android-em-7-0-x86_64 both debug and opt runs.
https://treeherder.mozilla.org/intermittent-failures/bugdetails?startday=2023-08-01&endday=2023-08-31&tree=trunk&failurehash=all&bug=1411358
Joel, got any time to look over these ones? Thank you.

Flags: needinfo?(jmaher)

thanks for needinfo me, I see that we have a common pattern for android emulator:

[task 2023-08-31T12:02:35.112Z] 12:02:35     INFO -  SUITE-END | took 2349s
[task 2023-08-31T12:02:35.112Z] 12:02:35     INFO -  Node moz-http2 server shutting down ...
[task 2023-08-31T12:02:35.112Z] 12:02:35     INFO -  Process stdout
[task 2023-08-31T12:02:35.112Z] 12:02:35     INFO -  forked process without handler sent: {"error":"","errorStack":""}
[task 2023-08-31T12:02:35.112Z] 12:02:35     INFO -  error: Command failed: /builds/worker/fetches/android-sdk-linux/platform-tools/adb reverse tcp:39137 tcp:39137
[task 2023-08-31T12:02:35.112Z] 12:02:35     INFO -  adb: error: cannot bind listener: Address already in use
[taskcluster:error] Task timeout after 5400 seconds. Force killing container.
[taskcluster 2023-08-31 12:51:08.587Z] === Task Finished ===
[taskcluster 2023-08-31 12:51:08.588Z] Unsuccessful task run with exit code: -1 completed in 5471.759 seconds

the line [task 2023-08-31T12:02:35.112Z] 12:02:35 INFO - forked process without handler sent: {"error":"","errorStack":""} is the last one I can find, I assume that we fork a process which then tries to do adb reverse and we get Address already in use. A few thoughts:

  • we run many adb sessions at a time for a single host, we could really be out of addresses
  • do we free the ports, or do they "timeout" over time?
  • I am not sure where or how we are calling adb reverse, I not a lot of references for reverse in-tree.

the error I see comes from here:

      console.log(
        `forked process without handler sent: ${JSON.stringify(msg)}`
      );

:aerickson, can you check the host machines to see if we are running out of addresses regularly? maybe a lot of stale connections via netstat or something like that?

:valentin, I see that you had added the code for the forking process, do you have thoughts on how this might be calling adb reverse?

Flags: needinfo?(valentin.gosu)
Flags: needinfo?(jmaher)
Flags: needinfo?(aerickson)

(In reply to Joel Maher ( :jmaher ) (UTC -8) from comment #617)

  • I am not sure where or how we are calling adb reverse, I not a lot of references for reverse in-tree.

We are calling adb reverse here

let command = `${adb_path} reverse tcp:${port} tcp:${port}`;
if (remove) {
  command = `${adb_path} reverse --remove tcp:${port}`;
  return true;
}

I see Kershaw tried to fix a similar issue a few months ago by implementing a retry mechanism in bug 1842651, but I'm not sure it was entirely successful.

I assume that we fork a process which then tries to do adb reverse and we get Address already in use. A few thoughts:

  • we run many adb sessions at a time for a single host, we could really be out of addresses

There is a small number of tests where we actually exercise this code. I suppose test-verify could cause it to run more times than expected. We do cleanup after the server, but to be honest I'm not sure if that actually closes the socket. (Worth trying to find out).

  • do we free the ports, or do they "timeout" over time?

We do clean up in the code I linked above, but while looking into this I found 1, 2 places that don't clean up after themselves.

I filed bug 1851038 to fix them.

Flags: needinfo?(valentin.gosu)

Only 1 failure after bug 1851038 landed, but the counts are way down.

Whiteboard: [stockwell infra] [see summary at comment 278] → [stockwell infra] [see summary at comment 278] [relops-android]
You need to log in before you can comment on or make changes to this bug.