Intermittent TEST-UNEXPECTED-ERROR Browser specs "before all" hook in "Browser specs" (browser.spec.ts)
Categories
(Remote Protocol :: Agent, defect, P2)
Tracking
(firefox-esr68 unaffected, firefox-esr78 unaffected, firefox79 unaffected, firefox80 wontfix, firefox81 wontfix, firefox82 wontfix, firefox83 wontfix, firefox84 wontfix, firefox85 wontfix, firefox86 wontfix, firefox87 wontfix)
Tracking | Status | |
---|---|---|
firefox-esr68 | --- | unaffected |
firefox-esr78 | --- | unaffected |
firefox79 | --- | unaffected |
firefox80 | --- | wontfix |
firefox81 | --- | wontfix |
firefox82 | --- | wontfix |
firefox83 | --- | wontfix |
firefox84 | --- | wontfix |
firefox85 | --- | wontfix |
firefox86 | --- | wontfix |
firefox87 | --- | wontfix |
People
(Reporter: intermittent-bug-filer, Unassigned)
References
(Depends on 1 open bug, Regression)
Details
(4 keywords)
Attachments
(1 file, 1 obsolete file)
(deleted),
text/x-phabricator-request
|
Details |
Filed by: malexandru [at] mozilla.com
Parsed log: https://treeherder.mozilla.org/logviewer.html#?job_id=309058845&repo=mozilla-central
Full log: https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/QhGR9GgrRDCo87YenaZXtg/runs/0/artifacts/public/logs/live_backing.log
[task 2020-07-08T23:27:58.116Z] TEST-PASS | waittask specs Frame.waitForXPath should allow you to select an element with single slash (waittask.spec.ts) | took 291ms
[task 2020-07-08T23:27:58.117Z] PID 194 | ["pass",{"title":"should allow you to select an element with single slash","fullTitle":"waittask specs Frame.waitForXPath should allow you to select an element with single slash","file":"/builds/worker/checkouts/gecko/remote/test/puppeteer/test/waittask.spec.ts","duration":47,"currentRetry":0}]
[task 2020-07-08T23:27:58.117Z] PID 194 | 1594250878117 RemoteAgent TRACE (connection {238bdb0f-0d2a-42b4-b596-1e0dacac9ff4})-> {"method":"Target.disposeBrowserContext","params":{"browserContextId":59},"id":1896}
[task 2020-07-08T23:27:58.122Z] PID 194 | 1594250878122 RemoteAgent TRACE <-(connection {238bdb0f-0d2a-42b4-b596-1e0dacac9ff4}) {"method":"Target.targetDestroyed","params":{"targetId":"6ccb32ef-2b08-4c7b-abf0-eed9f7e51aaa"}}
[task 2020-07-08T23:27:58.124Z] PID 194 | 1594250878123 RemoteAgent TRACE <-(connection {238bdb0f-0d2a-42b4-b596-1e0dacac9ff4}) {"id":1896}
[task 2020-07-08T23:27:58.188Z] PID 194 | Exiting due to channel error.
[task 2020-07-08T23:27:58.198Z] PID 194 | Exiting due to channel error.
[task 2020-07-08T23:27:58.198Z] PID 194 | Exiting due to channel error.
[task 2020-07-08T23:27:58.198Z] PID 194 | Exiting due to channel error.
[task 2020-07-08T23:27:58.199Z] TEST-START | Workers Page.workers (worker.spec.ts)
[task 2020-07-08T23:27:58.200Z] TEST-SKIP | Workers Page.workers (worker.spec.ts) | took 1ms
[task 2020-07-08T23:27:58.200Z] PID 194 | ["pending",{"title":"Page.workers","fullTitle":"Workers Page.workers","file":"/builds/worker/checkouts/gecko/remote/test/puppeteer/test/worker.spec.ts","currentRetry":0}]
[task 2020-07-08T23:27:58.200Z] TEST-START | Workers should emit created and destroyed events (worker.spec.ts)
[task 2020-07-08T23:27:58.200Z] TEST-SKIP | Workers should emit created and destroyed events (worker.spec.ts) | took 0ms
[task 2020-07-08T23:27:58.200Z] PID 194 | ["pending",{"title":"should emit created and destroyed events","fullTitle":"Workers should emit created and destroyed events","file":"/builds/worker/checkouts/gecko/remote/test/puppeteer/test/worker.spec.ts","currentRetry":0}]
[task 2020-07-08T23:27:58.200Z] TEST-START | Workers should report console logs (worker.spec.ts)
[task 2020-07-08T23:27:58.200Z] TEST-SKIP | Workers should report console logs (worker.spec.ts) | took 0ms
[task 2020-07-08T23:27:58.200Z] PID 194 | ["pending",{"title":"should report console logs","fullTitle":"Workers should report console logs","file":"/builds/worker/checkouts/gecko/remote/test/puppeteer/test/worker.spec.ts","currentRetry":0}]
[task 2020-07-08T23:27:58.200Z] TEST-START | Workers should have JSHandles for console logs (worker.spec.ts)
[task 2020-07-08T23:27:58.200Z] TEST-SKIP | Workers should have JSHandles for console logs (worker.spec.ts) | took 0ms
[task 2020-07-08T23:27:58.200Z] PID 194 | ["pending",{"title":"should have JSHandles for console logs","fullTitle":"Workers should have JSHandles for console logs","file":"/builds/worker/checkouts/gecko/remote/test/puppeteer/test/worker.spec.ts","currentRetry":0}]
[task 2020-07-08T23:27:58.200Z] TEST-START | Workers should have an execution context (worker.spec.ts)
[task 2020-07-08T23:27:58.200Z] TEST-SKIP | Workers should have an execution context (worker.spec.ts) | took 0ms
[task 2020-07-08T23:27:58.200Z] PID 194 | ["pending",{"title":"should have an execution context","fullTitle":"Workers should have an execution context","file":"/builds/worker/checkouts/gecko/remote/test/puppeteer/test/worker.spec.ts","currentRetry":0}]
[task 2020-07-08T23:27:58.200Z] TEST-START | Workers should report errors (worker.spec.ts)
[task 2020-07-08T23:27:58.200Z] TEST-SKIP | Workers should report errors (worker.spec.ts) | took 0ms
[task 2020-07-08T23:27:58.200Z] PID 194 | ["pending",{"title":"should report errors","fullTitle":"Workers should report errors","file":"/builds/worker/checkouts/gecko/remote/test/puppeteer/test/worker.spec.ts","currentRetry":0}]
[task 2020-07-08T23:27:58.200Z] PID 194 | ["end",{"suites":155,"tests":669,"passes":388,"pending":53,"failures":230,"start":"2020-07-08T23:11:42.642Z","end":"2020-07-08T23:27:58.200Z","duration":975558}]
[task 2020-07-08T23:27:58.257Z] PID 194 | npm ERR! code ELIFECYCLE
[task 2020-07-08T23:27:58.257Z] PID 194 | npm ERR! errno 230
[task 2020-07-08T23:27:58.258Z] PID 194 | npm ERR! puppeteer@5.0.0 unit: `tsc --version && mocha --config mocha-config/puppeteer-unit-tests.js "--reporter" "./json-mocha-reporter.js" "--retries" "0" "--fullTrace" "--timeout" "15000" "--no-parallel"`
[task 2020-07-08T23:27:58.259Z] PID 194 | npm ERR! Exit status 230
[task 2020-07-08T23:27:58.259Z] PID 194 | npm ERR!
[task 2020-07-08T23:27:58.259Z] PID 194 | npm ERR! Failed at the puppeteer@5.0.0 unit script.
[task 2020-07-08T23:27:58.259Z] PID 194 | npm ERR! This is probably not a problem with npm. There is likely additional logging output above.
[task 2020-07-08T23:27:58.263Z] PID 194 |
[task 2020-07-08T23:27:58.263Z] PID 194 | npm ERR! A complete log of this run can be found in:
[task 2020-07-08T23:27:58.263Z] PID 194 | npm ERR! /builds/worker/.npm/_logs/2020-07-08T23_27_58_259Z-debug.log
[task 2020-07-08T23:27:58.284Z] TEST-UNEXPECTED-MISSING Browser specs Browser.target should return browser target (browser.spec.ts)
[task 2020-07-08T23:27:58.284Z] TEST-UNEXPECTED-MISSING Browser specs Browser.isConnected should set the browser connected state (browser.spec.ts)
[task 2020-07-08T23:27:58.284Z] TEST-UNEXPECTED-MISSING Browser specs Browser.process should not return child_process for remote browser (browser.spec.ts)
[task 2020-07-08T23:27:58.284Z] TEST-UNEXPECTED-MISSING Browser specs Browser.userAgent should include WebKit (browser.spec.ts)
[task 2020-07-08T23:27:58.284Z] TEST-UNEXPECTED-MISSING Browser specs Browser.version should return whether we are in headless (browser.spec.ts)
[task 2020-07-08T23:27:58.284Z] TEST-UNEXPECTED-MISSING Browser specs Browser.process should return child_process instance (browser.spec.ts)
[task 2020-07-08T23:27:58.284Z] TEST-UNEXPECTED-MISSING Unknown new test Browser specs "after all" hook in "Browser specs" (browser.spec.ts)
[task 2020-07-08T23:27:58.284Z] TEST-UNEXPECTED-MISSING Unknown new test Browser specs "before all" hook in "Browser specs" (browser.spec.ts)
[task 2020-07-08T23:27:58.284Z] SUITE-END | took 985s
[task 2020-07-08T23:27:58.284Z] npm exited with code 230
[task 2020-07-08T23:27:58.287Z] ./mach: Got unexpected results
[fetches 2020-07-08T23:27:58.324Z] removing /builds/worker/fetches
[fetches 2020-07-08T23:27:58.485Z] finished
[taskcluster 2020-07-08 23:27:58.983Z] === Task Finished ===
[taskcluster 2020-07-08 23:28:00.817Z] Unsuccessful task run with exit code: 1 completed in 1171.9 seconds
Updated•4 years ago
|
As Henrik noticed earlier today, something is slowing down Firefox start-up. We could always increase the mocha timeout but it would be nice to find out what commit regressed browser start-up time. I think this job is only scheduled on m-c, so I guess it's impossible to backfill in on autoland?
Comment 2•4 years ago
|
||
Yes, I triggered some backfills on mozilla-central to figure out which merge actually regressed that. I will ask sheriffs to continue on autoland in case it is necessary.
Updated•4 years ago
|
Comment 3•4 years ago
|
||
Regression range:
The appropriate merge is:
https://hg.mozilla.org/mozilla-central/pushloghtml?changeset=34fb169ef962bb4c06b4436f78b91d94e0e90f4b
Sheriffs can you please do a backfill on autoland to find the regressor? Thanks.
Comment 4•4 years ago
|
||
We've started investigating and we'll post updates when jobs are done.
Comment 5•4 years ago
|
||
https://hg.mozilla.org/integration/autoland/rev/d72c64af2c2372d7e69230f76650dabb3141287e seems to be the culprit.
Mitchell, could you take a look?
Comment 6•4 years ago
|
||
Very strange that this is caused by some git improvements. But interestingly this is a source test, so that we get a clone of the repository - but with Mercurial. Might that be an unexpected side-effect?
https://searchfox.org/mozilla-central/source/taskcluster/ci/source-test/remote.yml#35-44
Updated•4 years ago
|
Comment 7•4 years ago
|
||
Hmm, I'm not sure how this cause be caused by the git improvements.
I'll try to reproduce locally :)
If anything, we can back out my changes to see if the issue gets resolved, eh?
Comment 8•4 years ago
|
||
I'm having additional unexpected failures when running locally, so I'm unable to test with a clean environment.
:razvanm do you want to back out my change to see if it resolves the problem?
If the issue is fixed, then I'll collaborate with the puppeteer team to see how I can diagnose the issue
If the issue isn't fixed, then perhaps the issue was caused by an earlier tree change, but had a delayed failure due to CI specifics?
Comment 9•4 years ago
|
||
FWIW, latest pup run was successful.
Updating summary as a result of reporting fix in Bug 1651469
Comment 11•4 years ago
|
||
(In reply to Sebastian Hengst [:aryx] (needinfo on intermittent or backout) from comment #9)
FWIW, latest pup run was successful.
I triggered two more Puppeteer jobs on central to see where it has been fixed.
Comment 12•4 years ago
|
||
Sebastian, can we try the backout as suggested by Mitchell? Would be good to know if that fixes the problem.
Comment 13•4 years ago
|
||
Bug 1645196 has been backed out from autoland.
Comment 14•4 years ago
|
||
I triggered the puppeteer tests and they are actually still broken:
So it hasn't been caused by that check-in, but there might have been an infrastructure change for the Linux docker images on July 7th? Callek, or Tom do you know?
Comment 15•4 years ago
|
||
We indeed had a docker image job running for ub18-test on the causing changeset:
Updated•4 years ago
|
Comment 16•4 years ago
|
||
So the changeset right before was updating the way how TaskCluster builds docker images. See bug 1626058.
Tom, did we maybe regress bug 1633101 with your change?
Comment 17•4 years ago
|
||
It's plausible that rebuilding the image updated some package that triggers the error... What happens if you take the changeset before bug 1633101, add an empty line to taskcluster/docker/ubuntu1804-test/Dockerfile, and push to try to run that puppeteer test?
Comment hidden (Intermittent Failures Robot) |
Comment 19•4 years ago
|
||
(In reply to Mike Hommey [:glandium] from comment #17)
It's plausible that rebuilding the image updated some package that triggers the error... What happens if you take the changeset before bug 1633101, add an empty line to taskcluster/docker/ubuntu1804-test/Dockerfile, and push to try to run that puppeteer test?
I assume you mean bug 1626058 here. So I pushed a try build for the parent commit:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=aeb54d91310c8d75162f4cda3e55bfc3dc806b30
Comment 20•4 years ago
|
||
So the tests are passing with the parent changeset from bug 1626058 used. That means it's not related to some updated packages inside the docker container, but related to how the docker image is getting built. Any further ideas to check? Otherwise I hope that Tom could have a look at it.
Comment 22•4 years ago
|
||
So this is clearly caused by a slow startup of Firefox which is aborted after 15s:
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=309851380&repo=try&lineNumber=158-381
Nearly all the time is spent between the following lines:
task 2020-07-15T14:25:29.864Z] PID 198 | DEBUG: Adding blocker TelemetryController: shutting down for phase profile-before-change-telemetry
[task 2020-07-15T14:25:44.334Z] PID 198 | DEBUG: Adding blocker EnvironmentAddonBuilder for phase AddonManager: Waiting to start provider shutdown.
The EnvironmentAddonBuilder
seems to get added by Telemetry via:
https://searchfox.org/mozilla-central/rev/82c04b9cad5b98bdf682bd477f2b1e3071b004ad/toolkit/components/telemetry/app/TelemetryEnvironment.jsm#544-548
I actually don't expect Telemetry to run for Puppeteer. And I see that it gets disabled via the preferences:
https://searchfox.org/mozilla-central/rev/82c04b9cad5b98bdf682bd477f2b1e3071b004ad/remote/test/puppeteer/src/node/Launcher.ts#570
So why do we still see Telemetry code running?
Also are there any kind of MOZ_LOG logs, which could be helpful to nail this down?
Comment 23•4 years ago
|
||
That pref disables upload. Telemetry still needs to exist in order that the instrumentation code in Firefox runs without error.
The Environment specifically is a very heavyweight object that is built at startup on each launch from a variety of sources and is used by not just Telemetry but by things like the Crash Reporter as well. There's no clear signal that we could decide to instead not build the Environment because of how many consumers there are.
There are lots of Telemetry logs. You can enable them at a variety of levels using toolkit.telemetry.log.level
("Trace" will get you everything). If you like your logs to be in stdout as well as in the Browser Console, set toolkit.telemetry.log.dump
to true
(docs for these are over here).
Not sure what could cause it to take 15s. Getting the addon information is bounded in time by the provider, most of the rest is bound by the speed of reading a variety of system files and services (to get things like hdd size, os versions, gpu count and vendors, etc)... there's been some changes in the past year to improve the asynchronicity of some of the worst offenders, but script can always block on waiting for the full environment to be present (especially if there's a shutdown during startup).
Hopefully something in this helps
Comment 24•4 years ago
|
||
Oh, I see. In that case I might actually want to modify the taskcluster configuration for this source tree job and start Firefox via Marionette to actually record a gecko profile. This should be more helpful for investigating this slow startup.
I will report back when I have more details. Triggered the try build for now:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=a333c789c0e5ba96495432cb1605f862eaeadcef
Comment 25•4 years ago
|
||
Here is the profile of the first start of Firefox: https://share.firefox.dev/3h5OAXB
As it can be seen we spend again all the time in gfxPlatformGtk::CreatePlatformFontList
. So it looks identical to the problem with the font cache that I recently fixed over on bug 1633101. So the question is now why is the fontcache for the x86_64 platform not created?
Updated•4 years ago
|
Comment 26•4 years ago
|
||
Something actually fixed it on mozilla-central for the first merge on July 12th:
Which is equivalent to this range on autoland:
During that time I cannot see a rebuilt of the ubuntu1804-test
docker image whether on autoland nor mozilla-central. So maybe some other landing fixed that? I cannot find any obvious landing for that range. So I will do some backfills which hopefully help.
Also try is working fine as this push demonstrates:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=f172ba3f2ce420c0eebb48fdafe6212d534f7e34
Comment 27•4 years ago
|
||
So the fix that made it actually work again was a rustfmt landing:
But with that changeset we also rebuilt the ubuntu1804-test
docket image:
We should really pin the versions of the packages we install via apt-get (similar to the debian image as Tom mentioned to me via Matrix) to make sure to not install non-tested releases of packages, which can cause tests to fail. Especially because we seem to rebuilt the docker images once in a while, so that failures like that can start out of a sudden.
Joel and Edwin, is that something we can do for the Ubuntu 1804 image?
Marking this bug as WFM, and moving it into the startup component where it most likely is related to.
Comment 28•4 years ago
|
||
it seems logical to me pin versions of packages in the 1804 docker image creation. Is there a list of changes to make?
Comment 29•4 years ago
|
||
That's the entire problem with using ubuntu, we can't pin packages.
Comment 30•4 years ago
|
||
:glandium is correct. Bug 1607713 is a similar issue where we have to wait for upstream to fix the issue.
Updated•4 years ago
|
Comment 31•4 years ago
|
||
Should we reopen the bug? Or is it ok to classify these failures against it as it is?
Comment hidden (Intermittent Failures Robot) |
Comment 33•4 years ago
|
||
What a bummer. Yes, lets reopen given that it doesn't make sense to file new bugs over and over again whenever a docker image update has happened, which might have broken it again. :/
The docker image update which caused the failure again is from Aug 6th:
The former docker image update from Aug 3rd didn't cause it:
Comparing the log files for the fontconfig packages doesn't show any difference in when the font cache is getting re-generated. Also all related packages are getting installed with the exact same version of the packages.
It might be worthwhile to check if explicitly adding fc-cache -fv
after the package installation steps would help here.
Comment hidden (Intermittent Failures Robot) |
Comment 35•4 years ago
|
||
So I pushed another try build with the startup profiler enabled just to check how the startups look like. Here the results:
https://share.firefox.dev/3iYdMjM
https://share.firefox.dev/2YfkOZ8
I'm not sure about the first one. Here Firefox started quick enough. But especially the second profile is interesting again because it shows the font cache generation timing again.
Comment 36•4 years ago
|
||
Just to check if it's really related to the font cache generation I pushed another try build where fc-cache -rv
is explicitly called before the actual mach command. And as you can see it works all fine:
We no longer run into a timeout that causes all tests to fail.
As such I will check if explicitly running this command at the end of ubuntu1804-test-system-setup.sh will make it work.
Comment 37•4 years ago
|
||
Btw here the link to which of the fonts actually had to be cached:
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=313475996&repo=try&lineNumber=61-140
Updated•4 years ago
|
Comment 38•4 years ago
|
||
I think I now know what the problem is. The system setup script gets run as root and as such also the installation of various apt-packages. When fontconfig
gets installed the script runs the fc-cache
script as user root. Hereby the cache gets stored at /root/.cache/fontcache
, which is finally all cleaned-up at the end of the script.
Now when we run all the various test jobs in CI the user that is used to run all the commands is worker
. And there is clearly no font cache available, and the first start of Firefox actually triggers its creation.
Given that we don't want to create the cache for each test job, I would say the best place to create it would be at the end of the setup script.
Comment 39•4 years ago
|
||
Updated•4 years ago
|
Comment 40•4 years ago
|
||
Comment 41•4 years ago
|
||
bugherder |
Updated•4 years ago
|
Comment hidden (Intermittent Failures Robot) |
Comment 43•4 years ago
|
||
This is still happening: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=313932047&repo=mozilla-central&lineNumber=160
Henrik, can you please take a look? Thank you
Comment 44•4 years ago
|
||
That's sad. Can we backout the patch please? I will have to investigate why it doesn't work when we run this step in the docker setup script.
Comment 46•4 years ago
|
||
Comment 47•4 years ago
|
||
Can you please also backout from beta (81)? You can do it as part of the next uplifts. There is no hurry.
Comment 48•4 years ago
|
||
Added the checkin-needed tag so whoever does the next round of uplifts sees it.
Comment 49•4 years ago
|
||
backout uplift |
Comment hidden (Intermittent Failures Robot) |
Comment 51•4 years ago
|
||
The problem here is that the /home
folder is not part of the docker image. As such the generated font cache will not be present, and needs to be re-generated for each and every test job on Ubuntu 18.04. Maybe that folder is not present given that since bug 1626058 landed, we no longer create the worker user:
https://hg.mozilla.org/ci/taskgraph/rev/004bada4f170#l7.65
Or do I have to check something else beside ubuntu1804-test-system-setup.sh?
Mike, given that you reviewed Tom's patch, do you know why we removed the creation of the worker
user?
Comment 52•4 years ago
|
||
Oh, as it looks like we still create the worker
user:
https://searchfox.org/mozilla-central/source/taskcluster/docker/ubuntu1804-test/Dockerfile#6
But the home folder is actually located at builds/worker
. I will have to check the docker image again.
Comment 53•4 years ago
|
||
So the generated font-cache is actually stored under /builds/worker/.fontcache
instead of /builds/worker/.cache/fontcache
. Could this cause problems with some applications so that is not being detected? Or should we also run font cache generation as root
?
Updated•4 years ago
|
Updated•4 years ago
|
Comment 54•4 years ago
|
||
So since a couple of days this problem doesn't exist anymore again. And it's working fine for recent Nightly and CI builds on autoland.
For checking the differences here two profiles:
Profile for long startup: https://share.firefox.dev/2YfkOZ8
Profile for normal startup: https://share.firefox.dev/2GfwWTW
Getting the fonts list takes 1s compared to 16s when it maybe had to be created first due to the missing cache.
I have no idea what's going on and what could have fixed it, maybe only temporarily and it will start failing in a couple of days again.
Comment 55•4 years ago
|
||
Please note that the user worker
has no font cache available in test jobs. I saw that via the following job:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=21dd183c015ccb6a8b8e2fdfa613178c1a1fac9c&selectedTaskRun=Ycr0W4iUQ8qBcsQ7dZf8bA.0
Interesting that even without the font cache the start-up only takes 5s and not 15s.
Updated•4 years ago
|
Comment 56•4 years ago
|
||
Puppeteer jobs are still green. Joel, do you think that we can enable the jobs for each commit on central and autoland now? See bug 1628589. I cannot say if this error comes back or not.
Comment 57•4 years ago
|
||
lets do it, we can clean up if problems persist and are frequent enough.
Comment 58•4 years ago
|
||
There's a r+ patch which didn't land and no activity in this bug for 2 weeks.
:whimboo, could you have a look please?
For more information, please visit auto_nag documentation.
Updated•4 years ago
|
Comment 59•4 years ago
|
||
The tests were pretty stable over the last weeks and that attached patch doesn't seem to be actually necessary. As such lets close this bug as INCOMPLETE for now, and get it reopened if the font cache generation problems appears again.
Comment 60•4 years ago
|
||
Seeing a similar failure line perma failing:
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=318041910&repo=autoland&lineNumber=161
Comment hidden (Intermittent Failures Robot) |
Comment 62•4 years ago
|
||
Henrik, looks like I somehow triggered this with a nodejs update that rebuilt the images. Could you take another look please?
Comment 63•4 years ago
|
||
This is kinda sad to see, but also is kinda scary in terms when it starts to failing and how often. As you can see, recently more pup jobs on autoland are passing. It gives the feeling that something is wrong with the test machines or the environment and maybe not with the docker image?
Joel, is there anyone you know who could help with that?
For the time being I would propose a workaround for the pup
job. Instead of launching Firefox the first time with Puppeteer I will check if starting Firefox directly before with --screenshot
will correctly warm-up the font cache, and then not cause this long first startup time for Puppeteer.
Comment 64•4 years ago
|
||
I pushed a try build to verify the possible workaround:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=283c780227fd7cc39cf9ec9d90fae8db796364d9
Comment 65•4 years ago
|
||
As it looks like the workaround with --screenshot
doesn't work because Firefox acts up and hangs during shutdown:
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=318151647&repo=try&lineNumber=63-71
[task 2020-10-09T09:16:02.923Z] + /builds/worker/fetches/firefox/firefox --screenshot
[task 2020-10-09T09:16:02.963Z] *** You are running in headless mode.
[task 2020-10-09T09:17:05.926Z] [Parent 36, Main Thread] ###!!! ABORT: file /builds/worker/checkouts/gecko/dom/serviceworkers/ServiceWorkerShutdownBlocker.cpp:98
[task 2020-10-09T09:17:05.926Z] [Parent 36, Main Thread] ###!!! ABORT: file /builds/worker/checkouts/gecko/dom/serviceworkers/ServiceWorkerShutdownBlocker.cpp:98
[task 2020-10-09T09:17:05.927Z] ExceptionHandler::GenerateDump cloned child 162
[task 2020-10-09T09:17:05.927Z] ExceptionHandler::SendContinueSignalToChild sent continue signal to child
[task 2020-10-09T09:17:05.927Z] ExceptionHandler::WaitForContinueSignal waiting for continue signal...
[task 2020-10-09T09:17:05.994Z] Exiting due to channel error.
Jens, do you have any idea what this could be? I'm happy to trigger some more try builds if there are specific log types available that I could use here to get further details.
Comment 66•4 years ago
|
||
The ServiceWorkerManager's shutdown blocker registered here triggered for some reason. This probably indicates a hanging service worker on shutdown.
Is this reproducible? Then a pernosco session would be very appreciated in order to understand, what is going on.
Comment 67•4 years ago
|
||
As long as the tests perma-failing it will be good to get a Pernosco session. Never did that before and had to look for some documentation first. But here is the try build with the --pernosco
flag set. I hope that this is all what I have to do. Lets see.
https://treeherder.mozilla.org/#/jobs?repo=try&revision=eb0853db2aeee0b8f2bc3f4323e594dc4d1299ec
Comment 68•4 years ago
|
||
(In reply to Henrik Skupin (:whimboo) [⌚️UTC+2] from comment #67)
As long as the tests perma-failing it will be good to get a Pernosco session. Never did that before and had to look for some documentation first. But here is the try build with the
--pernosco
flag set. I hope that this is all what I have to do. Lets see.https://treeherder.mozilla.org/#/jobs?repo=try&revision=eb0853db2aeee0b8f2bc3f4323e594dc4d1299ec
It still doesn't work. Andrew, maybe the source tree jobs don't support Pernosco? Or is there anything else what I have to set or make available as artifact?
Comment 69•4 years ago
|
||
Yes, the pernosco service scans pushes for jobs it supports. I suspect it doesn't support this task. It also only inherently works with Linux debug platforms.
Maybe you could ask khuey to get puppeteer added.
Updated•4 years ago
|
Comment 70•4 years ago
|
||
Jens, could this help?
Thing is that Firefox was quit immediately because I missed to actually specify a URL to get a screenshot from. So I assume it's a race between initializing the service workers and running shutdown right away.
Comment 71•4 years ago
|
||
With doing screenshots I get the following for debug builds:
https://treeherder.mozilla.org/#/jobs?repo=try&selectedTaskRun=ZUVDerrbSgC9RkbxaBgRlw.0&revision=54d788a8cfa39ca2105e9ed46e4f7b022af164cc
There is no slow startup anymore. The failures as visible are side-effects of the debug build slowness.
So here a new try build with a Firefox opt build:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=dd159af0490e2c1f7fbccbfa1486ab5cf3b60763
Updated•4 years ago
|
Comment 72•4 years ago
|
||
Due to font cache issues with the ub18-test docker image in
TaskCluster the first startup of Firefox sometimes takes more
than 15s.
As workaround for now just start Firefox once before running
Puppeteer unit tests.
Comment 73•4 years ago
|
||
The patch as attached is only a workaround. As such the bug should stay open.
The real fix for that should be done in Mocha config, but I would like to wait until we upgraded to Puppeteer 5.3.x, and have the keyboard interrupt handling working.
Comment 74•4 years ago
|
||
Why do you want to wait for the Puppeteer upgrade to change the mocha config?
Comment 76•4 years ago
|
||
Backed out as requested: https://hg.mozilla.org/integration/autoland/rev/c67eaf22376dbd200721282dbfeae3d09177ee1f
Comment 77•4 years ago
|
||
The extra comment caused breakage when running the mozharness script. Landing the patch again with it removed.
Comment 78•4 years ago
|
||
Comment hidden (Intermittent Failures Robot) |
Comment 80•4 years ago
|
||
bugherder |
Comment hidden (Intermittent Failures Robot) |
Comment 82•4 years ago
|
||
Dependencies need to be fixed first. So unassigning myself for now.
Updated•4 years ago
|
Comment 83•4 years ago
|
||
(In reply to Maja Frydrychowicz :maja_zf (UTC-4) (maja@mozilla.com) from comment #75)
Why do you want to wait for the Puppeteer upgrade to change the mocha config?
I haven't checked which changes shipped with Puppeteer 5.3.1 compared our our vendored version. If there are changes it might be worth syncing first. I also filed https://github.com/puppeteer/puppeteer/issues/6500 to maybe get this fixed by the Puppeteer folks directly given that this is a more generic issue and not Firefox specific.
Comment 85•4 years ago
|
||
(In reply to Jens Stutte [:jstutte] from comment #84)
Hi, it looks I am not needed here any more.
No, the needinfo is still valid, and I think we should really get the shutdown hang investigated. So please have a look again on comment 70. Thanks.
(In reply to Henrik Skupin (:whimboo) [⌚️UTC+2] (away 10/19 - 10/20) from comment #73)
The patch as attached is only a workaround. As such the bug should stay open.
The real fix for that should be done in Mocha config, but I would like to wait until we upgraded to Puppeteer 5.3.x, and have the keyboard interrupt handling working.
Note that our mach command overrides the default mocha timeout of 25s down to 15s because otherwise the overall pup job is too long.
Specific tests that involve browser restarts set a longer timeout e.g. https://searchfox.org/mozilla-central/source/remote/test/puppeteer/test/launcher.spec.ts#38
Comment 88•4 years ago
|
||
(In reply to Maja Frydrychowicz :maja_zf (UTC-4) (maja@mozilla.com) from comment #87)
Specific tests that involve browser restarts set a longer timeout e.g. https://searchfox.org/mozilla-central/source/remote/test/puppeteer/test/launcher.spec.ts#38
Oh! Could this be the reason why the mocha default settings for the timeout aren't in affect here?
Shouldn't we just remove the default from mach, and leave it all to mocha-config?
(In reply to Henrik Skupin (:whimboo) [⌚️UTC+2] from comment #88)
(In reply to Maja Frydrychowicz :maja_zf (UTC-4) (maja@mozilla.com) from comment #87)
Specific tests that involve browser restarts set a longer timeout e.g. https://searchfox.org/mozilla-central/source/remote/test/puppeteer/test/launcher.spec.ts#38
Oh! Could this be the reason why the mocha default settings for the timeout aren't in affect here?
Yep.
Shouldn't we just remove the default from mach, and leave it all to mocha-config?
We should only change the puppeteer mocha config if we think the change should/could be upstreamed.
Our choice of timeout is pretty specific to our own CI so it makes more sense to keep it in mach or in distinct mocha config file that we don't sync with upstream. It's best to keep the sync simple, of course.
Note that you can see how mocha is called at the top of the job logs.
Comment 91•4 years ago
|
||
(In reply to Henrik Skupin (:whimboo) [⌚️UTC+2] from comment #85)
(In reply to Jens Stutte [:jstutte] from comment #84)
Hi, it looks I am not needed here any more.
No, the needinfo is still valid, and I think we should really get the shutdown hang investigated. So please have a look again on comment 70. Thanks.
Finally here is the Pernosco session for the shutdown hang/crash:
https://pernos.co/debug/Uq1o5n_s5wLvmwUfY60jHg/index.html
Andrew, I would appreciate if you could have a look at it given that it seems to be worker related.
Comment 92•4 years ago
|
||
Thank you VERY much :whimboo.
So the problem in the trace is that multiple ServiceWorker subsystems are registering on "profile-change-teardown", which seems fine at first glance, but then it turns out that the one is using an nsIAsyncShutdownClient and the other is still using an observer for its first shutdown phase (but it does use an nsIAsyncShutdownClient for its finish shutdown use-case). This means that if the async shutdown service gets notified by the observer service prior to the ServiceWorkerManager being notified, we deadlock.
I will find or create an appropriate bug to block this one. (I think we have other bugs about this scenario but struggled to figure out what was going on, which you have now provided on a silver platter. Thank you again!)
Updated•4 years ago
|
Comment hidden (Intermittent Failures Robot) |
Updated•4 years ago
|
Updated•4 years ago
|
Comment 94•4 years ago
|
||
Henrik, Andrew, is there something that still needs to be done for this issue?
Comment 95•4 years ago
|
||
(In reply to Pascal Chevrel:pascalc from comment #94)
Henrik, Andrew, is there something that still needs to be done for this issue?
This bug is blocked on bug 1670290. WIthout that fix there is a chance that we have perma failures for Puppeteer jobs after a docker image rebuild.
Updated•4 years ago
|
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Updated•3 years ago
|
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Updated•2 years ago
|
Updated•2 years ago
|
Description
•