Open Bug 1651542 Opened 4 years ago Updated 2 years ago

Intermittent TEST-UNEXPECTED-ERROR Browser specs "before all" hook in "Browser specs" (browser.spec.ts)

Categories

(Remote Protocol :: Agent, defect, P2)

defect

Tracking

(firefox-esr68 unaffected, firefox-esr78 unaffected, firefox79 unaffected, firefox80 wontfix, firefox81 wontfix, firefox82 wontfix, firefox83 wontfix, firefox84 wontfix, firefox85 wontfix, firefox86 wontfix, firefox87 wontfix)

Tracking Status
firefox-esr68 --- unaffected
firefox-esr78 --- unaffected
firefox79 --- unaffected
firefox80 --- wontfix
firefox81 --- wontfix
firefox82 --- wontfix
firefox83 --- wontfix
firefox84 --- wontfix
firefox85 --- wontfix
firefox86 --- wontfix
firefox87 --- wontfix

People

(Reporter: intermittent-bug-filer, Unassigned)

References

(Depends on 1 open bug, Regression)

Details

(4 keywords)

Attachments

(1 file, 1 obsolete file)

Filed by: malexandru [at] mozilla.com
Parsed log: https://treeherder.mozilla.org/logviewer.html#?job_id=309058845&repo=mozilla-central
Full log: https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/QhGR9GgrRDCo87YenaZXtg/runs/0/artifacts/public/logs/live_backing.log


[task 2020-07-08T23:27:58.116Z] TEST-PASS | waittask specs Frame.waitForXPath should allow you to select an element with single slash (waittask.spec.ts) | took 291ms
[task 2020-07-08T23:27:58.117Z] PID 194 | ["pass",{"title":"should allow you to select an element with single slash","fullTitle":"waittask specs Frame.waitForXPath should allow you to select an element with single slash","file":"/builds/worker/checkouts/gecko/remote/test/puppeteer/test/waittask.spec.ts","duration":47,"currentRetry":0}]
[task 2020-07-08T23:27:58.117Z] PID 194 | 1594250878117	RemoteAgent	TRACE	(connection {238bdb0f-0d2a-42b4-b596-1e0dacac9ff4})-> {"method":"Target.disposeBrowserContext","params":{"browserContextId":59},"id":1896}
[task 2020-07-08T23:27:58.122Z] PID 194 | 1594250878122	RemoteAgent	TRACE	<-(connection {238bdb0f-0d2a-42b4-b596-1e0dacac9ff4}) {"method":"Target.targetDestroyed","params":{"targetId":"6ccb32ef-2b08-4c7b-abf0-eed9f7e51aaa"}}
[task 2020-07-08T23:27:58.124Z] PID 194 | 1594250878123	RemoteAgent	TRACE	<-(connection {238bdb0f-0d2a-42b4-b596-1e0dacac9ff4}) {"id":1896}
[task 2020-07-08T23:27:58.188Z] PID 194 | Exiting due to channel error.
[task 2020-07-08T23:27:58.198Z] PID 194 | Exiting due to channel error.
[task 2020-07-08T23:27:58.198Z] PID 194 | Exiting due to channel error.
[task 2020-07-08T23:27:58.198Z] PID 194 | Exiting due to channel error.
[task 2020-07-08T23:27:58.199Z] TEST-START | Workers Page.workers (worker.spec.ts)
[task 2020-07-08T23:27:58.200Z] TEST-SKIP | Workers Page.workers (worker.spec.ts) | took 1ms
[task 2020-07-08T23:27:58.200Z] PID 194 | ["pending",{"title":"Page.workers","fullTitle":"Workers Page.workers","file":"/builds/worker/checkouts/gecko/remote/test/puppeteer/test/worker.spec.ts","currentRetry":0}]
[task 2020-07-08T23:27:58.200Z] TEST-START | Workers should emit created and destroyed events (worker.spec.ts)
[task 2020-07-08T23:27:58.200Z] TEST-SKIP | Workers should emit created and destroyed events (worker.spec.ts) | took 0ms
[task 2020-07-08T23:27:58.200Z] PID 194 | ["pending",{"title":"should emit created and destroyed events","fullTitle":"Workers should emit created and destroyed events","file":"/builds/worker/checkouts/gecko/remote/test/puppeteer/test/worker.spec.ts","currentRetry":0}]
[task 2020-07-08T23:27:58.200Z] TEST-START | Workers should report console logs (worker.spec.ts)
[task 2020-07-08T23:27:58.200Z] TEST-SKIP | Workers should report console logs (worker.spec.ts) | took 0ms
[task 2020-07-08T23:27:58.200Z] PID 194 | ["pending",{"title":"should report console logs","fullTitle":"Workers should report console logs","file":"/builds/worker/checkouts/gecko/remote/test/puppeteer/test/worker.spec.ts","currentRetry":0}]
[task 2020-07-08T23:27:58.200Z] TEST-START | Workers should have JSHandles for console logs (worker.spec.ts)
[task 2020-07-08T23:27:58.200Z] TEST-SKIP | Workers should have JSHandles for console logs (worker.spec.ts) | took 0ms
[task 2020-07-08T23:27:58.200Z] PID 194 | ["pending",{"title":"should have JSHandles for console logs","fullTitle":"Workers should have JSHandles for console logs","file":"/builds/worker/checkouts/gecko/remote/test/puppeteer/test/worker.spec.ts","currentRetry":0}]
[task 2020-07-08T23:27:58.200Z] TEST-START | Workers should have an execution context (worker.spec.ts)
[task 2020-07-08T23:27:58.200Z] TEST-SKIP | Workers should have an execution context (worker.spec.ts) | took 0ms
[task 2020-07-08T23:27:58.200Z] PID 194 | ["pending",{"title":"should have an execution context","fullTitle":"Workers should have an execution context","file":"/builds/worker/checkouts/gecko/remote/test/puppeteer/test/worker.spec.ts","currentRetry":0}]
[task 2020-07-08T23:27:58.200Z] TEST-START | Workers should report errors (worker.spec.ts)
[task 2020-07-08T23:27:58.200Z] TEST-SKIP | Workers should report errors (worker.spec.ts) | took 0ms
[task 2020-07-08T23:27:58.200Z] PID 194 | ["pending",{"title":"should report errors","fullTitle":"Workers should report errors","file":"/builds/worker/checkouts/gecko/remote/test/puppeteer/test/worker.spec.ts","currentRetry":0}]
[task 2020-07-08T23:27:58.200Z] PID 194 | ["end",{"suites":155,"tests":669,"passes":388,"pending":53,"failures":230,"start":"2020-07-08T23:11:42.642Z","end":"2020-07-08T23:27:58.200Z","duration":975558}]
[task 2020-07-08T23:27:58.257Z] PID 194 | npm ERR! code ELIFECYCLE
[task 2020-07-08T23:27:58.257Z] PID 194 | npm ERR! errno 230
[task 2020-07-08T23:27:58.258Z] PID 194 | npm ERR! puppeteer@5.0.0 unit: `tsc --version && mocha --config mocha-config/puppeteer-unit-tests.js "--reporter" "./json-mocha-reporter.js" "--retries" "0" "--fullTrace" "--timeout" "15000" "--no-parallel"`
[task 2020-07-08T23:27:58.259Z] PID 194 | npm ERR! Exit status 230
[task 2020-07-08T23:27:58.259Z] PID 194 | npm ERR!
[task 2020-07-08T23:27:58.259Z] PID 194 | npm ERR! Failed at the puppeteer@5.0.0 unit script.
[task 2020-07-08T23:27:58.259Z] PID 194 | npm ERR! This is probably not a problem with npm. There is likely additional logging output above.
[task 2020-07-08T23:27:58.263Z] PID 194 | 
[task 2020-07-08T23:27:58.263Z] PID 194 | npm ERR! A complete log of this run can be found in:
[task 2020-07-08T23:27:58.263Z] PID 194 | npm ERR!     /builds/worker/.npm/_logs/2020-07-08T23_27_58_259Z-debug.log
[task 2020-07-08T23:27:58.284Z] TEST-UNEXPECTED-MISSING Browser specs Browser.target should return browser target (browser.spec.ts)
[task 2020-07-08T23:27:58.284Z] TEST-UNEXPECTED-MISSING Browser specs Browser.isConnected should set the browser connected state (browser.spec.ts)
[task 2020-07-08T23:27:58.284Z] TEST-UNEXPECTED-MISSING Browser specs Browser.process should not return child_process for remote browser (browser.spec.ts)
[task 2020-07-08T23:27:58.284Z] TEST-UNEXPECTED-MISSING Browser specs Browser.userAgent should include WebKit (browser.spec.ts)
[task 2020-07-08T23:27:58.284Z] TEST-UNEXPECTED-MISSING Browser specs Browser.version should return whether we are in headless (browser.spec.ts)
[task 2020-07-08T23:27:58.284Z] TEST-UNEXPECTED-MISSING Browser specs Browser.process should return child_process instance (browser.spec.ts)
[task 2020-07-08T23:27:58.284Z] TEST-UNEXPECTED-MISSING Unknown new test Browser specs "after all" hook in "Browser specs" (browser.spec.ts)
[task 2020-07-08T23:27:58.284Z] TEST-UNEXPECTED-MISSING Unknown new test Browser specs "before all" hook in "Browser specs" (browser.spec.ts)
[task 2020-07-08T23:27:58.284Z] SUITE-END | took 985s
[task 2020-07-08T23:27:58.284Z] npm exited with code 230
[task 2020-07-08T23:27:58.287Z] ./mach: Got unexpected results
[fetches 2020-07-08T23:27:58.324Z] removing /builds/worker/fetches
[fetches 2020-07-08T23:27:58.485Z] finished
[taskcluster 2020-07-08 23:27:58.983Z] === Task Finished ===
[taskcluster 2020-07-08 23:28:00.817Z] Unsuccessful task run with exit code: 1 completed in 1171.9 seconds
Summary: Perma [tier 2] TEST-UNEXPECTED-MISSING Browser specs Browser.target should return browser target (browser.spec.ts) → Perma [tier 2] TEST-UNEXPECTED-MISSING Browser specs Browser<random_text> (browser.spec.ts)

As Henrik noticed earlier today, something is slowing down Firefox start-up. We could always increase the mocha timeout but it would be nice to find out what commit regressed browser start-up time. I think this job is only scheduled on m-c, so I guess it's impossible to backfill in on autoland?

Yes, I triggered some backfills on mozilla-central to figure out which merge actually regressed that. I will ask sheriffs to continue on autoland in case it is necessary.

We've started investigating and we'll post updates when jobs are done.

Flags: needinfo?(sheriffs)

Very strange that this is caused by some git improvements. But interestingly this is a source test, so that we get a clone of the repository - but with Mercurial. Might that be an unexpected side-effect?

https://searchfox.org/mozilla-central/source/taskcluster/ci/source-test/remote.yml#35-44

Regressed by: 1645196
Assignee: nobody → mhentges
Flags: needinfo?(mhentges)

Hmm, I'm not sure how this cause be caused by the git improvements.
I'll try to reproduce locally :)
If anything, we can back out my changes to see if the issue gets resolved, eh?

I'm having additional unexpected failures when running locally, so I'm unable to test with a clean environment.
:razvanm do you want to back out my change to see if it resolves the problem?
If the issue is fixed, then I'll collaborate with the puppeteer team to see how I can diagnose the issue
If the issue isn't fixed, then perhaps the issue was caused by an earlier tree change, but had a delayed failure due to CI specifics?

Flags: needinfo?(rmaries)

Updating summary as a result of reporting fix in Bug 1651469

Summary: Perma [tier 2] TEST-UNEXPECTED-MISSING Browser specs Browser<random_text> (browser.spec.ts) → Perma [tier 2] TEST-UNEXPECTED-ERROR Browser specs "before all" hook in "Browser specs" (browser.spec.ts)

(In reply to Sebastian Hengst [:aryx] (needinfo on intermittent or backout) from comment #9)

FWIW, latest pup run was successful.

I triggered two more Puppeteer jobs on central to see where it has been fixed.

Flags: needinfo?(rmaries) → needinfo?(aryx.bugmail)

Bug 1645196 has been backed out from autoland.

Flags: needinfo?(aryx.bugmail)

I triggered the puppeteer tests and they are actually still broken:

https://treeherder.mozilla.org/#/jobs?repo=autoland&revision=739056dc4d11ec7953df7559309d5c958e136acd&searchStr=pup&selectedTaskRun=DW402OX9S7GqtGOPVAPDYg.0

So it hasn't been caused by that check-in, but there might have been an infrastructure change for the Linux docker images on July 7th? Callek, or Tom do you know?

Flags: needinfo?(mozilla)
Flags: needinfo?(bugspam.Callek)
Assignee: mhentges → nobody

So the changeset right before was updating the way how TaskCluster builds docker images. See bug 1626058.

Tom, did we maybe regress bug 1633101 with your change?

Flags: needinfo?(bugspam.Callek)

It's plausible that rebuilding the image updated some package that triggers the error... What happens if you take the changeset before bug 1633101, add an empty line to taskcluster/docker/ubuntu1804-test/Dockerfile, and push to try to run that puppeteer test?

(In reply to Mike Hommey [:glandium] from comment #17)

It's plausible that rebuilding the image updated some package that triggers the error... What happens if you take the changeset before bug 1633101, add an empty line to taskcluster/docker/ubuntu1804-test/Dockerfile, and push to try to run that puppeteer test?

I assume you mean bug 1626058 here. So I pushed a try build for the parent commit:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=aeb54d91310c8d75162f4cda3e55bfc3dc806b30

So the tests are passing with the parent changeset from bug 1626058 used. That means it's not related to some updated packages inside the docker container, but related to how the docker image is getting built. Any further ideas to check? Otherwise I hope that Tom could have a look at it.

Flags: needinfo?(mh+mozilla)

Download and compare the docker images?

Flags: needinfo?(mh+mozilla)

So this is clearly caused by a slow startup of Firefox which is aborted after 15s:
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=309851380&repo=try&lineNumber=158-381

Nearly all the time is spent between the following lines:

task 2020-07-15T14:25:29.864Z] PID 198 | DEBUG: Adding blocker TelemetryController: shutting down for phase profile-before-change-telemetry
[task 2020-07-15T14:25:44.334Z] PID 198 | DEBUG: Adding blocker EnvironmentAddonBuilder for phase AddonManager: Waiting to start provider shutdown.

The EnvironmentAddonBuilder seems to get added by Telemetry via:
https://searchfox.org/mozilla-central/rev/82c04b9cad5b98bdf682bd477f2b1e3071b004ad/toolkit/components/telemetry/app/TelemetryEnvironment.jsm#544-548

I actually don't expect Telemetry to run for Puppeteer. And I see that it gets disabled via the preferences:
https://searchfox.org/mozilla-central/rev/82c04b9cad5b98bdf682bd477f2b1e3071b004ad/remote/test/puppeteer/src/node/Launcher.ts#570

So why do we still see Telemetry code running?

Also are there any kind of MOZ_LOG logs, which could be helpful to nail this down?

Flags: needinfo?(chutten)

That pref disables upload. Telemetry still needs to exist in order that the instrumentation code in Firefox runs without error.

The Environment specifically is a very heavyweight object that is built at startup on each launch from a variety of sources and is used by not just Telemetry but by things like the Crash Reporter as well. There's no clear signal that we could decide to instead not build the Environment because of how many consumers there are.

There are lots of Telemetry logs. You can enable them at a variety of levels using toolkit.telemetry.log.level ("Trace" will get you everything). If you like your logs to be in stdout as well as in the Browser Console, set toolkit.telemetry.log.dump to true (docs for these are over here).

Not sure what could cause it to take 15s. Getting the addon information is bounded in time by the provider, most of the rest is bound by the speed of reading a variety of system files and services (to get things like hdd size, os versions, gpu count and vendors, etc)... there's been some changes in the past year to improve the asynchronicity of some of the worst offenders, but script can always block on waiting for the full environment to be present (especially if there's a shutdown during startup).

Hopefully something in this helps

Flags: needinfo?(chutten)

Oh, I see. In that case I might actually want to modify the taskcluster configuration for this source tree job and start Firefox via Marionette to actually record a gecko profile. This should be more helpful for investigating this slow startup.

I will report back when I have more details. Triggered the try build for now:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=a333c789c0e5ba96495432cb1605f862eaeadcef

Here is the profile of the first start of Firefox: https://share.firefox.dev/3h5OAXB

As it can be seen we spend again all the time in gfxPlatformGtk::CreatePlatformFontList. So it looks identical to the problem with the font cache that I recently fixed over on bug 1633101. So the question is now why is the fontcache for the x86_64 platform not created?

Regressed by: 1626058
No longer regressed by: 1645196

Something actually fixed it on mozilla-central for the first merge on July 12th:

https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&tier=2&searchStr=pup&fromchange=22f5f7e914449bde3bee21a322d67248cd5e0254&tochange=2c8bc998c107bd6a6182b5fc4740fe77fc07cfca

Which is equivalent to this range on autoland:

https://treeherder.mozilla.org/#/jobs?repo=autoland&tier=2&searchStr=ub18&tochange=2c8bc998c107bd6a6182b5fc4740fe77fc07cfca&fromchange=f629caea26ce083c0f525e5113f501e3bcd00a1e

During that time I cannot see a rebuilt of the ubuntu1804-test docker image whether on autoland nor mozilla-central. So maybe some other landing fixed that? I cannot find any obvious landing for that range. So I will do some backfills which hopefully help.

Also try is working fine as this push demonstrates:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=f172ba3f2ce420c0eebb48fdafe6212d534f7e34

So the fix that made it actually work again was a rustfmt landing:

https://treeherder.mozilla.org/#/jobs?repo=autoland&fromchange=f629caea26ce083c0f525e5113f501e3bcd00a1e&searchStr=pup&tochange=b15e9df8111e9db9c90079d6f1809caa4ca11ec4

But with that changeset we also rebuilt the ubuntu1804-test docket image:

https://treeherder.mozilla.org/#/jobs?repo=autoland&fromchange=f629caea26ce083c0f525e5113f501e3bcd00a1e&searchStr=ub18&tochange=b15e9df8111e9db9c90079d6f1809caa4ca11ec4

We should really pin the versions of the packages we install via apt-get (similar to the debian image as Tom mentioned to me via Matrix) to make sure to not install non-tested releases of packages, which can cause tests to fail. Especially because we seem to rebuilt the docker images once in a while, so that failures like that can start out of a sudden.

Joel and Edwin, is that something we can do for the Ubuntu 1804 image?

Marking this bug as WFM, and moving it into the startup component where it most likely is related to.

Status: NEW → RESOLVED
Closed: 4 years ago
Component: Debugger → Startup and Profile System
Flags: needinfo?(jmaher)
Flags: needinfo?(egao)
Product: Remote Protocol → Toolkit
Resolution: --- → WORKSFORME

it seems logical to me pin versions of packages in the 1804 docker image creation. Is there a list of changes to make?

Flags: needinfo?(jmaher)

That's the entire problem with using ubuntu, we can't pin packages.

:glandium is correct. Bug 1607713 is a similar issue where we have to wait for upstream to fix the issue.

Flags: needinfo?(egao)

What a bummer. Yes, lets reopen given that it doesn't make sense to file new bugs over and over again whenever a docker image update has happened, which might have broken it again. :/

The docker image update which caused the failure again is from Aug 6th:

https://treeherder.mozilla.org/#/jobs?repo=autoland&searchStr=docker%2Cub18&fromchange=6e35e01646d7&tochange=af63ceb25422&selectedTaskRun=TwlLnFw7Rx20jn3FItJTBw.0

The former docker image update from Aug 3rd didn't cause it:

https://treeherder.mozilla.org/#/jobs?repo=autoland&revision=359f9a3acc75b37125b90e69e93c8318c33b4ee3&searchStr=ub18&selectedTaskRun=J17VQzhHT2O1pRNJAvsuBw.0

Comparing the log files for the fontconfig packages doesn't show any difference in when the font cache is getting re-generated. Also all related packages are getting installed with the exact same version of the packages.

It might be worthwhile to check if explicitly adding fc-cache -fv after the package installation steps would help here.

Status: RESOLVED → REOPENED
Resolution: WORKSFORME → ---
Blocks: 1628589
Flags: needinfo?(hskupin)

So I pushed another try build with the startup profiler enabled just to check how the startups look like. Here the results:

https://share.firefox.dev/3iYdMjM
https://share.firefox.dev/2YfkOZ8

I'm not sure about the first one. Here Firefox started quick enough. But especially the second profile is interesting again because it shows the font cache generation timing again.

Flags: needinfo?(mozilla)

Just to check if it's really related to the font cache generation I pushed another try build where fc-cache -rv is explicitly called before the actual mach command. And as you can see it works all fine:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=f9ed2c61944c88bb00a09a6fda6663bb98eabd78&searchStr=pup

We no longer run into a timeout that causes all tests to fail.

As such I will check if explicitly running this command at the end of ubuntu1804-test-system-setup.sh will make it work.

Status: REOPENED → NEW
Component: Startup and Profile System → Task Configuration
Product: Toolkit → Firefox Build System

I think I now know what the problem is. The system setup script gets run as root and as such also the installation of various apt-packages. When fontconfig gets installed the script runs the fc-cache script as user root. Hereby the cache gets stored at /root/.cache/fontcache, which is finally all cleaned-up at the end of the script.

Now when we run all the various test jobs in CI the user that is used to run all the commands is worker. And there is clearly no font cache available, and the first start of Firefox actually triggers its creation.

Given that we don't want to create the cache for each test job, I would say the best place to create it would be at the end of the setup script.

Assignee: nobody → hskupin
Status: NEW → ASSIGNED
Pushed by hskupin@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/8b1447160ba4 Force font cache generation for ubuntu1804-test docker image. r=jmaher
Status: ASSIGNED → RESOLVED
Closed: 4 years ago4 years ago
Resolution: --- → FIXED
Target Milestone: --- → 81 Branch

That's sad. Can we backout the patch please? I will have to investigate why it doesn't work when we run this step in the docker setup script.

Flags: needinfo?(hskupin) → needinfo?(sheriffs)

Backout done as requested.

Flags: needinfo?(sheriffs)
Backout by malexandru@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/199a351dbc20 Backed out changeset 8b1447160ba4 as requested by whimboo.

Can you please also backout from beta (81)? You can do it as part of the next uplifts. There is no hurry.

Status: RESOLVED → REOPENED
Flags: needinfo?(malexandru)
Resolution: FIXED → ---
Target Milestone: 81 Branch → ---

Added the checkin-needed tag so whoever does the next round of uplifts sees it.

Flags: needinfo?(malexandru)
Whiteboard: [checkin-needed-beta]

The problem here is that the /home folder is not part of the docker image. As such the generated font cache will not be present, and needs to be re-generated for each and every test job on Ubuntu 18.04. Maybe that folder is not present given that since bug 1626058 landed, we no longer create the worker user:

https://hg.mozilla.org/ci/taskgraph/rev/004bada4f170#l7.65

Or do I have to check something else beside ubuntu1804-test-system-setup.sh?

Mike, given that you reviewed Tom's patch, do you know why we removed the creation of the worker user?

Flags: needinfo?(mh+mozilla)

Oh, as it looks like we still create the worker user:
https://searchfox.org/mozilla-central/source/taskcluster/docker/ubuntu1804-test/Dockerfile#6

But the home folder is actually located at builds/worker. I will have to check the docker image again.

So the generated font-cache is actually stored under /builds/worker/.fontcache instead of /builds/worker/.cache/fontcache. Could this cause problems with some applications so that is not being detected? Or should we also run font cache generation as root?

Status: REOPENED → ASSIGNED
Priority: P5 → P1
Flags: needinfo?(mh+mozilla)

So since a couple of days this problem doesn't exist anymore again. And it's working fine for recent Nightly and CI builds on autoland.

For checking the differences here two profiles:

Profile for long startup: https://share.firefox.dev/2YfkOZ8
Profile for normal startup: https://share.firefox.dev/2GfwWTW

Getting the fonts list takes 1s compared to 16s when it maybe had to be created first due to the missing cache.

I have no idea what's going on and what could have fixed it, maybe only temporarily and it will start failing in a couple of days again.

Depends on: 1662549
No longer depends on: 1662549

Please note that the user worker has no font cache available in test jobs. I saw that via the following job:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=21dd183c015ccb6a8b8e2fdfa613178c1a1fac9c&selectedTaskRun=Ycr0W4iUQ8qBcsQ7dZf8bA.0

Interesting that even without the font cache the start-up only takes 5s and not 15s.

Puppeteer jobs are still green. Joel, do you think that we can enable the jobs for each commit on central and autoland now? See bug 1628589. I cannot say if this error comes back or not.

Flags: needinfo?(jmaher)

lets do it, we can clean up if problems persist and are frequent enough.

Flags: needinfo?(jmaher)

There's a r+ patch which didn't land and no activity in this bug for 2 weeks.
:whimboo, could you have a look please?
For more information, please visit auto_nag documentation.

Flags: needinfo?(hskupin)
Attachment #9171194 - Attachment is obsolete: true

The tests were pretty stable over the last weeks and that attached patch doesn't seem to be actually necessary. As such lets close this bug as INCOMPLETE for now, and get it reopened if the font cache generation problems appears again.

Assignee: hskupin → nobody
Status: ASSIGNED → RESOLVED
Closed: 4 years ago4 years ago
Flags: needinfo?(hskupin)
Resolution: --- → INCOMPLETE
Status: RESOLVED → REOPENED
Resolution: INCOMPLETE → ---

Henrik, looks like I somehow triggered this with a nodejs update that rebuilt the images. Could you take another look please?

Flags: needinfo?(hskupin)

This is kinda sad to see, but also is kinda scary in terms when it starts to failing and how often. As you can see, recently more pup jobs on autoland are passing. It gives the feeling that something is wrong with the test machines or the environment and maybe not with the docker image?

https://treeherder.mozilla.org/#/jobs?repo=autoland&searchStr=pup&fromchange=d368b4f167322c1b8e809f13f94e807af6009b64

Joel, is there anyone you know who could help with that?

For the time being I would propose a workaround for the pup job. Instead of launching Firefox the first time with Puppeteer I will check if starting Firefox directly before with --screenshot will correctly warm-up the font cache, and then not cause this long first startup time for Puppeteer.

Flags: needinfo?(hskupin)

As it looks like the workaround with --screenshot doesn't work because Firefox acts up and hangs during shutdown:

https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=318151647&repo=try&lineNumber=63-71

[task 2020-10-09T09:16:02.923Z] + /builds/worker/fetches/firefox/firefox --screenshot
[task 2020-10-09T09:16:02.963Z] *** You are running in headless mode.
[task 2020-10-09T09:17:05.926Z] [Parent 36, Main Thread] ###!!! ABORT: file /builds/worker/checkouts/gecko/dom/serviceworkers/ServiceWorkerShutdownBlocker.cpp:98
[task 2020-10-09T09:17:05.926Z] [Parent 36, Main Thread] ###!!! ABORT: file /builds/worker/checkouts/gecko/dom/serviceworkers/ServiceWorkerShutdownBlocker.cpp:98
[task 2020-10-09T09:17:05.927Z] ExceptionHandler::GenerateDump cloned child 162
[task 2020-10-09T09:17:05.927Z] ExceptionHandler::SendContinueSignalToChild sent continue signal to child
[task 2020-10-09T09:17:05.927Z] ExceptionHandler::WaitForContinueSignal waiting for continue signal...
[task 2020-10-09T09:17:05.994Z] Exiting due to channel error.

Jens, do you have any idea what this could be? I'm happy to trigger some more try builds if there are specific log types available that I could use here to get further details.

Flags: needinfo?(jstutte)

The ServiceWorkerManager's shutdown blocker registered here triggered for some reason. This probably indicates a hanging service worker on shutdown.

Is this reproducible? Then a pernosco session would be very appreciated in order to understand, what is going on.

Flags: needinfo?(jstutte)

As long as the tests perma-failing it will be good to get a Pernosco session. Never did that before and had to look for some documentation first. But here is the try build with the --pernosco flag set. I hope that this is all what I have to do. Lets see.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=eb0853db2aeee0b8f2bc3f4323e594dc4d1299ec

(In reply to Henrik Skupin (:whimboo) [⌚️UTC+2] from comment #67)

As long as the tests perma-failing it will be good to get a Pernosco session. Never did that before and had to look for some documentation first. But here is the try build with the --pernosco flag set. I hope that this is all what I have to do. Lets see.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=eb0853db2aeee0b8f2bc3f4323e594dc4d1299ec

It still doesn't work. Andrew, maybe the source tree jobs don't support Pernosco? Or is there anything else what I have to set or make available as artifact?

Flags: needinfo?(ahal)

Yes, the pernosco service scans pushes for jobs it supports. I suspect it doesn't support this task. It also only inherently works with Linux debug platforms.

Maybe you could ask khuey to get puppeteer added.

Flags: needinfo?(ahal)

Jens, could this help?

https://treeherder.mozilla.org/#/jobs?repo=try&revision=478c65cc24ea49cc412682f43bc9721c09560d96&selectedTaskRun=Gl3FEwvMSYu_IX5gxQ-hSQ.0

Thing is that Firefox was quit immediately because I missed to actually specify a URL to get a screenshot from. So I assume it's a race between initializing the service workers and running shutdown right away.

Flags: needinfo?(jstutte)

With doing screenshots I get the following for debug builds:
https://treeherder.mozilla.org/#/jobs?repo=try&selectedTaskRun=ZUVDerrbSgC9RkbxaBgRlw.0&revision=54d788a8cfa39ca2105e9ed46e4f7b022af164cc

There is no slow startup anymore. The failures as visible are side-effects of the debug build slowness.

So here a new try build with a Firefox opt build:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=dd159af0490e2c1f7fbccbfa1486ab5cf3b60763

Assignee: nobody → hskupin
Status: REOPENED → ASSIGNED

Due to font cache issues with the ub18-test docker image in
TaskCluster the first startup of Firefox sometimes takes more
than 15s.

As workaround for now just start Firefox once before running
Puppeteer unit tests.

The patch as attached is only a workaround. As such the bug should stay open.

The real fix for that should be done in Mocha config, but I would like to wait until we upgraded to Puppeteer 5.3.x, and have the keyboard interrupt handling working.

Component: Task Configuration → Agent
Depends on: 1644826, 1670286
Keywords: leave-open
Product: Firefox Build System → Remote Protocol
Pushed by hskupin@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/694d23b789b1 [puppeteer] Warm-up font cache for Firefox when running Puppeteer unit tests. r=maja_zf

Why do you want to wait for the Puppeteer upgrade to change the mocha config?

Flags: needinfo?(hskupin)

The extra comment caused breakage when running the mozharness script. Landing the patch again with it removed.

Flags: needinfo?(hskupin)
Pushed by hskupin@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/2eee3f2f9456 [puppeteer] Warm-up font cache for Firefox when running Puppeteer unit tests. r=maja_zf

Dependencies need to be fixed first. So unassigning myself for now.

Assignee: hskupin → nobody
Status: ASSIGNED → NEW
Priority: P1 → P2

(In reply to Maja Frydrychowicz :maja_zf (UTC-4) (maja@mozilla.com) from comment #75)

Why do you want to wait for the Puppeteer upgrade to change the mocha config?

I haven't checked which changes shipped with Puppeteer 5.3.1 compared our our vendored version. If there are changes it might be worth syncing first. I also filed https://github.com/puppeteer/puppeteer/issues/6500 to maybe get this fixed by the Puppeteer folks directly given that this is a more generic issue and not Firefox specific.

Hi, it looks I am not needed here any more.

Flags: needinfo?(jstutte)

(In reply to Jens Stutte [:jstutte] from comment #84)

Hi, it looks I am not needed here any more.

No, the needinfo is still valid, and I think we should really get the shutdown hang investigated. So please have a look again on comment 70. Thanks.

Flags: needinfo?(jstutte)

Andrew, can you help here?

Flags: needinfo?(jstutte) → needinfo?(bugmail)

(In reply to Henrik Skupin (:whimboo) [⌚️UTC+2] (away 10/19 - 10/20) from comment #73)

The patch as attached is only a workaround. As such the bug should stay open.

The real fix for that should be done in Mocha config, but I would like to wait until we upgraded to Puppeteer 5.3.x, and have the keyboard interrupt handling working.

Note that our mach command overrides the default mocha timeout of 25s down to 15s because otherwise the overall pup job is too long.

Specific tests that involve browser restarts set a longer timeout e.g. https://searchfox.org/mozilla-central/source/remote/test/puppeteer/test/launcher.spec.ts#38

(In reply to Maja Frydrychowicz :maja_zf (UTC-4) (maja@mozilla.com) from comment #87)

Specific tests that involve browser restarts set a longer timeout e.g. https://searchfox.org/mozilla-central/source/remote/test/puppeteer/test/launcher.spec.ts#38

Oh! Could this be the reason why the mocha default settings for the timeout aren't in affect here?

https://searchfox.org/mozilla-central/source/remote/test/puppeteer/mocha-config/puppeteer-unit-tests.js

Shouldn't we just remove the default from mach, and leave it all to mocha-config?

Flags: needinfo?(mjzffr)

(In reply to Henrik Skupin (:whimboo) [⌚️UTC+2] from comment #88)

(In reply to Maja Frydrychowicz :maja_zf (UTC-4) (maja@mozilla.com) from comment #87)

Specific tests that involve browser restarts set a longer timeout e.g. https://searchfox.org/mozilla-central/source/remote/test/puppeteer/test/launcher.spec.ts#38

Oh! Could this be the reason why the mocha default settings for the timeout aren't in affect here?

Yep.

Shouldn't we just remove the default from mach, and leave it all to mocha-config?

We should only change the puppeteer mocha config if we think the change should/could be upstreamed.

Our choice of timeout is pretty specific to our own CI so it makes more sense to keep it in mach or in distinct mocha config file that we don't sync with upstream. It's best to keep the sync simple, of course.

Flags: needinfo?(mjzffr)

Note that you can see how mocha is called at the top of the job logs.

(In reply to Henrik Skupin (:whimboo) [⌚️UTC+2] from comment #85)

(In reply to Jens Stutte [:jstutte] from comment #84)

Hi, it looks I am not needed here any more.

No, the needinfo is still valid, and I think we should really get the shutdown hang investigated. So please have a look again on comment 70. Thanks.

Finally here is the Pernosco session for the shutdown hang/crash:
https://pernos.co/debug/Uq1o5n_s5wLvmwUfY60jHg/index.html

Andrew, I would appreciate if you could have a look at it given that it seems to be worker related.

Thank you VERY much :whimboo.

So the problem in the trace is that multiple ServiceWorker subsystems are registering on "profile-change-teardown", which seems fine at first glance, but then it turns out that the one is using an nsIAsyncShutdownClient and the other is still using an observer for its first shutdown phase (but it does use an nsIAsyncShutdownClient for its finish shutdown use-case). This means that if the async shutdown service gets notified by the observer service prior to the ServiceWorkerManager being notified, we deadlock.

I will find or create an appropriate bug to block this one. (I think we have other bugs about this scenario but struggled to figure out what was going on, which you have now provided on a silver platter. Thank you again!)

Flags: needinfo?(bugmail)

Henrik, Andrew, is there something that still needs to be done for this issue?

Flags: needinfo?(hskupin)
Flags: needinfo?(bugmail)

(In reply to Pascal Chevrel:pascalc from comment #94)

Henrik, Andrew, is there something that still needs to be done for this issue?

This bug is blocked on bug 1670290. WIthout that fix there is a chance that we have perma failures for Puppeteer jobs after a docker image rebuild.

Flags: needinfo?(hskupin)
Flags: needinfo?(bugmail)
Summary: Perma [tier 2] TEST-UNEXPECTED-ERROR Browser specs "before all" hook in "Browser specs" (browser.spec.ts) → Intermittent TEST-UNEXPECTED-ERROR Browser specs "before all" hook in "Browser specs" (browser.spec.ts)
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: