Closed
Bug 1443437
Opened 7 years ago
Closed 7 years ago
Slow first startup of 32bit Firefox builds on gecko-t-linux-xlarge (linux 64bit) workers due to missing i386 font cache
Categories
(Firefox Build System :: Task Configuration, task, P1)
Firefox Build System
Task Configuration
Tracking
(firefox-esr52 unaffected, firefox59 wontfix, firefox60 fixed)
RESOLVED
FIXED
mozilla60
Tracking | Status | |
---|---|---|
firefox-esr52 | --- | unaffected |
firefox59 | --- | wontfix |
firefox60 | --- | fixed |
People
(Reporter: whimboo, Assigned: whimboo)
References
(Blocks 1 open bug)
Details
Attachments
(3 files)
On the linux32 platform I can see a slow first startup of Firefox for basically every web-platform related test job. Checking different logs it shows that we talk about roughly 30s here. The webdriver tests seem to be more prone to fail due to the 30s test duration setting, and given that this is the duration Firefox needs for a cold start, we always fail. See:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=b2a3f6a13b2230d7ddc5083cc6b82844268d03f1
It means that opt builds need exactly the same time as debug builds to startup! This is not something I would ever expect.
Assignee | ||
Comment 1•7 years ago
|
||
Here some stats:
Debug build:
https://treeherder.mozilla.org/logviewer.html#?job_id=166082214&repo=try&lineNumber=1224-1263
> [task 2018-03-05T22:20:29.934Z] 22:20:29 INFO - PID 858 | 1520288429930 mozrunner::runner INFO Running command: "/builds/worker/workspace/build/application/firefox/firefox" "-marionette" "-profile" "/tmp/rust_mozprofile.oW8s8qxIYUEb"
> [task 2018-03-05T22:20:31.890Z] 22:20:31 INFO - PID 858 | [871, Main Thread] WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80520012: file /builds/worker/workspace/build/src/extensions/cookie/nsPermissionManager.cpp, line 2909
> [task 2018-03-05T22:20:59.088Z] 22:20:59 INFO - PID 858 | ++DOCSHELL 0xe4384000 == 1 [pid = 871] [id = {a2b6bd56-b0d0-41e1-952c-2eed44999634}]
> [..]
> [task 2018-03-05T22:21:04.299Z] 22:21:04 INFO - PID 858 | 1520288464294 Marionette INFO Listening on port 2828
Here we get a log line for the permission manager 2s after we started Firefox. Then it takes 28s for the next log output.
Opt build:
https://treeherder.mozilla.org/logviewer.html#?job_id=166082499&repo=try&lineNumber=1211-1214
> [task 2018-03-05T22:20:45.650Z] 22:20:45 INFO - STDOUT: tests/web-platform/tests/webdriver/tests/actions/key.py::test_lone_keyup_sends_no_events
> [task 2018-03-05T22:20:45.650Z] 22:20:45 INFO - PID 853 | 1520288445644 mozrunner::runner INFO Running command: "/builds/worker/workspace/build/application/firefox/firefox" "-marionette" "-profile" "/tmp/rust_mozprofile.P3FNtNslzjsF"
> [task 2018-03-05T22:21:14.666Z] 22:21:14 INFO - PID 853 | GLib-GIO-Message: Using the 'memory' GSettings backend. Your settings will not be saved or shared with other applications.
> [task 2018-03-05T22:21:15.354Z] 22:21:15 INFO - TEST-UNEXPECTED-TIMEOUT | /webdriver/tests/actions/key.py | expected OK
> [task 2018-03-05T22:21:15.355Z] 22:21:15 INFO - TEST-INFO took 30040ms
With starting the test Firefox gets started immediately, but then it takes 30s for a first log line by Firefox.
I think that submitting a patched wdspec job which uses `--webdriver-args=-vv` for trace output might give us more details.
Assignee | ||
Comment 2•7 years ago
|
||
Here the log output from a job of an opt build and trace logs enabled:
https://treeherder.mozilla.org/logviewer.html#?job_id=166176579&repo=try&lineNumber=1208-1484
Timings of important steps:
> [task 2018-03-06T10:50:42.275Z] 10:50:42 INFO - PID 850 | 1520333442265 mozrunner::runner INFO Running command: "/builds/worker/workspace/build/application/firefox/firefox" "-marionette" "-profile" "/tmp/rust_mozprofile.I2QMtuUyKgTC"
> [..]
> [task 2018-03-06T10:50:42.752Z] 10:50:42 INFO - PID 850 | 1520333442742 Marionette DEBUG Received observer notification profile-after-change
> [..]
> [task 2018-03-06T10:51:07.448Z] 10:51:07 INFO - PID 850 | 1520333467442 Marionette DEBUG Received observer notification command-line-startup
> [..]
> [task 2018-03-06T10:51:08.611Z] 10:51:08 INFO - PID 850 | 1520333468600 Marionette DEBUG Received observer notification sessionstore-windows-restored
> [..]
> [task 2018-03-06T10:51:08.898Z] 10:51:08 INFO - PID 850 | 1520333468872 Marionette INFO Listening on port 2828
So the `profile-after-change` observer notification is received pretty quickly, but then there is a huge delay of 25 seconds until `command-line-startup` is received.
Karl, do you have an idea what could delay the observer notification that drastically on our gecko-t-linux-xlarge workers in automation? Are there any specific MOZ_LOG settings which could be used to get more internal details of the delay?
Flags: needinfo?(karlt)
Assignee | ||
Updated•7 years ago
|
Assignee: nobody → hskupin
Status: NEW → ASSIGNED
Priority: -- → P1
Assignee | ||
Comment 3•7 years ago
|
||
This problem is actually not only affecting web-platform tests but every test job as run in automation on Linux 32. For each of those I can see a massive delay until the `command-line-startup` observer notification is sent. Here some examples:
Mochitest: https://treeherder.mozilla.org/logviewer.html#?job_id=166172237&repo=autoland&lineNumber=2309-2317
[task 2018-03-06T10:10:29.646Z] 10:10:29 INFO - Application command: /builds/worker/workspace/build/application/firefox/firefox -marionette -foreground -profile /tmp/tmppoVstB.mozrunner
[task 2018-03-06T10:10:29.662Z] 10:10:29 INFO - runtests.py | Application pid: 1036
[task 2018-03-06T10:10:29.662Z] 10:10:29 INFO - TEST-INFO | started process GECKO(1036)
[task 2018-03-06T10:10:30.551Z] 10:10:30 INFO - GECKO(1036) | 1520331030544 Marionette DEBUG Received observer notification profile-after-change
[task 2018-03-06T10:10:58.119Z] 10:10:58 INFO - GECKO(1036) | 1520331058114 Marionette DEBUG Received observer notification command-line-startup
[task 2018-03-06T10:10:58.120Z] 10:10:58 INFO - GECKO(1036) | 1520331058114 Marionette DEBUG Received observer notification nsPref:changed
[task 2018-03-06T10:10:59.411Z] 10:10:59 INFO - GECKO(1036) | 1520331059400 Marionette DEBUG Received observer notification sessionstore-windows-restored
[task 2018-03-06T10:10:59.734Z] 10:10:59 INFO - GECKO(1036) | 1520331059720 Marionette DEBUG New connections are accepted
[task 2018-03-06T10:10:59.735Z] 10:10:59 INFO - GECKO(1036) | 1520331059720 Marionette INFO Listening on port 2828
Browser Chrome: https://treeherder.mozilla.org/logviewer.html#?job_id=166172222&repo=autoland&lineNumber=2012-2020
Ref test: https://treeherder.mozilla.org/logviewer.html#?job_id=166172264&repo=autoland&lineNumber=1772-1779
Component: web-platform-tests → General
Summary: Investigate first slow startup of Firefox opt builds for web-platform-tests on linux32 → Very slow cold startup time of Firefox opt builds on gecko-t-linux-xlarge (linux32) workers
Assignee | ||
Comment 4•7 years ago
|
||
Maybe it is somewhat related to bug 1414495 comment 24, which has only seen / investigated on Windows yet. To hopefully see where the hang actually occurs I pushed a try build with a couple of debug print statements:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=d6759a7767cbe072da39dd115fe07fe3739417ef
Comment 5•7 years ago
|
||
Can we just run the profiler as part of this setup and look at the profiler output via http://perf-html.io/ ?
Flags: needinfo?(hskupin)
Comment 6•7 years ago
|
||
Most of the time between profile-after-change and command-line-startup is usually in the add-on manager startup. I think our test jobs sideload some test add-ons; that requires main thread I/O, and that triggers loading the blocklist. The machine would need to have some pretty slow (or busy) disks to have this take 30s, but it's possible.
Comment 7•7 years ago
|
||
wpt jobs don't use any addons; they just use marionette. Of course marionette could be doing something slow, or there could be some other test-specific problem.
Assignee | ||
Comment 8•7 years ago
|
||
I got a profile for startup: https://perfht.ml/2Hcn0G5
As it looks like we spent 99% of the time in TelemetryStartup for annotating the environment, and specifically building up the font list?
Flags: needinfo?(karlt)
Flags: needinfo?(hskupin)
Flags: needinfo?(gfritzsche)
Assignee | ||
Comment 9•7 years ago
|
||
Whereby the affected code seems to be located in gfxFcPlatformFontList::gfxFcPlatformFontList. So it looks more like a GFX as Telemetry issue.
Component: General → Graphics
Flags: needinfo?(gfritzsche) → needinfo?(milan)
Product: Testing → Core
Version: Version 3 → unspecified
Assignee | ||
Updated•7 years ago
|
Assignee: hskupin → nobody
Status: ASSIGNED → NEW
Assignee | ||
Updated•7 years ago
|
Priority: P1 → --
Assignee | ||
Comment 10•7 years ago
|
||
An observation I just made is that the hang is actually only existent for non Nightly/Release builds, so any CI build we run tests for.
Here some steps to reproduce it:
1) Request a one-click loaner based on https://treeherder.mozilla.org/#/jobs?repo=try&revision=1544ec8719409ce0e0f2ad2f58e4e4898f15154b&selectedJob=166224867
2) Download any CI build of Firefox like https://queue.taskcluster.net/v1/task/PEWCZZrfQN2bFlZYFrvJnA/runs/0/artifacts/public/build/target.tar.bz2 and extract it
3) Run Firefox
Please note that using a Nightly or Release doesn't show this problem! Also if you hit Ctrl-C during the hang, the next time Firefox is started it hangs again. But once Firefox has been completed the startup, the hang will not appear again for any other startup for that one-click loaner and a new one has to be requested.
Assignee | ||
Comment 11•7 years ago
|
||
I tried to use strace to log the file system access but sadly after a couple of seconds a dialog pops-up which tells that Firefox has unexpectedly closed and offers safe mode or to refresh nightly. Then it seems to hang in those SIGVTALRM lines.
open("/usr/lib/i386-linux-gnu/libgconf-2.so.4", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/lib/libgconf-2.so.4", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/usr/lib/libgconf-2.so.4", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/dev/shm/org.chromium.6taIq6", O_RDWR|O_CREAT|O_EXCL, 0600) = 50
open("/dev/shm/org.chromium.0sTRM0", O_RDWR|O_CREAT|O_EXCL, 0600) = 52
open("/usr/share/themes/Ambiance/gtk-3.0/assets/button-focused.png", O_RDONLY|O_LARGEFILE) = 52
--- SIGVTALRM {si_signo=SIGVTALRM, si_code=SI_TKILL, si_pid=3781, si_uid=0} ---
--- SIGVTALRM {si_signo=SIGVTALRM, si_code=SI_TKILL, si_pid=3781, si_uid=0} ---
[..]
Looks like this is caused by a broken profile and maybe due to hitting ctrl-c during startup.
When I tried with another fresh profile it worked. So a new loaner is needed which currently takes forever. I will continue tomorrow.
Assignee | ||
Comment 12•7 years ago
|
||
Assignee | ||
Comment 13•7 years ago
|
||
Assignee | ||
Comment 14•7 years ago
|
||
So when I compare the two strace logs between a working Nightly build and the broken CI build already the second line shows a difference.
CI build:
> execve("firefox/firefox", ["firefox/firefox", "-profile", "/builds/worker/_profile"], [/* 47 vars */]) = 0
> strace: [ Process PID=1115 runs in 32 bit mode. ]
> brk(NULL) = 0x9d86000
Nightly build:
> execve("firefox/firefox", ["firefox/firefox", "-profile", "/builds/worker/_profile"], [/* 17 vars */]) = 0
> brk(NULL) = 0x979000
Both builds should be 32bit Linux binaries and are run on a 32bit system. Why does strace list "runs in 32 bit mode" for CI builds? What are we building different for CI builds compared to Nightly builds? Could this difference cause such a slow behavior of gfxFcPlatformFontList::gfxFcPlatformFontList as seen in the profile in comment 8?
https://perf-html.io/public/851a746970e990b5e2de74958ab7fef3703c4672/calltree/?hiddenThreads=&range=0.0747_30.0274&thread=0&threadOrder=0-2-3-4-5-6-7-8-9-1&v=2
The part of the strace log which is showing that slowness is between the 4th instance of "ZZZ NS_CreateServicesFromCategory starts loop" and the next "ZZZ NS_CreateServicesFromCategory done".
Jed or Mike does one of you know why we build differently and if this could cause the problem?
Component: Graphics → General
Flags: needinfo?(milan)
Flags: needinfo?(mh+mozilla)
Flags: needinfo?(jld)
Product: Core → Firefox
Assignee | ||
Comment 15•7 years ago
|
||
Here some updated steps in how to reproduce the 30s hang during startup:
1) Grab a one-click loaner based on https://treeherder.mozilla.org/#/jobs?repo=try&revision=1544ec8719409ce0e0f2ad2f58e4e4898f15154b&selectedJob=166224867
2) Exit the wizard
3) Run the following commands:
wget https://queue.taskcluster.net/v1/task/PEWCZZrfQN2bFlZYFrvJnA/runs/0/artifacts/public/build/target.tar.bz2
tar xfv target.tar.bz2
apt install strace
mkdir _profile
4) Run Firefox:
firefox/firefox -profile /builds/worker/_profile
or
strace firefox/firefox -profile /builds/worker/_profile &>strace.log
Assignee | ||
Comment 16•7 years ago
|
||
I just noticed that when requesting a Linux opt one-click loaner I get the following:
# uname -a
Linux 47486cdcf0c8 4.4.0-1014-aws #14taskcluster-Ubuntu SMP Mon Feb 19 11:31:26 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Means we actually run a 64bit system but not a 32bit system. The above CI build is a 32bit binary, whereby the Nightly/Release builds I downloaded via mozdownload were 64bit binaries because I didn't force the 32bit binary.
Actually grabbing a 32bit nightly I can reproduce the same hang, and that indicates that we have those delays when running a 32bit Firefox binary on a 64bit Linux system.
Assignee | ||
Updated•7 years ago
|
Summary: Very slow cold startup time of Firefox opt builds on gecko-t-linux-xlarge (linux32) workers → Very slow cold startup time of Firefox opt 32bit builds on gecko-t-linux-xlarge (linux 64bit) workers
Assignee | ||
Comment 17•7 years ago
|
||
It actually doesn't seem to be a regression. I went back until Firefox 17.0 which also has this slow first start-up time. I wonder if it's a problem with Firefox, or if there is just some package/setting missing in the docker image for desktop1604-test.
Assignee | ||
Comment 18•7 years ago
|
||
I have the feeling that the problem is related to a missing font cache for the 32bit binaries. In the docker image we add the i386 architecture but newer seem to run fc-cache to build up the cache. As such it will be created the first time a 32bit application gets started.
If that is the case it will be a taskcluster related issue, and I will have a fix soon.
Assignee | ||
Comment 19•7 years ago
|
||
Ok, this is clearly a docker image problem. The font cache for i386 binaries is missing and during the first start of Firefox this cache will be created under `/var/cache/fontconfig/`. Removing this folder, I can always trigger the slowness.
I had a look into the fc-cache man page and as it looks like there is no way to force a rebuild of the font cache which will also include the cache for 32bit binaries because fc-cache itself is a 64bit binary.
I wonder how we can pre-build the cache so that we do not suffer from this 30s delay in each and every test job, which will cause a fair amount of intermittent failures at least for wdspec tests for the first test.
Component: General → Docker Images
Product: Firefox → Taskcluster
Summary: Very slow cold startup time of Firefox opt 32bit builds on gecko-t-linux-xlarge (linux 64bit) workers → Slow first startup of 32bit Firefox builds on gecko-t-linux-xlarge (linux 64bit) workers due to missing i386 font cache
Comment 20•7 years ago
|
||
(In reply to Henrik Skupin (:whimboo) from comment #19)
> I had a look into the fc-cache man page and as it looks like there is no way
> to force a rebuild of the font cache which will also include the cache for
> 32bit binaries because fc-cache itself is a 64bit binary.
>
> I wonder how we can pre-build the cache so that we do not suffer from this
> 30s delay in each and every test job, which will cause a fair amount of
> intermittent failures at least for wdspec tests for the first test.
Write a dummy bit of code that forces the same initialization to happen, compile as 32-bit and run prior to running Firefox? (Maybe this is a dumb idea for some reason...)
Assignee | ||
Comment 21•7 years ago
|
||
An option could be to install `firefox:i386` via apt, which will remove the x86_64 package. But running the system firefox binary afterward will populate the font cache for 32bit binaries, and keeps the cache for 64bit binaries in-tact.
With `firefox --screenshot` we could run Firefox and let it close immediately once the cache has been built. A start of the 32bit binary under test by test jobs later would be fast because the cache exists. But I'm not that happy by using a specific feature of Firefox here.
Maybe we can check when startup is done and kill the process?
Assignee | ||
Comment 22•7 years ago
|
||
(In reply to :Gijs (under the weather; responses will be slow) from comment #20)
> Write a dummy bit of code that forces the same initialization to happen,
> compile as 32-bit and run prior to running Firefox? (Maybe this is a dumb
> idea for some reason...)
That doesn't work. Each application builds its own cache. At least we would have to figure out how to spoof a Firefox binary.
Comment 23•7 years ago
|
||
Why do you think each application has it's own cache? It doesn't look like that on my machine, and if it were true I'm not sure what the point of fc-cache would be.
AIUI fc-cache basically is a wrapper program that calls the same cache-generating endpoint as a real application and then exits. So the question is how to run the 32bit variant of that at some appropriate time (e.g. in the dockerfile so that the initial cache is baked into the image).
Assignee | ||
Comment 24•7 years ago
|
||
Because when I start a 32-bit `nautilus` binary it also takes 30s to start while building its font cache. A second start is fast again. And this always happens whether I build the font cache for Firefox before or not.
We could install `fontconfig:i386` which would generate the 32bit cache. But that removes the x86_64 version, and I don't know which implications that has.
Assignee | ||
Comment 25•7 years ago
|
||
I triggered two try builds:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=8b2dc9ea59effd69564beeadd55eb3abe21907aa (firefox:i386)
https://treeherder.mozilla.org/#/jobs?repo=try&revision=d0aea53f7e798164a11331ec815e40692a671e99 (fontconfig:i386)
Assignee | ||
Comment 26•7 years ago
|
||
From a conversation with Ted on IRC:
ted> my alternate proposal would be to just fetch the i386 deb and manually extract it somewhere else and run the binary from there
<whimboo> ted: correct. both exists next to each other
<whimboo> if we don't need fc-cache anymore it should be all fine
<whimboo> the deb way is more complicated
<ted> whimboo: it wouldn't be terrible, apparently you can `apt-get download fontcache:i386` which downloads the .deb to the current directory
<ted> and `dpkg-deb --extract whatever.deb path` extracts its contents
<whimboo> oh
<ted> but if installing both packages in series works and doesn't break anything else i say go for it
<whimboo> k, let the reviewer decide :)
Assignee: nobody → hskupin
Status: NEW → ASSIGNED
Flags: needinfo?(mh+mozilla)
Flags: needinfo?(jld)
Priority: -- → P1
Assignee | ||
Comment 27•7 years ago
|
||
Both patches work, so I would tend to the replacement of fontconfig with fontconfig:i386. If there is a concern with this we could still run what Ted suggested. I checked that manually on a loaner and it works too.
Linux 32:
https://treeherder.mozilla.org/logviewer.html#?job_id=166464003&repo=try&lineNumber=1235-1237
> [task 2018-03-07T12:09:42.767Z] 12:09:42 INFO - PID 852 | 1520424582764 mozrunner::runner INFO Running command: "/builds/worker/workspace/build/application/firefox/firefox" "-marionette" "-profile" "/tmp/rust_mozprofile.NU6UpL7fxtoR"
[..]
> [task 2018-03-07T12:09:44.868Z] 12:09:44 INFO - PID 852 | 1520424584858 Marionette INFO Listening on port 2828
Linux 64:
https://treeherder.mozilla.org/logviewer.html#?job_id=166464004&repo=try&lineNumber=1248-1273
> [task 2018-03-07T12:08:12.562Z] 12:08:12 INFO - PID 856 | 1520424492556 mozrunner::runner INFO Running command: "/builds/worker/workspace/build/application/firefox/firefox" "-marionette" "-profile" "/tmp/rust_mozprofile.gHEClkRqg9tU"
[..]
> [task 2018-03-07T12:08:14.722Z] 12:08:14 INFO - PID 856 | 1520424494718 Marionette INFO Listening on port 2828
For both architectures we decreased the first startup time from ~30s to 2s.
Comment hidden (mozreview-request) |
Comment 29•7 years ago
|
||
mozreview-review |
Comment on attachment 8956807 [details]
Bug 1443437 - Pre-build font cache for Firefox i386 binaries on x86_64 systems.
https://reviewboard.mozilla.org/r/225782/#review231608
thanks for figuring this out :whimboo! In reading over this bug, it seems like this is the right solution and your try pushes look promising :)
Attachment #8956807 -
Flags: review?(jmaher) → review+
Comment 30•7 years ago
|
||
Pushed by hskupin@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/b8df7526c0f6
Pre-build font cache for Firefox i386 binaries on x86_64 systems. r=jmaher
Assignee | ||
Comment 31•7 years ago
|
||
We also see a decrease of the first startup time for debug builds from ˜35s to 8s across all test suites:
> [task 2018-03-07T15:53:27.952Z] 15:53:27 INFO - PID 856 | 1520438007948 mozrunner::runner INFO Running command: "/builds/worker/workspace/build/application/firefox/firefox" "-marionette" "-profile" "/tmp/rust_mozprofile.f9DqKSQQjHjS"
> [..]
> [task 2018-03-07T15:53:35.420Z] 15:53:35 INFO - PID 856 | 1520438015414 Marionette INFO Listening on port 2828
Updated•7 years ago
|
Component: Docker Images → Task Configuration
Product: Taskcluster → Firefox Build System
Comment 32•7 years ago
|
||
bugherder |
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
status-firefox60:
--- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla60
Assignee | ||
Updated•7 years ago
|
status-firefox59:
--- → fix-optional
status-firefox-esr52:
--- → unaffected
Comment 33•7 years ago
|
||
Nice detective work on this, whimboo!
Updated•6 years ago
|
Assignee | ||
Updated•5 years ago
|
You need to log in
before you can comment on or make changes to this bug.
Description
•