Closed Bug 1271035 Opened 9 years ago Closed 9 years ago

Reduce I/O during reftests by disabling Places

Categories

(Testing :: Reftest, defect)

Version 3
defect
Not set
normal

Tracking

(firefox49 fixed)

RESOLVED FIXED
mozilla49
Tracking Status
firefox49 --- fixed

People

(Reporter: gps, Assigned: gps)

References

(Blocks 1 open bug)

Details

Attachments

(1 file)

Patch forthcoming.
It was a cold Friday night in San Francisco. Earlier in the day, I informed Chris AtLee that I was going to start focusing on improving the efficiency of Firefox automation and asked him where the biggest capacity issues were. He said "we're hurting most on Windows tests." As I was casually drinking a barleywine (note to reader: barleywines are serious beers - there's nothing casual about them), I found myself tediously clicking through Treeherder looking at logs for Windows jobs, looking for patterns and other oddities. As I was clicking through, something stood out to me: the sheer number of reftest jobs. I recalled a random project I started a few years ago. Its aim was to analyze buildbot job metadata so we could better understand where time was spent in automation. I had mostly written off the side project as a failure and a near complete waste of my time. Not even a stray random thought of this project had entered my mind in the past year. But clicking through Treeherder after a few glasses of barleywine somehow reminded me of one of the few useful findings of that project: reftest jobs consumed a seemingly disproportiate amount of machine time, something like 35 or 40% IIRC of the time spent on all jobs. Now, this finding was several years ago and almost certainly no longer relevant. But, again, I had a few glasses of barleywine in me and was bothered by the amount of reftest jobs and their duration, so I thought "hey, why don't I run reftests and see why they take so long." So I built Firefox on Windows - the platform Chris AtLee said we're "hurting most on." I decided to start my very casual profiling session by recording a `mach reftest` run using Sysinternals Process Monitor. To my surprise, it yielded a very obvious and unexpected result: the Places SQLite database was incurring a lot of I/O. On my i7-6700K Windows 10 desktop with a high performance SSD, `mach reftest` yielded the following: File Time #Events #Reads #Writes RBytes WBytes Path 198s 980,872 243,270 669,231 7,971,471,360 20,667,084,080 places.sqlite-wal 165s 645,853 222,407 367,775 7,287,701,820 14.071,529,472 places.sqlite 2s 377,121 1 0 32,768 0 places.sqlite-shm The Places SQLite database accounts for 2,003,846 of the total of 3,547,527 system calls (56.49%) recorded by procmon during `mach reftest` execution. This represents a staggering 49,997,786,732 of the 50,307,660,589 (99.38%) bytes of I/O recorded! Yes, that's 50 GB. I reckon the reason the Places database accumulates so much I/O load during reftests is because the reftest suite essentially loads thousands of pages as quickly as possible. This effectively performs a stress test against the Places history recording service. As effective as reftests are at stress-testing Places, it adds no value to reftests because reftests are testing the layout features, not the performance of history recording. So these 2M system calls and 50 GB of I/O are overhead. This commit disables Places when executing reftests and prevents the overhead. After this commit, `mach reftest` has significantly reduced interaction with the Places SQLite database: File Time #Events #Reads #Writes RBytes WBytes Path 0.09s 502 138 302 4,521,984 8,961,528 places.sqlite-wal 0.07s 254 20 140 524,604 8,126,464 places.sqlite 0.01s 3,289 1 0 32,768 0 places.sqlite-shm Of the 948,033 system calls recorded with this change (26.7% of original), 691,322 were related to I/O. The Places SQLite database only consumed ~22MB of I/O, <0.01% of original. It's worth noting that ~half of the remaining I/O system calls are related to reftest.log, which now accounts for the largest percentage of write I/O (only ~53 MB, however). It's worth noting that reftest.log appears to be using unbuffered writes and is requiring an excessive amount of system calls for writing. But that's for another bug and commit. In terms of wall time, the drastic I/O reduction during `mach reftest` appears to have minimal impact on my machine: maybe 30s shaved from a ~900s execution, or ~3%. But my machine with its modern SSD doesn't struggle with I/O. In automation, it is a different story. I pushed this change to Try along with the base revision and triggered 4 runs of most reftest jobs. The runtime improvements in automation are impressive. Here are the fastest reported times for various jobs: Job Before After Delta Linux Opt R1 31m 34m +3m Linux Opt R2 43m 35m -8m Linux Opt Ru1 40m 34m -6m Linux Opt Ru2 43m 37m -6m Linux Opt R E10s 89m 72m -17m Linux Debug R1 52m 40m -12m Linux Debug R2 49m 42m -7m Linux Debug R3 60m 51m -9m Linux Debug R4 42m 37m -5m Linux Debug R1 E10s 84m 72m -12m Linux Debug R2 E10s 97m 85m -12m Linux64 Opt R1 35m 24m -11m Linux64 Opt R2 37m 26m -11m Linux64 Opt Ru1 32m 29m -3m Linux64 Opt Ru2 37m 26m -12m Linux64 Opt TC R1 12m 10m -2m Linux64 Opt TC R2 10m 7m -3m Linux64 Opt TC R3 11m 9m -2m Linux64 Opt TC R4 11m 9m -2m Linux64 Opt TC R5 13m 11m -2m Linux64 Opt TC R6 11m 9m -2m Linux64 Opt TC R7 9m 8m -1m Linux64 Opt TC R8 11m 10m -1m Linux64 Opt TC Ru1 30m 25m -5m Linux64 Opt TC Ru2 36m 27m -11m OS X 10.10 Opt 31m 27m -4m OS X 10.10 Opt E10s 26m 25m -1m OS X 10.10 Debug 68m 55m -13m Win7 Opt R 30m 28m -2m Win7 Opt Ru 28m 26m -2m Win7 Opt R E10S 29m 27m -2m Win7 Debug R 85m 76m -9m Win7 Debug R E10S 75m 65m -10m Win8 x64 Opt R 29m 26m -3m Win8 x64 Opt Ru 27m 25m -2m Win8 x64 Debug R 90m 71m -19m Android 4.3 API15 Opt R1 89m 71m -18m Android 4.3 API15 Opt R2 78m 64m -14m Android 4.3 API15 Opt R3 75m 64m -11m Android 4.3 API15 Opt R4 74m 68m -6m Android 4.3 API15 Opt R5 75m 69m -6m Android 4.3 API15 Opt R6 91m 86m -5m Android 4.3 API15 Opt R7 87m 66m -21m Android 4.3 API15 Opt R8 87m 82m -5m Android 4.3 API15 Opt R9 80m 66m -14m Android 4.3 API15 Opt R10 80m 67m -13m Android 4.3 API15 Opt R11 73m 66m -7m Android 4.3 API15 Opt R12 105m 91m -14m Android 4.3 API15 Opt R13 72m 59m -13m Android 4.3 API15 Opt R14 82m 61m -21m Android 4.3 API15 Opt R15 73m 62m -11m Android 4.3 API15 Opt R16 79m 78m -1m The savings total 6+ *hours* or ~15% when running all reftests. I'd say this isn't bad for a one-line code change! Review commit: https://reviewboard.mozilla.org/r/51267/diff/#index_header See other reviews: https://reviewboard.mozilla.org/r/51267/
Attachment #8749983 - Flags: review?(dholbert)
(In reply to Gregory Szorc [:gps] from comment #1) > As I was casually drinking a barleywine (note to reader: barleywines > are serious beers - there's nothing casual about them) What barleywine were you drinking, and how did you like it?
Flags: needinfo?(gps)
Attachment #8749983 - Flags: review?(dholbert) → review+
Comment on attachment 8749983 [details] MozReview Request: Bug 1271035 - Disable Places during reftests, preventing 50 GB of I/O; r?dholbert https://reviewboard.mozilla.org/r/51267/#review47931 Very nice. r=me
(In reply to Justin Dolske [:Dolske] from comment #4) > What barleywine were you drinking, and how did you like it? It was a Sierra Nevada Barrel Aged Bigfoot. Solid beer and definitely a memorable experience. Would do again.
Flags: needinfo?(gps)
margaret: mfinkle: I observe from gps's work that Android reftest times dropped significantly. This surprised me, but I think I now understand why: Fennec uses the same pref to disable maintaining visited links and writes to its Android SQLite DBs as Desktop. Could one of you confirm? Future thought: if we separated the Android history flag from the Desktop one, would we opt-out of additional toolkit/ Places cruft? Or is that already all removed from Fennec, and the flag name is just ... ancient history?
Flags: needinfo?(mark.finkle)
Flags: needinfo?(margaret.leibovic)
Reading the commit message the day after, it's obviously a by-product of drinking a high alcohol beer and staying up late to finish :) But some things deserve to be kept for posterity, so I didn't change it before landing. For the record, I'm still somewhat shocked that a random side-project I did a few years ago to explore and analyze buildbot data produced a take-a-way that I randomly recalled last night which contributed to me looking at reftests first and that by some miracle there was a low-hanging fruit that resulted in double digit percentage wins mostly across the board. It feels like I won the lottery. I would be shocked if I manage to find a single low-effort high-impact improvement again.
(In reply to Nick Alexander :nalexander from comment #8) > Future thought: if we separated the Android history flag from the Desktop > one, would we opt-out of additional toolkit/ Places cruft? I think Android doesn't define MOZ_PLACES and thus it should have no places code at all. It's likely the pref is there to simplify syncing prefs across profiles with Sync.
(In reply to Nick Alexander :nalexander from comment #8) > Fennec uses the same pref to disable maintaining visited links and writes to > its Android SQLite DBs as Desktop. Could one of you confirm? Yes. I did this in Bug 1093886, to allow us to turn off history in web apps for Bug 851854. This is a nice side-effect. > Future thought: if we separated the Android history flag from the Desktop > one, would we opt-out of additional toolkit/ Places cruft? Or is that > already all removed from Fennec, and the flag name is just ... ancient > history? We don't ship Places, but we obey the same pref.
Flags: needinfo?(mark.finkle)
Flags: needinfo?(margaret.leibovic)
Blocks: 1271068
(In reply to Marco Bonardo [::mak] from comment #10) > It's likely the pref is there to simplify syncing prefs across > profiles with Sync. Thread-drift, but: we don't sync prefs across applications (Thunderbird <-> Firefox, for example), and we don't implement prefs sync at all on either iOS or Android.
(In reply to Nick Alexander :nalexander from comment #8) > margaret: mfinkle: I observe from gps's work that Android reftest times > dropped significantly. This surprised me, but I think I now understand why: > Fennec uses the same pref to disable maintaining visited links and writes to > its Android SQLite DBs as Desktop. Could one of you confirm? Yes, we added support for the pref a little while ago: http://mxr.mozilla.org/mozilla-central/source/mobile/android/components/build/nsAndroidHistory.cpp#129 > Future thought: if we separated the Android history flag from the Desktop > one, would we opt-out of additional toolkit/ Places cruft? Or is that > already all removed from Fennec, and the flag name is just ... ancient > history? I don't think we are opting into Places cruft as it is. Android uses an entirely different History storage system. It just so happens that we used the same preference to control it. I think the morale of this long, long, long commit message story is: I/O while loading web pages slows everything down.
Any chance we'd get a similar win from setting this pref during mochitests?
Flags: needinfo?(gps)
(In reply to Nathan Froyd [:froydnj] from comment #14) > Any chance we'd get a similar win from setting this pref during mochitests? We'd have to be more careful about that, because some mochitests explicitly test history-related things (like visited coloring of links).
(In reply to Nathan Froyd [:froydnj] from comment #14) > Any chance we'd get a similar win from setting this pref during mochitests? Procmon says yes. Also, the Necko cache seems to be generating a fair bit of I/O. I'm not sure what is safe to disable where in mochitest land. Obviously some mochitests need to test features backed by I/O and disabling too many browser features may undermine the value of some tests We really need the system resource metrics obtained by mozharness surfaced better. e.g from http://archive.mozilla.org/pub/firefox/tinderbox-builds/mozilla-inbound-win32/1462612660/mozilla-inbound_win7_ix_test-mochitest-1-bm109-tests1-windows-build32.txt.gz: 04:16:08 INFO - Total resource usage - Wall time: 981s; CPU: 10.0%; Read bytes: 96571392; Write bytes: 1094087168; Read time: 2205010; Write time: 46124710 04:16:08 INFO - install - Wall time: 3s; CPU: 11.0%; Read bytes: 0; Write bytes: 33091072; Read time: 0; Write time: 326250 04:16:08 INFO - run-tests - Wall time: 979s; CPU: 10.0%; Read bytes: 96571392; Write bytes: 1035534336; Read time: 2205010; Write time: 45589980 That's 1 GB of write I/O during a mochitest-1 job. Seems fishy to me. And I /think/ this is actual disk I/O, not what the processes are generating (during many workloads large percentages of process I/O are serviced by the page cache and wouldn't be reported here).
Bug 859573 added the resource monitoring to mozharness. It is using https://dxr.mozilla.org/mozilla-central/source/testing/mozbase/mozsystemmonitor/mozsystemmonitor/resourcemonitor.py which uses psutil for gathering *system* level metrics. Those are described at http://pythonhosted.org/psutil/#system-related-functions. What immediately stands out to me is that "write time" value of 45,589,980. That's supposed to be in milliseconds. Sounds like a bug in data collection or documentation to me. We're also likely running an older psutil in automation. I think conditions are ripe for a lot of follow-up bugs :)
Flags: needinfo?(gps)
Perhaps we should file a followup bug to investigate whether there are things we could do to improve the performance characteristics for our end-users?
(In reply to David Baron [:dbaron] ⌚️UTC-7 (review requests must explain patch) (busy May 9-13) from comment #18) > Perhaps we should file a followup bug to investigate whether there are > things we could do to improve the performance characteristics for our > end-users? Is bug 1209027 sufficient?
In bug 1271077 I coded up a quick patch to enable system resource numbers to be printed in Treeherder. It is appropriate Mother's Day is this weekend because we appear to have hit the mother load: there appears to be a treasure trove of interesting data pointing at potential bottle necks and areas for optimization. We know the patch in this bug significantly improves reftest execution times by reducing I/O. Procmon recorded ~50 GB of saved I/O in system calls. However, not all of that I/O hits disk. What was the actual disk I/O implication of this change? Before: https://treeherder.mozilla.org/#/jobs?repo=try&revision=d09130fbe0545f1b6e29a97045949e1d385a819b After: https://treeherder.mozilla.org/#/jobs?repo=try&revision=65b2a1870e1819e1d54fca904a404c4646843939 And the breakdown of disk write bytes before and after disabling Places: Before After Linux64 Opt R1 9,235,111,936 274,952,192 Linux64 Opt R2 9,109,041,152 344,014,848 Linux64 Opt R E10s 19,124,137,984 412,995,584 Win7 Opt R 11,082,519,552 550,678,016 Win7 Opt Ru 9,697,164,800 518,457,856 OS X 10.10 Opt R E10s 18,420,071,424 609,679,872 I suspect most of the write bytes after this patch are due to extracting the build and test files. That will almost certainly be a few hundred megabytes due to the way we're doing things today. Although we've been talking about changing how that works. Stay tuned... As for non-reftest jobs, there appears to be significant I/O usage across the board. Mochitest jobs are frequently writing ~1 GB. xpcshell jobs are writing ~11 GB (!!!). We're doing a *lot* of I/O in automation. Keep in mind that a number of machines in automation have spinning disks or are virtualized (e.g. in EC2), so I/O isn't great. We also report CPU metrics in those Try pushes. A number of jobs are reporting <50% CPU utilization. Win7 Opt X reported ~24% CPU usage - and xpcshell tests execute concurrently. With 10.5 GB of write I/O in that job, I suspect I/O wait is slowing us down. OS X mochitest jobs are hovering around 50% CPU utilization. We know mochitests execute one test at a time. I suspect the OS X machines have 2 cores/threads. Before you look at CPU times too much, pay attention to whether the machine is a VM, especially in EC2. We're currently reporting physical CPU utilization. So e.g. on EC2 CPU utilization will often be near 100% because some other VM you don't have insight to is using the CPU. Bug 893391 tracks improving this reporting.
Blocks: 1271141
Blocks: fastci
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla49
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: