1292218 - Do collector work more aggressively in reftests

Assignee

Description

•

8 years ago

Reftests rapidly open and close windows, which our GC/CC heuristics are not set up to deal with. We also don't want to just force a sync GC and CC because this reduces the amount of testing we get for incremental GC and CC. To address this, I added a window utils method runNextCollectorTimer(), which does a bit of incremental collector work (or sometimes starts one if none is in progress). Mochitests already call this once after every test, but each reftest is so much faster that I think it makes sense to call it more times in a row. To measure the effectiveness of this, I ran the following, and measured for various numbers of times in a row running runNextCollectorTimer() after each test: time ./mach reftest layout/reftests/reftest-sanity/ 2>&1 | grep "\+\+DOMWINDOW" | sort For each test run, the format is (# of peak windows, time to run test). no runNextCollectorTimer(): (290, 18.1), (290, 18.1) 3 calls in a row: (133, 18.5), (133, 18.5), (141, 18.4) 4 calls in a row: (93, 18.8), (109, 18.4) 5 calls in a row: (103 windows, 18.9), (99, 18.4), (99, 18.3) Running it 3 times in a row drops the peak from 290 windows to 133. 4 drops it further to 100. It doesn't seem to substantially increase the running time. I'm going to do some try runs and see how it affects a full reftest run, both in terms of peak windows and time. Having many windows open at once will make memory usage worse, which could cause OOM crashes on 32-bit tests. Additionally, if you end up collecting hundreds of windows at once (which I have seen when running locally), this can take a while and cause a test to spuriously time out. Bug 1110928 seems to increase the number of windows a bit, by collecting less aggressively, so I would like to improve the current situation so I can land that patch.

Andrew McCreight [:mccr8]

Assignee

Comment 1

•

8 years ago

With my patches in bug Bug 1110928 and 4 in a row: (120, 18.5), (120, 18.2), (120, 18.2), (119, 20.6), (112, 18) So, a little worse but still better than it currently is.

Andrew McCreight [:mccr8]

Assignee

Comment 2

•

8 years ago

I looked at a full run for Linux 64 debug R1. Without my patch, there was a peak of about 471 windows. With it, it was 295. So, still high, but not as bad. I calculated that by using a little script that creates some ASCII art representing how many windows are opened: https://github.com/amccreight/mochitest-logs/blob/master/bloatfinder.py Terrence wanted to discuss this with me, so I'll needinfo him for that, so neither of us forget.

Flags: needinfo?(terrence)

Terrence Cole [:terrence]

Comment 3

•

8 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=cf04f25f2afa

Terrence Cole [:terrence]

Comment 4

•

8 years ago

Attached patch implement_MOZ_CCTIMER-v0.diff (obsolete) (deleted) — Details — Splinter Review

We have MOZ_GCTIMER for the GC, this adds the same for the CC.

Attachment #8778507 - Flags: review?(continuation)

Andrew McCreight [:mccr8]

Assignee

Comment 5

•

8 years ago

Comment on attachment 8778507 [details] [diff] [review] implement_MOZ_CCTIMER-v0.diff I'll assume the new bug replaces this.

Attachment #8778507 - Attachment is obsolete: true

Attachment #8778507 - Flags: review?(continuation)

Terrence Cole [:terrence]

Comment 6

•

8 years ago

(In reply to Andrew McCreight [:mccr8] from comment #5) > Comment on attachment 8778507 [details] [diff] [review] > implement_MOZ_CCTIMER-v0.diff > > I'll assume the new bug replaces this. Whoops! I forgot that I had put the patch here Friday. It warrants a new bug regardless.

Mason Chang [Inactive] [:mchang]

Updated

•

8 years ago

See Also: → https://bugzilla.mozilla.org/show_bug.cgi?id=1296358

Comment hidden (mozreview-request)

Andrew McCreight [:mccr8]

Assignee

Updated

•

8 years ago

See Also: https://bugzilla.mozilla.org/show_bug.cgi?id=1296358 →

Andrew McCreight [:mccr8]

Assignee

Comment 9

•

8 years ago

I talked to Terrence about this. His main issues was just that this does not fix the issue he is seeing. But I think we should just go ahead and land this, particularly since mchang is also seeing issues with the reftest bloat. Mason, does this patch fix the problem you are seeing with your other patches?

Flags: needinfo?(terrence) → needinfo?(mchang)

Joel Maher ( :jmaher ) (UTC -8)

Comment 10

•

8 years ago

mozreview-review

Comment on attachment 8782554 [details] Bug 1292218 - Run collectors more in reftests. https://reviewboard.mozilla.org/r/72678/#review70346 some pending questions, possibly I just need more context. ::: layout/tools/reftest/reftest-content.js:694 (Diff revision 1) > > + // Do some collector work so that windows don't build up too much. > + windowUtils().runNextCollectorTimer(); > + windowUtils().runNextCollectorTimer(); > + windowUtils().runNextCollectorTimer(); > + windowUtils().runNextCollectorTimer(); why do we need to call this 4 times? And this looks to be called for every test. Does this do work in both content and chrome processes?

Attachment #8782554 - Flags: review?(jmaher) → review-

Andrew McCreight [:mccr8]

Assignee

Comment 11

•

8 years ago

Sorry, I should have explained more what this patch is trying to do. (In reply to Joel Maher ( :jmaher) from comment #10) > why do we need to call this 4 times? And this looks to be called for every > test. Well, because when I only call it once it doesn't do that much. ;) I just tried out different numbers of times we call it and saw how it helped. Each call should not take very long, something like 5ms to 40ms. What bug 1296358 was doing was running a full GC every 10 tests (which is a few hundred ms). What this patch does is instead take that full GC and break it up into little pieces, and we run a few pieces at the end of every test. > Does this do work in both content and chrome processes? Yes, this script is run in both, which I agree is surprising. :P

Andrew McCreight [:mccr8]

Assignee

Comment 12

•

8 years ago

The reason that incremental GCs are better than full GCs is that for actual Firefox users they are mostly getting incremental GCs. (And incremental GCs are much trickier to get right.) My patch also runs the cycle collector in addition to the garbage collector, so it hopefully collects more.

Mason Chang [Inactive] [:mchang]

Comment 13

•

8 years ago

(In reply to Andrew McCreight [:mccr8] from comment #11) > Sorry, I should have explained more what this patch is trying to do. > > (In reply to Joel Maher ( :jmaher) from comment #10) > > why do we need to call this 4 times? And this looks to be called for every > > test. > > Well, because when I only call it once it doesn't do that much. ;) I just > tried out different numbers of times we call it and saw how it helped. > > Each call should not take very long, something like 5ms to 40ms. What bug > 1296358 was doing was running a full GC every 10 tests (which is a few > hundred ms). What this patch does is instead take that full GC and break it > up into little pieces, and we run a few pieces at the end of every test. > Please put this in a comment above. > > Does this do work in both content and chrome processes? > > Yes, this script is run in both, which I agree is surprising. :P Please put this in a comment as well. (In reply to Andrew McCreight [:mccr8] from comment #9) > I talked to Terrence about this. His main issues was just that this does not > fix the issue he is seeing. But I think we should just go ahead and land > this, particularly since mchang is also seeing issues with the reftest bloat. > > Mason, does this patch fix the problem you are seeing with your other > patches? I'll find out in three days, but we can land this first and if not keep trying other things. Sigh, try times are forever right now :(.

Flags: needinfo?(mchang)

Andrew McCreight [:mccr8]

Assignee

Comment 14

•

8 years ago

(In reply to Mason Chang [:mchang] from comment #13) > Please put this in a comment above. That's what "Do some collector work so that windows don't build up too much." was supposed to mean. I can spell things out a little more. > Please put this in a comment as well. Sure, I can add a comment to the top of the file.

Joel Maher ( :jmaher ) (UTC -8)

Comment 15

•

8 years ago

how will we know to tune the number of incremental gc calls as the project evolves? Right now 4 calls does a great job in reducing the number of oustanding domwindows, but what if we need 6 calls in the future, would that seem silly? is there some indication in the job results that would help us determine if we are getting out of balance.

Andrew McCreight [:mccr8]

Assignee

Comment 16

•

8 years ago

(In reply to Joel Maher ( :jmaher) from comment #15) > how will we know to tune the number of incremental gc calls as the project > evolves? Right now 4 calls does a great job in reducing the number of > oustanding domwindows, but what if we need 6 calls in the future, would that > seem silly? No, 4 is just something that seemed to be a reasonable tradeoff between reducing memory usage and not increasing test time too much. It is possible some other value would be better now or in the future. > is there some indication in the job results that would help us > determine if we are getting out of balance. As described in bug 1296358, the main indication is that we hit OOM crashes. A more minor symptom would be getting timeouts, because if we end up running a huge GC that frees a lot of things during a test, this can take a very long time, and the test will time out.

Andrew McCreight [:mccr8]

Assignee

Comment 17

•

8 years ago

I updated my patch with more comments. Let me know if that improves things or not.

Comment hidden (mozreview-request)

Joel Maher ( :jmaher ) (UTC -8)

Comment 19

•

8 years ago

mozreview-review

Comment on attachment 8782554 [details] Bug 1292218 - Run collectors more in reftests. https://reviewboard.mozilla.org/r/72678/#review70368 thanks for the update.

Attachment #8782554 - Flags: review?(jmaher) → review+

Pulsebot

Comment 20

•

8 years ago

Pushed by amccreight@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/0c135a70a81a Run collectors more in reftests. r=jmaher

Ryan VanderMeulen [:RyanVM]

Comment 21

•

8 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/0c135a70a81a

Status: NEW → RESOLVED

Closed: 8 years ago

status-firefox51: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla51

Phil Ringnalda (:philor)

Updated

•

8 years ago

Depends on: 1297919

Andrew McCreight [:mccr8]

Assignee

Comment 22

•

8 years ago

Feel free to back this out. I thought I'd checked to make sure it didn't affect the time, but I guess it does. It also didn't fix the problem I was trying to fix.

Phil Ringnalda (:philor)

Comment 23

•

8 years ago

Backed out in https://hg.mozilla.org/integration/mozilla-inbound/rev/0b638e24de7e - dunno if that makes this reopened, or wontfix.

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Andrew McCreight [:mccr8]

Assignee

Comment 24

•

8 years ago

Thanks, and sorry for the trouble. Terrence landed bug 1296484 which should address the same issue in a different way. Also if something else goes horribly awry with timeouts, maybe that will be to blame. ;)

Status: REOPENED → RESOLVED

Closed: 8 years ago → 8 years ago

Resolution: --- → WONTFIX

Ryan VanderMeulen [:RyanVM]

Comment 25

•

8 years ago

backout bugherder

Merge of backout: https://hg.mozilla.org/mozilla-central/rev/0b638e24de7e

Mason Chang [Inactive] [:mchang]

Comment 26

•

8 years ago

Just an FYI, we were seeing lots of graphics reftest failures due to OOM with Skia on Windows w/o this patch. If those come back, please let me know. Thanks!

Ryan VanderMeulen [:RyanVM]

Comment 27

•

8 years ago

Yep, they're back.

Mason Chang [Inactive] [:mchang]

Comment 28

•

8 years ago

The original skia problem is back due to the collector not being run as aggressively in the reftest suite and bug 1296484 wasn't enough. In bug 1297919, it looks like tests are taking too long to run. Is the proper solution here to take Joel's suggestion and chunk more or is there more GC work to be done? Thanks!

Andrew McCreight [:mccr8]

Assignee

Comment 29

•

8 years ago

Ah, right, I forgot that my patch actually did do something useful for somebody.

Phil Ringnalda (:philor)

Comment 30

•

8 years ago

What do those OOM failures look like? The only thing that I see happening on Windows reftests which isn't an already filed bug is failures in writing-mode/abspos/*, which is supposed to be disabled on Win7 by https://hg.mozilla.org/integration/mozilla-inbound/file/tip/layout/reftests/writing-mode/reftest.list#l176 but which your Skia landing somehow caused to start running on Win7 (and to then start failing after this was backed out). If that's what we're talking about, figuring out why they started running and stopping them again seems like a cheaper thing to do than to add 77 minutes per push on WinXP plus the time that this adds, when we have 22 hours of WinXP backlog right now.

Phil Ringnalda (:philor)

Comment 31

•

8 years ago

And indeed, every Windows reftest failure on mozilla-inbound since I backed this out has been either the longstanding e10s startup hang bug 1267106 or that return of bug 1167155 in tests which are supposed to have been skipped on Win7 since last December.

Mason Chang [Inactive] [:mchang]

Comment 32

•

8 years ago

(In reply to Phil Ringnalda (:philor) from comment #30) > What do those OOM failures look like? The only thing that I see happening on > Windows reftests which isn't an already filed bug is failures in > writing-mode/abspos/*, which is supposed to be disabled on Win7 by > https://hg.mozilla.org/integration/mozilla-inbound/file/tip/layout/reftests/ > writing-mode/reftest.list#l176 but which your Skia landing somehow caused to > start running on Win7 (and to then start failing after this was backed out). > If that's what we're talking about, figuring out why they started running > and stopping them again seems like a cheaper thing to do than to add 77 > minutes per push on WinXP plus the time that this adds, when we have 22 > hours of WinXP backlog right now. In the logs, we usually see this - 04:04:02 INFO - REFTEST TEST-LOAD | file:///c:/slave/test/build/tests/reftest/tests/layout/reftests/writing-mode/abspos/s71-abs-pos-non-replaced-vrl-024-ref.xht | 13286 / 13783 (96%) 04:04:02 INFO - [GFX1]: Failed 2 buffer db=00000000 dw=00000000 for 0, 0, 800, 1000 04:04:02 INFO - [GFX1]: Failed 2 buffer db=00000000 dw=00000000 for 0, 0, 800, 1000 04:04:02 INFO - REFTEST TEST-PASS | file:///c:/slave/test/build/tests/reftest/tests/layout/reftests/writing-mode/abspos/s71-abs-pos-non-replaced-vrl-024.xht == file:///c:/slave/test/build/tests/reftest/tests/layout/reftests/writing-mode/abspos/s71-abs-pos-non-replaced-vrl-024-ref.xht | image comparison 04:04:02 INFO - REFTEST TEST-END | file:///c:/slave/test/build/tests/reftest/tests/layout/reftests/writing-mode/abspos/s71-abs-pos-non-replaced-vrl-024.xht == file:///c:/slave/test/build/tests/reftest/tests/layout/reftests/writing-mode/abspos/s71-abs-pos-non-replaced-vrl-024-ref.xht 04:04:02 INFO - [GFX1]: Failed 2 buffer db=00000000 dw=00000000 for 0, 0, 800, 1000 The GFX failed 2 buffer means we failed to allocate memory for the texture here. I'll try taking a look at seeing why those tests aren't being skipped.

Mason Chang [Inactive] [:mchang]

Comment 33

•

8 years ago

Attached patch Skip abs-pos-non-replaced tests on Windows 7 (deleted) — Details — Splinter Review

The tests that were actually being skipped were here [1]. The other tests were matching because any other fuzzy-if or condition would override the skip-if. Hoorah reftest syntax! This explicitly disables those tests in the reftest.list itself. http://searchfox.org/mozilla-central/source/layout/reftests/writing-mode/abspos/reftest.list#104

Attachment #8786501 - Flags: review?(philringnalda)

Phil Ringnalda (:philor)

Comment 34

•

8 years ago

Comment on attachment 8786501 [details] [diff] [review] Skip abs-pos-non-replaced tests on Windows 7 Ah. Apparently that's actually a bug (when fuzzy-if was added it both should have been documented in https://dxr.mozilla.org/mozilla-central/source/layout/tools/reftest/README.txt#54 and shouldn't have been implemented to beat out skip), but in the meantime this should work around it. r=me with a commit message saying it's for bug 1167155 rather than this one.

Attachment #8786501 - Flags: review?(philringnalda) → review+

Mason Chang [Inactive] [:mchang]

Updated

•

8 years ago

See Also: → https://bugzilla.mozilla.org/show_bug.cgi?id=1300355

implement_MOZ_CCTIMER-v0.diff 8 years ago Terrence Cole [:terrence] (deleted), patch		Details \| Diff \| Splinter Review
Bug 1292218 - Run collectors more in reftests. 8 years ago Andrew McCreight [:mccr8] (deleted), text/x-review-board-request	jmaher : review+	Details
Skip abs-pos-non-replaced tests on Windows 7 8 years ago Mason Chang [Inactive] [:mchang] (deleted), patch	philor : review+	Details \| Diff \| Splinter Review