Closed
Bug 864085
Opened 12 years ago
Closed 12 years ago
Investigate and fix the enormous increase in time to run Windows debug browser-chrome
Categories
(Testing :: Mochitest, defect)
Tracking
(Not tracked)
RESOLVED
FIXED
mozilla23
People
(Reporter: philor, Assigned: terrence)
References
Details
(Whiteboard: [buildfaster:?][capacity])
Attachments
(4 files)
(deleted),
patch
|
philor
:
review+
|
Details | Diff | Splinter Review |
(deleted),
text/plain
|
Details | |
(deleted),
text/plain
|
Details | |
(deleted),
patch
|
billm
:
review+
|
Details | Diff | Splinter Review |
https://tbpl.mozilla.org/php/getParsedLog.php?id=22025306&tree=Mozilla-Central was a Windows 7 debug browser-chrome run which was killed by buildbot because it exceeded the 120 minute time limit it places on any buildstep.
Typical Win7 debug b-c times:
esr17: 60 minutes
b2g18: 75 minutes
m-r19: 85 minutes
m-r20: 95 minutes
m-b21: 103 minutes
m-a22: 101 minutes
m-c23: 113 minutes
m-c24: gonna be killed by buildbot at 120 minutes
Neither the time increases on other platforms (10.6 debug went from 51 minutes on esr17 to 78 minutes on trunk, Win7 opt went from 29 minutes to 30 minutes) nor the increase in number of tests (around 1.2 times as many on trunk as on esr17) fully explain that jump.
My first thought while staring at that log was a massive increase in assertions, but removing the "Shouldn't be trying to restyle non-elements directly" assertion we hit a billion times on trunk and zero on esr17 and pushing to try gave me a 113 minute Win7 debug run, not even a minute below typical.
It's hard to guess with the presence of that odd decrease for 22, but the smart bet is that we have less than 6 weeks to fix this (or do something gross and unworthy like split browser-chrome into two chunks, even though that would be two wasteful 15 minute chunks for most opt platforms, or increase the timeout so that we can take *more than two hours* to run them).
Reporter | ||
Comment 1•12 years ago
|
||
(Those times are the complete run time, since that's easier to get from tbpl without opening logs, while the timeout is just for the runtests step, but setup/teardown time is nothing - the "114 minute" run I'm looking at took 113 minutes 39 seconds in the runtests step, since mozharness pushes almost everything into that step.)
Updated•12 years ago
|
Whiteboard: [buildfaster:?][capacity]
Reporter | ||
Comment 2•12 years ago
|
||
Reporter | ||
Comment 3•12 years ago
|
||
https://tbpl.mozilla.org/php/getParsedLog.php?id=22145910&tree=Mozilla-Inbound
Looks like the apocalypse will be about... now.
Comment 4•12 years ago
|
||
It's not an actual fix, but we have the capability of setting a different max time limit for this test, if that helps.
Reporter | ||
Comment 5•12 years ago
|
||
Yeah, I even thought of an excuse to partly make ourselves not feel as bad about doing so, since part of the 17 -> 18 increase is because mozharness moved some setup (maybe 5 minutes worth) into the same buildstep as running the tests.
Comment 6•12 years ago
|
||
I think a prerequisite to figuring anything out here would be to scrape buildbot data for the test run times and try to figure out when it regressed.
Reporter | ||
Comment 7•12 years ago
|
||
https://tbpl.mozilla.org/php/getParsedLog.php?id=22209163&tree=Mozilla-Inbound (so close, printed the first line of the summary so it probably had under 30 seconds to go)
Comment 8•12 years ago
|
||
This will apply to both opt and debug; is that ok?
The debug-only patch will be considerably larger.
Attachment #741972 -
Flags: review?(philringnalda)
Reporter | ||
Comment 9•12 years ago
|
||
Comment on attachment 741972 [details] [diff] [review]
increase browser-chrome script_maxtime to 9000 seconds
Over 9000?! :)
Not being debug-only worries me less than not being Windows-only - that'll make the currently not-too-frequent bug 857127 cost us another half hour of slave time when it hits. Personally, I'd just edge it up by 5 minutes (it's possible the first instance of hitting this was more like 6 minutes from finishing, but all the rest have been within seconds of being done), but if you'd rather not have to keep tweaking it as we get worse, until we either fix some of the slowdown or switch to the new slaves, that's fine too.
Attachment #741972 -
Flags: review?(philringnalda) → review+
Comment 10•12 years ago
|
||
Comment on attachment 741972 [details] [diff] [review]
increase browser-chrome script_maxtime to 9000 seconds
http://hg.mozilla.org/build/buildbot-configs/rev/7af192410922
Yeah, it's purely a stopgap workaround. If the hung script problem gets worse we can lower this again.
Comment 11•12 years ago
|
||
(In reply to Phil Ringnalda (:philor) from comment #3)
> Looks like the apocalypse will be about... now.
Yep.
https://tbpl.mozilla.org/php/getParsedLog.php?id=22286121&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=22286102&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=22285817&tree=Mozilla-Inbound
Comment 12•12 years ago
|
||
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #6)
> I think a prerequisite to figuring anything out here would be to scrape
> buildbot data for the test run times and try to figure out when it regressed.
Here is some data for the historical build time of the mozilla-central_xp-debug_test-mochitest-browser-chrome builder. Make of it what you will.
Comment 13•12 years ago
|
||
I was curious about philor's comment, so I looked at which tests were the most expensive on Win debug. This attachment is that list.
TL;DR: we should try figuring out why the devtools/debugger, devtools/webconsole, and mozapps/extensions tests are so extraordinarily expensive. Those test directories account for about 40% of the test time with about 20% of the test scripts.
I don't think it's something inherent to the tests (e.g. the most expensive test takes about 1% of the time in an opt build vs. the time it takes in a debug build) necessarily--there might be some really expensive debug checks the tests are triggering.
Comment 14•12 years ago
|
||
I checked to see if this is the same thing as bug 846890, but it would appear not to be: I disabled the file checking stuff, and it was just as awfully slow, and that code is only hit about half a dozen times.
Comment 15•12 years ago
|
||
Running the tests in browser/devtools/debugger/ under perf on a Linux box turns up the following profile:
+ 2.98% firefox-bin libxul.so [.] js::gc::Cell::isTenured() const
+ 2.47% firefox-bin libxul.so [.] js::gc::Cell::tenuredZone() const
+ 2.10% firefox-bin libplds4.so [.] PL_HashString
+ 1.98% firefox-bin libxul.so [.] js::GCMarker::processMarkStackTop(js::SliceBudget&)
+ 1.93% firefox-bin libpthread-2.11.3.so [.] __pthread_mutex_unlock_usercnt
+ 1.92% firefox-bin libpthread-2.11.3.so [.] pthread_mutex_lock
+ 1.71% firefox-bin libplds4.so [.] PL_HashTableRawLookup
+ 1.63% firefox-bin libpthread-2.11.3.so [.] pthread_mutex_trylock
+ 1.47% firefox-bin libc-2.11.3.so [.] __strcmp_sse42
+ 1.41% firefox-bin libxul.so [.] js::Interpret(JSContext*, js::StackFrame*, js::InterpMode, bool)
+ 1.26% firefox-bin libxul.so [.] js::gc::Cell::markIfUnmarked(unsigned int) const
+ 1.20% firefox-bin libxul.so [.] void MarkInternal<JSAtom>(JSTracer*, JSAtom**)
+ 1.19% firefox-bin libxul.so [.] JSCompartment::sweepBreakpoints(js::FreeOp*)
+ 1.04% firefox-bin libpthread-2.11.3.so [.] pthread_getspecific
+ 0.98% firefox-bin libxul.so [.] GetBloatEntry(char const*, unsigned int)
+ 0.95% firefox-bin libnspr4.so [.] PR_Unlock
+ 0.95% firefox-bin firefox-bin [.] arena_malloc
+ 0.92% firefox-bin libxul.so [.] js::ReadBarrieredValue::get() const
+ 0.91% firefox-bin firefox-bin [.] arena_dalloc
+ 0.87% firefox-bin libxul.so [.] js::ObjectImpl::lastProperty() const
+ 0.82% firefox-bin libxul.so [.] js::gc::Cell::arenaHeader() const
+ 0.80% firefox-bin libxul.so [.] js::UncheckedUnwrap(JSObject*, bool, unsigned int*)
+ 0.68% firefox-bin libxul.so [.] SearchTable(PLDHashTable*, void const*, unsigned int, PLDHashOperator)
+ 0.66% firefox-bin libnspr4.so [.] PR_GetCurrentThread
+ 0.63% firefox-bin libxul.so [.] void MarkInternal<JSObject>(JSTracer*, JSObject**)
+ 0.63% firefox-bin libxul.so [.] js::gc::Cell::isMarked(unsigned int) const
+ 0.62% firefox-bin libxul.so [.] js::gc::FreeSpan::checkSpan() const
Around 25-30% of the runtime for all these tests is being consumed in GC; the #ifdef DEBUG checks are apparently killing us here. Bill, is that expected?
Flags: needinfo?(wmccloskey)
Comment 16•12 years ago
|
||
Terrence might also have some idea.
Comment 17•12 years ago
|
||
The isTenured stuff is fairly recently, so that can't be the whole story.
At one point, the debugger would trigger various nasty GCs, but I don't recall if that's still the case.
Comment 18•12 years ago
|
||
(In reply to Andrew McCreight [:mccr8] from comment #17)
> The isTenured stuff is fairly recently, so that can't be the whole story.
>
> At one point, the debugger would trigger various nasty GCs, but I don't
> recall if that's still the case.
I hate perf's interface for viewing things, but if I'm understanding it correctly, it looks like the GCs are getting triggered under js::Debugger::addDebuggee. The callstacks for isTenured look something like:
js::Debugger::addDebuggee
IncrementalCollectSlice
js::GCMarker::drainMarkStack
js::GCMarker::processMarkStack
and then various bits under that.
Assignee | ||
Comment 19•12 years ago
|
||
(In reply to Andrew McCreight [:mccr8] from comment #17)
> The isTenured stuff is fairly recently, so that can't be the whole story.
IsTenured is particularly bad because the call is not inlinable in a path that is otherwise fully inlined. It would be nice to know which exact usage is the costly one here. Regarding inlining, we recently changed the semantics of JS_ALWAYS_INLINE; this may also have had an impact.
> At one point, the debugger would trigger various nasty GCs, but I don't
> recall if that's still the case.
I'm not aware of anything changing there recently.
(In reply to Nathan Froyd (:froydnj) from comment #18)
> I hate perf's interface for viewing things,
Likewise.
> but if I'm understanding it
> correctly, it looks like the GCs are getting triggered under
> js::Debugger::addDebuggee. The callstacks for isTenured look something like:
>
> js::Debugger::addDebuggee
> IncrementalCollectSlice
> js::GCMarker::drainMarkStack
> js::GCMarker::processMarkStack
>
> and then various bits under that.
Well, the isTenured assertion should fire any time that we have to access the arena header: only tenured objects have one, so this is pretty important. This access typically happens at the leaves, however: can we get a full stack to see which particular callsite(s) are making the assertion?
I doubt this has anything to do with isTenured. It just happens to show up in the profile because it's not inlined.
I'll look into this later today. Do I read comment 13 right: the debugger tests run 100x slower in debug builds than in opt builds?
Flags: needinfo?(wmccloskey)
Comment 21•12 years ago
|
||
(In reply to Bill McCloskey (:billm) from comment #20)
> I'll look into this later today. Do I read comment 13 right: the debugger
> tests run 100x slower in debug builds than in opt builds?
They do run slower. But I see now I scrambled my numbers when I was doing the calculations for how much slower they ran. 100x would be a pretty big deal!
It looks like individual tests are around 8x slower in debug builds than in opt builds, just from glancing at the top several offenders in the list I attached.
I gathered some data to look at the GC issue in more detail. During a normal Win7 debug build (that had a few weird errors that hopefully don't affect these results), we spent 2041 seconds in GC, out of a total of 7205 seconds. That's 28%. I also tried to turn off two forms of debug checks. These are the numbers:
In GC (s) Total (s) Pct in GC
normal build 2041 7205 28%
no incremental validation 1644 6630 25%
no DEBUG checks during marking 981 5751 17%
(Removing the DEBUG checks also speeds up allocation, if you're wondering.)
The try builds are here if anyone is interested:
normal (with printfs): https://tbpl.mozilla.org/?tree=Try&rev=0e8e57c87159
no validation: https://tbpl.mozilla.org/?tree=Try&rev=cce0664b5015
no DEBUG checks: https://tbpl.mozilla.org/?tree=Try&rev=0568f161e182
The cost of the incremental validation doesn't seem too extreme to me. The DEBUG checks are more expensive than I would have expected. I'm going to do a few more builds to try to isolate the cost of each kind of check.
I also looked at why we're doing so much garbage collection. Here's a breakdown of the GC reasons for non-incremental GCs, and how often a GC is invoked for that reason:
{'LAST_CONTEXT': 52, 'TOO_MUCH_MALLOC': 2, 'MEM_PRESSURE': 1, 'DOM_WORKER': 539, 'SHUTDOWN_CC': 2, 'DESTROY_CONTEXT': 52, 'DOM_UTILS': 51, 'COMPONENT_UTILS': 1707, 'API': 1, 'SET_NEW_DOCUMENT': 1, 'ALLOC_TRIGGER': 1, 'DEBUG_MODE_GC': 415}
I also looked at the reason why we initiate incremental GCs:
{'CC_WAITING': 61, 'FULL_GC_TIMER': 15, 'PAGE_HIDE': 251, 'MAYBEGC': 1, 'SET_NEW_DOCUMENT': 180}
The incremental ones are pretty typical, and there aren't very many of them. The outliers are these:
COMPONENT_UTILS: 1707
DOM_WORKER: 539
DEBUG_MODE_GC: 415
The DOM_WORKER ones happen on worker threads, so they may not affect the total running time very much. The DEBUG_MODE_GCs happen whenever we turn the debugger on or off, so I guess we should expect a lot of them. The COMPONENT_UTILS GCs seem pretty extreme. I'm going to try to find out if we can eliminate some or all of them. They account for 51% of all GCs.
Comment 23•12 years ago
|
||
That's pretty weird. I guess people just sprinkle forceGC calls all over their tests? I know there have been a few intermittent oranges that people try to force into a particular test by forcing GC.
Comment 24•12 years ago
|
||
Ah, it looks like checkForLeakedGlobalWindows schedules a precise GC, so maybe this is a regression from the work in bug 658738 to detect leaking windows?
Comment 25•12 years ago
|
||
There aren't many forceGC in browser-chrome tests, at first glance...
Comment 26•12 years ago
|
||
On the other hand, our HTTP server does a forceGC every time a connection closes. :(
It looks like bz is right. The COMPONENT_UTILS GCs are from the HTTP server, which is in another process. They just happen to get printed to the log with everything else. It looks like they get run after the connection is closed, so they probably don't affect the total running time of the tests much (assuming the test slaves have multiple cores). Strangely, the relevant bug (bug 508125) is inaccessible to me. I hope I'm not about to get fired or something.
Comment 28•12 years ago
|
||
The real bug is bug 508128, the commit message just had a typo.
Comment 29•12 years ago
|
||
(In reply to Andrew McCreight [:mccr8] from comment #24)
> Ah, it looks like checkForLeakedGlobalWindows schedules a precise GC, so
> maybe this is a regression from the work in bug 658738 to detect leaking
> windows?
It would certainly be interesting to compare test run times without the leak detection.
How often does the leak detection run? I thought it was only at the end of the test run.
From bug 508128 (thanks Gavin), it sounds like we didn't used to have a GC trigger based on the number of bytes allocated. We do now. It's possible we could just remove that GC. I'll test. I'm still not sure if that would help, though.
Comment 31•12 years ago
|
||
It runs after each individual test, because it wants to blame individual tests for leaking things. So on the order of 1000+ times per "bc" run.
(In reply to :Gavin Sharp (use gavin@gavinsharp.com for email) from comment #31)
> It runs after each individual test, because it wants to blame individual
> tests for leaking things. So on the order of 1000+ times per "bc" run.
It looks like it only does the leak checking when the done flag is set:
http://mxr.mozilla.org/mozilla-central/source/testing/mochitest/browser-test.js#310
I think that only happens at the end?
Comment 33•12 years ago
|
||
((In reply to Bill McCloskey (:billm) from comment #32)
> I think that only happens at the end?
Yeah, looks like my recollection of how this works was wrong.
OK, I was wrong. It looks like the isTenured thing is a big problem. If I eliminate that check, we go from ~120 mins to ~105 mins. None of the other changes I tried had any effect. Terrence, can you try inlining that path?
Also, I tried removing the forceGC call from httpd.js. Doing so didn't cause the test to go orange, but it also didn't improve the running time at all. We should probably leave it alone.
Comment 36•12 years ago
|
||
Do we have any theories as to why this
a) is worse on windows
b) regressed significantly in recent history
Assignee | ||
Comment 37•12 years ago
|
||
Well, (b) at least is my fault. I'll work on fixing that.
Assignee: nobody → terrence
Assignee | ||
Comment 38•12 years ago
|
||
Attachment #744852 -
Flags: review?(wmccloskey)
I just noticed that the running time for this test dropped a lot recently, from ~120 mins to ~100 mins. It seems to have happened between these checkins:
120 mins: https://tbpl.mozilla.org/?tree=Mozilla-Inbound&rev=ba3ce7dde51f
100 mins: https://tbpl.mozilla.org/?tree=Mozilla-Inbound&rev=e13bb42d2872
Ted, is there anything in your checkin that we would expect would make builds go faster? Only Windows debug builds seem to be affected. Other tests also seemed to get faster too.
Flags: needinfo?(ted)
Attachment #744852 -
Flags: review?(wmccloskey) → review+
Comment 40•12 years ago
|
||
This is completely unexpected. I talked to billm on IRC, the only thing I can imagine is that that patch that adds -RELEASE to LDFLAGS does something. He's going to try backing it out on Try and see what happens. My only theory is that maybe it makes loading symbols for when we get assertion stacks faster somehow? I know WinDBG complains about the missing checksum when it loads symbols, so maybe that causes a slowdown somehow?
Flags: needinfo?(ted)
Assignee | ||
Comment 41•12 years ago
|
||
Oh, I forgot to say, Ted's patch in bug 431048 is what caused the running time to go down so much. It will be interesting to see if Terrence's patch gets us down even further.
Comment 43•12 years ago
|
||
(In reply to Terrence Cole [:terrence] from comment #41)
> Green try at:
> https://tbpl.mozilla.org/?tree=Try&rev=2090e7142df4
>
> Pushed at:
> https://hg.mozilla.org/integration/mozilla-inbound/rev/75699d465ba0
From spot checking TBPL, it looks like this helped us save about 10 minutes on XP and 7 and 5 minutes on win8. Nice job. :-)
Comment 44•12 years ago
|
||
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla23
Updated•7 years ago
|
Component: BrowserTest → Mochitest
You need to log in
before you can comment on or make changes to this bug.
Description
•