Closed Bug 1092171 Opened 10 years ago Closed 9 years ago

4.53% Win7+8 V8 Regression on Inbound (v.36) Oct 30 from push 13ecff800114

Categories

(Core :: XBL, defect)

x86_64
Linux
defect
Not set
normal

Tracking

()

RESOLVED WONTFIX

People

(Reporter: jmaher, Unassigned)

References

Details

(Keywords: perf, regression, Whiteboard: [talos_regression])

Here is a graph showing the regression:
http://graphs.mozilla.org/graph.html#tests=%5B%5B230,131,25%5D%5D&sel=none&displayrange=7&datatype=running

I did some retriggers on tbpl:
https://tbpl.mozilla.org/?tree=Mozilla-Inbound&startdate=2014-10-30&enddate=2014-10-31&jobname=Windows%207%2032-bit%20mozilla-inbound%20talos%20dromaeojs

it points out that this is the push causing the regression:
https://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?changeset=13ecff800114

Here is some information about v8:
https://wiki.mozilla.org/Buildbot/Talos/Tests#V8.2C_version_7

I looked on other platforms and none experience a regression, so this is limited to Win7 (our most popular desktop platform).
This is quite odd, since V8 shouldn't be using DOM things at all, except maybe Window.
That said, the numbers are very clear, I agree.

Looking at the subtest results, they look like this on the 4 runs before the change, dropping everything after the decimal point:

Crypto: 17997, 17777, 18339, 17733
DeltaBlue: 25038, 25177, 25138, 25131
EarlyBoyer: 23450, 23591, 23526, 23593
NavierStokes: 22341, 22274, 22408, 22237
RayTrace: 40329, 40289, 40033, 39697
RegExp: 3479, 3444, 3469, 3461
Richards: 21869, 21491, 21714, 21774
Splay: 14815, 14595, 14285, 14912

And like this on the 4 runs after the change:

Crypto: 17797, 17557, 17496, 17759
DeltaBlue: 25111, 24569, 24913, 24820
EarlyBoyer: 22928, 22788, 21573, 22531
NavierStokes: 22215, 22341, 22363, 22363
RayTrace: 40215, 38958, 40067, 39919
RegExp: 3454, 3437, 3466, 3458
Richards: 21491, 21922, 21689, 21710
Splay: 10443, 9932, 10683, 10440

So most obviously, something happened with Splay here..
And note that http://arewefastyet.com/#machine=17&view=single&suite=octane&subtest=Splay does not show a regression either, as far as I can tell.  Though maybe octane-splay and v8-v7-splay are not the same thing...
should be we updating our tests?
So I tried http://v8.googlecode.com/svn/data/benchmarks/v7/run.html on Mac with and without the changeset in question and unsurprisingly I see no performance difference on Mac.

I'm going to do some try pushes to see if I can reproduce there, and if so, I guess I get to do some debugging via try...
we also have a 2.74% win8 regression here:
http://graphs.mozilla.org/graph.html#tests=%5B%5B230,131,31%5D%5D&sel=1414604226202,1414777777506.348&displayrange=7&datatype=running
Summary: 4.53% Win7 V8 Regression on Inbound (v.36) Oct 30 from push 13ecff800114 → 4.53% Win7+8 V8 Regression on Inbound (v.36) Oct 30 from push 13ecff800114
> should be we updating our tests?

I have no idea.  What is "our tests" in this case?  ;)

Just to make sure, the regression here is on non-PGO boxes, so I don't have to worry about weird PGO effects, right?

In any case, I've done a bunch of try pushes to try to narrow this down and here's what I see: 

1)  The regression is definitely in Splay only.
2)  The regression is due to changing what the parent (in the JS engine sense) is for
    nsXULElement instances from their parent element to the global (!).  Specifically,
    compare https://tbpl.mozilla.org/?tree=Try&rev=6d9ab47bdf0d and
    https://tbpl.mozilla.org/?tree=Try&rev=0a1a122280de

There shouldn't even be any nsXULElement instances in splay, except _possibly_ for viewport scrollbars.  So my best guess is that changing the parenting here changes the GC timing for the browser window itself somehow and this affects the performance on splay (which iirc is largely an object allocation benchmark).

Terrence, is there something GC-related I could try logging on my try pushes to verify this hypothesis?
Flags: needinfo?(terrence)
Yes, small changes to the GC timing can dramatically change the splay score. In particular, we've had problems in the past with a GC that normally happens in the harness moving forward a bit and happening inside splay's timed region. Steve has the most recent experience here, looking at splay numbers on Flame. Those devices are slow enough that we only do a handful of GC's so one extra is an enormous relative slowdown. Desktop, on the other hand, should be fast enough that the difference would be relatively small -- like about 5%. Steve might also have some ideas for how to debug this on try -- I certainly don't.
Flags: needinfo?(terrence) → needinfo?(sphink)
Terrence, am I right to understand that if GC timing changes and as a result the splay numbers change, then it's not a real regression, but rather incidental?

If that's the case, and you feel that the modified GC timing is not a bad thing, then I don't think we should go out of our way to make the splay numbers get back to their previous values, since, at least for this change, the change in numbers doesn't represent a user visible change - it only happens that the splays use case triggers more GCs than it used to, but the overall behavior (i.e. for more cases than just splay's) is supposedly better, right?
Perhaps we could change the test harness in Talos a bit to ensure that all GCs resulting from the warm-up run have happened before the timed run starts (it would probably make sense to move Talos over to Octane v2 at the same time). e.g. by inserting a gcslice() call before starting the timed run?
> should be fast enough that the difference would be relatively small -- like about 5%.

We have a 30-40% Splay regression, so whatever our hypothesis is should be able to account for that.

The discussion in comments 10 and 11 seems premature; let's figure out whether GC is in fact the problem before we start worrying about how to solve it.  ;)
(In reply to Boris Zbarsky [:bz] from comment #12)
> > should be fast enough that the difference would be relatively small -- like about 5%.
> 
> We have a 30-40% Splay regression, so whatever our hypothesis is should be
> able to account for that.

Oh, I thought the 4.33% was for splay alone. If it's 30-40% on Windows, GC timing is not going to be an adequate explanation. I guess the first question is if GC is even culpable. Ideally we could get MOZ_GCTIMER logs off talos, but I don't know how feasible that is.
> I guess the first question is if GC is even culpable. 

That was precisely my question in comment 8, yes.  What can I log on try to test this hypothesis?
I talked to sfink.

His approach involves adding shell functions and changing tests to call them; neither one is that viable here, since this regression is browser-only and in tests I can't edit.

I'll see if I can come up with some logging based on his patch, though.
I gave bz a patch at http://people.mozilla.org/~sfink/data/upload-stats that hopefully will make talos upload the MOZ_GCTIMER log. But this strikes me as something generally useful -- seems like talos ought to be logging major GCs to stderr by default, for cases like this. Possibly in a way compatible with the fancy new structured logging stuff, since we can already output as JSON.

The one thing that's going to be difficult is figuring where splay is in the output. I guess if talos already logs the individual test starts/ends, perhaps this would be possible with interleaved logs. So bz, an alternative to the above patch is instead hardcoding char *env = "stderr". That will spew very basic summary timings to stderr, to get mixed in with the regular output.
Unfortunately, with the logging patch the Talos run fails.  :(  See https://tbpl.mozilla.org/?tree=Try&rev=b41ea8fcaceb
looking at that talos run, the browser terminates after the second time it loads http://localhost/page_load_test/dromaeo/cssquery-dojo.html.  For reference this is the first page it loads in the test, so maybe this patch is causing the browser to terminate?
What's interesting to me is that in a run without that patch, I only see cssquery-dojo.html loaded once, unless I'm missing something...
ah, good point- that is true, what we see here is the stdout from the browser, then when it fails we dump all we have again (probably a bug in talos although on some platform like android this is a benefit).  Is there a chance that we could have the browser crash or hang somehow?
So here's a question.  Is this reproducible on win7/8 outside our test infra?
I am not sure, that is a great question.  I don't have access to a hardware win7 machine, but I could do this on a VM.

Avi, do you have a win7/8 machine?
Flags: needinfo?(avihpit)
I do, though my own builds crash randomly for some reason (regardless of this bug). Nightly builds don't crash though.

So if there some procedure you want me to run/test on windows 8.1 with try builds, let me know what it is.
Flags: needinfo?(avihpit)
Status: NEW → RESOLVED
Closed: 9 years ago
Flags: needinfo?(sphink)
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.