1005253 - 4.14% tart win7 regression on april 30 from rev e4be5203a3c9

Reporter

Description

•

11 years ago

I found this in a talos alert, we have a noticable tart regression that seems to be specific to windows 7. Here is a graph: http://graphs.mozilla.org/graph.html#tests=[[293,131,25]]&sel=none&displayrange=7&datatype=running I did some retriggers and this is really the deal: https://tbpl.mozilla.org/?tree=Mozilla-Inbound&fromchange=2cecc0699b45&tochange=44e189f0fd67&jobname=Windows%207%2032-bit%20mozilla-inbound%20talos%20svgr Tart is documented here: https://wiki.mozilla.org/Buildbot/Talos/Tests#TART It is pretty easy to push to try any fixes and see if your fix for this works. As always, if you need help or more information, please ask in the bug.

Andrew McCreight [:mccr8]

Assignee

Comment 1

•

11 years ago

That's pretty odd, especially that it is on a single platform. I'm surprised the cycle collector runs at all during this test. I guess I'll have to run it locally and see what it is even doing. I mean, in theory we could just suppress CC during tab animation, but hopefully we can avoid that.

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 2

•

11 years ago

In the tart test we do a garbage collect between *tests*, so there is a chance that this is called for this test. you can get talos locally here: https://wiki.mozilla.org/Buildbot/Talos/Running#Running_locally_-_Source_Code or you can make tart a .xpi and use it as an addon.

Andrew McCreight [:mccr8]

Assignee

Comment 3

•

11 years ago

Where in the source code does that garbage collection happen?

Andrew McCreight [:mccr8]

Assignee

Comment 4

•

11 years ago

My theory is that the test harness is forcing a GC, but not a CC. Without ICC, this will trigger a CC 5 seconds after the GC, but if we're done with the test in that time, there will be no CC during the test. With ICC, we start doing CC work sooner, after a second or two, so it may start to overlap with the test run. The solution would be to force a synchronous CC after the GC, which will reset the timer, so we will do a GC in 5 seconds, thus hopefully keeping collector activity from interfering with the test.

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 5

•

11 years ago

here is where we do the gc: http://hg.mozilla.org/build/talos/file/tip/talos/pageloader/chrome/pageloader.js#l412 a test is a page as defined here: https://hg.mozilla.org/build/talos/file/b4907f0b27d3/talos/page_load_test/tart/tart.manifest unfortunately for tart, that is a single page. I would be happy to cc when we gc- I would want to be careful so we don't affect other tests or adjust our testing so it is further away from real world. With that said, a little thought before doing something across the board isn't too hard.

Andrew McCreight [:mccr8]

Assignee

Comment 6

•

11 years ago

Thanks! Unfortunately for my theory, that already does a cycle collection.

Andrew McCreight [:mccr8]

Assignee

Updated

•

11 years ago

Assignee: nobody → continuation

Andrew McCreight [:mccr8]

Assignee

Comment 7

•

11 years ago

I looked at this locally on OSX. It does open and close a bunch of tabs, running for five minutes. We spend only a second or two of the 5 minute run in the CC. Maybe something terrible is happening on Win7 in particular.

Andrew McCreight [:mccr8]

Assignee

Comment 8

•

11 years ago

In Win7 we're still spending less than 10 seconds in cycle collection across all of the Talos tests in this group, and there's no dramatic differences I can see between Win7, Win8 and WinXP, on some basic CC time metrics.

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 9

•

11 years ago

the code change (https://hg.mozilla.org/integration/mozilla-inbound/rev/ac6c395d9364) looks to be more related to network code cleanup, not necessarily cycle collection from my naive point of view.

Andrew McCreight [:mccr8]

Assignee

Comment 10

•

11 years ago

From the retriggers you listed in comment 0, it looks like the previous push ( https://hg.mozilla.org/integration/mozilla-inbound/rev/e4be5203a3c9 ) is the one where TART regressions from around 7.0 to around 7.4. Am I misinterpreting that?

Andrew McCreight [:mccr8]

Assignee

Comment 11

•

11 years ago

I'm talking about the pushes in https://tbpl.mozilla.org/?tree=Mozilla-Inbound&fromchange=2cecc0699b45&tochange=44e189f0fd67&jobname=Windows%207%2032-bit%20mozilla-inbound%20talos%20svgr I have no idea how to interpret that graphs page.

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 12

•

11 years ago

oh, you are right: https://hg.mozilla.org/integration/mozilla-inbound/rev/e4be5203a3c9 title updated. Should we push to try with this backed out to see if that helps?

Andrew McCreight [:mccr8]

Assignee

Updated

•

11 years ago

Summary: 4.14% tart win7 regression on april 30 from rev ac6c395d9364 → 4.14% tart win7 regression on april 30 from rev e4be5203a3c9

Andrew McCreight [:mccr8]

Assignee

Comment 13

•

11 years ago

(In reply to Joel Maher (:jmaher) from comment #12) > Should we push to try with this backed out to see if that helps? I can do that once try is working again.

Andrew McCreight [:mccr8]

Assignee

Comment 14

•

11 years ago

try run with ICC disabled: https://tbpl.mozilla.org/?tree=Try&rev=fbae39857d7b

Andrew McCreight [:mccr8]

Assignee

Comment 15

•

11 years ago

How do I see what the TART score is in my push in comment 14? The datazilla link in the try push just hangs as far as I can tell.

Avi Halachmi (:avih)

Comment 16

•

11 years ago

(In reply to Andrew McCreight [:mccr8] from comment #15) > How do I see what the TART score is in my push in comment 14? The datazilla > link in the try push just hangs as far as I can tell. Datazilla is very slow in the past few days, so except for waiting (up to few minutes) and hoping, not much you can do. FWIW, I got the link for your win7 tart run showing the data after about a minute: https://datazilla.mozilla.org/?start=1399422132&stop=1400026932&product=Firefox&repository=Try-Non-PGO&os=win&os_version=6.1.7601&test=tart&graph_search=fbae39857d7b&x86_64=false&project=talos And you can also see the graph server graphs. The summary section at the bottom of tbpl (once you click the test code/char at the top) shows items of type: - datazilla: <test-name> <-- datazilla link - <test-name> : "score" <-- graph server link The graphserver scrore is the one which generates the regression reports, and is the average of all the test's sub results (TART has about 30 sub results). Graph server doesn't allow you to view individual sub results, while datazilla does.

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 17

•

11 years ago

just click on the 's' on tbpl and the details panel below has the score, For Win7 it is 7.57. What is your base tree for the push? we need to compare against the revision that you pulled from for the best case comparison.

Andrew McCreight [:mccr8]

Assignee

Comment 18

•

11 years ago

The base revision is this: https://hg.mozilla.org/try/rev/7f5a8526b55a I did add some printing, too, so I guess that could cause a problem. I'll try a cleaner push and see what happens there. 7.57 still sounds like it is a regressed value. > just click on the 's' on tbpl and the details panel below has the score, For Win7 it is 7.57. Hmm. I don't see that for some reason. When I click on the 's', I see "TinderboxPrint: mozharness_revlink: https://hg.mozilla.org/build/mozharness/rev/8b81f2f286de" and some datazilla links, but not that. It looks like I can search in log for "process_Request line: tart" to get the results, though.

Andrew McCreight [:mccr8]

Assignee

Comment 19

•

11 years ago

I did a new clean push: https://tbpl.mozilla.org/?tree=Try&rev=d3b580d8dacd This is against revision 93e03b8c127e.

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 20

•

11 years ago

with the clean push, this is posting a TART score of 7.48->7.92 (really 7.51 if you ignore the 7.92 outlier). what branch is 93e03b8c127e on? m-c? We have had some tart regressions lately: http://graphs.mozilla.org/graph.html#tests=[[293,94,25]]&sel=none&displayrange=30&datatype=running I believe you are based of m-c from May 12th, that has value of 8.0 for tart, so if you are posting 7.5 then we are seeing a win!

Andrew McCreight [:mccr8]

Assignee

Comment 21

•

11 years ago

Yeah, this is the revision I was comparing to: https://tbpl.mozilla.org/?rev=93e03b8c127e ICC disabled: 7.92, 7.51, 7.54, 7.48 ICC enabled: 8.28, 8.36, 8.04 So this confirms that we're still seeing a regression with ICC enabled. One change ICC makes is that we schedule CCs sooner after GCs, even when the CCs are short, so this can potentially cause us to do slightly more GCs and CCs. I'll see if changing that helps.

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 22

•

11 years ago

at least we are getting to the bottom of this! Whether or not this is straightforward to fix is another story :)

Andrew McCreight [:mccr8]

Assignee

Comment 23

•

11 years ago

I still have no idea what is happening.

Andrew McCreight [:mccr8]

Assignee

Updated

•

11 years ago

Depends on: 1011137

Avi Halachmi (:avih)

Comment 24

•

11 years ago

(In reply to Andrew McCreight [:mccr8] from comment #1) > I mean, in theory we could just suppress CC during tab animation, but > hopefully we can avoid that. BTW, why do you hope to be able to avoid it? there probably is an API for it already, the start and end points are well defined, the duration is short and known in advance, and UI animations (tabs or otherwise) are quite important for us to keep smooth. Wouldn't it make sense to explicitly suppress CC when we know we're entering such phase, rather than having possibly complex heuristics which hopefully ends up with the same result? I'm teasing, but only partially. I'd also prefer automatic behavior which covers all our cases, and I do know that explicit triggers could be a slippery slope, but still, I'd consider it more than a theoretical option. Your call though (which if goes the explicit way, would have several more implications very soon IMO, e.g. for customize animation and menu animations).

Andrew McCreight [:mccr8]

Assignee

Comment 25

•

11 years ago

It would certainly be nice to suppress CC during animations, but that won't help during TART, which runs for 5 minutes, and we really don't want to suppress CC for 5 minutes at a time, in general. I have an early patch in progress that would make us be less janky during painting, which should help, but again, I think that won't help TART, which I have the impression is really more of a throughput test than a jankiness test.

Avi Halachmi (:avih)

Comment 26

•

11 years ago

The overall talos test run of tart, which repeats the same test sequence 25 times (where each sequence includes dozens of animations), takes few minutes. But during each test run, TART does give Firefox time to recover. Any animation which TART measures is preceded with a wait of at least 500ms - intentionally to allow CC and other cleanups to complete before the next animation starts. The explicit suppression I suggested on comment 24 was for UI animations in general (and specifically tabs as well), not for the TART run. And if following such change, it turns out that TART method of (unrealistic afterall) tab animation invocation clashes with this explicit suppression, then we could always change TART, or refine the suppression, whichever we think would be better.

Avi Halachmi (:avih)

Comment 27

•

11 years ago

(In reply to Avi Halachmi (:avih) from comment #26) > But during each test run, TART does give Firefox time to recover. Any > animation which TART measures is preceded with a wait of at least 500ms - > intentionally to allow CC and other cleanups to complete before the next > animation starts. Or at the very least, I hope it's enough time to recover. If it's not, then we should change TART to wait longer. Afteral, TART is invoking tab animations at a rate which is not expected from users in practice, and the heuristics in place surely expect some kind of behavior pattern in order to perform as expected. If TART happens to trigger cases of unoptimized code path, then we should fix TART. I'm open and listening for any suggestion or info which can improve TART.

Andrew McCreight [:mccr8]

Assignee

Comment 28

•

11 years ago

> Any animation which TART measures is preceded with a wait of at least 500ms - intentionally to allow CC and other cleanups to complete before the next animation starts. Ah, thanks for the explanation. Yeah, that will be a little sensitive to changes in collector scheduling. The collectors will run, say, every 10 seconds, so I could imagine that tweaking how they are run could make them end up running during a run, whereas they were not before. I'll think about how to see if that is happening.

Andrew McCreight [:mccr8]

Assignee

Comment 29

•

11 years ago

If my reading of the results of a TART run on try is right, bug 1011137 should fix this. Essentially, ICC was running a CC two seconds earlier. I suspect this caused us to CC during the TART test, when we weren't before. My patch makes the behavior more like non-CC, where instead of running a CC sooner, we run a GC sooner, but only if the ICC is long, basically. I think that's an improvement, even ignoring the TART regression issue. If I make some future changes to CC scheduling that mess with TART again, I'll try coming up with a more robust fix, but I think it is okay for now.

Avi Halachmi (:avih)

Comment 30

•

11 years ago

(In reply to Andrew McCreight [:mccr8] from comment #29) > ... but I think it is okay for now. Many thanks. And what about the CC/GC hints idea during UI animations? we currently don't have tons of them (probably not more than 10), and it would remove the dependency on heuristics refinements. Not just for TART, but more importantly, for the users. The heuristics would still be useful, it'll just have better knowledge of when we'd really prefer to not trigger GC/CC.

Andrew McCreight [:mccr8]

Assignee

Comment 31

•

11 years ago

It looks like there was a pretty clear improvement when bug 1011137 landed, but there's no message on tree-management about it: http://graphs.mozilla.org/graph.html#tests=[[293,131,25]]&sel=none&displayrange=7&datatype=running Maybe the regression from OMTC confused things.

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 32

•

11 years ago

yeah, there is enough data points to show the improvement as sustained. I would expect an improvement email, maybe there is an issue with the regression emailer script :) We are almost back to 100% now- probably just a slight bit off. I am inclined to mark this as fixed

Andrew McCreight [:mccr8]

Assignee

Comment 33

•

10 years ago

Fixed by bug 1011137.

Status: NEW → RESOLVED

Closed: 10 years ago

Resolution: --- → WORKSFORME

Avi Halachmi (:avih)

Updated

•

10 years ago

See Also: → https://bugzilla.mozilla.org/show_bug.cgi?id=1017055