Closed
Bug 1371649
Opened 7 years ago
Closed 7 years ago
2.48 - 3.73% tp5n nonmain_normal_fileio (windows7-32) regression on push 2b11565d26fc3e84db455ecc46cab3fdf4cde59b (Thu Jun 8 2017)
Categories
(Core :: Networking: HTTP, defect, P2)
Tracking
()
RESOLVED
WORKSFORME
Tracking | Status | |
---|---|---|
firefox53 | --- | unaffected |
firefox54 | --- | unaffected |
firefox55 | --- | fix-optional |
firefox56 | --- | affected |
People
(Reporter: igoldan, Assigned: mayhemer)
References
Details
(Keywords: perf, regression, talos-regression, Whiteboard: [necko-triaged])
Talos has detected a Firefox performance regression from push:
https://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?changeset=2b11565d26fc3e84db455ecc46cab3fdf4cde59b
As author of one of the patches included in that push, we need your help to address this regression.
Regressions:
4% tp5n nonmain_normal_fileio windows7-32 pgo e10s 355,510,048.50 -> 368,766,201.08
2% tp5n nonmain_normal_fileio windows7-32 opt e10s 356,332,714.04 -> 365,168,193.50
You can find links to graphs and comparison views for each of the above tests at: https://treeherder.mozilla.org/perf.html#/alerts?id=7154
On the page above you can see an alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a treeherder page showing the Talos jobs in a pushlog format.
To learn more about the regressing test(s), please see: https://wiki.mozilla.org/Buildbot/Talos/Tests
For information on reproducing and debugging the regression, either on try or locally, see: https://wiki.mozilla.org/Buildbot/Talos/Running
*** Please let us know your plans within 3 business days, or the offending patch(es) will be backed out! ***
Our wiki page outlines the common responses and expectations: https://wiki.mozilla.org/Buildbot/Talos/RegressionBugsHandling
Reporter | ||
Updated•7 years ago
|
Reporter | ||
Updated•7 years ago
|
Component: Untriaged → Networking: HTTP
Product: Firefox → Core
Reporter | ||
Comment 1•7 years ago
|
||
:mayhemer Could you look over this bug and confirm the regression?
Flags: needinfo?(honzab.moz)
Assignee | ||
Comment 2•7 years ago
|
||
(In reply to IonuČ› Goldan [:igoldan], Performance Sheriffing from comment #1)
> :mayhemer Could you look over this bug and confirm the regression?
I highly doubt that bug 1369496 would increase IO operations (if my understanding of the regressions is correct) that much since it only very slightly changes some http networking internal scheduling.
Unfortunately I'm more and more getting busy right now, so it may not be right now for me to look at it.
Comment 3•7 years ago
|
||
in looking at this, there is a clear regression that holds consistent with the push for bug 1369496:
https://treeherder.mozilla.org/perf.html#/graphs?series=%5Bmozilla-inbound,c0e52d7e6d0f31d629654c20aad3fec29640939f,1,1%5D&zoom=1496941416057.9004,1496945524394.4573,346067416.59720427,370325717.31681037
I even looked at a larger scale and this is a test that has a lot of noise or many changes.
:mayhemer I is ok to be busy, could you find someone else to look at this? Maybe :mcmanus as he reviewed it? Or maybe you have a better idea of when you can take a few hours to a day to look into this?
Assignee | ||
Comment 4•7 years ago
|
||
Oh, according comment 0, it looks like we are pretty certain it's the change from bug 1369496. I wrote comment 2 before I looked at that more closely...
Notes for me:
[16:13] jmaher: that is measured with the xperf toolchain (that is why it is win7 only- those are the only machines we have that toolchain installed on)
[16:14] jmaher: we run tp5 with a single page load (not repeated)
[16:14] jmaher: nonmain = non main thread
[16:14] jmaher: normal = normal operations, not startup
[16:14] jmaher: file IO = any file access (even to cached files)- that is read/write
Assignee: nobody → honzab.moz
Flags: needinfo?(honzab.moz)
Assignee | ||
Comment 5•7 years ago
|
||
[16:18] mayhemer: jmaher: I presume there is no breakdown on the IO?
[16:20] jmaher: mayhemer: it is on my p3 list to make that easier to get that data; I can push to try and get a bunch of data by hacking up talos a bit; then analyzing the data with some scripts
Thanks!
Flags: needinfo?(jmaher)
Updated•7 years ago
|
status-firefox53:
--- → unaffected
status-firefox54:
--- → unaffected
status-firefox55:
--- → affected
Updated•7 years ago
|
Whiteboard: [necko-active]
Comment 6•7 years ago
|
||
a few days late, but here is what I have:
https://docs.google.com/a/mozilla.com/spreadsheets/d/1QBnZ6lbOR3q_PDtwqY9oZq3ClUFMrNwZ5qgaPncJ7PM/edit?usp=sharing
a lot of font files are referenced after this patch.
Flags: needinfo?(jmaher)
Comment 7•7 years ago
|
||
Honza, wanna have a look at the info in Comment 6?
Flags: needinfo?(honzab.moz)
Assignee | ||
Comment 8•7 years ago
|
||
(In reply to Mike Taylor [:miketaylr] (55 Regression Engineering Owner) from comment #7)
> Honza, wanna have a look at the info in Comment 6?
I'm aware, just busy.
Flags: needinfo?(honzab.moz)
Assignee | ||
Comment 9•7 years ago
|
||
(In reply to Joel Maher ( :jmaher) from comment #6)
> a few days late, but here is what I have:
> https://docs.google.com/a/mozilla.com/spreadsheets/d/
> 1QBnZ6lbOR3q_PDtwqY9oZq3ClUFMrNwZ5qgaPncJ7PM/edit?usp=sharing
>
> a lot of font files are referenced after this patch.
Thanks for this! But to be honest, I'm lost more than before...
I can see that the HTTP cache IO (in cache2/entries) doesn't overlap AT ALL between the two runs. Are those runs made on exactly the same set of pages? Are the runs in different containers? Are they really the same changeset except bug 1369496? Because otherwise I don't follow why there would not be any overlap of http cache entries at all - the entry file name is based on |context+URL| hash. Context may contain the container id (reason I'm asking), URLs should remain the same?
The HTTP cache IO adds "only" around 4M IOs. It's about 17% of the overall difference. This is the only thing that my patch could have an influence on.
Then I can see +9M IOs on {cltbld}\AppData\Local\\{profile}\startupCache\startupCache.4.little
And 12.5M IOs on ttf files - no idea who works with these.
Sorry, but I'm less and less convinced this is caused by bug 1369496 at all.
Let's take bug 1369496 out from mozilla-central and see if something changes, what do you think?
Flags: needinfo?(jmaher)
Comment 10•7 years ago
|
||
I pushed to try a baseline from tip of inbound, then again for same baseline+backout, I see this on the graph:
https://treeherder.mozilla.org/perf.html#/graphs?series=%5Btry,c0e52d7e6d0f31d629654c20aad3fec29640939f,1,1%5D&highlightedRevisions=cbdffb52c6a2&highlightedRevisions=1e0399b52cc7&zoom=1498238586800.0247,1498239255000,346138119.21698123,383507138.2802308
for the baseline the data falls in the 368-372M range for 6 data points
for the baseline+backout, I see bi-modal data, 350-355M and 368-373M, so this seems to make a difference.
looking at the data for inbound, there is a large range of normal values:
https://treeherder.mozilla.org/perf.html#/graphs?timerange=2592000&series=%5Bmozilla-inbound,c0e52d7e6d0f31d629654c20aad3fec29640939f,1,1%5D&selected=%5Bmozilla-inbound,c0e52d7e6d0f31d629654c20aad3fec29640939f,NaN,NaN,1%5D
not fully conclusive- I will retrigger many more times on my try pushes. Possibly there is not much we can do here and have to accept it, it would be nice to know what is going on either way.
Flags: needinfo?(jmaher)
Assignee | ||
Comment 11•7 years ago
|
||
Does xperf provide callstacks where from a file is being open? I think the font thing is the most interesting to figure out here.
Just a note we also have a kill switch pref to disable the throttling - network.http.throttle.enable. May be interesting to check what effect that may have.
Comment 12•7 years ago
|
||
I am not sure what is mentioned by callstacks- I suspect the answer is no as we are profiling inside the windows OS, not related to firefox.
Are you suggesting pushing to try with network.http.throttle.enable=False and comparing that to baseline?
Assignee | ||
Comment 13•7 years ago
|
||
(In reply to Joel Maher ( :jmaher) from comment #12)
> I am not sure what is mentioned by callstacks-
a windows term for back trace or stack trace.
> I suspect the answer is no as
> we are profiling inside the windows OS, not related to firefox.
according some xperf docs [1][2] it should be possible to get stack traces for IO events as well. you only need to setup symbols correctly.
>
> Are you suggesting pushing to try with network.http.throttle.enable=False
> and comparing that to baseline?
maybe it can show some difference too or at least it would be interesting to know. that pref flip should effectively disable the whole feature (landed/enabled in bug 1365307) that only has been slightly changed by the patch for the offending bug 1369496.
thanks!
[1] https://blogs.msdn.microsoft.com/ntdebugging/2008/04/03/windows-performance-toolkit-xperf/
[2] https://blogs.msdn.microsoft.com/ntdebugging/2009/08/17/using-xperf-to-investigate-slow-io-issues/
Comment 14•7 years ago
|
||
We looked at this in regression triage today... did anyone do the try push here? (comment 12)
Comment 16•7 years ago
|
||
I just did this:
https://treeherder.mozilla.org/#/jobs?repo=try&author=jmaher@mozilla.com&fromchange=e01e4c8941984b0c2b6884ea5eba35b8c5717bf5&tochange=174b70f6c4c723174989d24f670cec7842355ef6
I will follow up when results are available in a couple hours.
Assignee | ||
Comment 17•7 years ago
|
||
Thanks Joel.
Comment 18•7 years ago
|
||
and I have data:
baseline: 359904066
pref: 349405002
this shows about a 10MB difference which is similar to the regression reported.
Flags: needinfo?(jmaher)
Assignee | ||
Comment 19•7 years ago
|
||
(In reply to Joel Maher ( :jmaher) (UTC-8) from comment #18)
> and I have data:
> baseline: 359904066
> pref: 349405002
>
> this shows about a 10MB difference which is similar to the regression
> reported.
So, there is a *lower* number of IO with the patch for bug 1369496 and the pref at "false"?
Comment 20•7 years ago
|
||
yes, with the tip of trunk (assume that includes the patch) and the pref = False, we see a 10MB lower number in IO.
Assignee | ||
Comment 21•7 years ago
|
||
(In reply to Joel Maher ( :jmaher) (UTC-8) from comment #20)
> yes, with the tip of trunk (assume that includes the patch) and the pref =
> False, we see a 10MB lower number in IO.
Thanks. Any chance on getting the stack traces for at least the tff file opening? Do you have any idea where from in the code we could be opening those files? Or who could know?
Comment 22•7 years ago
|
||
I am not sure how to do that- ideally that can be done running on try with custom code? |try -b o -p win32 -u none -t xperf-e10s|
Comment 24•7 years ago
|
||
Should we consider backing this out at this point?
Assignee | ||
Comment 25•7 years ago
|
||
(In reply to Mike Taylor [:miketaylr] (55 Regression Engineering Owner) from comment #24)
> Should we consider backing this out at this point?
There will be another change in the code around landing soon. I'd like to wait until then (or check before landing via a try push) if that's going to affect the regression here.
If it doesn't help, I'll investigate the reason.
Note that because the relation between the code changes and the character of this regression it's not likely the regression is directly caused by those changes but only uncovers a bug somewhere else.
My patch that started this regressions is part of the CPD project, which is an important Quantum bug. Hence, I am against backing it out.
Comment 26•7 years ago
|
||
We are about to have 55 RC tomorrow. Looks like there are no actions at this moment. Mark 55 fix-optional.
Updated•7 years ago
|
status-firefox56:
--- → affected
Comment 27•7 years ago
|
||
Bulk priority update: https://bugzilla.mozilla.org/show_bug.cgi?id=1399258
Priority: -- → P1
Comment 28•7 years ago
|
||
Keeping as P1. We've made significant changes since this was filed, so we should look again at how the perf is doing, and try to fix if it's still present.
Updated•7 years ago
|
Whiteboard: [necko-active]
Updated•7 years ago
|
Priority: P1 → P2
Updated•7 years ago
|
Whiteboard: [necko-triaged]
Assignee | ||
Comment 29•7 years ago
|
||
I have recently changed the code that caused this regressions in bug 1386746. It's currently disabled via a pref, but I would like to check if this regression is still present when the pref is switched.
To turn it on, change:
network.http.throttle.version = 2
Joel, can you please repush with that pref modified? Or just tell me (maybe again, I know) how and I will do it.
Thanks!
Flags: needinfo?(honzab.moz) → needinfo?(jmaher)
Comment 30•7 years ago
|
||
nonmain_normal_fileio appears to be fixed:
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=76231f59448d&newProject=try&newRevision=6d9fa8b952303035c078c1c9bbb4fa049ded2f7e&framework=1
the _netio counters seem to have changes, but for the purposes of this regression, there is no issue.
Flags: needinfo?(jmaher)
Assignee | ||
Comment 31•7 years ago
|
||
Thanks a lot Joel! I am about to run a pref-flip study on the new version of the code. If successful I will switch to the new version and likely remove the old version code.
ni? on myself to not loose this.
Flags: needinfo?(honzab.moz)
Assignee | ||
Comment 32•7 years ago
|
||
fixed with bug 1444975
Status: NEW → RESOLVED
Closed: 7 years ago
Flags: needinfo?(honzab.moz)
Resolution: --- → WORKSFORME
You need to log in
before you can comment on or make changes to this bug.
Description
•