Closed
Bug 936617
Opened 11 years ago
Closed 10 years ago
Add talos regression test for "cold" startups
Categories
(Core :: General, defect, P1)
Core
General
Tracking
()
RESOLVED
FIXED
People
(Reporter: vladan, Assigned: jmaher)
References
(Blocks 1 open bug)
Details
Attachments
(1 file, 3 obsolete files)
(deleted),
patch
|
dminor
:
review+
|
Details | Diff | Splinter Review |
We need a cold startup benchmark that takes into account I/O read costs.
Talos ts_paint only measures hot startups, i.e. the time to re-open the browser after most of its code and data have already been cached in memory.
Variations in I/O op times and the effects of OS pre-fetching will have to be accounted for.
Reporter | ||
Updated•11 years ago
|
Blocks: start-faster
Assignee | ||
Comment 1•11 years ago
|
||
this will be tricky with talos due to the restarting and repeated runs of this. we might have to trick buildbot/mozharness somehow if we are rebooting- not a trivial task.
Reporter | ||
Updated•11 years ago
|
Assignee: nobody → rvitillo
Comment 2•11 years ago
|
||
Instead of rebooting, we could use the NtSetSystemInformation API on Windows to clear the Standby List (see http://processhacker.sourceforge.net/doc/memlists_8c_source.html), i.e. flushing the page cache. We could do something similar on OSX and Linux.
Comment 3•11 years ago
|
||
Here are some numbers for firstPaint under various scenarios. To empty the file cache I used NtSetSystemInformation on Windows, purge on OSX and “sync && echo 3 > /proc/sys/vm/drop_caches” on Linux.
Windows (reboot):
mean: 1152
sd: 48
Windows (cache cleared):
mean: 1228
sd: 22
Windows (hot):
mean: 765
sd: 8
Linux (reboot):
mean: 859
sd: 22
Linux (cache cleared):
mean: 961
sd: 27
Linux (hot):
mean: 678
sd: 12
OSX (reboot):
mean: 1835
sd: 18
OSX (cache cleared):
mean: 2570
sd: 23
OSX (hot)
mean: 1737
sd: 13
The following observation can be made (measurements come from different machines):
- dropping the cache increases the average startup time on all OSs since also libraries that are loaded at startup by the system are purged.
- Even though the standard deviation is slightly different under the various scenarios, the difference in the mean is statistically significant.
If you think that the numbers are acceptable, we can proceed to encapsulate the logic in a python module so that it can be used by talos.
Flags: needinfo?(vdjeric)
Flags: needinfo?(jmaher)
Assignee | ||
Comment 4•11 years ago
|
||
for windows and linux the reboot vs cache cleared are similar, but osx is way off. Also the stddev (I assume that is sd) seems much higher, I assume that translates into more noise?
It would be interesting to see this run on our automation machines.
Flags: needinfo?(jmaher)
Comment 5•11 years ago
|
||
Joel, looking at reboot vs hot on OSX, it would seem that OSX preloads at startup quite many libraries. In any case the purge utility, which comes with the OS, seems to be the correct way to flush the file cache. Do you have something better in mind?
The difference in standard deviations for OSX are not particularly "big" imho, at least in statistical terms. An F-test fails to reject the hypothesis that the standard deviations are equal with alpha=0.05. On the other hand, a small increase in noise might be a reasonable tradeoff if we can account also for IO timings.
Flags: needinfo?(jmaher)
Assignee | ||
Comment 6•11 years ago
|
||
So my concern is that we are showing a real cold startup and measuring it, but will users ever experience this? I understand we won't get 100% accurate, this seems a bit distorted- although it is better than hot startup only.
Flags: needinfo?(jmaher)
Updated•11 years ago
|
Whiteboard: [good first bug][mentor=rvitillo][lang=python]
Assignee | ||
Comment 7•11 years ago
|
||
This is a great good_first_bug.
To get started with talos here is a brief primer:
https://wiki.mozilla.org/Buildbot/Talos/Running#Running_locally_-_Source_Code
To get started with Mercurial at Mozilla, here are some tips:
https://developer.mozilla.org/en-US/docs/Mercurial_FAQ
If you are considering this bug, please go ahead and start working on it. You can ask questions in the bug or on irc in #perf.
Reporter | ||
Comment 8•11 years ago
|
||
(In reply to Roberto Agostino Vitillo (:rvitillo) from comment #3)
> If you think that the numbers are acceptable, we can proceed to encapsulate
> the logic in a python module so that it can be used by talos.
- I would have expected shorter startup times after rebooting Windows vs after flushing the cache. How long did you wait after a Windows reboot before starting Firefox? Windows Prefetch will pre-emptively read program code+data after each startup if it notices the user frequently launches the program after a reboot.
- Does the NtSetSystemInformation call clear both standby and active pages in the disk cache?
(In reply to Joel Maher (:jmaher) from comment #6)
> So my concern is that we are showing a real cold startup and measuring it,
> but will users ever experience this? I understand we won't get 100%
> accurate, this seems a bit distorted- although it is better than hot startup
> only.
I don't think users would ever see Firefox startups where their disk cache is completely empty, but they would experience similarly slow Firefox startups when Firefox code+data isn't cached, when the disk I/O bandwidth is saturated, or the system is under memory pressure and swapping. Note also that our existing hot startup test doesn't occur often in the wild either.
This cold test is useful because it assigns a cost to all the I/O needed to start Firefox. The absolute numbers won't necessarily be meaningful, but it should be able to catch real regressions/improvements. Noise will be an issue because there will be variations in disk drive performance between different test machines & test runs.
Flags: needinfo?(vdjeric) → needinfo?(rvitillo)
Reporter | ||
Comment 9•11 years ago
|
||
(In reply to Vladan Djeric (:vladan) from comment #8)
> - Does the NtSetSystemInformation call clear both standby and active pages
> in the disk cache?
You can verify this by looking at the "Metafile" row in the RamMap utility: http://technet.microsoft.com/en-ca/sysinternals/ff700229.aspx
Comment 10•11 years ago
|
||
(In reply to Vladan Djeric (:vladan) from comment #8)
> (In reply to Roberto Agostino Vitillo (:rvitillo) from comment #3)
> > If you think that the numbers are acceptable, we can proceed to encapsulate
> > the logic in a python module so that it can be used by talos.
>
> - I would have expected shorter startup times after rebooting Windows vs
> after flushing the cache. How long did you wait after a Windows reboot
> before starting Firefox?
I waited 1 minute.
> Windows Prefetch will pre-emptively read program
> code+data after each startup if it notices the user frequently launches the
> program after a reboot.
Do you know at what point Prefetch will actually load the the program into memory? In a few trials I noticed that if I wait long enough, e.g. more than 10 minutes, then the startup time can be lower but that's not always the case.
> - Does the NtSetSystemInformation call clear both standby and active pages
> in the disk cache?
It clears only the standby pages.
Flags: needinfo?(rvitillo)
Reporter | ||
Comment 11•11 years ago
|
||
(In reply to Roberto Agostino Vitillo (:rvitillo) from comment #10)
> > - Does the NtSetSystemInformation call clear both standby and active pages
> > in the disk cache?
> It clears only the standby pages.
I think you can use NtSetSystemInformation to clear active pages as well.
The CacheSet utility clears the active cache http://technet.microsoft.com/en-ca/sysinternals/bb897561.aspx
And apparently it uses NtSetSystemInformation as well http://www.autohotkey.com/board/topic/49114-replicating-cachesets-sysinterals-functionality-in-ahk/page-2
Comment 12•11 years ago
|
||
Joel, I uploaded a python module to pypi that allows to flush the disk cache:
To install it:
pip install cache_flusher
To use it:
import cache_flusher
cache_flusher.flush()
The flush() method required administrative privileges. Also, the python module comes with a Windows executable (see https://github.com/vitillo/Flush).
How do you suggest to proceed to add the module as dependency to talos?
Flags: needinfo?(jmaher)
Assignee | ||
Comment 13•11 years ago
|
||
we have a setup.py in talos that has dependencies: http://hg.mozilla.org/build/talos/file/b4907f0b27d3/setup.py#l13.
If we want to depend on this, we need to upload it to the internal pypi server (similar to bug 976797). Depending on the code, etc.
I recommend getting the module uploaded to pypi and adding a dependency to setup.py (with a specific version). Then we can test on try and figure out what else is required (i.e. root access, windows binary compatibility, etc.).
We would also need to integrate this into talos, there are setup and cleanup functions that the test can specify:
http://hg.mozilla.org/build/talos/file/tip/talos/xperf.config#l20
Flags: needinfo?(jmaher)
Comment 14•11 years ago
|
||
Attachment #8423199 -
Flags: review?(jmaher)
Assignee | ||
Comment 15•11 years ago
|
||
Comment on attachment 8423199 [details] [diff] [review]
Add talos regression test for "cold" startups, v1
Review of attachment 8423199 [details] [diff] [review]:
-----------------------------------------------------------------
I assume this worked for you locally. Please advise on which platform we should run this on, and I can get started playing with this in staging.
Attachment #8423199 -
Flags: review?(jmaher) → review+
Comment 16•11 years ago
|
||
(In reply to Joel Maher (:jmaher) from comment #15)
> Comment on attachment 8423199 [details] [diff] [review]
> Add talos regression test for "cold" startups, v1
>
> Review of attachment 8423199 [details] [diff] [review]:
> -----------------------------------------------------------------
>
> I assume this worked for you locally. Please advise on which platform we
> should run this on, and I can get started playing with this in staging.
It would be great if you could test it on Windows since that is the platform I do expect more trouble.
Assignee | ||
Comment 17•11 years ago
|
||
did a few retriggers once I got the database entries:
https://tbpl.mozilla.org/?tree=Try&rev=dc4ad268fdb2
this looks to be reliable and the numbers are stable (all seem to have 1 outlier and the rest in a +- 10ms range)
Assignee | ||
Comment 18•11 years ago
|
||
a full push is up here: https://tbpl.mozilla.org/?tree=Try&rev=60bbe32332e5 (please ignore the first 4 'o' pushes on linux32)
linux32:
ts_paint: 916.42, 927.11, 899.53, 903.53
ts_cold: 881.79, 882.47, 917.42, 899.58
OSX 10.6:
ts_paint: 1075.53, 1092.05, 1118.37, 1071.37
ts_cold: 1077.68, 1086.00, 1080.74, 1075.79
linux64:
ts_paint: 826.16, 833.05, 823.11, 820.63
ts_cold: 836.74, 846.95, 847.84, 829.74
Win7:
ts_paint: 836.16, 833.68, 883,11, 834.74
ts_cold: 833.95, 834.11, 856.68, 830.47
Win8:
ts_paint: 824.58, 839.74, 839.42, 837.37
ts_cold: 862.26, 853.89, 847.00, 825.11
WinXP:
ts_paint: 694.32, 673.58, 684.00, 683.26
ts_cold: 686.63, 684.37, 687.37, 688.05
OSX 10.8:
ts_paint: 832.32, 861.58, 1447.53, 844.32
ts_cold: 11880.79, 11416.26, 14249.42, 12352.68
with the exception of osx 10.8, all the numbers are almost identical. I am wondering if we are really running anything here, or if Firefox is just that good :)
Comment 19•11 years ago
|
||
We get all kinds of errors:
sh: 1: cannot create /proc/sys/vm/drop_caches: Permission denied (on Linux)
The system cannot find the path specified. (on Windows)
sh: purge: command not found (on OSX)
I wonder why the measurements are so different on OSX between ts_paint and ts_cold if the purge command is not execute anyway. First things first, we need to have administrative privileges to run python module, how can we do that Joel?
Flags: needinfo?(jmaher)
Assignee | ||
Comment 20•11 years ago
|
||
chatting in irc, lets focus on windows 7 first and get that going.
Roberto can you outline the specific binary that needs to run? We can figure out where the specific location is?
We can then work with the ops team to make this work. We solved this for xperf on win7 to have admin privileges.
Flags: needinfo?(jmaher)
Comment 21•10 years ago
|
||
Attachment #8423199 -
Attachment is obsolete: true
Comment 22•10 years ago
|
||
The binary is located in site-packages/cache_flusher-1.0.3-py2.7.egg/cache_flusher/data/Flush.exe where the location of site-packages is platform dependent. With the last changes I made to the cache_flusher module the location shouldn't matter, we just need admin privileges.
Assignee | ||
Comment 23•10 years ago
|
||
Q: how could we get Flush.exe to run in admin mode? I know we did some magic for xperf (which is win7 only), I would like to figure out a way to do that again (ideally all platforms, but Win7 for now)
Flags: needinfo?(q)
Assignee | ||
Comment 24•10 years ago
|
||
Q, any next steps we can take here?
Comment 25•10 years ago
|
||
Joel, My apologies this got lost in a big shuffle I think we can use the same elevated privileges fix as xperf. Is this only called inside the harness?
Flags: needinfo?(q)
Assignee | ||
Comment 26•10 years ago
|
||
Q- this is a binary that is installed in the virtualenv as a dependent package (python package with a binary) but we only call it from the talos harness (just like we only call xperf from the talos harness).
I do believe the same xperf stuff will work in this case.
Assignee | ||
Comment 27•10 years ago
|
||
Q- I just realized that maybe this might be done with the xperf solution? I can churn on this quickly, just let me know when you want to test and I can retrigger jobs quickly or do it on a loaner.
Flags: needinfo?(q)
Updated•10 years ago
|
Mentor: rvitillo
Whiteboard: [good first bug][mentor=rvitillo][lang=python] → [good first bug][lang=python]
Updated•10 years ago
|
Priority: -- → P1
Comment 28•10 years ago
|
||
With the escalation of this to P1 I will jump on working this fix in today for testing.
Flags: needinfo?(q)
Reporter | ||
Comment 29•10 years ago
|
||
So I discovered today that there used to be a "ts_cold" cold startup test. Roberto, have you seen these bugs? How many of these objections are applicable to your approach?
- bug 510587: Implement Mac/Linux cold startup Talos test
- bug 522807: Implement Windows cold startup test
- bug 532965: Mac cold startup is not cold enough
- Bug 619536: Turn off ts_cold tests for all platforms
Flags: needinfo?(rvitillo)
Comment 30•10 years ago
|
||
Even though it seems that none of the previous approaches used NtSetSystemInformation, the effects might be comparable. The main concern back then was noise but we never really got to the point where we tried my python module out. On my machine the noise didn't look too bad but the only way to know for sure is to actually run ts_cold on our talos machines.
Also, in Bug 532965, valid concerns are made about the OSX purge utilty not flushing the libraries which can be circumvented by either mounting and unmounting a volume, or by requesting 2x physical memory from the virtual memory pager and then touching every page. The latter approach though being considerably slow while the former instead might not be feasible easily on talos boxes.
My suggestion here is that before we move to more complicated solutions we first check if noise is a problem with what we have right now. If libraries are not flushed and we have already too much noise, it might not be worth to continue working on this. We might as well just be happy with the number of bytes read/written at startup that xperf can give us.
Flags: needinfo?(rvitillo)
Comment 31•10 years ago
|
||
This should now be set to inherit "Highest permissions" from the initial harness launch in win 7. Is someone available to run this through on a test system and aim me a the result to make sure we are including all the sub processes in the privilege escalation ?
Assignee | ||
Comment 32•10 years ago
|
||
I have retriggered a job that will run ts_paint_cold on windows 7, lets see how this works!
Assignee | ||
Comment 33•10 years ago
|
||
I have triggered this and have a log:
https://tbpl-dev.allizom.org/php/getParsedLog.php?id=43274581&tree=Try&full=1
I am not sure how to determine if this is working or not. Roberto, can you help me out to see if you can determine what is working/not working based on the above log?
Flags: needinfo?(rvitillo)
Comment 34•10 years ago
|
||
Joel, I can't spot the output generated by Flush.exe which should be either: "Failure to set privileges" or "Flush FileCache WorkingSet : ok". Does the log from talos include also the output generated by executables launched in a subshell through os.system?
Flags: needinfo?(rvitillo) → needinfo?(jmaher)
Assignee | ||
Comment 35•10 years ago
|
||
it doesn't, but I can try to get it printed out!
Flags: needinfo?(jmaher)
Assignee | ||
Comment 36•10 years ago
|
||
here is what I get locally while trying to print it out:
https://pastebin.mozilla.org/5533148
I am not sure we have any logging printed.
Comment 37•10 years ago
|
||
Joel, the logging varies by OS since I am using different tools; on Linux nothing is reported.
Assignee | ||
Comment 38•10 years ago
|
||
I haven't been able to get any data from it :( I have tried printing stdout,stderr- hard to validate if it is working- tomorrow i will test it with talos on my win7 vm.
Assignee | ||
Comment 39•10 years ago
|
||
I have a windows 7 loaner machine and I have some more information on this.
While running this on the loaner, I see a popup dialog for MSVCR110.dll. I found an easy solution to this:
http://www.microsoft.com/en-us/download/details.aspx?id=30679#
I downloaded the redistributable, installed it and flush worked as one would expect.
options:
1) get flush.exe (https://github.com/vitillo/python_cache_flusher/tree/master/cache_flusher/data) to be build without needing a redistributable
2) install the msvcr110.dll on the system.
We should look at both options. I think 1 would be most logical, but what if there is yet another file missing in the future?
Flags: needinfo?(rvitillo)
Flags: needinfo?(q)
Assignee | ||
Comment 40•10 years ago
|
||
for reference, here are some results:
https://pastebin.mozilla.org/5545400
Comment 41•10 years ago
|
||
Joel,
Thanks for testing it. I don't think we are going to require any other file in the future. I compiled that dll statically, could you please double check that it works "out of the box" for you now? You can upgrade cache_flusher through pip, once I have your OK I will ask to upload that package to our repository.
The numbers look stable enough to me but without a non-cold comparison on the same machine it's hard to say.
Flags: needinfo?(rvitillo) → needinfo?(jmaher)
Assignee | ||
Comment 42•10 years ago
|
||
I tested the new cache-flusher bits from pypi and this worked on my loaner. Knowning this, we will get these on the internal pypi server, then we can test on try- stay tuned :)
Flags: needinfo?(q)
Flags: needinfo?(jmaher)
Assignee | ||
Comment 43•10 years ago
|
||
cache flushing is working on try:
https://tbpl.mozilla.org/php/getParsedLog.php?id=43832469&tree=Try&full=1
I have requested some retriggers, lets see how stable this is and get a better idea of time it takes to run.
right now it is taking ~10 minutes to run this test.
Once we get more results we can discuss next steps.
Thanks :rvitillo and :Q for your help on this so far.
Updated•10 years ago
|
Mentor: rvitillo
Comment 45•10 years ago
|
||
The current state is that the noise on the Talos box is just too high to get anything meaningful out of it. Running ts_paint_cold on my local machine (or Joel's for that matter) seems to exhibit though a much stabler behavior. That might be due to the fact that I am using an SSD.
Assignee | ||
Comment 46•10 years ago
|
||
my local computer is SSD only as well. Maybe this is not something to embark on as part of talos?
Comment 47•10 years ago
|
||
Unfortunately the noise can't be reduced with defragmentation or by mounting an ISO image so this seems to be as good as it gets. Also note that talos machines are defragmented weekly.
Assignee | ||
Comment 48•10 years ago
|
||
The question becomes, do we move forward with this, or do we forget about this?
Comment 49•10 years ago
|
||
I suggest to run this test for a longer period of time, say a month, and see if we can learn anything from it.
Updated•10 years ago
|
Assignee: rvitillo → nobody
Whiteboard: [good first bug][lang=python]
Assignee | ||
Comment 50•10 years ago
|
||
Assignee | ||
Comment 51•10 years ago
|
||
we did some up front work on this, I envision this running in the xperf job since this needs to run on windows 7 only. I have tested this on try for about 50 retriggers and it is green.
Assignee: nobody → jmaher
Attachment #8430839 -
Attachment is obsolete: true
Status: NEW → ASSIGNED
Attachment #8502022 -
Flags: review?(wlachance)
Assignee | ||
Comment 52•10 years ago
|
||
Comment on attachment 8502022 [details] [diff] [review]
finish enabling ts_paint_cold for talos (1.0)
wlach is in a work week and will be travelling tomorrow, lets get a review from dminor.wlachance
Attachment #8502022 -
Flags: review?(wlachance) → review?(dminor)
Assignee | ||
Comment 53•10 years ago
|
||
after some feedback from :avih, I have reshuffled the parameters around so we can actually take advantage of inheritance and define the right things.
Attachment #8502022 -
Attachment is obsolete: true
Attachment #8502022 -
Flags: review?(dminor)
Attachment #8503273 -
Flags: review?(dminor)
Comment 54•10 years ago
|
||
Comment on attachment 8503273 [details] [diff] [review]
finish enabling ts_paint_cold for talos (1.1)
Review of attachment 8503273 [details] [diff] [review]:
-----------------------------------------------------------------
lgtm
Attachment #8503273 -
Flags: review?(dminor) → review+
Assignee | ||
Comment 55•10 years ago
|
||
Assignee | ||
Updated•10 years ago
|
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Comment 56•9 years ago
|
||
Hey Joel, is ts_paint_cold currently running in the tree anywhere?
Flags: needinfo?(jmaher)
Assignee | ||
Comment 57•9 years ago
|
||
no, we removed that quite some time ago. I recall the reasons being that it wasn't accurately measuring cold. I believe we did some experiments to flush the cache and (as you can see in the comments here). As to why those didn't work, I couldn't find it right away. I recall something about the results not being useful.
Flags: needinfo?(jmaher)
Comment 58•9 years ago
|
||
(In reply to Joel Maher (:jmaher) from comment #57)
> no, we removed that quite some time ago. I recall the reasons being that it
> wasn't accurately measuring cold. I believe we did some experiments to
> flush the cache and (as you can see in the comments here). As to why those
> didn't work, I couldn't find it right away. I recall something about the
> results not being useful.
The variance of the data was just too high to detect regressions.
Comment 59•9 years ago
|
||
If we want to use this test for something now, would it be appropriate to just reopen this bug? Or should I file a follow up? Is anything in tree?
Assignee | ||
Comment 60•9 years ago
|
||
we have removed everything from the tree as this has been inactive for 6+ months. We can consider a new test- it would be a low priority for me to work on. Just mark the bug as a dependency to bug 1040081
Reporter | ||
Comment 61•9 years ago
|
||
(In reply to Jim Mathies [:jimm] from comment #59)
> If we want to use this test for something now, would it be appropriate to
> just reopen this bug? Or should I file a follow up? Is anything in tree?
My suggestion in the e10s perf criteria document was to use this harness (or manual testing if need be) to compare e10s cold-startup time *on an SSD*. The variance was excessive on Talos HDDs, but it was reasonable on laptop SSDs.
You need to log in
before you can comment on or make changes to this bug.
Description
•