Closed Bug 828153 Opened 12 years ago Closed 9 years ago

Better Firefox Health Report integration with Talos

Categories

(Firefox Health Report Graveyard :: Client: Desktop, defect, P4)

defect

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: gps, Unassigned)

References

Details

Attachments

(1 file)

Firefox Health Report doesn't currently integrate well with Talos. For starters, it appears that the database is initialized during actual tests, not during the special first application run Talos uses to initialize the profile. This is because FHR initialization is on a timer and this timer may not fire during the initial profile set-up application run. For this first app run, Talos has a special page, getInfo.html. We should initialize FHR from this page (assuming we want FHR running during Talos tests). During many Talos tests, FHR appears to be performing its initialization routine in the background. This could skew Talos results and add noise. While we likely want FHR enabled for some Talos tests (especially those that measure startup and shutdown time), we could possibly do without it for others (e.g. scrolling perf). Or, we could at least delay running these Talos tests until FHR is fully initialized. I'm not an expert on Talos and the tests therein. But, I'm pretty sure the current situation is far from ideal. I defer to someone else for a recommended course of action.
how do we turn it on and off?
(In reply to Joel Maher (:jmaher) from comment #1) > how do we turn it on and off? The healthreport.service.enabled pref can be used to short-circuit the service initialization on startup. It's my understanding we'd need to update PerfConfigurator.py and push out a new Talos. This could take a few days.
would this be: on: healthreport.service.enabled=1 off: healthreport.service.enabled=0 is this on by default? has this landed yet? We can easily land a patch and deploy a talos, the process is: 1) write/review patch 2) run on try for select platforms/tests 3) upload new talos.zip to build network 4) deploy to talos.json in the tree This usually takes 1-2 days. If we were to do different things with the health report in getInfo.html or turn it on/off per test, it could take a couple more days.
healthreport.service.enabled=false would be the way to turn it off. This hasn't landed in m-c yet. It is running on services-central. I think we should preemptively land Talos support for easily disabling FHR globally via this pref. We could follow that up with new bug(s) for getInfo.html or per-test/suite options.
Assignee: nobody → jmaher
Status: NEW → ASSIGNED
Attachment #699972 - Flags: review?(jhammel)
Vladan: Would you mind posting your Talos + FHR findings in this bug?
Flags: needinfo?(vdjeric)
Comment on attachment 699972 [details] [diff] [review] disable health reporting in talos (1.0) I wish we kept these preferences in order. Also, it might be nice to have a comment as to why we set this preference.
Attachment #699972 - Flags: review?(jhammel) → review+
Comment on attachment 699972 [details] [diff] [review] disable health reporting in talos (1.0) Testing without FHR is dangerous. It's a path that ever user will be hitting. I'm not comfortable disabling that until unless there is an equivalent test.
Attachment #699972 - Flags: feedback-
Good point Taras- I will hold off on this until we have a better story for testing with this preference on.
(In reply to Gregory Szorc [:gps] from comment #6) > Vladan: Would you mind posting your Talos + FHR findings in this bug? I re-ran FHR Talos numbers today: FHR disabled: https://tbpl.mozilla.org/?tree=Try&rev=a20fa3fbd9ac FHR enabled: https://tbpl.mozilla.org/?tree=Try&rev=a92dcbbd1400 The top observation is that shutdown times have regressed by ~200% on Win7, ~100% on WinXP + Linux, and ~20% on Mac OS X. This might indeed be because the FHR DB is getting created at shutdown during Talos tests. It needs to be investigated & fixed ASAP because it's also making the other Talos measures meaningless, i.e. no FHR work gets done until shutdown. Also, we probably shouldn't be creating the DB at shutdown at all and instead we should wait for an idle time during a future session to do it.
Flags: needinfo?(vdjeric)
For more comparison purposes: Console logging: https://tbpl.mozilla.org/?tree=Try&rev=0a8d26e4aea8 60s Startup Delay (instead of 10s): https://tbpl.mozilla.org/?tree=Try&rev=b4d9af5cb6f3
OK. The console logging tells me a lot. http://gps.pastebin.mozilla.org/2064339 is the relevant FHR shutdown logs from a Windows run. We see it takes 2.229s wall time for FHR to shut down. This is similar to a Linux64 (http://gps.pastebin.mozilla.org/2064351) which takes 1.986s. Compare this with an OS X run (http://gps.pastebin.mozilla.org/2064343) where it took 0.113s! FWIW, the OS X times are very similar to my own MBP, where I have an SSD. I wonder if the OS X Talos machines are running SSDs? Anyway, if we look at the Windows logs, every SQLite query that modifies the DB takes 125-175ms to complete. Because of a sub-optimal implementation of session recording, we are performing 13 such queries during shutdown. 13 * 150ms = 1.950s. This reinforces the point that DB writes on shutdown are bad. We need to refactor the session recording code ASAP (bug 827157) and eliminate as many of (hopefully all) these writes on shutdown. As a quick fix, I /think/ if we move the writes to a transaction, we'll only incur 1 disk flush at transaction commit time. These operations are all related, so they should be incurring within a transaction anyway. As an aside, 100+ ms to perform a DB write seems like a lot, even for magnetic (I'm assuming) disks. I guess there is a lot more disk head movement than I thought there would be! Who knows.
(In reply to comment #12) > FWIW, the OS X times are very similar to my own MBP, where I have an SSD. I > wonder if the OS X Talos machines are running SSDs? They aren't. > This reinforces the point that DB writes on shutdown are bad. We need to > refactor the session recording code ASAP (bug 827157) and eliminate as many of > (hopefully all) these writes on shutdown. Is that something that we can do within the next couple of days? We usually do not let regressions of this magnitude stay in the tree...
https://tbpl.mozilla.org/?tree=Try&rev=67dffc73b9eb is a trivial change to perform most of the shutdown writes in a transaction. If my theory about I/O and transactions is correct, this should recover a majority of the regression. A more in depth solution depends on what Product and Metrics have to say in bug 830495 and its effect on bug 827157.
Assignee: jmaher → gps
I just posted a patch in bug 827157 that removes all shutdown-specific DB writes. I expect this will recover all lost shutdown performance losses.
My theory about transactions appears to be correct! tp5n_shutdown_paint times for Windows: Before FHR: 1000-1250 Current FHR: 3200-4000 Transaction: 1626.0 Transactions FTW!
https://tbpl.mozilla.org/?tree=Try&rev=bf784d3cf643 combined the better session recording in bug 827157 with the transactions off main thread patch in bug 830209. I'm optimistic the latter will have some benefit to Talos since there is no longer main thread blocking ~10s after startup.
the question for this bug is "do we want to disable FHR for talos?" From what Taras says we shouldn't. I believe his reasoning is valid. Are we looking for <10% perf differences before going live? Do we want to have this on for some tests and not others?
(In reply to Joel Maher (:jmaher) from comment #18) > the question for this bug is "do we want to disable FHR for talos?" From > what Taras says we shouldn't. I believe his reasoning is valid. Are we > looking for <10% perf differences before going live? Do we want to have > this on for some tests and not others? There are multiple questions for this bug that were posed in the original comment. We still have a number of paths we can proceed down. I don't believe we've made any concrete decisions yet.
One of the reports we received was a regression for Kraken. Looking at the logs at https://tbpl.mozilla.org/php/getParsedLog.php?id=18845314&tree=Try&full=1, it appears that the FHR database is being created while Kraken, Dromaeo, etc run during the dromaeojs Talos test. This is certainly throwing off numbers. From what limited I know about these benchmarks, I /think/ we should disable FHR completely or at least ensure the database schema is created *before* the benchmarks start.
(In reply to comment #20) > From what limited I know about these benchmarks, I /think/ we should disable > FHR completely or at least ensure the database schema is created *before* the > benchmarks start. Do you mean disable FHR on Talos, or in Firefox?
I assume disabling it on Talos. If we do that we should have a test which show the effect of FHR. Is there any way to wait until the database schema is created?
Marking as P4 because we'll be addressing this fully after FHR on Android. We don't want to land a partial change in this area, because we don't want to hide the impact of FHR on Talos.
Priority: -- → P4
Component: Metrics and Firefox Health Report → Client: Desktop
Product: Mozilla Services → Firefox Health Report
I'm unassigning bugs I haven't worked on in a while.
Assignee: gps → nobody
FHR is going away per bug 1209088, wontfix.
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → WONTFIX
Product: Firefox Health Report → Firefox Health Report Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: