828153 - Better Firefox Health Report integration with Talos

Reporter

Description

•

12 years ago

Firefox Health Report doesn't currently integrate well with Talos. For starters, it appears that the database is initialized during actual tests, not during the special first application run Talos uses to initialize the profile. This is because FHR initialization is on a timer and this timer may not fire during the initial profile set-up application run. For this first app run, Talos has a special page, getInfo.html. We should initialize FHR from this page (assuming we want FHR running during Talos tests). During many Talos tests, FHR appears to be performing its initialization routine in the background. This could skew Talos results and add noise. While we likely want FHR enabled for some Talos tests (especially those that measure startup and shutdown time), we could possibly do without it for others (e.g. scrolling perf). Or, we could at least delay running these Talos tests until FHR is fully initialized. I'm not an expert on Talos and the tests therein. But, I'm pretty sure the current situation is far from ideal. I defer to someone else for a recommended course of action.

Joel Maher ( :jmaher ) (UTC -8)

Comment 1

•

12 years ago

how do we turn it on and off?

Gregory Szorc [:gps]

Reporter

Comment 2

•

12 years ago

(In reply to Joel Maher (:jmaher) from comment #1) > how do we turn it on and off? The healthreport.service.enabled pref can be used to short-circuit the service initialization on startup. It's my understanding we'd need to update PerfConfigurator.py and push out a new Talos. This could take a few days.

Joel Maher ( :jmaher ) (UTC -8)

Comment 3

•

12 years ago

would this be: on: healthreport.service.enabled=1 off: healthreport.service.enabled=0 is this on by default? has this landed yet? We can easily land a patch and deploy a talos, the process is: 1) write/review patch 2) run on try for select platforms/tests 3) upload new talos.zip to build network 4) deploy to talos.json in the tree This usually takes 1-2 days. If we were to do different things with the health report in getInfo.html or turn it on/off per test, it could take a couple more days.

Gregory Szorc [:gps]

Reporter

Comment 4

•

12 years ago

healthreport.service.enabled=false would be the way to turn it off. This hasn't landed in m-c yet. It is running on services-central. I think we should preemptively land Talos support for easily disabling FHR globally via this pref. We could follow that up with new bug(s) for getInfo.html or per-test/suite options.

Joel Maher ( :jmaher ) (UTC -8)

Comment 5

•

12 years ago

Attached patch disable health reporting in talos (1.0) (deleted) — Details — Splinter Review

Assignee: nobody → jmaher

Status: NEW → ASSIGNED

Attachment #699972 - Flags: review?(jhammel)

Gregory Szorc [:gps]

Reporter

Comment 6

•

12 years ago

Vladan: Would you mind posting your Talos + FHR findings in this bug?

Flags: needinfo?(vdjeric)

Jeff Hammel

Comment 7

•

12 years ago

Comment on attachment 699972 [details] [diff] [review] disable health reporting in talos (1.0) I wish we kept these preferences in order. Also, it might be nice to have a comment as to why we set this preference.

Attachment #699972 - Flags: review?(jhammel) → review+

(dormant account)

Comment 8

•

12 years ago

Comment on attachment 699972 [details] [diff] [review] disable health reporting in talos (1.0) Testing without FHR is dangerous. It's a path that ever user will be hitting. I'm not comfortable disabling that until unless there is an equivalent test.

Attachment #699972 - Flags: feedback-

Joel Maher ( :jmaher ) (UTC -8)

Comment 9

•

12 years ago

Good point Taras- I will hold off on this until we have a better story for testing with this preference on.

Vladan Djeric (:vladan)

Comment 10

•

12 years ago

(In reply to Gregory Szorc [:gps] from comment #6) > Vladan: Would you mind posting your Talos + FHR findings in this bug? I re-ran FHR Talos numbers today: FHR disabled: https://tbpl.mozilla.org/?tree=Try&rev=a20fa3fbd9ac FHR enabled: https://tbpl.mozilla.org/?tree=Try&rev=a92dcbbd1400 The top observation is that shutdown times have regressed by ~200% on Win7, ~100% on WinXP + Linux, and ~20% on Mac OS X. This might indeed be because the FHR DB is getting created at shutdown during Talos tests. It needs to be investigated & fixed ASAP because it's also making the other Talos measures meaningless, i.e. no FHR work gets done until shutdown. Also, we probably shouldn't be creating the DB at shutdown at all and instead we should wait for an idle time during a future session to do it.

Flags: needinfo?(vdjeric)

Gregory Szorc [:gps]

Reporter

Comment 11

•

12 years ago

For more comparison purposes: Console logging: https://tbpl.mozilla.org/?tree=Try&rev=0a8d26e4aea8 60s Startup Delay (instead of 10s): https://tbpl.mozilla.org/?tree=Try&rev=b4d9af5cb6f3

Gregory Szorc [:gps]

Reporter

Comment 12

•

12 years ago

OK. The console logging tells me a lot. http://gps.pastebin.mozilla.org/2064339 is the relevant FHR shutdown logs from a Windows run. We see it takes 2.229s wall time for FHR to shut down. This is similar to a Linux64 (http://gps.pastebin.mozilla.org/2064351) which takes 1.986s. Compare this with an OS X run (http://gps.pastebin.mozilla.org/2064343) where it took 0.113s! FWIW, the OS X times are very similar to my own MBP, where I have an SSD. I wonder if the OS X Talos machines are running SSDs? Anyway, if we look at the Windows logs, every SQLite query that modifies the DB takes 125-175ms to complete. Because of a sub-optimal implementation of session recording, we are performing 13 such queries during shutdown. 13 * 150ms = 1.950s. This reinforces the point that DB writes on shutdown are bad. We need to refactor the session recording code ASAP (bug 827157) and eliminate as many of (hopefully all) these writes on shutdown. As a quick fix, I /think/ if we move the writes to a transaction, we'll only incur 1 disk flush at transaction commit time. These operations are all related, so they should be incurring within a transaction anyway. As an aside, 100+ ms to perform a DB write seems like a lot, even for magnetic (I'm assuming) disks. I guess there is a lot more disk head movement than I thought there would be! Who knows.

(no longer active)

Comment 13

•

12 years ago

(In reply to comment #12) > FWIW, the OS X times are very similar to my own MBP, where I have an SSD. I > wonder if the OS X Talos machines are running SSDs? They aren't. > This reinforces the point that DB writes on shutdown are bad. We need to > refactor the session recording code ASAP (bug 827157) and eliminate as many of > (hopefully all) these writes on shutdown. Is that something that we can do within the next couple of days? We usually do not let regressions of this magnitude stay in the tree...

Gregory Szorc [:gps]

Reporter

Comment 14

•

12 years ago

https://tbpl.mozilla.org/?tree=Try&rev=67dffc73b9eb is a trivial change to perform most of the shutdown writes in a transaction. If my theory about I/O and transactions is correct, this should recover a majority of the regression. A more in depth solution depends on what Product and Metrics have to say in bug 830495 and its effect on bug 827157.

Assignee: jmaher → gps

Gregory Szorc [:gps]

Reporter

Comment 15

•

12 years ago

I just posted a patch in bug 827157 that removes all shutdown-specific DB writes. I expect this will recover all lost shutdown performance losses.

Gregory Szorc [:gps]

Reporter

Comment 16

•

12 years ago

My theory about transactions appears to be correct! tp5n_shutdown_paint times for Windows: Before FHR: 1000-1250 Current FHR: 3200-4000 Transaction: 1626.0 Transactions FTW!

Gregory Szorc [:gps]

Reporter

Comment 17

•

12 years ago

https://tbpl.mozilla.org/?tree=Try&rev=bf784d3cf643 combined the better session recording in bug 827157 with the transactions off main thread patch in bug 830209. I'm optimistic the latter will have some benefit to Talos since there is no longer main thread blocking ~10s after startup.

Joel Maher ( :jmaher ) (UTC -8)

Comment 18

•

12 years ago

the question for this bug is "do we want to disable FHR for talos?" From what Taras says we shouldn't. I believe his reasoning is valid. Are we looking for <10% perf differences before going live? Do we want to have this on for some tests and not others?

Gregory Szorc [:gps]

Reporter

Comment 19

•

12 years ago

(In reply to Joel Maher (:jmaher) from comment #18) > the question for this bug is "do we want to disable FHR for talos?" From > what Taras says we shouldn't. I believe his reasoning is valid. Are we > looking for <10% perf differences before going live? Do we want to have > this on for some tests and not others? There are multiple questions for this bug that were posed in the original comment. We still have a number of paths we can proceed down. I don't believe we've made any concrete decisions yet.

Gregory Szorc [:gps]

Reporter

Comment 20

•

12 years ago

One of the reports we received was a regression for Kraken. Looking at the logs at https://tbpl.mozilla.org/php/getParsedLog.php?id=18845314&tree=Try&full=1, it appears that the FHR database is being created while Kraken, Dromaeo, etc run during the dromaeojs Talos test. This is certainly throwing off numbers. From what limited I know about these benchmarks, I /think/ we should disable FHR completely or at least ensure the database schema is created *before* the benchmarks start.

(no longer active)

Comment 21

•

12 years ago

(In reply to comment #20) > From what limited I know about these benchmarks, I /think/ we should disable > FHR completely or at least ensure the database schema is created *before* the > benchmarks start. Do you mean disable FHR on Talos, or in Firefox?

Joel Maher ( :jmaher ) (UTC -8)

Comment 22

•

12 years ago

I assume disabling it on Talos. If we do that we should have a test which show the effect of FHR. Is there any way to wait until the database schema is created?

Richard Newman [:rnewman]

Comment 23

•

12 years ago

Marking as P4 because we'll be addressing this fully after FHR on Android. We don't want to land a partial change in this area, because we don't want to hide the impact of FHR on Talos.

Priority: -- → P4

Gregory Szorc [:gps]

Reporter

Updated

•

12 years ago

Component: Metrics and Firefox Health Report → Client: Desktop

Product: Mozilla Services → Firefox Health Report

Gregory Szorc [:gps]

Reporter

Comment 24

•

11 years ago

I'm unassigning bugs I haven't worked on in a while.

Assignee: gps → nobody

Alessio Placitelli [:Dexter]

Comment 25

•

9 years ago

FHR is going away per bug 1209088, wontfix.

Status: ASSIGNED → RESOLVED

Closed: 9 years ago

Resolution: --- → WONTFIX

BMO Automation

Updated

•

6 years ago

Product: Firefox Health Report → Firefox Health Report Graveyard