Remove detectable CPU usage impossible spikes
Categories
(Core :: Gecko Profiler, enhancement, P2)
Tracking
()
People
(Reporter: mozbugz, Assigned: mozbugz)
References
Details
Bug 1685938 helped with reducing apparent CPU spikes that were mostly due to random gaps between CPU measurements and their associated timestamps.
It worked very well on Windows and Linux.
However on Mac, based on minimal testing, it reduced spikes from around 15x, down to less than 3x.
It's much better now, graphs are at least visible, but still visibly squashed.
The cause is not yet known. Assuming the previous work is not doing anything incorrect (to be checked), this may be a problem with the OS function thread_info
? Looking as spikes, they are usually preceded by a low-usage interval; maybe there's some kind of caching of thread statistics so we're getting old values, which would explain which some time intervals seem to be receiving excess CPU time that may in fact have happened during a prior interval.
In any case, since the CPU usage is given in microseconds, it is possible to detect when values are in fact physically impossible; i.e., when there are more microseconds apparently spent on the CPU than there are microseconds of wall clock time since the previous reading.
This bug/task here will look at post-processing CPU usage values, to redistribute spikes to neighboring sample intervals.
Assignee | ||
Comment 1•4 years ago
|
||
Having worked on it for a bit now, I'm wondering if this post-processing should really be done in Firefox. 🤔
First, as an example of how the post-processing can help (wherever it's eventually done), here's a prototype of a flattening algorithm:
https://deploy-preview-3098--perf-html.netlify.app/public/e8mhgb671v9wwsbg1t1bmjbwmy3j27kba69pg4r
Collected spikes went up to 243%, post-processing flattened them down to 120% max. It means "normal" graphs are only shrunk by 1.2x max instead of 2.43x or worse.
This shows the usefulness of such post-processing.
Thoughts on where it should be performed:
- Though these spikes may look physically impossible at first, maybe they reflect some real work? E.g., CPU frequency boost, internal parallelism (int+float+vectorization), etc. So I'm getting reluctant in losing this data as it was collected.
- Distributing excess usage to nearby samples feels kludgy, we're only guessing where this apparent excess should go, we could be wrong, and would lose the original data.
- Flattening spikes (e.g., anything above 100% could be flattened to at most 120%) would help with the display, but would also lose some information; The flattening ratios could be stored somewhere in the profile JSON, to allow reconstruction of the original data, but this seems like a lot of extra work on both sides.
So I'm now thinking that handling spikes may be more appropriately done on the front-end side during display. Hat tip: And Florian did suggest it before I was convinced (I argued that doing the work in C++ would be a quicker&better experience for the user.)
This way, we won't artificially modify the data that was collected in Firefox.
The front-end can best decide how to display it (e.g., flatten or distribute excesses, highlight spikes, show unmodified numbers in tooltips, etc.), and future improvements will be able to retroactively handle past profiles with the same data.
On the Firefox side, we can continue exploring the CPU usage collection code, to improve it if possible.
Nazım, what do you think? Happy to discuss... If you agree, please file a github issue to follow-up, or add this to be done as part of your WIP.
(CC:mstange, if you have thoughts about this.)
Yes, I think we can move the processing of "impossible spikes" to the front-end.
That will require a special handling for macOS, or let's say that the platforms that provide timing values. I was against adding a special handling at first but it looks like we don't have much choice anymore :)
I will try to limit these impossible spikes to 100% and redistribute the excess CPU usage to neighbor samples at first. We can discuss if that works or not. If distributing is not producing the values we want, I can just start with limiting them to 100% and not distributing as a start.
I will add a github issue and cross reference each issues/bugs.
Assignee | ||
Comment 3•4 years ago
|
||
Thank you, much appreciated.
MOVED to github. (We can reopen this bug if something is needed on Firefox's side.)
Description
•