Closed Bug 1685938 Opened 4 years ago Closed 4 years ago

Reduce gap between CPU measurements and sample timestamp

Categories

(Core :: Gecko Profiler, enhancement, P2)

enhancement

Tracking

()

RESOLVED FIXED
86 Branch
Tracking Status
firefox86 --- fixed

People

(Reporter: mozbugz, Assigned: mozbugz)

References

(Blocks 1 open bug)

Details

Attachments

(4 files)

A profile taken on MacOS after bug 1329600 seems to have 0% CPU usage everywhere.

Looking at the raw data (accessible through profile in the JS console), this is due to apparent spikes in CPU usage, which themselves are most probably due to gaps between the time CPU measurements are taken and the sample timestamp. These gaps can be caused by the OS scheduling another thread.

Consider the following sequence:

  • sampler loop 1: CPU, (long gap), timestamp, stack, no sleep (to "catch up")
  • sampler loop 2: CPU, (NO gap), timestamp, stack...

Because of the gap, we have CPU[2]-CPU[1] covering a much longer time than timestamp[2]-timestamp[1]. So when we compute the CPU usage over time, we get a spike. The above profile shows spike going for an apparent 15x the theoretical maximum!

One issue is the tiny sleep between sampler loops, which makes things worse. This will be dealt in separate bug 1685937.

The main issue we have is due to the gap between CPU measurements and timestamp.
On all supported systems, these are collected with different APIs, so we cannot guarantee that there couldn't be such a gap.
I doubt OSes provide ways to prevent scheduling interruptions. Maybe a higher thread priority could help though? In any case, I'll start with a generic solution, we can try to tweak it on different OSes later on.

The proposed solution here is to record timestamps immediately before and after the CPU measurements, and in cases where the time difference is "too long", we'd repeat the operation.
"Too long" can be determined with a quick test loop at the start of the profiling session.
With this, long gaps should be eliminated, so we can trust that (CPU[2]-CPU[1])/(timestamp[2]-timestamp[1]) is reliable.

The cost of repeating measurements should be negligible: The API calls themselves are short (in the order of a microsecond), so when we detect a "too long" gap (probably in tens of microseconds or more), it would indicate that the system was probably busy with something else (so there's little CPU waste), and the stack sample would have been delayed anyway, so it shouldn't hurt to delay it a little more, in order to get better CPU usage values.

This handles the conversion (from TimeStamp to number of milliseconds since process start) once and gives it to subroutines.
It will also help in a following patch where this value will be more closely tied with the CPU usage value.

Disambiguated nested delta variables in the Sampler:

  • sampleStartDeltaMs is at the start of each sampling loop,
  • threadSampleDeltaMs is associated to one thread being sampled during that loop.

Depends on D101543

mPostMeasurementTimeStamp records the time right after CPU measurements ended.
It is then used as the main sample timestamp, to both avoid another TimeStamp::Now() call, and to keep them as close together as possible.

Depends on D101544

Pushed by gsquelart@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/cefa452f7131 Remove unused parameters in DoPeriodicSample - r=canaltinova https://hg.mozilla.org/integration/autoland/rev/a42b21d1612a Precompute sample timestamps as milliseconds in sampler loop - r=canaltinova https://hg.mozilla.org/integration/autoland/rev/5eb3bdbbdcd5 Add RunningTimes::mPostMeasurementTimeStamp and use as sample timestamp - r=canaltinova https://hg.mozilla.org/integration/autoland/rev/090fc4f02d83 Keep CPU measurements and their timestamp within 8*median duration - r=canaltinova
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: