Closed Bug 882543 Opened 11 years ago Closed 11 years ago

Create Web Audio benchmarks

Categories

(Core :: Web Audio, defect)

x86_64
Linux
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla26
Tracking Status
firefox25 --- fixed
firefox26 --- fixed

People

(Reporter: roc, Assigned: padenot)

References

Details

(Whiteboard: [qa-])

Attachments

(9 files, 2 obsolete files)

(deleted), patch
roc
: review+
justin.lebar+bug
: review+
Details | Diff | Splinter Review
(deleted), patch
roc
: review+
Details | Diff | Splinter Review
(deleted), patch
Details | Diff | Splinter Review
(deleted), patch
ehsan.akhgari
: review+
Details | Diff | Splinter Review
(deleted), patch
BenWa
: review+
Details | Diff | Splinter Review
(deleted), application/x-gzip
Details
(deleted), patch
ehsan.akhgari
: review+
Details | Diff | Splinter Review
(deleted), patch
Details | Diff | Splinter Review
(deleted), image/png
Details
Before we do much optimization work we should create some Web Audio benchmarks so we can measure our progress. I think a pretty good way to do this would be to use OfflineAudioContext. We can create a node graph and then measure how long it takes to run the graph to produce some fixed number of samples. That won't handle latency but I think most of our WebAudio-specific optimizations will be targeting throughput. It would be nice to have some tests in the tree, but tests we can run manually would be a great start. Bonus points if we can run them on other browsers.
Taking, I already have some code.
Assignee: nobody → paul
Paul, I suggest you touch base with Chris Rogers about this. He might already have something like this... If not, I'm sure he's going to be interested in the result of our work here. Thanks!
Ha, did not see your comment, on time, so I started to write a toy webpage that dose some trivial benchmark [1], and I have bad news: when using the OfflineAudioContext in current Nightly, we are actually doing the work slower than in real time (14 seconds for 10 seconds of audio). I haven't had a chance to look into the cause yet. [1]: https://github.com/padenot/webaudio-benchmark
After a quick profile, we're spending about 60% of our time in UpdateStreamOrder, which is almost all either realloc() or free(). :( Also, the benchmark doesn't work in Chrome, unless they really finish their processing in 1ms. ;-)
Also, we spend another 20-ish% in PrepareUpdatesToMainThreadState, most of which is, wait for it, realloc(). This is all completely unnecessary overhead which we should be able to get rid of completely. The actual procesing time in my profile is 12ms, out of 12996ms. Thia makes me very sad.
(In reply to comment #5) > Also, we spend another 20-ish% in PrepareUpdatesToMainThreadState, most of > which is, wait for it, realloc(). > > This is all completely unnecessary overhead which we should be able to get rid > of completely. The actual procesing time in my profile is 12ms, out of > 12996ms. For stream ordering at least, we should be able to set a dirty flag when something happens that actually affects stream ordering, and make UpdateStreamOrder a no-op otherwise.
PrepareUpdatesToMainThread state is going to be trickier. We can at least not allocate. We should also be able to flag the MediaStreams whose currentTime and other main-thread state can be queried; for AudioNodes (other than Destination), they can't. Then those can be skipped.
Depends on: 883010
(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #7) > PrepareUpdatesToMainThread state is going to be trickier. We can at least > not allocate. We should also be able to flag the MediaStreams whose > currentTime and other main-thread state can be queried; for AudioNodes > (other than Destination), they can't. Then those can be skipped. I believe not allocating is enough. At least, we should first do that and then re-measure.
Depends on: 883011
(In reply to :Ehsan Akhgari (needinfo? me!) from comment #4) > After a quick profile, we're spending about 60% of our time in > UpdateStreamOrder, which is almost all either realloc() or free(). :( > > Also, the benchmark doesn't work in Chrome, unless they really finish their > processing in 1ms. ;-) Yes, they output a silent buffer for some reason.
(In reply to comment #9) > (In reply to :Ehsan Akhgari (needinfo? me!) from comment #4) > > After a quick profile, we're spending about 60% of our time in > > UpdateStreamOrder, which is almost all either realloc() or free(). :( > > > > Also, the benchmark doesn't work in Chrome, unless they really finish their > > processing in 1ms. ;-) > > Yes, they output a silent buffer for some reason. Have you tried with the sampling rate of 44.1KHz?
Yes, they throw an exception when you do that.
(In reply to comment #11) > Yes, they throw an exception when you do that. WTF?! Please file a bug on that.
(In reply to :Ehsan Akhgari (needinfo? me!) from comment #12) > (In reply to comment #11) > > Yes, they throw an exception when you do that. > > WTF?! Please file a bug on that. I looked into it a bit more after having written this comment, and it appears to be known, and a fix is coming [1]. [1]: http://code.google.com/p/chromium/issues/detail?id=168216#c5
(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #7) > PrepareUpdatesToMainThread state is going to be trickier. We can at least > not allocate. We should also be able to flag the MediaStreams whose > currentTime and other main-thread state can be queried; for AudioNodes > (other than Destination), they can't. Then those can be skipped. Hmm, then maybe we can differentiate those based on their AudioNodeStreamKind. Should be very easy in fact!
Depends on: 884632
Blocks: 885496
I could not find a method that did what I wanted on nsTArray, maybe it is hidden somewhere else? Anyway, this obviously makes alloc/free disappear from the profiles. We probably want to shrink the array after some time, but I haven't done so in the patch, I need to write some instrumentation to characterize what is happening here for some common use cases.
Attachment #770302 - Flags: review?(roc)
This is obviously the biggest and most embarrassing win. This brings us in the same ballpark as blink and webkit (we are two or three times slower when running my testcase at that point, but we resample, and they don't, because the file I'm using in the testcase has a samplerate of 44100Hz, and we run our context at 48000Hz, and they run their context at 44100Hz).
Attachment #770307 - Flags: review?(roc)
This should give us better cache locality. I haven't profiled with and without, nor have tried to tweak the factor.
I'm not sure about this one, but it seems to work fine on my testcases, and it seriously reduces the main thread CPU usage.
Anyway, when all those are landed, we can start talking about SSE and friends, I believe :-).
Comment on attachment 770302 [details] [diff] [review] Retain storage for media streams accross iterations instead of reallocating. r= Review of attachment 770302 [details] [diff] [review]: ----------------------------------------------------------------- Justin, can you please look over this new nsTArray API?
Attachment #770302 - Flags: review?(justin.lebar+bug)
Comment on attachment 770316 [details] [diff] [review] Don't spam the main thread when rendering an offline graph. r= Review of attachment 770316 [details] [diff] [review]: ----------------------------------------------------------------- ::: content/media/MediaStreamGraph.cpp @@ +910,5 @@ > { > mMonitor.AssertCurrentThreadOwns(); > > + // We don't want to update the main thread about timing update when we are not > + // running in realtime. This prevents OfflineAudioContext.currentTime from making progress, right? That violates the spec (and I believe we have a test somewhere which should turn red with this patch.)
Comment on attachment 770302 [details] [diff] [review] Retain storage for media streams accross iterations instead of reallocating. r= Okay, but please be careful to compact these arrays eventually if they're not tiny. :) Could you add a comment on the method saying what it does, and in particular call out the need to Compact() eventually, unless you know the array is small or short-lived? Also, please add one of these two checks if (mHdr == EmptyHdr()) return; or if (mHdr->mLength == 0) return; Otherwise if ClearRetainStorage() is called on the empty header, we'll write 0 into its length. Which is fine, but it makes helgrind complain.
Attachment #770302 - Flags: review?(justin.lebar+bug) → review+
Comment on attachment 770302 [details] [diff] [review] Retain storage for media streams accross iterations instead of reallocating. r= Review of attachment 770302 [details] [diff] [review]: ----------------------------------------------------------------- ::: content/media/MediaStreamGraphImpl.h @@ +535,5 @@ > * True when a non-realtime MediaStreamGraph has started to process input. This > * value is only accessed on the main thread. > */ > bool mNonRealtimeProcessing; > + nsTArray<nsRefPtr<MediaStream> > mOldStreams; Move this up next to some other pointer field. Also, give it a comment to explain what it's for. ::: xpcom/glue/nsTArray.h @@ +1042,5 @@ > > // > // Mutation methods > // > + void ClearRetainStorage() { ClearAndRetainStorage()
Attachment #770302 - Flags: review?(roc) → review+
> ClearAndRetainStorage() Ah, yes. Good call.
Comment on attachment 770307 [details] [diff] [review] Actually run offline MSG offline. r= Review of attachment 770307 [details] [diff] [review]: ----------------------------------------------------------------- Ahahahaha. r+ with the following comments fixed. ::: content/media/MediaStreamGraph.cpp @@ +323,5 @@ > + mCurrentTimeStamp = now; > + LOG(PR_LOG_DEBUG+1, ("Updating current time to %f (real %f, mStateComputedTime %f)", > + MediaTimeToSeconds(nextCurrentTime), > + (now - mInitialTimeStamp).ToSeconds(), > + MediaTimeToSeconds(mStateComputedTime))); I think this LOG should happen in both the realtime and non-realtime cases. @@ +326,5 @@ > + (now - mInitialTimeStamp).ToSeconds(), > + MediaTimeToSeconds(mStateComputedTime))); > + } else { > + prevCurrentTime = mCurrentTime; > + nextCurrentTime = mCurrentTime + AUDIO_TARGET_MS; I think you should use MEDIA_GRAPH_TARGET_PERIOD_MS here.
Attachment #770307 - Flags: review?(roc) → review+
Comment on attachment 770315 [details] [diff] [review] Use bigger chunks when using an offline graph. r= Review of attachment 770315 [details] [diff] [review]: ----------------------------------------------------------------- Switching to MEDIA_GRAPH_TARGET_PERIOD_MS in the previous patch means that this patch should turn MEDIA_GRAPH_TARGET_PERIOD_MS into a function instead. Call it MediaGraphTargetPeriod().
Comment on attachment 770316 [details] [diff] [review] Don't spam the main thread when rendering an offline graph. r= Review of attachment 770316 [details] [diff] [review]: ----------------------------------------------------------------- ::: content/media/MediaStreamGraph.cpp @@ +910,5 @@ > { > mMonitor.AssertCurrentThreadOwns(); > > + // We don't want to update the main thread about timing update when we are not > + // running in realtime. Yeah, I don't think we should do this. An alternative is for OfflineAudioContext to send an update to the main thread every time a certain amount of *real* time has passed.
(In reply to Paul Adenot (:padenot) from comment #16) > This is obviously the biggest and most embarrassing win. This brings us in > the > same ballpark as blink and webkit (we are two or three times slower when > running > my testcase at that point, but we resample, and they don't, because the file > I'm > using in the testcase has a samplerate of 44100Hz, and we run our context at > 48000Hz, and they run their context at 44100Hz). I think it would be a good idea to use an odd sample rate for your test files, so that everyone has to resample no matter what platform the test runs on. Would suck to have a change in our sample rate make a big difference to benchmark results. Of course, Blink/Webkit will still have an edge since they're still using linear resampling (right?).
(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #25) > @@ +326,5 @@ > > + (now - mInitialTimeStamp).ToSeconds(), > > + MediaTimeToSeconds(mStateComputedTime))); > > + } else { > > + prevCurrentTime = mCurrentTime; > > + nextCurrentTime = mCurrentTime + AUDIO_TARGET_MS; > > I think you should use MEDIA_GRAPH_TARGET_PERIOD_MS here. This means that AUDIO_TARGET_MS and VIDEO_TARGET_MS will also need to become functions. Might as well make all those constants into functions.
Comment on attachment 770307 [details] [diff] [review] Actually run offline MSG offline. r= > } >+ if (mStateComputedTime < nextCurrentTime) { >+ LOG(PR_LOG_WARNING, ("Media graph global underrun detected")); >+ nextCurrentTime = mStateComputedTime; >+ } > > if (prevCurrentTime >= nextCurrentTime) { A few too many U+0020 wasted :).
> I think it would be a good idea to use an odd sample rate for your test > files, so that everyone has to resample no matter what platform the test > runs on. Would suck to have a change in our sample rate make a big > difference to benchmark results. Good idea, I'll do that. > Of course, Blink/Webkit will still have an edge since they're still using > linear resampling (right?). Their code did not look like they optimized it much, so we'll see. We will certainly have better sound quality, though. iirc, there is code to use SIMD in the speex resampler. Considering how much we use this resampler, these days (with more usage coming, especially in WebRTC code), I think we should consider building it.
(In reply to Paul Adenot (:padenot) from comment #31) > iirc, there is code to use SIMD in the speex resampler. Considering how much > we use this resampler, these days (with more usage coming, especially in > WebRTC code), I think we should consider building it. Definitely!
This brings a 60x improvement, and I still get the expected result. I need to investigate more to check what is going on.
Attachment #770315 - Attachment is obsolete: true
This gives a 2x speedup on clock wall time, and frees up a core (because it is not spinning to process the MSG messages).
Attachment #770888 - Flags: review?(roc)
Attachment #770316 - Attachment is obsolete: true
This registers the MSG thread to the SPS profiler so I can profile on other platforms than MacOS/Instruments.
Attachment #770890 - Flags: review?(bgirard)
Comment on attachment 770890 [details] [diff] [review] Register the MSG thread for in the profiler. r= Review of attachment 770890 [details] [diff] [review]: ----------------------------------------------------------------- ::: content/media/MediaStreamGraph.cpp @@ +1274,5 @@ > + } > + NS_IMETHOD Run() > + { > + char aLocal; > + profiler_register_thread("MSG", &aLocal); Ideally we can get a more descriptive name then MSG. This is shown in the UI and can be used to select which threads you want to sample. @@ +1276,5 @@ > + { > + char aLocal; > + profiler_register_thread("MSG", &aLocal); > + mGraph->RunThread(); > + return NS_OK; We need to unregister the thread before returning to make sure we don't sample a thread that is dead.
Attachment #770890 - Flags: review?(bgirard) → review+
(In reply to comment #27) > Comment on attachment 770316 [details] [diff] [review] > --> https://bugzilla.mozilla.org/attachment.cgi?id=770316 > Don't spam the main thread when rendering an offline graph. r= > > Review of attachment 770316 [details] [diff] [review]: > ----------------------------------------------------------------- > > ::: content/media/MediaStreamGraph.cpp > @@ +910,5 @@ > > { > > mMonitor.AssertCurrentThreadOwns(); > > > > + // We don't want to update the main thread about timing update when we are not > > + // running in realtime. > > Yeah, I don't think we should do this. > > An alternative is for OfflineAudioContext to send an update to the main thread > every time a certain amount of *real* time has passed. Is this even an issue these days? I thought that I had fixed most of this. I'd love to see a profile with all of the patches here except for this one applied.
(In reply to comment #28) > Of course, Blink/Webkit will still have an edge since they're still using > linear resampling (right?). Yes.
Comment on attachment 770888 [details] [diff] [review] Don't spam the main thread when rendering an offline graph. r= Why are you using the wall clock time? That seems wrong.
Attachment #770888 - Flags: review?(roc) → review-
Attached file Profiles (open in Instruments) (deleted) —
(In reply to :Ehsan Akhgari (needinfo? me!) from comment #37) > (In reply to comment #27) > Is this even an issue these days? I thought that I had fixed most of this. > > I'd love to see a profile with all of the patches here except for this one > applied. Attached are two profiles, gzipped: with-patch.trace and without-patch.trace (with-patch being a profile of a build with the "don't spam the main thread" patch, and without-patch being a profile of a build without this patch. Both build have all the other patches applied, opt-debug build). Those are made using Instruments on MacOs. I see a 1.5x to 2.5x speedup _with_ the patch applied. Anyways, we should not even try to update the main thread each iteration when we are running offline.
Hmm, in the without-patch profile we're spending ~5% of the audio thread's time in PrepareUpdatesToMainThreadState. But I guess this profile can be tricky since it's not easy to correlate this 5% time to the amount of time that we lock the main thread. But still I'm quite surprised that this patch results in that much speedup... But if that's the case, then ok, let's take this. (I still believe that we should not make it look at the wall clock time.) BTW, what's this OffTheBooksMutex stuff in the profile? I can't see where we use this in the MSG code...
http://hg.mozilla.org/mozilla-central/rev/56f89cd46261, used by the monitors, apparently, so we still do too much locking. I'm trying to remove some allocations in AudioNodeStream::ProduceOutput at the moment. We have way too much MSG overhead at the moment to be able to feel the difference when using the SSE2 and NEON implementations we are adding.
(In reply to comment #42) > http://hg.mozilla.org/mozilla-central/rev/56f89cd46261, used by the monitors, > apparently, so we still do too much locking. Huh, do you have --enable-debug in your mozconfig? That would be misleading, you want --enable-debug-symbols if all you need is symbols for your build. > I'm trying to remove some allocations in AudioNodeStream::ProduceOutput at the > moment. We have way too much MSG overhead at the moment to be able to feel the > difference when using the SSE2 and NEON implementations we are adding. Agreed!
(In reply to :Ehsan Akhgari (needinfo? me!) from comment #39) > Comment on attachment 770888 [details] [diff] [review] > Don't spam the main thread when rendering an offline graph. r= > > Why are you using the wall clock time? That seems wrong. Well, this is what roc suggests in comment 27. The two solutions here are to send the event once we have rendered _n_ chunks, or time based (every x milliseconds). Time based has the advantage of sending the messages at a rate that does not depend on the machine speed. The other way as the advantage of not querying the system clock but this should not be expensive. Thoughts ?
Flags: needinfo?(ehsan)
(In reply to Paul Adenot (:padenot) from comment #44) > (In reply to :Ehsan Akhgari (needinfo? me!) from comment #39) > > Comment on attachment 770888 [details] [diff] [review] > > Don't spam the main thread when rendering an offline graph. r= > > > > Why are you using the wall clock time? That seems wrong. > > Well, this is what roc suggests in comment 27. The two solutions here are to > send the event once we have rendered _n_ chunks, or time based (every x > milliseconds). Time based has the advantage of sending the messages at a > rate that does not depend on the machine speed. The other way as the > advantage of not querying the system clock but this should not be expensive. > Thoughts ? Hmm, I see. That's a good point. If querying the system time with this patch doesn't show up in your profile, then I'm fine with this.
Flags: needinfo?(ehsan)
Using the system clock means that the overhead of sending the events is roughly constant, which is good.
Attachment #770888 - Flags: review- → review?(ehsan)
Attachment #770888 - Flags: review?(ehsan) → review+
This code was making a lot of malloc/free in my benchmarks. Using a linked list is way better in terms of perf (benchmarks are faster). I'll attach a profile tomorrow.
Attachment #777939 - Flags: review?(ehsan)
Comment on attachment 777939 [details] [diff] [review] Use a linked list for ordering stream instead of an array. r= Review of attachment 777939 [details] [diff] [review]: ----------------------------------------------------------------- Fine by me!
Attachment #777939 - Flags: review?(ehsan) → review+
If we're hitting UpdateStreamOrderForStream a lot, we should probably investigate why. For example, if we're creating and destroying a lot of streams with no inputs (like AudioBufferSourceNodes), we could avoid walking the entire stream graph, since we know these nodes can't be part of a cycle. In particular when we add one of those streams, we can always add it at the beginning of the list of streams.
I've benchmarked this, and tried to find a good value, 600 is nice.
Attachment #778384 - Flags: review?(roc)
Comment on attachment 778384 [details] [diff] [review] Use bigger chunks when using an offline graph. r= Review of attachment 778384 [details] [diff] [review]: ----------------------------------------------------------------- Hang on. I thought we agreed to dispatch messages every N ms of real time. This patch doesn't do that.
Attached image Without (deleted) —
This is a profile without attachment 778384 [details] [diff] [review] (Use bigger chunks when using an offline graph) applied. The MSG overhead is basically way too high. Applying this patch gives us a 64x speedup. I guess the patch as it is is actually wrong, since we don't process main thread events frequently enough. I think MSG iterations should run as long as there are no main thread messages or something. That should give us the same perf improvements.
Then we'd better fix the MSG overhead. Making each RunThread iteration produce more audio samples will reduce the overhead of OfflineAudioContext but it will become less representative of normal AudioContext performance.
Whiteboard: [leave-open]
Yay! Should we move further patches to their own bugs?
Attachment #778384 - Flags: review?(roc)
Yes, I've got a list of things we can optimize per benchmark listed down in my notebook at the office, I'll open more specific bugs for them.
I'm gonna close this.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla26
Whiteboard: [leave-open]
Paul, do you need any QA on this bug?
Flags: needinfo?(paul)
Flags: needinfo?(paul)
(In reply to Anthony Hughes, Mozilla QA (:ashughes) from comment #61) > Paul, do you need any QA on this bug? Taking your silence as a "no". Please remove the [qa-] whiteboard tag and add the verifyme keyword if there's something we can do to help here.
Whiteboard: [qa-]
Depends on: 914016
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: