Closed Bug 125025 Opened 23 years ago Closed 23 years ago

Animated gifs use large amounts of CPU when not visible

Categories

(Core :: Graphics: ImageLib, defect)

defect
Not set
major

Tracking

()

VERIFIED FIXED
mozilla0.9.9

People

(Reporter: jesup, Assigned: jesup)

References

Details

(Keywords: perf, testcase)

Attachments

(7 files, 4 obsolete files)

(deleted), text/html
Details
(deleted), text/html
Details
(deleted), patch
Details | Diff | Splinter Review
(deleted), patch
brendan
: review+
brendan
: superreview+
Details | Diff | Splinter Review
(deleted), patch
saari
: review+
Details | Diff | Splinter Review
(deleted), text/html
Details
(deleted), text/html
Details
When an animated gif is not visible (obscured or scrolled out of the visible page) they still can cause major CPU use. The example I'm using (from bug 86319) uses 60-70% cpu on a 233MMX even when it's not visible. IE uses close to 0% when it's not visible. (Both use 100% when it is visible on that system: 233MMX, 64MB, older cheap S3? video card.) I'll upload my testcase and get a jprof.
Attached file Testcase (deleted) —
Attached file jprof of the page (anim not visible) (deleted) —
It also includes loading the page, but that's down in the noise. 84%+ in nsImageGTK::DrawToImage() and imgContainer::BuildCompositeMask(). Are we really rebuilding the mask on every frame?
tor, aren't you in the middle of whacking that code?
It isn't our actual problem, but since I was looking at the CompositeMask code, I tweaked it for a speed improvement. Rough calculation using jprof before and after is that it's about 16% faster. And yes, we are building the masks every time.
Attached patch small perf patch to CompositeMask (obsolete) (deleted) — Splinter Review
This should speed up the mask generation some. Nothing fancy, just hoisting things out of loops, and avoiding calculations of indexes and masks on the fly - use pointers and shifts of mask vars instead. Easier to read after applying; I'll post a -w -u as well. This does NOT solve the basic problem of this bug; it may help some. Rough (very) calulation of improvement is 16%; note that in this case CPU use remains at around 100% (it just spins faster).
Comment on attachment 69091 [details] [diff] [review] small perf patch to CompositeMask r=pavlov
Attachment #69091 - Flags: review+
The basic problem, as per expectations, is that on wakeup, we're calling DrawTo to draw the next frame, and then calling BuildCompositeMask - all without any idea if the frame is visible. Both DrawToImage() and BuildCompositeMask are expensive due to transparency issues. Sounds like we either need to clip (may be hard) or do a visibility check (easier I'd think), or find a way to cache the alpha data, or suspend the actual animation when not visible except for keeping the frame count (and correctly rebuild the on-screen data when it scrolls back on screen! - not trivial). If we totally suspend anims that are off-screen, there will probably be some objections, though I don't think it violates any specs (it might, I suppose).
Yes, I'm reworking a fair bit of nsImageGTK code. I was hoping to move all the work of nsImageGTK to the server side, but since imgContainer is fiddling with the bits (image and alpha) inside a Lock/Unlock it can't be done. Attached is an (untested) subset of my 121015 patch which removes some useless work from DrawToImage and probaly makes it more correct (we'll always pick up the locked bits modifications, which I think we could miss with the current setup). What some people have been talking about is caching the GIF frames post-composite so that this expensive work only needs to be done once.
Attached patch rewritten patch (obsolete) (deleted) — Splinter Review
Updated patch. Now uses almost no time for CompositeMask. Removed all bit-twiddling from the common cases (no offset between alpha and overlay). Do multiple bits at a time (8 at a time for no alpha/overlay offset). Don't write to alpha if overlay bits are 0x00. Much, much much faster; probably close to optimal. jprof with the patch shows 99% of CPU use is now in DrawToImage, 0 to 1% in CompositeMask. Tried with anim with colored background to make sure it was working correctly. Also corrects a bug in previous patch when overlayXoffset != 0 (missing shift).
Attachment #69091 - Attachment is obsolete: true
Attachment #69092 - Attachment is obsolete: true
--> Randell :-) who is doing much appreciated work.
Assignee: pavlov → rjesup
Target Milestone: --- → mozilla0.9.9
r=/sr= for the patch? Again, this still won't solve the problem with DrawImageTo(), but it will speed up transparent anims in all cases including reducing the overhead when they're offscreen.
Keywords: patch
Comment on attachment 69370 [details] [diff] [review] rewritten patch r=pavlov
Attachment #69370 - Flags: review+
Blocks: 119597
Comment on attachment 69370 [details] [diff] [review] rewritten patch >+ for(PRUint32 i=0; >+ i < height; >+ i++) { Nit: put this on one line, it's a canonical for loop that fits. The for loop header it replaces didn't fit, but it's history and cvs diff shows it far before this replacement. >+ >+ PRUint8 pixels; >+ PRUint32 j; >+ // use locals to avoid keeping track of how much we need to add >+ // at the end of a line. we don't really need this since we may >+ // be able to calculate the ending offsets, but it's simpler and >+ // cheap. >+ PRUint8 *localOverlay = overlayLine-1; >+ PRUint8 *localAlpha = alphaLine; >+ >+ for (j = 0; j < width; j += 8) { >+ // don't do in for(...) to avoid reference past end of buffer >+ pixels = *++localOverlay; So you're subtracting overlayLine - 1 before the loop, then preincrementing, for an extra sub and as many adds as would obtain if you just post-incremented. There's no inefficiency with post-incrementing vs. pre- here -- addi;load or load;addi on RISCy architectures. Please correct me if I'm wrong in Intel. >+ >+ if (pixels == 0) { >+ // no bits to set - iterate and bump output pointer >+ ++localAlpha; >+ } >+ else { >+ // for the last few bits, we need to special-case it "last few" or "first few"? /me is confused. >+ if (j >= 8) { >+ if (mask_offset == 0) { >+ // simple case, no offset >+ *localAlpha++ |= pixels; >+ } >+ else { >+ *localAlpha++ |= (pixels >> mask_offset); >+ *localAlpha |= (pixels << (8U-mask_offset)); >+ } >+ } The above case (inside the (j >= 8) condition) could be the body of this loop, if you hoist this code: >+ else { >+ // last few bits have to be handled more carefully if >+ // width is not a multiple of 8. >+ >+ // set bits we don't want to change to 0 >+ pixels = (pixels >> (8U-j)) << (8U-j); >+ *localAlpha++ |= (pixels >> mask_offset); >+ // don't touch this byte unless we have bits for it >+ if (j > (8U - mask_offset)) >+ *localAlpha |= (pixels << (8U-mask_offset)); >+ } Then the loop's smaller, has only one (predicted taken, I hope) branch around the (pixels == 0) ++localAlpha code, and runs a bit faster. >+ } > } >+ >+// Windows and OS/2 have the funky bottom up data storage we need to account for >+#if defined(XP_WIN) || defined(XP_OS2) >+ alphaLine -= abprComposite; >+ overlayLine -= abprOverlay; >+#else >+ alphaLine += abprComposite; >+ overlayLine += abprOverlay; >+#endif > } > } >- > break; > default: > break; If you could unroll the j < 8 case and re-test, I'll sr= quickly. Thanks, /be
Nice patch, thanks!
>So you're subtracting overlayLine - 1 before the loop, then preincrementing, >for an extra sub and as many adds as would obtain if you just post-incremented. >There's no inefficiency with post-incrementing vs. pre- here -- addi;load or >load;addi on RISCy architectures. Please correct me if I'm wrong in Intel. It was a left-over from an intermediate form of the patch. Fixed. >+ // for the last few bits, we need to special-case it >+ if (j >= 8) { > >"last few" or "first few"? /me is confused. Last few bits of a line. This is a bug in the logic of the code; at one point I reversed the value of j to be decrementing but didn't reverse the for() statement. It should be: for (j = width; j > 0; j -= 8) { (otherwise the handling of partials isn't correct, due to the (8U-j) term). Revised, and retesting. I'll upload after running some tests.
Perf tests show 0% in BuildCompositeMask for not-visible (all CPU time is now in DrawToImage, used to be 50% BuildCompositeMask), 1.2% when it's visible (used to be >20%). Looks pretty good to me. Patch to be attached shortly.
Attached patch Updated patch, tested heavily (deleted) — Splinter Review
Updated patch as per comments. Also fixes incorrect mask_offset calculation. For a good testcase, see http://www.agag.com/gallery/toon/finnyluv.shtml (which has offsets that vary from frame to frame). r=/sr= please, and lets get it in.
Attachment #69370 - Attachment is obsolete: true
Comment on attachment 70147 [details] [diff] [review] Updated patch, tested heavily Carrying r= forward. sr=brendan@mozilla.org if you use the prevailing if-{ style (the { on the same line as the if, not on its own line). /be
Attachment #70147 - Flags: superreview+
Attachment #70147 - Flags: review+
Patch for imgContainer.cpp checked in. I have a patch that does the same sort of hackery to DrawToImage to optimize drawing transparent anims. This should (mostly) solve the other half of the equation, especially when combined with bug 125157 (limit animation rates - no infinitely fast anims). The patch is working but still has 1 bug I need to ferret out for tomorrow. That should solve the other bad actor in the heavy CPU use for transparent anims.
Keywords: mozilla1.0mozilla0.9.9
Sorry, I meant bug 125137 for the animation-refresh-limiting bug. Marking this as dependent on that (since speeding up these functions merely wastes cpu faster unless we limit the rate we try to animate). Also, we still need to try to shortcut all of this masking/drawing if the anim isn't visible.
Depends on: 125137
Similar rewrite to DrawToImage to avoid bit-twiddling. We should also apply the patch #69176 - it's not integrated here. I can either provide an updated copy of that after we get this patch in, or I can merge them if people think it won't cause any danger of not getting this fix in. Should greatly reduce CPU use in DrawToImage. Tested heavily using similar testcases as per above. r/sr ASAP so this can make 0.9.9 please
I'm wondering if lookup tables might further improve speed. For example, where you have alphaPixels &= 0xff << (8 - (aDWidth-x)); the right hand side presumable has only 8 possible values so it could be alphaPixels &= maskvals1[aDWidth-x]; There are other similar places. Just an idea.
Maybe an array would be cheaper - but probably not, since you'd introduce a memory read into the loop. Right now mask can live entirely in a register, and mask >>= 1 is cheap. Memory references are bad, tight register loops are good (generally). I'll remove the NULL and substitute nsnull (I was just silencing one of many warnings already in the code) before I check in.
Memory is red-shifting away from the super-scalar pipeline on modern CPUs. Even on-CPU cache memory may take multiple cycles to access. So i agree with rjesup in general. The Intel P4 seems, however, to have penalized non-constant shifts, and we can't turn them into multiplies easily here. But don't prematurely optimize for P4, I say. Make the compiler sweat it, measure first, etc. Pav, can you r= (test if not read the code :-)? /be
Realtime jprof of the testcase, with the anim not visible. The anim has a repeat rate of 10ms (100Hz), which in this test was not limited (bug 125137 not in place). Note, dual-450 P3 Linux, remote X display (no bits are actually being drawn). Before latest patch, 25% hits in DrawToImage (75% poll) After latest patch, 8% hits in DrawToImage, 90+% poll. Before either this patch or the other one already committed, non-poll hits were >50%. Looks like overhead is down 66% for DrawToImage, 80-90% down overall for both patches. Yum. I hope my code didn't make anyone's head hurt (pav). ;-) I've thought of going through and making it even faster, since it still shows up as an appreciable %. It would be pretty easy to extend to 32-bit chunks, and perhaps more importantly I could gather runs of 0xff's and do larger memcpy's. I'm not sure these would produce a measurable improvement, however. I'm ready to checkin once r/sr are in place.
Note: 1st patch applied, second not applied.
The "reduce work in DrawToImage" patch posted earlier (http://bugzilla.mozilla.org/attachment.cgi?id=69176&action=view) appears to do the opposite; even with the other 2 patches applied it raises CPU use from 11-14% to 45-50% (measured via 'top' using the testcase). So I think we should leave that one out for now, and just put in my patch. Perhaps we can revisit things later. Tor might well fix some of this in his GTK rewrite. As per the obvious, the biggest win would be to not do any (or much) of this work for non-visible areas.
Interesting - do you have a way of measuring frame rate before/after the change?
Comment on attachment 70330 [details] [diff] [review] Patch to majorly speed up DrawToImage sr=tor
Attachment #70330 - Flags: superreview+
rate: I haven't put anything in. I could, but honestly I don't care how close to 0ms we can animate. I do care how much CPU time animating uses - my measurements of the testcase (big 10ms transparent anim, which has a lot of area with mixed (not 0 or ff) mask bytes, showed a 85-90% total reduction in CPU use. I don't think we can do a lot more without making it not try to mask/draw non-visible anims. At a 10ms anim rate, nothing visible on-screen, we used 8% of one CPU of a dual-450 PIII. We used to use more than 60%. We also need to approve 125137 to stop trying to animate at 0ms - we can argue if it should max out at 10, 16, 20ms. I can't see any reason to animate faster than 10ms in our system.
Comment on attachment 70330 [details] [diff] [review] Patch to majorly speed up DrawToImage r=saari Do the other platforms need similar unrolling? Seems like there should be a way to structure this based on the processor's preferred "native" register size. For example, on Mac I'm thinking this could be even more intersting when using the AltiVec ney Velocity Engine vector unit registers in full 128 bit mode.
Attachment #70330 - Flags: review+
a=roc+moz for 0.9.9 This isn't actually unrolling here. This is just smarter code. Per-processor vectorized code for this and similar loops (e.g., in nsBlender) would be very interesting but should definitely be spun off into a separate bug. We'll always want a reasonably easy to understand, reasonably fast C path anyway.
Fix checked in. If someone want to try to rehabilitate the other DrawToImage patch (which _seemed_ to slow things down for me), feel free to reopen or add another bug.
Status: NEW → RESOLVED
Closed: 23 years ago
Resolution: --- → FIXED
Does this fix the bug I filed a bit ago? If not, could the same fix be applied to 120154's case? Bug 120154 - animated GIF eating cpu when mozilla in another tab or minimized
Hmm. resource:///res/samples/gear1.gif simply stopped animating in Mozilla very recently. Side-effect of this fix?
That resource url animates fine for me with the patch, just tried it.
Sorry for the false alarm -- it actually looks like it's tor's server-side scaling code that I'm testing in my tree that caused gears1.gif to stop animating. I'll comment over on that bug.
Re-ran jprof with build from midday today. 99.5% of hits are in poll(), 0.4% in DrawToImage() when the testcase is offscreen. Verified. Bigtime. :-) (tor's changes may be involved in this too)
Status: RESOLVED → VERIFIED
Hm.... I still see images in _minimized_ mozilla windows taking up lots of cpu. Was this bug supposed to address that, or should I open a new one?
This was not directly targeted at "minimize", though it would have an effect on that as well. Note that this patch didn't stop us from updating invisible images; it merely reduces the overhead of masking/drawing transparent anims (which was way higher than it should have been). It has no effect on non-transparent anims, for example. Bug 125137 would help with reducing anim overhead in general.
Ah, OK. I was dealing with a non-transparent gif. :)
So, really, this was a good optimization but it didn't technically address the bug in hand? Ie. it helps with 'animated gifs use large amounts of CPU' but not 'animated gifs use large amounts of CPU when not visible' except as a side-effect of making (transparent) animated GIFs faster overall. How does moz avoid draws to updated images which are (scrolled) off-screen? Does it bother, at the client side? If it just sets a clip-region and lets X11 clip the image at the server side then each frame still has to be dithered/remapped and go 'over the wire', IIRC.
The fact that we still animate when not visible is bug 120154. I'll post a jprof there.
Attachment #8914375 - Attachment is obsolete: true
Attachment #8914375 - Flags: review?(gasolin)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: