Closed Bug 125025 Opened 23 years ago Closed 23 years ago

Animated gifs use large amounts of CPU when not visible

Tracking

()

Status:

VERIFIED FIXED

Milestone:

mozilla0.9.9

People

(Reporter: jesup, Assigned: jesup)

References

Details

(Keywords: perf, testcase)

Attachments

(7 files, 4 obsolete files)

Testcase 23 years ago Randell Jesup [:jesup] (needinfo me) (deleted), text/html		Details
jprof of the page (anim not visible) 23 years ago Randell Jesup [:jesup] (needinfo me) (deleted), text/html		Details
small perf patch to CompositeMask 23 years ago Randell Jesup [:jesup] (needinfo me) (deleted), patch	pavlov : review+	Details \| Diff \| Splinter Review
-w -u patch - not actually much more readable 23 years ago Randell Jesup [:jesup] (needinfo me) (deleted), patch		Details \| Diff \| Splinter Review
reduce work in nsImageGTK::DrawToImage 23 years ago tor (deleted), patch		Details \| Diff \| Splinter Review
rewritten patch 23 years ago Randell Jesup [:jesup] (needinfo me) (deleted), patch	pavlov : review+	Details \| Diff \| Splinter Review
Updated patch, tested heavily 23 years ago Randell Jesup [:jesup] (needinfo me) (deleted), patch	brendan : review+ brendan : superreview+	Details \| Diff \| Splinter Review
Patch to majorly speed up DrawToImage 23 years ago Randell Jesup [:jesup] (needinfo me) (deleted), patch	saari : review+ tor : superreview+	Details \| Diff \| Splinter Review
jprof before latest patch (realtime, dual-450 P3 linux) of Testcase 23 years ago Randell Jesup [:jesup] (needinfo me) (deleted), text/html		Details
real-time jprof, after DrawToImage patch 23 years ago Randell Jesup [:jesup] (needinfo me) (deleted), text/html		Details
Bug 125025 - Properly decode response body; 7 years ago Jan Honza Odvarko [:Honza] (always need-info? me) (deleted), text/x-review-board-request		Details

Randell Jesup [:jesup] (needinfo me)

Assignee

Description

•

23 years ago

When an animated gif is not visible (obscured or scrolled out of the visible page) they still can cause major CPU use. The example I'm using (from bug 86319) uses 60-70% cpu on a 233MMX even when it's not visible. IE uses close to 0% when it's not visible. (Both use 100% when it is visible on that system: 233MMX, 64MB, older cheap S3? video card.) I'll upload my testcase and get a jprof.

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 1

•

23 years ago

Attached file Testcase (deleted) — Details

Randell Jesup [:jesup] (needinfo me)

Assignee

Updated

•

23 years ago

Keywords: mozilla1.0, perf, testcase

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 2

•

23 years ago

Attached file jprof of the page (anim not visible) (deleted) — Details

It also includes loading the page, but that's down in the noise. 84%+ in nsImageGTK::DrawToImage() and imgContainer::BuildCompositeMask(). Are we really rebuilding the mask on every frame?

Boris Zbarsky [:bzbarsky]

Comment 3

•

23 years ago

tor, aren't you in the middle of whacking that code?

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 4

•

23 years ago

It isn't our actual problem, but since I was looking at the CompositeMask code, I tweaked it for a speed improvement. Rough calculation using jprof before and after is that it's about 16% faster. And yes, we are building the masks every time.

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 5

•

23 years ago

Attached patch small perf patch to CompositeMask (obsolete) (deleted) — Details — Splinter Review

This should speed up the mask generation some. Nothing fancy, just hoisting things out of loops, and avoiding calculations of indexes and masks on the fly - use pointers and shifts of mask vars instead. Easier to read after applying; I'll post a -w -u as well. This does NOT solve the basic problem of this bug; it may help some. Rough (very) calulation of improvement is 16%; note that in this case CPU use remains at around 100% (it just spins faster).

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 6

•

23 years ago

Attached patch -w -u patch - not actually much more readable (obsolete) (deleted) — Details — Splinter Review

Stuart Parmenter

Comment 7

•

23 years ago

Comment on attachment 69091 [details] [diff] [review] small perf patch to CompositeMask r=pavlov

Attachment #69091 - Flags: review+

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 8

•

23 years ago

The basic problem, as per expectations, is that on wakeup, we're calling DrawTo to draw the next frame, and then calling BuildCompositeMask - all without any idea if the frame is visible. Both DrawToImage() and BuildCompositeMask are expensive due to transparency issues. Sounds like we either need to clip (may be hard) or do a visibility check (easier I'd think), or find a way to cache the alpha data, or suspend the actual animation when not visible except for keeping the frame count (and correctly rebuild the on-screen data when it scrolls back on screen! - not trivial). If we totally suspend anims that are off-screen, there will probably be some objections, though I don't think it violates any specs (it might, I suppose).

tor

Comment 9

•

23 years ago

Attached patch reduce work in nsImageGTK::DrawToImage (deleted) — Details — Splinter Review

tor

Comment 10

•

23 years ago

Yes, I'm reworking a fair bit of nsImageGTK code. I was hoping to move all the work of nsImageGTK to the server side, but since imgContainer is fiddling with the bits (image and alpha) inside a Lock/Unlock it can't be done. Attached is an (untested) subset of my 121015 patch which removes some useless work from DrawToImage and probaly makes it more correct (we'll always pick up the locked bits modifications, which I think we could miss with the current setup). What some people have been talking about is caching the GIF frames post-composite so that this expensive work only needs to be done once.

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 11

•

23 years ago

Attached patch rewritten patch (obsolete) (deleted) — Details — Splinter Review

Updated patch. Now uses almost no time for CompositeMask. Removed all bit-twiddling from the common cases (no offset between alpha and overlay). Do multiple bits at a time (8 at a time for no alpha/overlay offset). Don't write to alpha if overlay bits are 0x00. Much, much much faster; probably close to optimal. jprof with the patch shows 99% of CPU use is now in DrawToImage, 0 to 1% in CompositeMask. Tried with anim with colored background to make sure it was working correctly. Also corrects a bug in previous patch when overlayXoffset != 0 (missing shift).

Attachment #69091 - Attachment is obsolete: true

Attachment #69092 - Attachment is obsolete: true

Cathleen

Comment 12

•

23 years ago

--> Randell :-) who is doing much appreciated work.

Assignee: pavlov → rjesup

Target Milestone: --- → mozilla0.9.9

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 13

•

23 years ago

r=/sr= for the patch? Again, this still won't solve the problem with DrawImageTo(), but it will speed up transparent anims in all cases including reducing the overhead when they're offscreen.

Keywords: patch

Stuart Parmenter

Comment 14

•

23 years ago

Comment on attachment 69370 [details] [diff] [review] rewritten patch r=pavlov

Attachment #69370 - Flags: review+

Randell Jesup [:jesup] (needinfo me)

Assignee

Updated

•

23 years ago

Blocks: 119597

Brendan Eich [:brendan]

Comment 15

•

23 years ago

Comment on attachment 69370 [details] [diff] [review] rewritten patch >+ for(PRUint32 i=0; >+ i < height; >+ i++) { Nit: put this on one line, it's a canonical for loop that fits. The for loop header it replaces didn't fit, but it's history and cvs diff shows it far before this replacement. >+ >+ PRUint8 pixels; >+ PRUint32 j; >+ // use locals to avoid keeping track of how much we need to add >+ // at the end of a line. we don't really need this since we may >+ // be able to calculate the ending offsets, but it's simpler and >+ // cheap. >+ PRUint8 *localOverlay = overlayLine-1; >+ PRUint8 *localAlpha = alphaLine; >+ >+ for (j = 0; j < width; j += 8) { >+ // don't do in for(...) to avoid reference past end of buffer >+ pixels = *++localOverlay; So you're subtracting overlayLine - 1 before the loop, then preincrementing, for an extra sub and as many adds as would obtain if you just post-incremented. There's no inefficiency with post-incrementing vs. pre- here -- addi;load or load;addi on RISCy architectures. Please correct me if I'm wrong in Intel. >+ >+ if (pixels == 0) { >+ // no bits to set - iterate and bump output pointer >+ ++localAlpha; >+ } >+ else { >+ // for the last few bits, we need to special-case it "last few" or "first few"? /me is confused. >+ if (j >= 8) { >+ if (mask_offset == 0) { >+ // simple case, no offset >+ *localAlpha++ |= pixels; >+ } >+ else { >+ *localAlpha++ |= (pixels >> mask_offset); >+ *localAlpha |= (pixels << (8U-mask_offset)); >+ } >+ } The above case (inside the (j >= 8) condition) could be the body of this loop, if you hoist this code: >+ else { >+ // last few bits have to be handled more carefully if >+ // width is not a multiple of 8. >+ >+ // set bits we don't want to change to 0 >+ pixels = (pixels >> (8U-j)) << (8U-j); >+ *localAlpha++ |= (pixels >> mask_offset); >+ // don't touch this byte unless we have bits for it >+ if (j > (8U - mask_offset)) >+ *localAlpha |= (pixels << (8U-mask_offset)); >+ } Then the loop's smaller, has only one (predicted taken, I hope) branch around the (pixels == 0) ++localAlpha code, and runs a bit faster. >+ } > } >+ >+// Windows and OS/2 have the funky bottom up data storage we need to account for >+#if defined(XP_WIN) || defined(XP_OS2) >+ alphaLine -= abprComposite; >+ overlayLine -= abprOverlay; >+#else >+ alphaLine += abprComposite; >+ overlayLine += abprOverlay; >+#endif > } > } >- > break; > default: > break; If you could unroll the j < 8 case and re-test, I'll sr= quickly. Thanks, /be

saari (gone)

Comment 16

•

23 years ago

Nice patch, thanks!

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 17

•

23 years ago

>So you're subtracting overlayLine - 1 before the loop, then preincrementing, >for an extra sub and as many adds as would obtain if you just post-incremented. >There's no inefficiency with post-incrementing vs. pre- here -- addi;load or >load;addi on RISCy architectures. Please correct me if I'm wrong in Intel. It was a left-over from an intermediate form of the patch. Fixed. >+ // for the last few bits, we need to special-case it >+ if (j >= 8) { > >"last few" or "first few"? /me is confused. Last few bits of a line. This is a bug in the logic of the code; at one point I reversed the value of j to be decrementing but didn't reverse the for() statement. It should be: for (j = width; j > 0; j -= 8) { (otherwise the handling of partials isn't correct, due to the (8U-j) term). Revised, and retesting. I'll upload after running some tests.

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 18

•

23 years ago

Perf tests show 0% in BuildCompositeMask for not-visible (all CPU time is now in DrawToImage, used to be 50% BuildCompositeMask), 1.2% when it's visible (used to be >20%). Looks pretty good to me. Patch to be attached shortly.

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 19

•

23 years ago

Attached patch Updated patch, tested heavily (deleted) — Details — Splinter Review

Updated patch as per comments. Also fixes incorrect mask_offset calculation. For a good testcase, see http://www.agag.com/gallery/toon/finnyluv.shtml (which has offsets that vary from frame to frame). r=/sr= please, and lets get it in.

Attachment #69370 - Attachment is obsolete: true

Brendan Eich [:brendan]

Comment 20

•

23 years ago

Comment on attachment 70147 [details] [diff] [review] Updated patch, tested heavily Carrying r= forward. sr=brendan@mozilla.org if you use the prevailing if-{ style (the { on the same line as the if, not on its own line). /be

Attachment #70147 - Flags: superreview+

Attachment #70147 - Flags: review+

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 21

•

23 years ago

Patch for imgContainer.cpp checked in. I have a patch that does the same sort of hackery to DrawToImage to optimize drawing transparent anims. This should (mostly) solve the other half of the equation, especially when combined with bug 125157 (limit animation rates - no infinitely fast anims). The patch is working but still has 1 bug I need to ferret out for tomorrow. That should solve the other bad actor in the heavy CPU use for transparent anims.

Keywords: mozilla1.0 → mozilla0.9.9

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 22

•

23 years ago

Sorry, I meant bug 125137 for the animation-refresh-limiting bug. Marking this as dependent on that (since speeding up these functions merely wastes cpu faster unless we limit the rate we try to animate). Also, we still need to try to shortcut all of this masking/drawing if the anim isn't visible.

Depends on: 125137

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 23

•

23 years ago

Attached patch Patch to majorly speed up DrawToImage (deleted) — Details — Splinter Review

Similar rewrite to DrawToImage to avoid bit-twiddling. We should also apply the patch #69176 - it's not integrated here. I can either provide an updated copy of that after we get this patch in, or I can merge them if people think it won't cause any danger of not getting this fix in. Should greatly reduce CPU use in DrawToImage. Tested heavily using similar testcases as per above. r/sr ASAP so this can make 0.9.9 please

tenthumbs

Comment 24

•

23 years ago

I'm wondering if lookup tables might further improve speed. For example, where you have alphaPixels &= 0xff << (8 - (aDWidth-x)); the right hand side presumable has only 8 possible values so it could be alphaPixels &= maskvals1[aDWidth-x]; There are other similar places. Just an idea.

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 25

•

23 years ago

Maybe an array would be cheaper - but probably not, since you'd introduce a memory read into the loop. Right now mask can live entirely in a register, and mask >>= 1 is cheap. Memory references are bad, tight register loops are good (generally). I'll remove the NULL and substitute nsnull (I was just silencing one of many warnings already in the code) before I check in.

Brendan Eich [:brendan]

Comment 26

•

23 years ago

Memory is red-shifting away from the super-scalar pipeline on modern CPUs. Even on-CPU cache memory may take multiple cycles to access. So i agree with rjesup in general. The Intel P4 seems, however, to have penalized non-constant shifts, and we can't turn them into multiplies easily here. But don't prematurely optimize for P4, I say. Make the compiler sweat it, measure first, etc. Pav, can you r= (test if not read the code :-)? /be

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 27

•

23 years ago

Realtime jprof of the testcase, with the anim not visible. The anim has a repeat rate of 10ms (100Hz), which in this test was not limited (bug 125137 not in place). Note, dual-450 P3 Linux, remote X display (no bits are actually being drawn). Before latest patch, 25% hits in DrawToImage (75% poll) After latest patch, 8% hits in DrawToImage, 90+% poll. Before either this patch or the other one already committed, non-poll hits were >50%. Looks like overhead is down 66% for DrawToImage, 80-90% down overall for both patches. Yum. I hope my code didn't make anyone's head hurt (pav). ;-) I've thought of going through and making it even faster, since it still shows up as an appreciable %. It would be pretty easy to extend to 32-bit chunks, and perhaps more importantly I could gather runs of 0xff's and do larger memcpy's. I'm not sure these would produce a measurable improvement, however. I'm ready to checkin once r/sr are in place.

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 28

•

23 years ago

Attached file jprof before latest patch (realtime, dual-450 P3 linux) of Testcase (deleted) — Details

Note: 1st patch applied, second not applied.

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 29

•

23 years ago

Attached file real-time jprof, after DrawToImage patch (deleted) — Details

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 30

•

23 years ago

The "reduce work in DrawToImage" patch posted earlier (http://bugzilla.mozilla.org/attachment.cgi?id=69176&action=view) appears to do the opposite; even with the other 2 patches applied it raises CPU use from 11-14% to 45-50% (measured via 'top' using the testcase). So I think we should leave that one out for now, and just put in my patch. Perhaps we can revisit things later. Tor might well fix some of this in his GTK rewrite. As per the obvious, the biggest win would be to not do any (or much) of this work for non-visible areas.

tor

Comment 31

•

23 years ago

Interesting - do you have a way of measuring frame rate before/after the change?

tor

Comment 32

•

23 years ago

Comment on attachment 70330 [details] [diff] [review] Patch to majorly speed up DrawToImage sr=tor

Attachment #70330 - Flags: superreview+

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 33

•

23 years ago

rate: I haven't put anything in. I could, but honestly I don't care how close to 0ms we can animate. I do care how much CPU time animating uses - my measurements of the testcase (big 10ms transparent anim, which has a lot of area with mixed (not 0 or ff) mask bytes, showed a 85-90% total reduction in CPU use. I don't think we can do a lot more without making it not try to mask/draw non-visible anims. At a 10ms anim rate, nothing visible on-screen, we used 8% of one CPU of a dual-450 PIII. We used to use more than 60%. We also need to approve 125137 to stop trying to animate at 0ms - we can argue if it should max out at 10, 16, 20ms. I can't see any reason to animate faster than 10ms in our system.

saari (gone)

Comment 34

•

23 years ago

Comment on attachment 70330 [details] [diff] [review] Patch to majorly speed up DrawToImage r=saari Do the other platforms need similar unrolling? Seems like there should be a way to structure this based on the processor's preferred "native" register size. For example, on Mac I'm thinking this could be even more intersting when using the AltiVec ney Velocity Engine vector unit registers in full 128 bit mode.

Attachment #70330 - Flags: review+

Robert O'Callahan (:roc) (email my personal email if necessary)

Comment 35

•

23 years ago

a=roc+moz for 0.9.9 This isn't actually unrolling here. This is just smarter code. Per-processor vectorized code for this and similar loops (e.g., in nsBlender) would be very interesting but should definitely be spun off into a separate bug. We'll always want a reasonably easy to understand, reasonably fast C path anyway.

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 36

•

23 years ago

Fix checked in. If someone want to try to rehabilitate the other DrawToImage patch (which _seemed_ to slow things down for me), feel free to reopen or add another bug.

Status: NEW → RESOLVED

Closed: 23 years ago

Resolution: --- → FIXED

Jeremy M. Dolan

Comment 37

•

23 years ago

Does this fix the bug I filed a bit ago? If not, could the same fix be applied to 120154's case? Bug 120154 - animated GIF eating cpu when mozilla in another tab or minimized

Adam D. Moss

Comment 38

•

23 years ago

Hmm. resource:///res/samples/gear1.gif simply stopped animating in Mozilla very recently. Side-effect of this fix?

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 39

•

23 years ago

That resource url animates fine for me with the patch, just tried it.

Adam D. Moss

Comment 40

•

23 years ago

Sorry for the false alarm -- it actually looks like it's tor's server-side scaling code that I'm testing in my tree that caused gears1.gif to stop animating. I'll comment over on that bug.

Christopher Blizzard (:blizzard)

Updated

•

23 years ago

Keywords: mozilla0.9.9 → mozilla0.9.9+

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 41

•

23 years ago

Re-ran jprof with build from midday today. 99.5% of hits are in poll(), 0.4% in DrawToImage() when the testcase is offscreen. Verified. Bigtime. :-) (tor's changes may be involved in this too)

Status: RESOLVED → VERIFIED

Boris Zbarsky [:bzbarsky]

Comment 42

•

23 years ago

Hm.... I still see images in _minimized_ mozilla windows taking up lots of cpu. Was this bug supposed to address that, or should I open a new one?

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 43

•

23 years ago

This was not directly targeted at "minimize", though it would have an effect on that as well. Note that this patch didn't stop us from updating invisible images; it merely reduces the overhead of masking/drawing transparent anims (which was way higher than it should have been). It has no effect on non-transparent anims, for example. Bug 125137 would help with reducing anim overhead in general.

Boris Zbarsky [:bzbarsky]

Comment 44

•

23 years ago

Ah, OK. I was dealing with a non-transparent gif. :)

Adam D. Moss

Comment 45

•

23 years ago

So, really, this was a good optimization but it didn't technically address the bug in hand? Ie. it helps with 'animated gifs use large amounts of CPU' but not 'animated gifs use large amounts of CPU when not visible' except as a side-effect of making (transparent) animated GIFs faster overall. How does moz avoid draws to updated images which are (scrolled) off-screen? Does it bother, at the client side? If it just sets a clip-region and lets X11 clip the image at the server side then each frame still has to be dithered/remapped and go 'over the wire', IIRC.

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 46

•

23 years ago

The fact that we still animate when not visible is bug 120154. I'll post a jprof there.

Comment hidden (mozreview-request)

Jan Honza Odvarko [:Honza] (always need-info? me)

Updated

•

7 years ago

Attachment #8914375 - Attachment is obsolete: true

Attachment #8914375 - Flags: review?(gasolin)

You need to log in before you can comment on or make changes to this bug.