<a class="header-button" href="https://bugzilla-dev.allizom.org/home" title="Go to home page"> Bugzilla

Reporter

Updated

•

13 years ago

Keywords: regression

Jeff Muizelaar [:jrmuizel]

Comment 7

•

•

13 years ago

(In reply to comment #14) > Discarding of decoded images has been in Firefox since 3.0. In 4.0, it only > applies to background tabs. Is the decision to change it to only background tabs causing regressive behaviour? Is the decision to increase the time-out value causing regressive behaviour? Should either or both of these decisions be re-evaluated? I'm primarily interested in fixing the regression that has happened since 3.6, and I've renamed this bug accordingly. AFAICT the decode-on-draw stuff is orthogonal, is that right? If so, we should leave discussion of it in bug 573583.

Summary: Image-heavy sites are slow and often abort due to OOM → Image-heavy sites are slow and often abort due to OOM; regression from 3.6

Assignee

Updated

•

13 years ago

Assignee: nobody → jmuizelaar

Comment 18

•

•

13 years ago

Keywords: common-issue+

Johnathan Nightingale [:johnath]

Reporter

Comment 28

•

13 years ago

d.a., thanks for the clear steps to reproduce and detailed measurements, that's very helpful. > Firefox 3.6.17: > Idle: 80 MB > Peak: 170 MB > Idle after load: 115 MB How long after peak was this measurement taken? I ask because it's relevant to the image.mem.min_discard_timeout_ms value mentioned above. (But even if that "idle after load" number were to go down after 2 or 3 minutes, the peak measurements are still much higher on 4.0.1 than 3.6.17 > ├──129.13 MB (51.15%) -- images > │ ├──128.67 MB (50.97%) -- content > │ │ ├──128.66 MB (50.96%) -- used > │ │ │ ├──113.43 MB (44.93%) -- uncompressed > [...] > ├──113.96 MB (45.14%) -- gfx > │ └──113.96 MB (45.14%) -- surface > │ └──113.96 MB (45.14%) -- image > [...] > └──-47.62 MB (-18.86%) -- (1 omitted) Negative values -- bug 658814 strikes again!

d.a.

Comment 29

•

13 years ago

(In reply to comment #28) > How long after peak was this measurement taken? I ask because it's relevant > to the image.mem.min_discard_timeout_ms value mentioned above. (But even if > that "idle after load" number were to go down after 2 or 3 minutes, the peak > measurements are still much higher on 4.0.1 than 3.6.17 Only a few seconds after the page was loaded, going down a few megabytes every other second or so. Scrolling down the page returned the memory usage to the peak. Using Firefox 4, the memory usage will only go down once the page is no longer in view. After that it will follow image.mem.min_discard_timeout_ms, which I've currently set at 10 seconds. I'm going to do some testing to see which of my extensions causes the jump in memory usage to become so large (420 MB vs 260 MB for 32-bit mode).

Nathaniel Simpson

Comment 30

•

13 years ago

It's hard to give details to reproduce as it's on our internal task system, but a large coldfusion based data table also causes the issue with memory usage just climbing continually and not dropping when the tab with the table in is closed. There are only colour scale backgrounds and button images so it's not those, thought the extra info might help.

d.a.

•

13 years ago

Depends on: 661304

douglas godfrey

Comment 34

•

13 years ago

in bug 660515 Firefox memory usage peaks over 4GB but the bulk of the memory usage is not recovered for more than 4 hours and about 300MB of memory is never recovered. image.mem.min_discard_timeout_ms at 120 seconds should not cause Firefox to retain memory for images that are not displayed in ANY tab or window. Such memory should be released immediately after the window or tab is closed.

Comment 35

•

13 years ago

This isn't specific to firefox 5 or 6, it's just something we'd like to see fixed as soon as possible (and potentially worth asking for approval on aurora/beta depending on the safety of the fix). Minusing the tracking noms.

tracking-firefox5: ? → -

tracking-firefox6: ? → -

Reporter

Comment 36

•

13 years ago

FWIW, the user "E B" who reported bug 658604 has said that setting image.mem.min_discard_timeout_ms to 10000 fixed the particular problems he/she was having, and resolved that bug as WORKSFORME. Since that's a trivial change and fixes at least some of the problems we're seeing with image-heavy sites, does anyone object if we at least make that change immediately to give us some breathing room to work on the longer-term changes?

Comment 37

•

13 years ago

(In reply to comment #36) As I understand it, changing the timeout back would basically improve performance for those with low RAM and regress performance for those with higher RAM. Is there a middle-ground setting that could give enough of an effect to help the low RAM situation without causing people with newer systems to not be able to take advantage of their RAM? Ten seconds versus two minutes is a big jump. What about 40 seconds or so instead? Also, why is bug 583426 currently restricted? It'd be easier to make a good determination about this setting if information about its change was available for discussion. If that bug has a good reason to stay locked could someone with access please state why and post a quick summary of the relevant bits here? (namely how much increasing the setting actually helps on what kind of systems in what ways) That being said, this is a bad issue with a fair bit of hurt at least in part fixated on a simple setting. If lowering it will definitely fix some drastic problems, even with some perf regressions for others, then I think it should probably be lowered and pushed out in a chemspil 4.0.x update ASAP with more and better fixes in future major releases.

Comment 38

•

13 years ago

> Also, why is bug 583426 currently restricted? That bug is a new hire notification. njn must have mistyped the bug number.

Comment 39

•

13 years ago

(In reply to comment #38) > > Also, why is bug 583426 currently restricted? > > That bug is a new hire notification. njn must have mistyped the bug number. Well that makes more sense then. Bug 593426 is the real number, apparently.

Reporter

Comment 40

•

13 years ago

(In reply to comment #37) > > That being said, this is a bad issue with a fair bit of hurt at least in > part fixated on a simple setting. If lowering it will definitely fix some > drastic problems, even with some perf regressions for others Can someone quantify the perf regression for high-RAM users? I think it would make flicking between tabs slightly slower for those people, while avoiding huge slowdowns for low-RAM users (which includes mobile users). If that's right, I think we should definitely err on the side of the low-RAM users. Also, would reducing it to 40 seconds help that much? I'm thinking about the use case where the user middle-clicks on a heap of images from some index page and then gradually browses through the opened tabs. In that case, discarding the background tab images quickly is the best thing to do. > then I think > it should probably be lowered and pushed out in a chemspil 4.0.x update ASAP > with more and better fixes in future major releases. Chemspill releases are for urgent security fixes only. Firefox 5 is only a few weeks away, if we do make a change

Boris Zbarsky [:bzbarsky]

Comment 41

•

13 years ago

Reading the patch in bug 593426, the comment for the pref says something interesting here: // Minimum timeout for image discarding (in milliseconds). The actual time in // which an image must inactive for it to be discarded will vary between this // value and twice this value. Thus the 2 minute timeout is 2-4 minutes and the 10 second timeout is 10-20 seconds. If this is apparently the case, then I guess the best route would be to bump it down low ASAP and come back later with the better solution which would involve removing it and discarding smartly based on need. Large values end up with larger ranges and holding onto things more than really intended. (In reply to comment #40) > Can someone quantify the perf regression for high-RAM users? (Justin Lebar in bug 593426 comment #11) > This is pretty noticeable on image-heavy sites, such as [1]. Open it, wait > 30s or so, and then switch back. There's a period of a second or two as the > images are re-decoded where they're all blank and FF is less responsive. > > [1] > http://www.boston.com/bigpicture/2011/01/protest_spreads_in_the_middle.html Apparently with enough RAM it prevents a fairly noticeable problem. (In reply to comment #40) > Also, would reducing it to 40 seconds help that much? That was a wild guess on my part and now that I've learned the above bit of information with respect to the imprecision of this preference I'd lean towards 20 seconds instead which would translate to 20-40 seconds. But, none of this looks ideal so just going back to what it was at 10s looks like the safest route. (In reply to comment #40) > (In reply to comment #37) > > it should probably be lowered and pushed out in a chemspil 4.0.x update ASAP > > with more and better fixes in future major releases. > > Chemspill releases are for urgent security fixes only. They're for urgent security and stability fixes, and this for some of those affected is a stability issue. > Firefox 5 is only a few weeks away, if we do make a change I have to disagree here rather strongly. If this issue is common enough and crippling Firefox 4 on lower-RAM systems, then these people are highly likely to not volunteer for a major update unless they know the problem is fixed, which they won't. Mozilla already has a deeply fractured install base with many people on a wide variety of different versions and there are already statements of people reverting to Firefox 3.6 because of this. These are the users that may never upgrade again. If at far least a fix comes out in a 4.0.x update automatically then those who haven't downgraded or those at least willing to try Firefox 4 again will see the fix. Otherwise they won't, and won't update to Firefox 5. Until Mozilla gets its act together and *forces* all updates with no easily accessible way to override, you can't fix a big problem with a new major update because people won't opt to install it. (and now that we're going through major versions like candy, a fully automatic update is now a *requirement* if you want people to continue to update and use new Firefox versions, but that's another discussion)

Comment 42

•

13 years ago

> then these people are highly likely to not volunteer for a major update Firefox 5 is a minor update to Firefox 4. If you have updates enabled at all, it will happen. No volunteering involved.

Reporter

Comment 43

•

13 years ago

(In reply to comment #41) > > Chemspill releases are for urgent security fixes only. > > They're for urgent security and stability fixes, and this for some of those > affected is a stability issue. Trust me, save your breath arguing this. There won't be a chemspill release for anything less than a major zero-day exploit or similar.

Comment 44

•

13 years ago

(In reply to comment #42) > > then these people are highly likely to not volunteer for a major update > > Firefox 5 is a minor update to Firefox 4. If you have updates enabled at > all, it will happen. No volunteering involved. Heh? I'm surprised I didn't read that anywhere. So it's basically Firefox 4.1 with a "5" slapped on for no particular reason, which is profoundly dumb and confusing. In two years if this is kept up we'll hit Firefox 10 or so and have no meaning left in these versions. major.minor.update was more or less standardized... :sigh: Whatever. If Firefox 5 is really Firefox 4.1 pushed out like 4.0.2 in disguise then putting a fix in there doesn't have the problem I was worried about in the end of comment 41.

Comment 45

•

13 years ago

I think we should be really, really careful about flipping this switch down to 10 or 20s from 2m for ff5. I understand that it's frustrating some number of users, but other users on beefier machines might be frustrated when switching tabs is slower (due to sync decode) or images don't appear right away (due to async decode). The right solution might be more complex than finding the perfect default setting for this dial; at 10-20s, we might as well discard all images as soon as a tab loses focus. I think the right channel for this change is nightly, or maybe aurora. Making this change with a few weeks to go on beta scares the heck out of me.

d.a.

Comment 46

•

Comment 53

•

13 years ago

(In reply to comment #50) > We really have no idea > how many people would be helped by this change, and we really have no idea > how many people would be hurt by this change. So I don't think we have > evidence sufficient to conclude that this change would be an overall win. > > We flipped the dial up *to* 2m late in the FF4 cycle thinking that would be > an unmitigated win as well. If that was a mistake, why do we want to repeat > it by doing the same thing for FF5? So you had the dial set at 10s for at least part of the FF4 beta cycle, right? How long was it on that setting? How many complaints did you get?

Comment 54

•

13 years ago

Here's an idea. The full fix for this will be some form of heuristic to discard more smartly based on available RAM. What about a quick hack that gets it in the ballpark first? Just detect the total physical RAM and pick a timeout that scales roughly with system: 10s w/ 1GB of RAM or less 30s w/ 1-2GB of RAM 60s w/ 2-3GB of RAM 120s w/ 3GB of RAM or more Then replace the pref with an optional (i.e. no default) "image.mem.min_discard_timeout" pref (no "_ms") to still allow setting an override if really wanted (this time in seconds because millisecond resolution is pointless here anyway). This helps the low end, doesn't hurt the higher end, and gives something in the middle for the middle. The numbers may not be perfect, but it would be a better balance than we have now without having to rewrite the existing timeout based system yet. Only big question is this: is there a quick and reliable method to get the system's total physical RAM size on all platforms? The main worry is that this may not get tested fully, however because it's just varying an existing setting as long as it gets a little testing I don't think the risk would be too high, at least in comparison to the current setup that has known problems. Again, an ideal fix would be a decent heuristics system to know when to discard more based on available RAM rather than just a rough guess based on total RAM, but that can come later.

cplarosa

Comment 55

•

13 years ago

I may be naive in bringing up this point, since I haven't worked on the actual mozilla code, but isn't the fundamental operation of a garbage collector supposed to work as follows: 1. Attempt to allocate memory 2. If out of memory, run garbage collector and attempt to allocate memory again 3. If out of memory, fail The garbage collector can always be run more frequently for better performance, but step 2 seems essential. It looks like step 2 is not being executed in the current Firefox code (at lease for image cache), based on my test case. Adding that step would solve everything, would it not? I know it's probably a lot more complicated than I have made it sound, but isn't step 2 essentially what is needed? Also, just to add a bit more feedback, on my 2MB Windows XP system, the Firefox 4 betas seemed more stable than the final release (they didn't crash as often). Since you increased the image retention time late in the development cycle, I think that's probably when my Firefox became less stable, but I was unable to reproduce the problem at the time. So maybe Dave Garrett's idea would be a good temporary solution until step 2 can be added.

Reporter

Comment 56

•

13 years ago

(In reply to comment #55) > > 1. Attempt to allocate memory > 2. If out of memory, run garbage collector and attempt to allocate memory > again > 3. If out of memory, fail I think you're unclear about the meaning of "garbage". Usually it refers to JavaScript objects that will definitely never be used again. But in this bug we're mostly talking about uncompressed images that are stored in a cache; they may or may not be used again, and they can be thrown away and regenerated as necessary. That's not "garbage" per se. In other words, this bug is about the policy of the uncompressed image cache -- when should images be uncompressed into it, and when should they be discarded? Getting that policy right isn't easy.

Boris Zbarsky [:bzbarsky]

Comment 57

•

13 years ago

(In reply to comment #53) > So you had the dial set at 10s for at least part of the FF4 beta cycle, > right? How long was it on that setting? How many complaints did you get? This is a fair point. I still don't think it completely mitigates the risk of setting the dial back, however. The known unknown is that users with powerful machines might see it as a regression from current behavior; they didn't complain during the FF4 beta because what they had was better than 3.6. But there are unknown unknowns too. We've changed plenty since we flipped the switch from 10s to 2m. How will that affect users? If we're seriously considering changing this for FF5, I think we should land the change on nightly and aurora immediately, so we can get some handle on what it means.

Comment 58

•

13 years ago

> 1. Attempt to allocate memory With modern OSes, this will typically succeed, then kill the process when you try to actually use the memory, for what it's worth.

VanillaMozilla

Comment 59

•

Updated

•

•

13 years ago

P.S. Now that the facts are known, us bystanders need to butt out of the discussion and let the Mozilla guys handle it. :-)

Jeff Muizelaar [:jrmuizel]

Assignee

Comment 80

•

13 years ago

I've filed bug 664290 to lower the timeout.

Depends on: 664290

Jesse Ruderman

Comment 81

•

•

Comment 108

•

13 years ago

(In reply to comment #102) > > jrmuizel, you're the current assignee of this bug -- is this a fair summary? > Have I missed anything? Am I mistaken about the likelihood of the > decode-on-draw or prediction work happening? It is a decent summary. However, I do plan on getting decode-on-draw working and I also plan to get to place where we're not decoding/keeping around images on a page that aren't visible.

Emanuel Hoogeveen [:ehoogeveen]

Comment 109

•

13 years ago

It seems that right now, "Don't decode images too far outside the current view" means "Don't spawn workers to decode images too far outside the current view". After bug 674547, however, it will mean "Don't queue up decoding images that are too far outside the current view". As such, I proposed in that bug to queue up images for decoding in order of their distance from the center of the current view. This would still end up decoding all images on a page, but in a reasonable fashion. To partially fix this bug, perhaps the worker could keep track of how many bytes have been decoded and bail when the limit is reached? That way pages with only a few images or a lot of small ones would still get all their images decoded, whereas exceptionally image heavy pages would get only a reasonable amount. That wouldn't solve the issue of scrolling, though. Once the page is scrolled, the worker could start decoding again in the scrolling direction, or queue up a new set of undecoded images to decode if the scrolling happens too fast (say if an as yet undecoded image is scrolled into view). To solve the issue of excessive memory usage, images scrolled too out of the current view could be discarded iff the 'decoded bytes' for the page exceeds the limit used in the first worker.

Reporter

Comment 110

•

13 years ago

Based on comment 102, comment 105 and comment 108 it appears we have a good handle on the pieces needed here. So I'm going to morph this into a tracking bug. Additional comments are probably best put into the blocking bugs.

Keywords: meta

Summary: Image-heavy sites are slow and often abort due to OOM; regression from 3.6 → [meta] Image-heavy sites are slow and often abort due to OOM; regression from 3.6

Reporter

Comment 111

•

•

13 years ago

Thanks Boris. On my machine (with Firebug turned off) I see Firefox nightly times that are much slower than Chrome. I will open a different bug.

Randell Jesup [:jesup] (needinfo me)

Reporter

Comment 130

•

13 years ago

There's no need to have two tracking bugs for sub-optimal handling of images. I'm closing this in favour of bug 683284.

Status: NEW → RESOLVED

Closed: 13 years ago

Resolution: --- → DUPLICATE

Comment 131

•

13 years ago

Nicholas, I disagree somewhat - this is focused on the memory-use aspect (especially paths that lead to OOM); bug 683284 is focused on performance per the title (though the text in comment 1 there says (approx) "only things we're doing *wrong*"). Memory-use (barring degenerate cases) isn't "wrong", it's better or worse with lots of subjective tradeoffs, so I don't think this bug falls into that one on either definition. Or we need to change the summary on that metabug (and make it "any bug about image performance and memory use"), and move any bugs dependent here (I think there's only one non-duplicated bug open), and maybe open a new bug on OOM behavior (or make this non-meta and make it dependent on that one). Part of the issue is that this bug wasn't *really* a meta-bug - look at all the discussion here. A lot of it was about the image-discard behavior, after that was papered over (with the 10s change - though recently some people have advocated reverting the discard timeout!) that makes a lot of the early discussion here moot. There is other info here, but the one big things discussed but not resolved here is the more advanced discard/predictive decode algorithms such as comment 90 and comment 91. We should spin those off into another bug (or put them in the relevant existing bug).