Open Bug 1546671 Opened 6 years ago Updated 2 years ago

Investigate high OOM rate for WebRender

Categories

(Core :: Graphics: WebRender, defect, P2)

Other Branch
defect

Tracking

()

People

(Reporter: kats, Unassigned)

References

Details

https://metrics.mozilla.com/webrender/dashboard_nvidia.html#nightly shows WR at almost 400% compared to non-WR when it comes to out of memory crashes. That's not good. The beta graph just below shows a more "reasonable" ~140%.

This needs some investigation to figure out what's going on.

I used databricks to get the windows GPU process "OOM | small" crashes with WR enabled on beta since 20190414 and collated the stackframes to get a better idea of which code is OOMing. The top few most-frequent stacks (I aggressively pruned stack frames to make this more readable) are below.

13, static void webrender::scene_builder::SceneBuilder::run() | static void std::sys_common::backtrace::__rust_begin_short_backtrace<closure,()>(struct closure) | static void alloc::boxed::{{impl}}::call_box<(),closure>(struct closure *, <NoType>)
5, moz_xmalloc | mozilla::BufferList<InfallibleAllocPolicy>::AllocateSegment(unsigned __int64,unsigned __int64) | mozilla::BufferList<InfallibleAllocPolicy>::WriteBytes(char const *,unsigned __int64)
4, static union core::result::Result<(), alloc::collections::CollectionAllocErr> std::collections::hash::map::HashMap<webrender_api::display_item::ClipId, webrender::display_list_flattener::ClipNode, core::hash::BuildHasherDefault<fxhash::FxHasher>>::try_resize<webrender_api::display_item::ClipId,webrender::display_list_flattener::ClipNode,core::hash::BuildHasherDefault<fxhash::FxHasher>>(unsigned __int64, std::collections::hash::table::Fallibility) | static void webrender::display_list_flattener::NodeIdToIndexMapper::add_clip_chain(union webrender_api::display_item::ClipId, struct webrender::clip::ClipChainId, unsigned __int64) | static union core::option::Option<webrender_api::display_list::BuiltDisplayListIter> webrender::display_list_flattener::DisplayListFlattener::flatten_item(struct webrender_api::display_list::DisplayItemRef, struct webrender_api::api::PipelineId, bool)
3, static union core::result::Result<(), alloc::collections::CollectionAllocErr> std::collections::hash::map::HashMap<(i32, i32), webrender::picture::Tile, core::hash::BuildHasherDefault<fxhash::FxHasher>>::try_resize<(i32, i32),webrender::picture::Tile,core::hash::BuildHasherDefault<fxhash::FxHasher>>(unsigned __int64, std::collections::hash::table::Fallibility) | static void webrender::picture::TileCache::pre_update(struct euclid::rect::TypedRect<f32, webrender_api::units::LayoutPixel>, struct webrender::frame_builder::FrameVisibilityContext *, struct webrender::frame_builder::FrameVisibilityState *, struct webrender::picture::SurfaceIndex)
3, static struct webrender::util::Allocation<webrender::picture::PicturePrimitive> webrender::util::{{impl}}::alloc<webrender::picture::PicturePrimitive>(struct alloc::vec::Vec<webrender::picture::PicturePrimitive> *) | static void webrender::display_list_flattener::DisplayListFlattener::pop_stacking_context() | static union core::option::Option<webrender_api::display_list::BuiltDisplayListIter> webrender::display_list_flattener::DisplayListFlattener::flatten_item(struct webrender_api::display_list::DisplayItemRef, struct webrender_api::api::PipelineId, bool)

The bulk of the OOMs seem to be in the content process rather than the GPU process which is somewhat good news, in that it won't take down the browser (at least not right away). But aggregating the crashes by buildid doesn't show any clear regression window where the count went up.

Miko pointed me to bug 1541092 which might be related in that it also about an increase in Windows content-process OOMs, in that case after an arena size was bumped from 8k to 32k. So in general it seems like as the allocation size increases, the higher the OOM rate, which points to some sort of memory fragmentation problem probably in the allocator. Since WR is making larger allocations than non-WR we hit this more often. At least that's my best theory right now.

I looked at the dashboard again, and now the nvidia beta summary is showing WR is better than non-WR for OOM crashes. Got worse on Nightly though - now it's around 600%.

But... on AMD and Intel WR is better than non-WR. So I suspect that maybe we just don't have enough crashes/data to meaningfully compare and the numbers are fluctuating a lot as a result. Downgrading to P2 and flagging as something we should look more closely at for 68 but now I'm less concerned than I was a few days ago.

Blocks: wr-68
Priority: P1 → P2
Assignee: nobody → a.beingessner

Unassigning myself. I agree with kats' conclusion that this data is too noisy to motivate a deeper investigation (e.g. now nvidia nightly has us with less OOMs than non-wr, but beta has more).

Assignee: a.beingessner → nobody
No longer blocks: wr-68
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.