Closed Bug 1221540 Opened 9 years ago Closed 9 years ago

Compiling Firefox while running e10s nightly makes nightly stop painting content

Categories

(Core :: Graphics, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
Tracking Status
e10s m8+ ---
firefox45 --- affected

People

(Reporter: bzbarsky, Assigned: milan)

References

(Blocks 1 open bug)

Details

(Whiteboard: [gfx-noted])

Attachments

(1 file, 5 obsolete files)

I tried turning on e10s again in nightly this morning, and ran into this problem.  The symptoms are just like bug 1198765 comment 12, except instead of crashing we simply stop painting stuff from the content process.  I can open new windows fine, can navigate to web sites, can see a "hand" cursor over links, can follow the links and see the url bar update.  Just no painting.  Once I get into this state, the only way to get out of it seems to be to restart the browser.

This has happened to me twice so far this morning, over the course of two Firefox compiles.

Tabs running in the parent process paint fine (e.g. I was about to load preferences to turn off e10s again).

One of those times when I was shutting down Firefox crashed under PLayerTransactionParent::DeallocShmem.  See https://crash-stats.mozilla.com/report/index/fbc91f0f-f9d2-48e5-b2e7-27ee52151104
Boris, what platform, what's your about:support graphics look like?  Also, when this happens, any error messages in the graphics section of about:support? (that should also be running in the parent process, so you should be able to see it)
Flags: needinfo?(bzbarsky)
Whiteboard: [gfx-noted]
> Boris, what platform

Mac 10.10.5

> what's your about:support graphics look like?

Graphics
--------

Asynchronous Pan/Zoom: wheel input enabled
Device ID: 0x0d26
GPU Accelerated Windows: 31/31 OpenGL (OMTC)
Supports Hardware H264 Decoding: No;
Vendor ID: 0x8086
WebGL Renderer: ATI Technologies Inc. -- AMD Radeon R9 M370X OpenGL Engine
windowLayerManagerRemote: true
AzureCanvasBackend: skia
AzureContentBackend: quartz
AzureFallbackCanvasBackend: none
AzureSkiaAccelerated: 1

> any error messages in the graphics section of about:support

Not that I can see.  The entire graphics section is pasted above.
Flags: needinfo?(bzbarsky)
I understand this was happening on your old machine as well - did it start with 10.10 by any chance?  A number of us are doing the same workflow as you, with 10.9 (or even 10.11) and not seeing the problem.  It doesn't make sense that it is a one off, because survived multiple machines.

From the comments, I'm assuming resizing the window doesn't help, and the profile is otherwise default.
Note that on the old machine I was getting crashes.  But that was also with older builds...

> did it start with 10.10 by any chance? 

I don't know.  I updated to 10.10 long before I tried turning e10s back on in nightlies, because for a while the browser wouldn't even start for me with e10s enabled.

> I'm assuming resizing the window doesn't help

Hmm. I had not actually tried.  But opening a new window and trying to load a web page in it didn't paint anything either, so I doubt resizing would have helped.  I can try if you really want me to.

> and the profile is otherwise default.

This is my default browsing profile.  It's not default in all sorts of ways (e.g. has history, bookmarks, a few extensions, etc), though none of them should obviously affect gfx.

A more relevant question is this: do the people not seeing this have 31 windows open with about 130 tabs between them?
(In reply to Boris Zbarsky [:bz] from comment #4)
> ...
> 
> A more relevant question is this: do the people not seeing this have 31
> windows open with about 130 tabs between them?

I forgot about your tab workflow.  You're definitely the right person to (eventually) stress test process per tab setting.

Anyway, we'll start by dealing with the crashes that were first an issue for you, and see if this gets any better.  After that, we may need to send you a instrumented build to see what's up.

Debug build is a pain to run, but I wonder if it would give us some extra information...
> Debug build is a pain to run, but I wonder if it would give us some extra information...

I guess I could give this a shot when I'm not trying to do anything with my browser...
(In reply to Boris Zbarsky [:bz] from comment #4)
> A more relevant question is this: do the people not seeing this have 31
> windows open with about 130 tabs between them?

I suspect this basically some kind of OOM. Can you try turning on logging in SharedMemoryBasic_mach.mm? That may reveal the source of problem.
Flags: needinfo?(bzbarsky)
> Can you try turning on logging in SharedMemoryBasic_mach.mm?

Just to make sure, this means either just running a debug build or changing the LOG_ERROR stuff to work in an opt build, right?  This isn't NSPR logging that I can just enable via env var?
Flags: needinfo?(bzbarsky)
(In reply to Boris Zbarsky [:bz] from comment #8)
> > Can you try turning on logging in SharedMemoryBasic_mach.mm?
> 
> Just to make sure, this means either just running a debug build or changing
> the LOG_ERROR stuff to work in an opt build, right?  This isn't NSPR logging
> that I can just enable via env var?

A debug build is sufficient. I'm not sure what would be needed be need to get it in an opt build, but it looks possible.
Flags: needinfo?(gwright)
So in a debug build, I don't get the no-painting behavior.  I get a content process crash instead, due to a fatal assertion from the gfx logger.  From the log:

Crash Annotation GraphicsCriticalError: |[0][GFX1]: [Tiling:Client] Failed to allocate a TextureClient[GFX1]: [Tiling:Client] Failed to allocate a TextureClient
Assertion failure: false (An assert from the graphics logger), at ../../../mozilla/gfx/2d/Logging.h:504
#01: mozilla::gfx::Log<1, mozilla::gfx::CriticalLogger>::WriteLog(std::string const&)[/Users/bzbarsky/mozilla/vanilla/obj-firefox/dist/NightlyDebug.app/Contents/MacOS/XUL +0x165efca]
#02: mozilla::gfx::Log<1, mozilla::gfx::CriticalLogger>::Flush()[/Users/bzbarsky/mozilla/vanilla/obj-firefox/dist/NightlyDebug.app/Contents/MacOS/XUL +0x165eedc]
#03: mozilla::gfx::Log<1, mozilla::gfx::CriticalLogger>::~Log()[/Users/bzbarsky/mozilla/vanilla/obj-firefox/dist/NightlyDebug.app/Contents/MacOS/XUL +0x165ee69]
#04: mozilla::gfx::Log<1, mozilla::gfx::CriticalLogger>::~Log()[/Users/bzbarsky/mozilla/vanilla/obj-firefox/dist/NightlyDebug.app/Contents/MacOS/XUL +0x1640685]
#05: mozilla::layers::TileClient::GetBackBuffer(nsIntRegion const&, gfxContentType, mozilla::layers::SurfaceMode, nsIntRegion&, RefPtr<mozilla::layers::TextureClient>*)[/Users/bzbarsky/mozilla/vanilla/obj-firefox/dist/NightlyDebug.app/Contents/MacOS/XUL +0x181da7d]
#06: mozilla::layers::ClientSingleTiledLayerBuffer::PaintThebes(nsIntRegion const&, nsIntRegion const&, nsIntRegion const&, void (*)(mozilla::layers::PaintedLayer*, gfxContext*, nsIntRegion const&, nsIntRegion const&, mozilla::layers::DrawRegionClip, nsIntRegi[/Users/bzbarsky/mozilla/vanilla/obj-firefox/dist/NightlyDebug.app/Contents/MacOS/XUL +0x181ce28]

and then the stack goes up through painting code until it gets to a refresh driver tick.  I assume in an opt build we would press on past this point and that may be my "no painting" state....  In any case, I did "Restore all crashed tabs", kept compiling, and it happened again, with the same exact message/stack.

Both times this bit was preceded by this warning:

[Child 81952] WARNING: didn't get an id (ipc/rcv) timed out 10004003
: file ../../../mozilla/ipc/glue/SharedMemoryBasic_mach.mm, line 619

(same exact warning both times, though the child PID was obviously different).

[Child 11670] WARNING: didn't get an id (ipc/rcv) timed out 10004003
: file ../../../mozilla/ipc/glue/SharedMemoryBasic_mach.mm, line 619

I have the full log if that would be useful, but I'd rather not attach it to the bug because it's obviously got all the URIs from my session in it.  Happy to mail it if people want to see it, of course.  It's about 1MB; e10s debug builds sure do warn a lot...
So looking at line 619 in SharedMemoryBasic_mach.mm, we have:

617   err = ports->mReceiver->WaitForMessage(&msg, kTimeout);
618   if (err != KERN_SUCCESS) {
619     LOG_ERROR("didn't get an id %s %x\n", mach_error_string(err), err);

kTimeout is 1000, so it sounds like we will wait 1s and then give up here.  That makes it likely that what's breaking things is just the machine being under load (build with high -j so all cores are busy) and us actually missing the timeout...
Can you increase the timeout to something like 10s and see if we do ever actually get the message?
Flags: needinfo?(gwright)
I tried raising it to 50s and have now completed 2 compiles without running into any problems: My tabs are still painting, no content process crashes, and no mentions of "SharedMemoryBasic_mach.mm" in the log.
Attached patch increasetimeout.patch (obsolete) (deleted) — Splinter Review
As discussed with Jeff in the office, we should probably increase this to something much longer like 60s and MOZ_CRASH if we don't get it back.
Attachment #8684299 - Flags: review?(jmuizelaar)
What would be ideal from my point of view as a user is if we recovered from this state (as in, if it fails, skip a paint or whatever but then later try again instead of just stopping painting completely or crashing).
I'll be the first to admit that I don't understand this stuff at all, but in general I agree with Boris. Just increasing this timeout doesn't seem like it'll solves the underlying problem here. Tell me I'm wrong if so.
(In reply to Johnny Stenback  (:jst, jst@mozilla.com) from comment #16)
> I'll be the first to admit that I don't understand this stuff at all, but in
> general I agree with Boris. Just increasing this timeout doesn't seem like
> it'll solves the underlying problem here. Tell me I'm wrong if so.

Increasing the timeout does help if the underlying problem is that it just takes a long time for us to get a response. Crashing should give us an idea of how serious this problem is in practice and whether we need to investigate what's causing it to take so long.
No matter what this timeout is people will run into it, and if the end result is an irrecoverable state then why do we even have this timeout? Do gather data on this, but crashing or no more painting seems worse than waiting indefinitely.
Note that afaict the crashing would be a content process crash, so the user can restore tabs after that.  This _may_ be better than waiting indefinitely.
That depends, it seems. If the reason we hit this timeout is that the system is simply so busy, as in bz's case, that it just doesn't get to respond, it seems unlikely that a user will be able to interact with the browser in that time period anyways. In that case the alternative is to kill a child process and eventually get to a usable browser with a dead child process, or get to a usable browser with that child process still alive and well.

If we're protecting against other cases here, then the situation is quite possibly different. Again, I have no idea what this code is actually doing, so I'm just sharing my high level thoughts. Not saying we shouldn't increase this timeout to get to a place where Firefox survives compiling itself on a modern laptop, just seems there's more to this than that.
It would be useful to get more information here.  We don't know how many people are hitting the 1s timeout, nor how many would be hitting the 60s (or whatever else we come up with.)  Before we go into potentially expensive scenario of really fixing this, I'd like to get that data.  We can do it with forced crashes, or telemetry.  Perhaps to start, we put in the telemetry if the 1s times out, then also do a 60s timeout on top of that and watch what happens.  So:

err = ports->mReceiver->WaitForMessage(&msg, "1 second");
if (err != KERN_SUCCESS) {
  err = ports->mReceiver->WaitForMessage(&msg, "1 minute");
  if (err != KERN_SUCCESS) {
    telemetry counter for "60 second IPC timeout missed"
  } else {
    telemetry counter for "60 second IPC timeout missed"
  }
}
...

So, yes, it will still stop painting as it does today, if both time out, but at least we can collect some data and see how big a problem we have, and some people will get their problems mitigated in the meantime...
All good with me to gather more data and learn more about this, and also get people unblocked short term.
Comment on attachment 8684299 [details] [diff] [review]
increasetimeout.patch

Review of attachment 8684299 [details] [diff] [review]:
-----------------------------------------------------------------

This seems like it should be an improvement over what we have. It might be worth adding more MOZ_CRASHES to the other places we can fail, so that we can see if they are happening instead of just failing to paint.
Attachment #8684299 - Flags: review?(jmuizelaar) → review+
Note that I can easily reproduce this on my local build of Nightly without building Firefox just by visiting http://codepen.io/. I get the same "[GFX1-]: Failed to allocate a surface due to invalid size Size(0,0)" in my console.
s/without building Firefox/without a build going in the background
So are we landing this patch, or adding telemetry markers? Or both (eventually)?
I was thinking of something like this, but there is the part where we use "graphics" code inside of IPC. Something that leaves the trace if we fail the 1 second timeout, and is very noisy if we fail the 1 minute timeout.
Now, we'd want to use gfxDevCrash (once bug Bug1222033 lands) and introduce a proper enum for this problem, rather than using D3D11.. one, but this is the type of thing I was going for.
Comment on attachment 8685021 [details] [diff] [review]
More information about timeouts - not an appropriate patch for landing, concept

Is this patch going too far?

In the meantime, I'd rather not add another MOZ_CRASH, but if we were to take George's patch and make it ifdef NIGHTLY, while we're figuring out what we want to do long run, I can get behind that.
Attachment #8685021 - Flags: feedback?(jmuizelaar)
Attachment #8685021 - Flags: feedback?(gwright)
Comment on attachment 8685021 [details] [diff] [review]
More information about timeouts - not an appropriate patch for landing, concept

Review of attachment 8685021 [details] [diff] [review]:
-----------------------------------------------------------------

This patch is better.
Attachment #8685021 - Flags: feedback?(jmuizelaar) → feedback+
Comment on attachment 8685021 [details] [diff] [review]
More information about timeouts - not an appropriate patch for landing, concept

That looks good to me as well.
Attachment #8685021 - Flags: feedback?(gwright) → feedback+
I'll make a clean patch.
Assignee: nobody → milan
Attached patch Try with a longer timeout if IPC fails. r=gw280 (obsolete) (deleted) — Splinter Review
Are we OK using the gfx* methods?  Yes, it builds and runs, but is there a high level objection to doing this?

https://treeherder.mozilla.org/#/jobs?repo=try&revision=0811af26d081
Attachment #8684299 - Attachment is obsolete: true
Attachment #8685021 - Attachment is obsolete: true
Attachment #8685493 - Flags: review?(gwright)
Comment on attachment 8685493 [details] [diff] [review]
Try with a longer timeout if IPC fails.  r=gw280

I can't really review this as I'm not a peer of the right stuff. It looks fine to me though.
Attachment #8685493 - Flags: review?(gwright) → feedback+
Comment on attachment 8685493 [details] [diff] [review]
Try with a longer timeout if IPC fails.  r=gw280

Bill, how bad is it to be using reporting methods from the gfx/ directory in ipc?
Attachment #8685493 - Flags: review?(wmccloskey)
Comment on attachment 8685493 [details] [diff] [review]
Try with a longer timeout if IPC fails.  r=gw280

Review of attachment 8685493 [details] [diff] [review]:
-----------------------------------------------------------------

I'd be in favor of just making this be an infinite timeout. I don't really see how this telemetry will be actionable. If a lot of people are timing out after 1s, would we realistically do anything besides increasing the timeout? It seems very unlikely to me that we'll get to a state where we can tolerate shmem allocation failing.

Also, I don't think that using these graphics routines is the right way to do it. If you want to use them that way, they should be factored out so other people can use them.
Attachment #8685493 - Flags: review?(wmccloskey)
I agree about the graphics routines location.  As far as telemetry - it would help us figure out how many bz's are out there, and if this is something we should quickly patch so that those few can keep running, or if it's something that shows up a lot and we need a full recovery solution.
I think if we wanted to get a sense of impact, crashing is better than telemetry. People pay more attention to crash stats.
(In reply to Milan Sreckovic [:milan] from comment #36)
> I agree about the graphics routines location.  As far as telemetry - it
> would help us figure out how many bz's are out there, and if this is
> something we should quickly patch so that those few can keep running, or if
> it's something that shows up a lot and we need a full recovery solution.

All right. How about we make the timer be something pretty long (say 10 seconds) and then crash after that? I guess it could be useful to know that a lot of people have to wait that long. Although I'm still not sure there's much we would do about it besides increasing the timer.
The difference between crashing and telemetry is that people who crash stop using the crashing builds or (if they're more knowledgeable) disable e10s.  So crashing will undercount compared to telemetry.
I guess we could stick a telemetry counter here, although I'm skeptical anyone will look at it.
I understand the skepticism.  The graphics team (I :) look at the GFX_CRASH telemetry exactly because we're replacing MOZ_CRASH with it.  In particular, the "gfxDevCrash" I was using above is MOZ_CRASH in Nightly and Aurora, but switches to telemetry in Beta and Release.

Still, let's do this right - I'll see if I can extract some of this out of graphics code.
Attachment #8685493 - Attachment is obsolete: true
Blocks: 1182377
Milan, it's been a month.  People keep asking me to test various e10s stuff, but it's just not workable while we have this bug.  Any chance of getting _something_ landed here just so e10s nightlies are somewhat usable?
Flags: needinfo?(milan)
Looking at it today.
This is in spirit the same as the last obsolete patch, but with some of the non-graphics specific code pulled out from gfx/ to a more common place.
Flags: needinfo?(milan)
Comment on attachment 8699510 [details] [diff] [review]
WIP - Try with a longer timeout if IPC fails - don't use the graphics utilities (as much)

Just to be clear, this doesn't actually build yet, just wanted to put it out there to see if the direction is acceptable.
Attachment #8699510 - Attachment description: Try with a longer timeout if IPC fails - don't use the graphics utilities (as much) → WIP - Try with a longer timeout if IPC fails - don't use the graphics utilities (as much)
If the short (1 second) timeout fails, we assert in debug; otherwise, we try a longer timeout (1 minute.)  If the longer timeout fails, we crash in nightly and aurora.  In beta and release, we will send telemetry instead of crashing.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=ee1909dd5994
Attachment #8699510 - Attachment is obsolete: true
Attachment #8699582 - Flags: review?(wmccloskey)
Comment on attachment 8699582 [details] [diff] [review]
Try with a longer timeout if IPC fails.  r=gw280, billm

Review of attachment 8699582 [details] [diff] [review]:
-----------------------------------------------------------------

Sorry, I'm still having trouble with this patch. The notion that people will pay more attention to the GFX_CRASH seems like a temporary thing, especially as we add more stuff to the already pretty big LogReason enum. Finally, I don't like the fact that there's a generic sounding logging function that always uses the GFX_CRASH histogram. But if we use a different histogram, then people will probably be even more likely to ignore it.

One way around all of this disagreement is to have the graphics code handle the error. That way you don't need me to review it and you can do whatever you want :-).

It looks like we can make a pinch point at ISurfaceAllocator::AllocUnsafeShmem. Perhaps we could do something like this:

class ISurfaceAllocator {
  AllocUnsafeShmem(...args) {
    auto result = AllocUnsafeShmemInternal(...args);
    if (!result) { ... error reporting ... }
    return result;
  }
  AllocUnsafeShmemInternal(...args) = 0;
};

You would have to rename the existing implementations of AllocUnsafeShmem in the subclasses of ISurfaceAllocator. It looks like there are maybe six of those, so not too many.

Milan, does this sound feasible/reasonable to you?
If the proposed solution takes care of Boris' original problem, I have no objections.
(In reply to Bill McCloskey (:billm) from comment #47)
> Comment on attachment 8699582 [details] [diff] [review]
> One way around all of this disagreement is to have the graphics code handle
> the error. That way you don't need me to review it and you can do whatever
> you want :-).
> 
> It looks like we can make a pinch point at
> ISurfaceAllocator::AllocUnsafeShmem. Perhaps we could do something like this:
> 
> class ISurfaceAllocator {
>   AllocUnsafeShmem(...args) {
>     auto result = AllocUnsafeShmemInternal(...args);
>     if (!result) { ... error reporting ... }
>     return result;
>   }
>   AllocUnsafeShmemInternal(...args) = 0;
> };

Maybe I misunderstood on the first read, because I keep understanding differently on each subsequent read.

As far as the IPC code goes, you're fine with us doing the short, followed by the long timeout, you just don't want the reporting to happen in the IPC code, and what you're suggesting in comment 47 is to deal with the error reporting part?

Or are you suggesting we have the timeout be a parameter to the IPC code, so that I can call with a particular timeout, and if it fails, call again with a longer one, or whatever is appropriate, and all the error reporting is handled by the caller?
Flags: needinfo?(milan) → needinfo?(wmccloskey)
(In reply to Milan Sreckovic [:milan] from comment #49)
> As far as the IPC code goes, you're fine with us doing the short, followed
> by the long timeout, you just don't want the reporting to happen in the IPC
> code, and what you're suggesting in comment 47 is to deal with the error
> reporting part?

Yes. Let's keep the changes to SharedMemoryBasic_mach.mm, except remove the CrashOrSendTelemetry call. Then AllocUnsafeShmem can call CrashOrSendTelemetry if it gets an error. I think that should work.
Flags: needinfo?(wmccloskey)
Hey milan, is this potentially fixed and can we test or are we still waiting on more work here?
Flags: needinfo?(milan)
This is as fixed as it's going to get.  The reason I left the bug open - I'd like us to make some "noise" when the longer timeout fails, and right now we're not doing that outside of the debug builds.
Flags: needinfo?(milan)
(In reply to Milan Sreckovic [:milan] from comment #56)
> This is as fixed as it's going to get.  The reason I left the bug open - I'd
> like us to make some "noise" when the longer timeout fails, and right now
> we're not doing that outside of the debug builds.

Can you kick that work out to another bug and close this one for tracking purposes?
Flags: needinfo?(milan)
I believe this is resolved as far as blocking the workflow goes.  I opened bug 1239781 to track the debugging logging.
Status: NEW → RESOLVED
Closed: 9 years ago
Flags: needinfo?(milan)
Keywords: leave-open
Resolution: --- → FIXED
(In reply to Milan Sreckovic [:milan] from comment #21)
> It would be useful to get more information here.  We don't know how many
> people are hitting the 1s timeout, nor how many would be hitting the 60s (or
> whatever else we come up with.)  Before we go into potentially expensive
> scenario of really fixing this, I'd like to get that data.  We can do it
> with forced crashes, or telemetry.  Perhaps to start, we put in the
> telemetry if the 1s times out, then also do a 60s timeout on top of that and
> watch what happens.  So:

> So, yes, it will still stop painting as it does today, if both time out, but
> at least we can collect some data and see how big a problem we have, and
> some people will get their problems mitigated in the meantime...

So we didn't add telemetry in the field, which would have been useful.  However, we started hitting the 1s timeouts in automation back in june/july, and they've spiked recently.  See bug 1264073 (23 failures in the last week) and bug 1282598 (10 in the last week).  

a) we should add telemetry - perhaps not on every allocation, though we could, but maybe every one that goes past a (say) 100ms or 200ms timeout)
b) we should consider changing the 1s timeout to NS_ASSERTION (don't stop waiting and blow up the test, see if it ever recovers - would still orange, but get more real data)
c) we should crash either at the 60s timeout with a MOZ_CRASH in the field (under assumption 1 minute waiting will never return and is just annoying the user), or timeout faster and do <something> to recover to a usable state. Also would tell us if the automation-induced timeouts ever recover.
d) if it fails the 60s timeout, is in the "no paint" state, or is it in a workable state?  I'm presuming the former
Blocks: 1264073, 1282598
Flags: needinfo?(wmccloskey)
Flags: needinfo?(milan)
(In reply to Randell Jesup [:jesup] from comment #59)
> c) we should crash either at the 60s timeout with a MOZ_CRASH in the field
> (under assumption 1 minute waiting will never return and is just annoying
> the user), or timeout faster and do <something> to recover to a usable
> state. Also would tell us if the automation-induced timeouts ever recover.

I like this, but let's not crash debug/nightly builds, otherwise debugging the parent process in gdb will often cause the child process to crash which is annoying. Or add a pref or environment variable or something for people using gdb on optimized builds.
The telemetry wasn't added due to a code review (see comment 50.)
From Boris' reports, this can happen because the machine is busy.  Do we really want Firefox to crash when the user loads their machine with some other work?

Recovering from a "there was no shared memory now, so hold tight, maybe they'll be more later" seems like a non-trivial amount of work.
Flags: needinfo?(milan)
(In reply to Milan Sreckovic [:milan] from comment #61)
> The telemetry wasn't added due to a code review (see comment 50.)

Reading that and previous comments, I think we could generate some telemetry on how long these operations run, or how often then exceed an initial timeout (or start with 100ms, then 200, 400, etc up to a total sum-of-timeouts of ~60*1000 ms).  If that's all in this block of code, and it can be, billm isn't needed, and no APIs need changing.

> From Boris' reports, this can happen because the machine is busy.  Do we
> really want Firefox to crash when the user loads their machine with some
> other work?

Well, the 1 sec timeout can happen.  I don't believe that a 1min timeout can happen - and if it did, I'm thinking that crashing might really be preferable to waiting more than a minute for a *paint*....  Something is horribly, seriously wrong if that happens.  The problem is that now we're crashing (in Debug builds) after a 1s wait, which can happen, especially on Try runs, or bz when doing a compile.  (Still pretty icky that that can happen, but ok.)

> Recovering from a "there was no shared memory now, so hold tight, maybe
> they'll be more later" seems like a non-trivial amount of work.

It wasn't clear to me.  And if that's not a reasonable option, ok.  Didn't know if there was reset code such that we could put the rendering code back into "initial" state and let it try to reinit or not.
I just want to be clear that I'm okay adding telemetry here. It's the GFX_CRASH annotation that I primarily object to. Also, my review is still needed. This is IPC code.

What mainly frustrates me about this bug is that no one seems to have any idea what to do with the data we collect. Even if 100% of Mac users encounter this problem, what will we do besides waiting longer or crashing? If we can't get a shmem, then we can't paint. Even crashing doesn't seem guaranteed to fix the situation. We could try to figure out why it's taking so long, but that seems really hard, and I doubt it's worth the time given how few Mac users we have.

Also, if this is really a problem, it should show up in BHR data. I've never seen it there, but since it's Mac-only it would probably rank quite low.
Flags: needinfo?(wmccloskey)
(In reply to Bill McCloskey (:billm) from comment #63)
> I just want to be clear that I'm okay adding telemetry here. It's the
> GFX_CRASH annotation that I primarily object to. Also, my review is still
> needed. This is IPC code.
> 
> What mainly frustrates me about this bug is that no one seems to have any
> idea what to do with the data we collect. Even if 100% of Mac users
> encounter this problem, what will we do besides waiting longer or crashing?

We'll know it's a critical problem to address ASAP.  If 0.01% ever see it, then we know we can ignore it unless/until the telemetry spikes.  (We've gotten a bit of a spike in automation recently; Telemetry would help us know if there are real-world implications of this, especially if it provides some level of info on how long it takes (directly via ::Now() or indirectly via increasing timeouts).

> If we can't get a shmem, then we can't paint. Even crashing doesn't seem
> guaranteed to fix the situation. We could try to figure out why it's taking
> so long, but that seems really hard, and I doubt it's worth the time given
> how few Mac users we have.

If the % is low, likely we'll defer dealing with it.  Failure to get an shmem doesn't mean we'll always fail, I suppose.  The fact that bz hit this in real-world usage when doing compiles (the 1s hit I presume) means it's not just a test artifact.  If we find that all recover with 2-3-5-whatever seconds, then we assess the impact of that.  Right now we're just assuming from lack of angry reports that bz's case is rare... but we're hitting it in Try.
OK, but what if 100% of Mac users are seeing 1 second or more pauses here? What would we do then? I'm just saying that we don't have a lot of options here, and gathering more data doesn't change that.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: