Closed
Bug 1492580
Opened 6 years ago
Closed 6 years ago
Infinite loop in GLContext::RawGetErrorAndClear from endless GL_CONTEXT_LOST with Nvidia Linux drivers and suspend/resume
Categories
(Core :: Graphics: CanvasWebGL, defect, P1)
Tracking
()
VERIFIED
FIXED
mozilla65
Tracking | Status | |
---|---|---|
firefox-esr60 | --- | unaffected |
firefox62 | --- | unaffected |
firefox63 | --- | wontfix |
firefox64 | + | verified |
firefox65 | + | verified |
People
(Reporter: jld, Assigned: jgilbert)
References
Details
(Keywords: hang, regression, Whiteboard: [gfx-noted])
Attachments
(3 files, 1 obsolete file)
(deleted),
text/x-phabricator-request
|
Details | |
(deleted),
text/x-phabricator-request
|
Details | |
(deleted),
patch
|
jcristau
:
approval-mozilla-release+
|
Details | Diff | Splinter Review |
STR #1:
1. Load a WebGL page like Google Maps
2. Suspend/resume ("echo mem > /sys/power/state" as root, then hit a key)
STR #2:
1. Load a WebGL page
2. Navigate to a non-WebGL page (e.g., rfc-editor.org)
3. Suspend/resume
4. Navigate to a WebGL page
This causes the content process to lock up in the loop in GLContext::RawGetErrorAndClear, and I can reproduce it on official builds with a clean profile, using driver versions 390.77 and 390.87.
Debugging, I can see that GetError is endlessly returning GL_CONTEXT_LOST (0x0507); there's also a function inside the closed-source library returning 0x92bb which happens to be GL_PURGED_CONTEXT_RESET_NV. I immediately suspected bug 1484782, and verified that that's when this regressed. (Also, if it matters, the backtrace goes into WebGLRenderingContext_Binding::getError and then JS.)
I don't know OpenGL very well, but from reading documentation on glGetError, I don't think it's supposed to behave like that.
Reporter | ||
Updated•6 years ago
|
Component: Graphics → Canvas: WebGL
Updated•6 years ago
|
Comment 1•6 years ago
|
||
See also:
https://phabricator.kde.org/R108:aefb5f4dd9d4
https://gitlab.gnome.org/GNOME/mutter/commit/d4d2bf0f
I tried checking for GL_CONTEXT_LOST in RawGetErrorAndClear, but I hit a lot of assertions afterwards and I'm not familiar enough with the code to know the proper way of fixing those.
Reporter | ||
Updated•6 years ago
|
status-firefox62:
--- → unaffected
status-firefox63:
--- → affected
status-firefox64:
--- → affected
Reporter | ||
Comment 2•6 years ago
|
||
If there isn't a simple fix for this, should we back out bug 1484782 from m-c and beta? That one affected only WebRender, which isn't enabled by default on any Linux systems yet, but this seems to affect any use of WebGL.
Flags: needinfo?(jgilbert)
Comment 4•6 years ago
|
||
(In reply to Jeff Gilbert [:jgilbert] from comment #3)
> I'll take a look tomorrow.
Jeff, can we get a status update on this one? Thanks
Updated•6 years ago
|
Assignee | ||
Comment 5•6 years ago
|
||
Here's the extension:
https://www.khronos.org/registry/OpenGL/extensions/NV/NV_robustness_video_memory_purge.txt
Flags: needinfo?(jgilbert)
Priority: P3 → P1
Assignee | ||
Comment 6•6 years ago
|
||
I have a theory. Will test tomorrow.
Updated•6 years ago
|
Assignee | ||
Comment 7•6 years ago
|
||
So the docs are weird here. From some parts of the docs, it looks like it is allowed to keep returning CONTEXT_LOST from glGetError until you check GetGraphicsResetStatus(), and that's what it sounds like it's doing.
I'm setting up a test machine now. I usually just use Mesa's drivers, even for NV these days. (I would encourage you to try them!)
Sorry to barge in. I have probably same issue in https://bugzilla.mozilla.org/show_bug.cgi?id=1489354 - hangs on linux on switching users, not surviving suspend\resume, etc.
From 64.0a1 2018-10-17 build its got much severe - now hangs occurs randomly during normal browsing session, without any special action mentioned above. Is something changed? My best guess is https://bugzilla.mozilla.org/show_bug.cgi?id=1498070 but only because its only WebGL related fix.
Assignee | ||
Comment 9•6 years ago
|
||
I can't seem to get Linux to resume from suspend on NV binary drivers.
Reporter | ||
Comment 10•6 years ago
|
||
(In reply to Jeff Gilbert [:jgilbert] from comment #9)
> I can't seem to get Linux to resume from suspend on NV binary drivers.
If you have patches I can test them, but I understand that that's not a great way to develop.
Incidentally, I tried amdgpu a few months ago and it reliably crashed the X server on resume; I'd be surprised if nouveau did any better, but I haven't actually tried it.
Comment 11•6 years ago
|
||
This problem is started to get noticed by users in wild
https://askubuntu.com/questions/1088324/firefox-becoming-unresponsive-after-suspending-on-kubuntu-18-10/1088716#1088716
https://askubuntu.com/questions/1089295/segfaults-related-to-cameras-ipc/1090307#1090307
Comment 12•6 years ago
|
||
Jeff, given that
1) This is breaking users on release.
2) WebRender won't be turned on for Linux for the foreseeable future.
Would you reconsider comment 2, i.e. backing out that fix so WebRender (which isn't used) breaks but WebGL (which *is* used) works again?
Flags: needinfo?(jgilbert)
Comment 13•6 years ago
|
||
(In reply to Jeff Gilbert [:jgilbert] from comment #9)
> I can't seem to get Linux to resume from suspend on NV binary drivers.
As Ubuntu user with Nvidia binary drivers i can say, from my experience, that not being able to resume from suspend is abnormal situation (i dont have such problem). Yet, its not uncommon, especially on laptops.
Reporter | ||
Comment 14•6 years ago
|
||
I tried poking around in GLContext, a while ago, to see what I could do. This patch seems to avoid crashing or locking up the process, and instead winds up throwing the error into content. Google Maps puts up error UI saying to reload the page; a simpler WebGL demo I tried just stopped animating.
Note that, without any of these patches, Google Maps will continue trying to work but its textures will have been replaced with garbage, so this may not be much of a regression. However, I have no idea what side effects this patch might cause.
It seems to me that a better solution would be to turn on the Nvidia magic just for WebRender's GL context and not contexts used for WebGL, which should be possible given that they're in different processes at the moment, but I don't know what the situation will be after WebGL remoting lands.
Assignee | ||
Comment 15•6 years ago
|
||
I have a potential fix, but I can't repro, so I'll be needinfo-ing people with builds to try tomorrow.
Flags: needinfo?(jgilbert)
Comment 16•6 years ago
|
||
[Tracking Requested - why for this release]:
Linux webgl regression shipped in current release. Fix should be considered for a uplift.
tracking-firefox64:
--- → ?
Updated•6 years ago
|
status-firefox65:
--- → affected
Assignee | ||
Comment 17•6 years ago
|
||
Ok, I split the fix into 'hack in a loop limit', and 'improve how we handle context loss'.
Assignee | ||
Comment 18•6 years ago
|
||
Assignee | ||
Updated•6 years ago
|
Attachment #9023112 -
Attachment is obsolete: true
Assignee | ||
Comment 19•6 years ago
|
||
Landing is queued. Once it's in Nightly, I'll ask repro people to try to verify.
Comment 20•6 years ago
|
||
Pushed by jgilbert@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/f314c7e261c6
Limit glGetError flush loop and handle CONTEXT_LOST. r=lsalzman
Reporter | ||
Comment 21•6 years ago
|
||
I tried D11270 with a debug build:
[gl:0x7faff66e7000] void mozilla::gl::GLContext::raw_fBindFramebuffer(GLenum, GLuint): Generated unexpected error. (0x0507)
Hit MOZ_CRASH(Unexpected error with MOZ_GL_DEBUG_ABORT_ON_ERROR. (Run with MOZ_GL_DEBUG_ABORT_ON_ERROR=0 to disable)) at /home/jld/src/gecko-dev/gfx/gl/GLContext.cpp:3043
#01: mozilla::gl::GLScreenBuffer::BindFB(unsigned int) (/home/jld/src/gecko-dev/gfx/gl/GLScreenBuffer.cpp:214)
#02: mozilla::WebGLContext::BindFramebuffer(unsigned int, mozilla::WebGLFramebuffer*) (/home/jld/src/gecko-dev/dom/canvas/WebGLContextGL.cpp:153)
#03: mozilla::dom::WebGLRenderingContext_Binding::bindFramebuffer(JSContext*, JS::Handle<JSObject*>, mozilla::WebGLContext*, JSJitMethodCallArgs const&) (/home/jld/src/obj.gecko-dev/obj-x86_64-pc-linux-gnu/dom/bindings/WebGLRenderingContextBinding.cpp:14045)
Okay, so let's set that env var to not abort on error:
Assertion failure: false (Unreachable.), at /home/jld/src/gecko-dev/dom/canvas/WebGLContext.cpp:1573
#01: mozilla::WebGLContextLossHandler::TimerCallback() (/home/jld/src/gecko-dev/dom/canvas/WebGLContextLossHandler.cpp:100)
#02: mozilla::WatchdogTimerEvent::Notify(nsITimer*) (/home/jld/src/gecko-dev/dom/canvas/WebGLContextLossHandler.cpp:42)
#03: nsTimerImpl::Fire(int) (/home/jld/src/gecko-dev/xpcom/threads/nsTimerImpl.cpp:687)
Finally, a non-debug build:
WebGL(0x7face8e56000)::ForceLoseContext
…followed by a black page and in-content error message, but no crash or hang. So that's promising.
But, trying the other repro — navigate to a non-WebGL page, suspend+resume, navigate back to a WebGL page — gives me an infinite loop:
625 while (mGL.fGetError()) {}
(gdb) fin
Run till exit from #0 mozilla::gl::GLContext::LocalErrorScope::GetError (this=0x7ffdccc16c20) at /home/jld/src/obj.gecko-dev/obj-x86_64-pc-linux-gnu/dist/include/GLContext.h:625
That's the while loop that I deleted in my patch, because I ran into this when I was experimenting earlier, and the comment on fGetError says “you never have to loop on this”. A little more context from the stack (I can post the whole thing if it would help):
#10 0x00007faceea3850a in mozilla::gl::GLContext::LocalErrorScope::GetError() (this=0x7ffdccc16c20) at /home/jld/src/obj.gecko-dev/obj-x86_64-pc-linux-gnu/dist/include/GLContext.h:625
#11 0x00007faceea21e95 in mozilla::gl::GLContext::GetPotentialInteger(unsigned int, int*) (this=0x7faccf7f4000, pname=33309, param=<optimized out>) at /home/jld/src/obj.gecko-dev/obj-x86_64-pc-linux-gnu/dist/include/GLContext.h:647
#12 0x00007faceea21e95 in mozilla::gl::GLContext::InitExtensions()::$_4::operator()() const (this=<optimized out>) at /home/jld/src/gecko-dev/gfx/gl/GLContext.cpp:1633
#13 0x00007faceea21e95 in mozilla::gl::GLContext::InitExtensions() (this=0x7faccf7f4000) at /home/jld/src/gecko-dev/gfx/gl/GLContext.cpp:1630
...
#23 0x00007facef877094 in mozilla::WebGLContext::SetDimensions(int, int) (this=0x7facd0150800, signedWidth=<optimized out>, signedHeight=<optimized out>) at /home/jld/src/gecko-dev/dom/canvas/WebGLContext.cpp:874
Assignee | ||
Comment 22•6 years ago
|
||
I hate it when I shoot from the hip and miss.
Please try these builds, once ready:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=49f8757f861433ae2920614424063d155fc1238b
Flags: needinfo?(jld)
Assignee | ||
Comment 23•6 years ago
|
||
Just kidding, please try these builds instead:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=9bde8e92d35906fbace0f07a92a701cb7117bf8d
Comment 24•6 years ago
|
||
bugherder |
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla65
Reporter | ||
Comment 25•6 years ago
|
||
(In reply to Jeff Gilbert [:jgilbert] from comment #23)
> Just kidding, please try these builds instead:
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=9bde8e92d35906fbace0f07a92a701cb7117bf8d
The debug builds there are broken because MOZ_ASSERT(!err && err == LOCAL_GL_CONTEXT_LOST) should be an ||, which interestingly is the same mistake I made the first time I tried to fix the loop condition in RawGetErrorAndClear.
Using the PGO build[*], the second STR from comment #0 crashes thusly (and I saved the minidump in case it's useful):
Thread 0 (crashed)
0 libxul.so!mozilla::gl::GLContext::InitWithPrefixImpl(char const*, bool) [GLContext.cpp:9bde8e92d35906fbace0f07a92a701cb7117bf8d : 853 + 0x0]
...
11 libxul.so!mozilla::dom::HTMLCanvasElement_Binding::getContext(JSContext*, JS::Handle<JSObject*>, mozilla::dom::HTMLCanvasElement*, JSJitMethodCallArgs const&) [HTMLCanvasElementBinding.cpp: : 277 + 0xb]
Which is https://hg.mozilla.org/try/file/d2238accb71d4ca6093fa7562350109bc36d7bc1/gfx/gl/GLContext.cpp#l853
> 852 const auto err = mSymbols.fGetError();
> 853 MOZ_RELEASE_ASSERT(!err);
Maybe we should back out bug 1484782 from mozilla-beta while this gets ironed out? These patches look unlikely to be accepted for uplift at this point, but a backout probably would, since the only thing it would break is WebRender on Linux, which has to be manually turned on in about:config.
[*] https://tools.taskcluster.net/groups/Pt_TxpssSn6unB93_bbTjQ/tasks/czbwBhoIRbqA60y0Sx7TjQ/runs/0/artifacts
Status: RESOLVED → REOPENED
Flags: needinfo?(jld)
Resolution: FIXED → ---
Assignee | ||
Comment 26•6 years ago
|
||
Man, we check !err all over.
Assignee | ||
Comment 27•6 years ago
|
||
Try this please:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=ee50acd41a56a8d56f4b4275a4416195f6d8616c
Flags: needinfo?(jld)
Reporter | ||
Comment 28•6 years ago
|
||
No difference with STR #1. With STR #2, it doesn't crash, but WebGL no longer works in the affected content process, and there's nothing logged to stderr. (http://vill.ee/eye shows an error in this case; Google Maps appears to quietly fall back to a 2D backend which uses Mercator projection instead of a 3D sphere.)
I pulled the patches from Try and made a debug build with the assertion from comment #21 removed, but it's not much more helpful:
[Child 8419, Main Thread] WARNING: GLContext::InitWithPrefix failed!: file /home/jld/src/gecko-dev/gfx/gl/GLContext.cpp, line 330
[Child 8419, Main Thread] WARNING: Failed to create GLXContext!: file /home/jld/src/gecko-dev/gfx/gl/GLContextProviderGLX.cpp, line 572
Flags: needinfo?(jld)
Updated•6 years ago
|
Assignee | ||
Comment 29•6 years ago
|
||
I have a repro machine now. There's a bunch of little issues here.
Assignee | ||
Comment 30•6 years ago
|
||
Yeah, context lost handling needs a big refresh.
I'll try to make a minimal patch for Beta tomorrow.
Assignee | ||
Comment 31•6 years ago
|
||
I added context loss/restore handling to one of my simple demos:
https://jdashg.github.io/misc/webgl/ccw-point.html
I have local patches that fix this. A minimal patch for beta is TBD.
Assignee | ||
Comment 32•6 years ago
|
||
Simplify error handling in GLContext.
Modernize context loss handling in GLContext.
Remove various unused parts.
Fix WebGLContext's context loss/restoration.
MozReview-Commit-ID: Lu2hi5HnP8x
Assignee | ||
Comment 33•6 years ago
|
||
I'm split, because it's hard to feel safe uplifting anything less than this whole patch.
We can't remove the purge-handling patch, because that might expose undefined data to WebGL.
I'll play with some hacks.
Comment 34•6 years ago
|
||
Pushed by jgilbert@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/ff40e8ca1e42
Repair CONTEXT_LOST handling. r=lsalzman
Comment 35•6 years ago
|
||
bugherder |
Status: REOPENED → RESOLVED
Closed: 6 years ago → 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla65
status-firefox-esr60:
--- → unaffected
Assignee | ||
Comment 36•6 years ago
|
||
[Beta/Release Uplift Approval Request]
Feature/Bug causing the regression: None
User impact if declined: Linux proprietary nvidia gpu drivers will freeze/crash firefox on system suspend/resume from sleep.
Is this code covered by automated tests?: No
Has the fix been verified in Nightly?: Yes
Needs manual test from QE?: Yes
If yes, steps to reproduce:
List of other uplifts needed: None
Risk to taking this patch: Medium
Why is the change risky/not risky? (and alternatives if risky): It's risky in that it's a bunch of relatively delicate code. This patch makes it more robust, but, well, it's just tricky, and I'm recommending that we take only half of the patch we have on Nightly to reduce the risk of touching more of the codebase than we have to.
I have tested this myself though, and we really don't want this issue on Release, *and* we really really shouldn't revert the bug that caused this, so I think we should go for it.
String changes made/needed: none
Attachment #9028425 -
Flags: approval-mozilla-beta?
Comment 37•6 years ago
|
||
Hello, i previously reported bug 1489354 which seems to be directly related (or dupe of) to this one. After this bug patches was landed behavior i see in firefox with user switching is changed but its still far from being "As good as it was". For example, Firefox no longer hangs completely, instead it just paint black all tabs content untill firefox restart. Good thing is that UI is no longer blocked so now i can at least close firefox gracefully without using xkill or similar tools.
I running latest Nightly 65.0a1 (2018-11-30) build on Ubuntu 18.10 (Xorg) with "stock" Nvidia drivers version 390.87 with layers.acceleration.force-enabled=true and layers.offmainthreadcomposition.enabled=true, Webrender=off
Comment 38•6 years ago
|
||
(In reply to V. Korn from comment #37)
> Hello, i previously reported bug 1489354 which seems to be directly related
> (or dupe of) to this one. After this bug patches was landed behavior i see
> in firefox with user switching is changed but its still far from being "As
> good as it was". For example, Firefox no longer hangs completely, instead it
> just paint black all tabs content untill firefox restart.
I'm seeing the same problem. I filed a new bug 1511508.
Comment 40•6 years ago
|
||
Jeff, should we uplift despite 1511508? We're building 64.0 rc1 today.
Flags: needinfo?(jgilbert)
Assignee | ||
Comment 41•6 years ago
|
||
(In reply to Julien Cristau [:jcristau] from comment #40)
> Jeff, should we uplift despite 1511508? We're building 64.0 rc1 today.
We want this. 1511508 is P3, since accelerated layers on Linux is not officially supported.
Flags: needinfo?(jgilbert)
Assignee | ||
Comment 42•6 years ago
|
||
Also for time-sensitive things, please email me directly.
Comment 43•6 years ago
|
||
Comment on attachment 9028425 [details] [diff] [review]
0001-Bug-1492580-Restore-CONTEXT_LOST-handling.-minimal-r.patch
linux webgl fix, approved for 64.0 rc2.
Attachment #9028425 -
Flags: approval-mozilla-beta? → approval-mozilla-release+
Updated•6 years ago
|
Flags: qe-verify+
Comment 44•6 years ago
|
||
bugherder uplift |
Comment 45•6 years ago
|
||
I tested this issue using the STR from the description on Ubuntu 16.04 x32 with FF 65.0a1 (2018-11-20) and after resuming the OS and focus the WebGL tab (Google maps) I see a black screen and the tab is blocked. I also tested with FF 65.0a1(2018-12-06) and after resuming the OS and visit the WebGL page I see a black screen for ~1 sec and then everything is back to normal, the tab works, I also have this output in the terminal:
WebGL(0xa7acac00)::ForceLoseContext
WebGL(0xa7acac00)::ForceRestoreContext
WebGL(0xa7acac00)::ForceLoseContext
WebGL(0xa7acac00)::ForceRestoreContext
Please let me know if this is the expected result. Thanks
Flags: needinfo?(jgilbert)
Assignee | ||
Updated•6 years ago
|
Status: RESOLVED → VERIFIED
Comment 47•6 years ago
|
||
I also tested on RC build 3 and I can confirm the fix.
Updated•6 years ago
|
Flags: qe-verify+
Updated•6 years ago
|
Updated•6 years ago
|
You need to log in
before you can comment on or make changes to this bug.
Description
•