Closed Bug 769949 Opened 12 years ago Closed 12 years ago

With WebGL antialiasing enabled, leaving a WebGL FBO bound causes broken compositing

Categories

(Core :: Graphics: Layers, defect)

16 Branch
defect
Not set
normal

Tracking

()

VERIFIED FIXED
mozilla16
Tracking Status
firefox15 + verified
firefox16 --- fixed
firefox32 --- verified

People

(Reporter: epinal99-bugzilla2, Assigned: jgilbert)

References

(Blocks 1 open bug, )

Details

(Keywords: regression, Whiteboard: [games:p1])

Attachments

(3 files, 1 obsolete file)

STR

1) Be sure HWA is enabled
2) Open Tony Lee WebGL demo (textures and meshes on 3D model)
http://sandbox.ayzenberg.com/creative_prototypes/html5_/webgl/prototypeEightMe.html
3) You need to wait 5-10 sec to see the 3D model rotating
Once you are on the demo page, you can toggle through different textures by clicking anywhere on the tab.
More details: http://labs.ayzenberg.com/2012/03/html5-webgl-demo-tony-lee/

Result: 
With HWA on, the 3D model stays frozen and doesn't rotate with the mouse and clicking anywhere on the page doesn't change the texture.

With HWA off, the demo plays fine.

In FF13, the demo plays fine with HWA on/off.

Mozregression range:

m-c
good=2012-05-04
bad=2012-05-05
http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=2db9df42823d&tochange=0a48e6561534

m-i
good=2012-05-02
bad=2012-05-03
http://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=c3813fbb1c9a&tochange=de5745bce8bc

Suspected bug:
Nicholas Cameron - Bug 716439 - (GPU-clipping-rounded) Implement clipping to rectangles with rounded corners on the GPU
Assignee: nobody → ncameron
I seem to get different results than reported on comment 0:
 - with default settings (layers accel enabled), the demo plays "fine" except that it's extremely slow (have to wait far more than 10 seconds to see the results of mouse interaction).
 - with layers.acceleration.disabled=true, the canvas seems to remain blank black.

This with NVIDIA 296.88 driver on Windows 7 64bit, 32bit Nightly from 06-29.
Ah, wait!

With layers.acceleration.disabled, switching to another tab and coming back to the Tony Lee tab seems to immediately "fix" the rendering.
Yes but the 3D model move stays "frozen" and is'nt smooth like with FF13.
Does anyone else get an ASSERT thrown by Texture.cpp:241 in the Angle library when running this e.g. in a debug build? I get it fairly often, but not every run. I see similar behaviour to bjacob with HWA disabled, and as reported for HWA enabled. Looking into this now...
Actually, I get the same behaviour with and without accelerated layers, it is frozen and generally borked, but switching tabs causes an update. I can switch textures with accelerated layers on or off, but only with a tab change. Turning Direct2D off fixes everything.
This webgl demo, http://schumann.elis.ugent.be/#, seems to work fine with HWA, but not without, failing in the the same way as the reported one, again it is fine with or without HWA if D2D is turned off, so I assume it is the same bug. Not sure what that is though.
On the HWA side, not using a shareHandle to make mTexture in CanvasLayerD3D10 and leaving mUsingSharedTexture false (around line 85 in CanvasLayerD3D10.cpp) makes the problem go away. I couldn't find an equivalent fix for basic layers. I don't understand why, and I couldn't 'fix' the problem by doing anything different when updating or rendering the canvas (e.g., getting a fresh shareHandle and mTexture or reattaching the shader resource view).

I tried a few more prefs (in particular Azure content), but the only thing that seems to make a difference is Direct2D, so I'm assuming this problem is some interaction between Direct2D and WebGL/ANGLE. I can't see anything in the mask layers patches which would cause this, and preventing most of the mask layers code from executing does not fix this. But, I agree that it still looks like the most likely thing to affect this in the regression range.

I think the precise problem is that the webgl canvas is not repainting unless it is re-initialised.

The failed ANGLE assertion (Texture.cpp:241) is extremely intermittent and only presents in debug builds (obviously it won't assert in an optimised build but it could bork, I can't recover after the assertion in a debug build). Possibly due to time? WebGL in a debug build really crawls. I can't work out if it is related to the bug here or not, but the bug occurs always, and the assertion not very often, so possibly not. I don't know what causes the assertion.

I'm pretty stuck with this. I'm going to leave it for today and have another look tomorrow, I'd really appreciate it if anyone has any ideas for things to look at/try.
[Triage comment]
Tracking, since it's an unshipped regression so getting this fixed before release is ideal.
Nick, you can try to run another WebGL demo of Tony Lee if it's more "simple" to find the bug like this one about LOTR Ring:
http://sandbox.ayzenberg.com/creative_prototypes/html5_/webgl/prototypeEightLOD.html
On FF16, the ring doesn't rotate slowly and stays frozen.
I did some inbound builds this morning, and I can confirm that it is something in the mask layers (bug 716439) patches that is causing this problem, trying to work out exactly what now...
narrowed it down to changeset f7b8deeb0cc4. 

I didn't mean to change the tracking flags, but can't change them back, sorry :-(
The problem appears to be that the webgl canvas is being rendered as an active layer (Canvas layer), whereas before the mask layers patch it was treated as inactive and rendered on a Thebes layer. No idea why that causes the breakage yet though.
Turning bug 618892 off fixes this, and doesn't break other WebGL examples, but I guess we lose speed. No idea why D2D + pbuffers + Tony Lee breaks.
Maybe this path breaks the DidTransactionCallback mechanism? The WebGL implementation needs to know when compositing occurred, and knows that when the DidTransactionCallback function is called. If this callback isn't properly called by this path, that could explain it.
(In reply to Benoit Jacob [:bjacob] from comment #14)
> Maybe this path breaks the DidTransactionCallback mechanism? The WebGL
> implementation needs to know when compositing occurred, and knows that when
> the DidTransactionCallback function is called. If this callback isn't
> properly called by this path, that could explain it.

Thanks for the tip. I checked it though, and it seems to be getting called just fine.
I think bug 725747 may have broken this, commenting out http://mxr.mozilla.org/mozilla-central/source/gfx/gl/GLContext.h#1115 seems to fix this problem. Perhaps the internal FBO needs to be reset before calling this method or something?
That's very bad if that fixes it -- it means that some of the FBO tracking is busted.  I don't -fully- understand the internal/external FBO stuff, though I think I understand the intent. But gets called before any read operation, and I guess the check is to see if there's a user bound fbo or not.  If there isn't, then it does a blit from the offscreen draw fbo to the offscreen read fbo if they're different, which, I admit, I don't understand when that would be the case.

Or wait, yes I do.  It happens if we want to have a multisample front buffer, where we can't ReadPixels from directly -- we'd need to do a framebuffer resolve first, but we want to hide that fact from content (especially WebGL).  Ok; so I'd do some dumps to see what framebuffer stuff that WebGL thing is doing, and then see where our bindings are going wrong.

My guess -- it's drawing to framebuffer 0 (the front/display buffer), but is leaving some other FBO bound at the end of the JS draw operation.  If some code isn't being careful, that could cause this... I'd see what code is being read when fReadPixels (or raw_ReadPixels) is being called, and who's calling it, if we're going down the readpixels path.  Otherwise there might be something similar going on in another path.
Aha --- setting webgl.msaa=0  (which disables WebGL multisampling) fixes it.
Attached file testcase (deleted) —
minimal test case
Summary: Tony Lee WebGL demo (textures and meshes on 3D model) is broken with hardware acceleration enabled → With WebGL antialiasing enabled, leaving a WebGL FBO bound causes broken compositing
Jeff: See Nick's Comment 16 and the testcase.
Assignee: ncameron → jgilbert
I think I have a fix for this, it seems to me that BeforeGLReadCall should blit the offscreen read to the offscreen draw buffer if we are about to do a glFinish, even if the offscreen buffer is not currently bound. This certainly fixes the original Tony Lee example and bjacob's test case, not 100% sure it is right though.

I've pushed to Try: https://tbpl.mozilla.org/?tree=Try&rev=fdf1104da540
Attached patch possible patch (obsolete) (deleted) — Splinter Review
could do with a comment, but just to get the idea of what I think the fix might be
Attachment #639221 - Flags: review?(jgilbert)
(In reply to Benoit Jacob [:bjacob] from comment #19)
> Created attachment 639217 [details]
> testcase
> 
> minimal test case

There is something I don't understand. Why does your testcase not work with FF13 while Tony Lee's demo seems to work fine with FF13?

Indeed, mozregression found a different range for your testcase:
m-i
good=2012-03-12
bad=2012-03-13
http://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=eceb16d4057b&tochange=d4496bf4bb64

Is this bug in 2 regression steps?
Pre-mask layers, any object with rounded corners was forced to be rendered to an inactive layer, this uses a different code path for canvases, so the Tony Lee demo rendered fine. With the mask layers patches (in the regression range), the Tony Lee demo got it's own layer (with a mask, but the mask has nothing to do with the bug), and so you get the same effects (bug) as in the minimal test case. So, the rounded corners hid the underlying bug which was exposed when mask layers where introduced, the mask layers didn't introduce the bug. You can test this by removing the rounded corners from the demo, and it breaks in the same way as for more recent versions of FF.

I haven't checked the regression range for the test case, but I suspect it will be for one of the patches which improved the FBO handling.
(In reply to Nick Cameron [:nrc] from comment #24)
> I haven't checked the regression range for the test case, but I suspect it
> will be for one of the patches which improved the FBO handling.
Thanks for the explanation.

Indeed, the suspected bug seems to be:
Jeff Gilbert — Bug 726396 - Repair ANGLE d3d share handle fetching an PBuffer creation behavior - r=bjacob 
The patch touched code about FBO bindings.
(In reply to Nick Cameron [:nrc] from comment #21)
> I think I have a fix for this, it seems to me that BeforeGLReadCall should
> blit the offscreen read to the offscreen draw buffer if we are about to do a
> glFinish, even if the offscreen buffer is not currently bound. This
> certainly fixes the original Tony Lee example and bjacob's test case, not
> 100% sure it is right though.
> 
> I've pushed to Try: https://tbpl.mozilla.org/?tree=Try&rev=fdf1104da540

Nope, not a good fix -- you're accidentally fixing the symptom, but without fixing the underlying bug.  glFinish should have nothing to do with blitting nor should it care about what FBO is currently bound.  Putting the BeforeGLReadCall wrapper around there probably causes the right front buffer to get updated at that time when the FBO bindings just happen to be "right".
Attachment #639221 - Flags: review?(jgilbert) → review-
Comment on attachment 639221 [details] [diff] [review]
possible patch

Review of attachment 639221 [details] [diff] [review]:
-----------------------------------------------------------------

Vlad's right, that's not it. I'll take a look. My guess is something touched the somewhat-delicate FBO binding code.
Yeah, looks GuaranteeResolve() will not trigger a AA-resolve blit if we're bound to something other than 0 (from the user's point of view). Patch forthcoming.
Status: NEW → ASSIGNED
Assignee: jgilbert → nobody
Component: Graphics → Graphics: Layers
Assignee: nobody → jgilbert
Blocks: 726396
Attached patch patch (deleted) — Splinter Review
GuaranteeResolved() must call BlitDirtyFBOs(). Previously, we were only calling fFinish(), which only blitted FBs if we were bound to FB '0'.

On my linux machine, bug confirmed, and fix verified.
Attachment #639221 - Attachment is obsolete: true
Attachment #639525 - Flags: review?(bjacob)
This affects all desktop platforms, but not mobile.
OS: Windows 7 → All
Hardware: x86_64 → All
Comment on attachment 639525 [details] [diff] [review]
patch

Review of attachment 639525 [details] [diff] [review]:
-----------------------------------------------------------------

I'm happy to see the "flush guarantees resolve" thing go away since it didn't seem to be used anyway; and happy too that fFinish() is no longer wrapped as a read-call.
Attachment #639525 - Flags: review?(bjacob) → review+
https://hg.mozilla.org/mozilla-central/rev/921d27b8a76e
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
I'll put together a backport of this, since it's really simple.
(In reply to Jeff Gilbert [:jgilbert] from comment #34)
> I'll put together a backport of this, since it's really simple.

Please nominate the backport for mozilla-beta approval now that 15 is on Beta. Since we have STR it would be great to get a fix landed in an earlier beta so we can get time to verify.
(In reply to Jeff Gilbert [:jgilbert] from comment #34)
> I'll put together a backport of this, since it's really simple.
Please do ASAP.
Attached patch backport (deleted) — Splinter Review
[Approval Request Comment]
Bug caused by (feature/regressing bug #): bug 726396
User impact if declined: Failure to render webgl sometimes under normal operation.
Testing completed (on m-c, etc.): m-c, on aurora.
Risk to taking this patch (and alternatives if risky): Low: This is well understood.
String or UUID changes made by this patch: None

Carry r+ forward from original patch.
Attachment #645564 - Flags: review+
Attachment #645564 - Flags: approval-mozilla-beta?
Comment on attachment 645564 [details] [diff] [review]
backport

Still early enough in beta cycle that we can take this fix for a regression.
Attachment #645564 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Verified on Firefox 15 beta 3 that WebGL loads fine having HWA enabled - the demo from the Description plays fine and that the canvas from the test case attached in Comment 19 repaints constantly. 
Verified on Windows 7, Ubuntu 12.04 and Mac OS X 10.7:

Mozilla/5.0 (Windows NT 6.1; rv:15.0) Gecko/20100101 Firefox/15.0
Mozilla/5.0 (X11; Linux i686; rv:15.0) Gecko/20100101 Firefox/15.0
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:15.0) Gecko/20100101 Firefox/15.0
Blocks: gecko-games
Verified fixed 32.0a1 (2014-05-18), Win 7 x64
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: