Closed Bug 777574 Opened 12 years ago Closed 10 years ago

Intermittent test_webgl_conformance_test_suite.html | [conformance/more/conformance/quickCheckAPI-B2.html,quickCheckAPI-B3.html,quickCheckAPI-B4.html] Timeout in this test page

Categories

(Core :: Graphics: CanvasWebGL, defect)

17 Branch
x86
Windows 7
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla34
Tracking Status
firefox32 --- fixed
firefox33 --- fixed
firefox34 --- fixed
firefox-esr24 --- fixed
firefox-esr31 --- fixed

People

(Reporter: mbrubeck, Assigned: u480271)

References

Details

(Keywords: intermittent-failure, Whiteboard: webgl-internal)

Attachments

(2 files)

https://tbpl.mozilla.org/php/getParsedLog.php?id=13834738&tree=Mozilla-Inbound Rev3 WINNT 6.1 mozilla-inbound debug test mochitests-1/5 on 2012-07-25 03:49:54 PDT for push 607a4fe97df4 slave: talos-r3-w7-022 JavaScript warning: http://mochi.test:8888/tests/content/canvas/test/webgl/conformance/more/conformance/quickCheckAPI-B1.html, line 40: WebGL: bindAttribLocation: string contains the illegal character '36145' JavaScript warning: http://mochi.test:8888/tests/content/canvas/test/webgl/conformance/more/conformance/quickCheckAPI-B1.html, line 40: WebGL: No further warnings will be reported for this WebGL context (already reported 32 warnings) 59815 INFO TEST-PASS | /tests/content/canvas/test/webgl/test_webgl_conformance_test_suite.html | [conformance/more/conformance/quickCheckAPI-B1.html] Test passed - testValidArgs 59816 INFO TEST-PASS | /tests/content/canvas/test/webgl/test_webgl_conformance_test_suite.html | [conformance/more/conformance/quickCheckAPI-B1.html] All 1 test(s) passed 59817 INFO TEST-INFO | /tests/content/canvas/test/webgl/test_webgl_conformance_test_suite.html | [conformance/more/conformance/quickCheckAPI-B2.html] (WebGL mochitest) Starting test page Destroying context 0F6F6608 surface 1861AF50 on display 0E4AEE88 Destroying context 10D13CF8 surface 16F2CAE0 on display 0E4AEE88 --DOMWINDOW == 19 (0C158990) [serial = 1460] [outer = 00000000] [url = http://mochi.test:8888/tests/content/canvas/test/webgl/conformance/more/conformance/methods.html] WARNING: Break suggested inside cluster!: file e:/builds/moz2_slave/m-in-w32-dbg/build/gfx/thebes/gfxFont.cpp, line 4380 WARNING: Break suggested inside cluster!: file e:/builds/moz2_slave/m-in-w32-dbg/build/gfx/thebes/gfxFont.cpp, line 4380 WARNING: Break suggested inside cluster!: file e:/builds/moz2_slave/m-in-w32-dbg/build/gfx/thebes/gfxFont.cpp, line 4380 WARNING: Break suggested inside cluster!: file e:/builds/moz2_slave/m-in-w32-dbg/build/gfx/thebes/gfxFont.cpp, line 4380 WARNING: Break suggested inside cluster!: file e:/builds/moz2_slave/m-in-w32-dbg/build/gfx/thebes/gfxFont.cpp, line 4380 WARNING: Break suggested inside cluster!: file e:/builds/moz2_slave/m-in-w32-dbg/build/gfx/thebes/gfxFont.cpp, line 4380 [...continues for many more lines...] WARNING: Break suggested inside cluster!: file e:/builds/moz2_slave/m-in-w32-dbg/build/gfx/thebes/gfxFont.cpp, line 4380 ++DOMWINDOW == 20 (109ED6C8) [serial = 1463] [outer = 0FECF658] 59818 ERROR TEST-UNEXPECTED-FAIL | /tests/content/canvas/test/webgl/test_webgl_conformance_test_suite.html | [conformance/more/conformance/quickCheckAPI-B2.html] Timeout in this test page 59819 INFO TEST-INFO | /tests/content/canvas/test/webgl/test_webgl_conformance_test_suite.html | [conformance/more/conformance/quickCheckAPI-B3.html] (WebGL mochitest) Starting test page --DOMWINDOW == 19 (0C157A78) [serial = 1461] [outer = 00000000] [url = http://mochi.test:8888/tests/content/canvas/test/webgl/conformance/more/conformance/quickCheckAPI-A.html] Destroying context 0FF78FA8 surface 16F16800 on display 0E4AEE88 ++DOMWINDOW == 20 (0C1598A8) [serial = 1464] [outer = 0FECF658] EGL Config: 7 [00000007] BUFFER_SIZE: 32 (0x0020) ALPHA_SIZE: 8 (0x0008) BLUE_SIZE: 8 (0x0008) GREEN_SIZE: 8 (0x0008) RED_SIZE: 8 (0x0008) DEPTH_SIZE: 24 (0x0018) STENCIL_SIZE: 0 (0x0000) CONFIG_CAVEAT: 12368 (0x3050) CONFIG_ID: 7 (0x0007) LEVEL: 0 (0x0000) MAX_PBUFFER_HEIGHT: 8192 (0x2000) MAX_PBUFFER_PIXELS: 67108864 (0x4000000) MAX_PBUFFER_WIDTH: 8192 (0x2000) NATIVE_RENDERABLE: 0 (0x0000) NATIVE_VISUAL_ID: ERROR (0x3004) NATIVE_VISUAL_TYPE: 0 (0x0000) PRESERVED_RESOURCES: ERROR (0x3004) SAMPLES: 0 (0x0000) SAMPLE_BUFFERS: 0 (0x0000) SURFACE_TYPE: 1029 (0x0405) TRANSPARENT_TYPE: 12344 (0x3038) TRANSPARENT_RED_VALUE: 0 (0x0000) TRANSPARENT_GREEN_VALUE: 0 (0x0000) TRANSPARENT_BLUE_VALUE: 0 (0x0000) BIND_TO_TEXTURE_RGB: 0 (0x0000) BIND_TO_TEXTURE_RGBA: 1 (0x0001) MIN_SWAP_INTERVAL: 0 (0x0000) MAX_SWAP_INTERVAL: 4 (0x0004) LUMINANCE_SIZE: 0 (0x0000) ALPHA_MASK_SIZE: 0 (0x0000) COLOR_BUFFER_TYPE: 12430 (0x308e) RENDERABLE_TYPE: 4 (0x0004) CONFORMANT: 4 (0x0004) Initializing context 0F6F6608 surface 16F1CF98 on display 0E4AEE88 EGL Config: 7 [00000007] BUFFER_SIZE: 32 (0x0020) ALPHA_SIZE: 8 (0x0008) BLUE_SIZE: 8 (0x0008) GREEN_SIZE: 8 (0x0008) RED_SIZE: 8 (0x0008) DEPTH_SIZE: 24 (0x0018) STENCIL_SIZE: 0 (0x0000) CONFIG_CAVEAT: 12368 (0x3050) CONFIG_ID: 7 (0x0007) LEVEL: 0 (0x0000) MAX_PBUFFER_HEIGHT: 8192 (0x2000) MAX_PBUFFER_PIXELS: 67108864 (0x4000000) MAX_PBUFFER_WIDTH: 8192 (0x2000) NATIVE_RENDERABLE: 0 (0x0000) NATIVE_VISUAL_ID: ERROR (0x3004) NATIVE_VISUAL_TYPE: 0 (0x0000) PRESERVED_RESOURCES: ERROR (0x3004) SAMPLES: 0 (0x0000) SAMPLE_BUFFERS: 0 (0x0000) SURFACE_TYPE: 1029 (0x0405) TRANSPARENT_TYPE: 12344 (0x3038) TRANSPARENT_RED_VALUE: 0 (0x0000) TRANSPARENT_GREEN_VALUE: 0 (0x0000) TRANSPARENT_BLUE_VALUE: 0 (0x0000) BIND_TO_TEXTURE_RGB: 0 (0x0000) BIND_TO_TEXTURE_RGBA: 1 (0x0001) MIN_SWAP_INTERVAL: 0 (0x0000) MAX_SWAP_INTERVAL: 4 (0x0004) LUMINANCE_SIZE: 0 (0x0000) ALPHA_MASK_SIZE: 0 (0x0000) COLOR_BUFFER_TYPE: 12430 (0x308e) RENDERABLE_TYPE: 4 (0x0004) CONFORMANT: 4 (0x0004) Initializing context 10D13CF8 surface 16F1B2E8 on display 0E4AEE88 59820 INFO TEST-PASS | /tests/content/canvas/test/webgl/test_webgl_conformance_test_suite.html | [conformance/more/conformance/quickCheckAPI-B3.html] Test passed - testValidArgs 59821 INFO TEST-PASS | /tests/content/canvas/test/webgl/test_webgl_conformance_test_suite.html | [conformance/more/conformance/quickCheckAPI-B3.html] All 1 test(s) passed
Whiteboard: [orange]
Jeff, any objections to just skipping this test on Linux?
Flags: needinfo?(jgilbert)
Yes, but I think we should just skip it for now. I need to make this harness more robust.
Flags: needinfo?(jgilbert)
Flags: needinfo?(ryanvm)
Attached patch Skip quickCheckAPI-B2.html on Linux (deleted) — — Splinter Review
Attachment #8345281 - Flags: review?(jgilbert)
Flags: needinfo?(ryanvm)
Jeff, review ping?
Flags: needinfo?(jgilbert)
Comment on attachment 8345281 [details] [diff] [review] Skip quickCheckAPI-B2.html on Linux I don't know the code enough so reassigning to bjacob.
Attachment #8345281 - Flags: review?(jgilbert) → review?(bjacob)
Attachment #8345281 - Flags: review?(bjacob) → review+
Flags: needinfo?(jgilbert)
Whiteboard: [test disabled on Linux][leave open] → [test disabled on Linux][leave open] webgl-internal
Added quickCheckAPI-B2.html to skipped_tests_linux_mesa.txt in https://hg.mozilla.org/integration/mozilla-inbound/rev/371e42c987d6 since that's the skiplist that ASan is apparently using.
Hello, B3!
Summary: Intermittent test_webgl_conformance_test_suite.html | [conformance/more/conformance/quickCheckAPI-B2.html] Timeout in this test page → Intermittent test_webgl_conformance_test_suite.html | [conformance/more/conformance/quickCheckAPI-B2.html,quickCheckAPI-B3.html,quickCheckAPI-B4.html] Timeout in this test page
I'm pretty sure you'll find this will keep happening until you skip all the remaining tests. Have you tested this?
Nope, because no acceptable test harness should have interdependent tests. What am I missing, that caused B3 to run perfectly as long as B2 timed out, but not if B2 didn't run, and caused B4 to run perfectly as long as B3 timed out, and caused C to run perfectly as long as one of the B tests timed out?
The interdependence between tests boils down strictly to this: the test slave has finite resources, and recently executed tests can leave fewer resources immediately available for the next test (depending on random things such as GC and the behavior of various sytem resource allocations). The only thing that is specific to these tests, that make them prone to failing and to causing other tests to fail intermittently, is that they can consume lots of resources. That is because they are "fuzzing" tests passing various arguments to WebGL functions, some of them resulting in large resource allocations. The way to fix that would be to edit these tests to avoid generating too large values for the arguments that determine the amount of resources allocated. We've done that in the past already, though a long time ago.
I've got a really strong sense of deja vu telling me we had the same conversation about these tests on Android (which didn't end especially well, since I just wound up throwing the Tegras under the bus). So timing out in one test does... we know not what, but whatever it does it lets us catch our breath again. Can we implement an intentional version of timing out, sleeping for 15 seconds before starting the first of these tests? quickCheckAPI-A_Nap.html? Is it the cumulative effect of every test which runs before these tests which requires a timeout during one of them, or is it the result of the last test before the quickcheck set, or some particular test shortly before the quickcheck set, which leaves us winded, and we should be disabling backward from B2 rather than forward?
Milan: while this bug has been around for a long time, it seems to have increased in frequency around September 2013. The current rate of intermittent failure is unreasonable (incurs lots of work for tree sheriffs). I believe that it's worth asking someone to fix this. See comment 460 about how to think of these failures and try fixing them.
Flags: needinfo?(milan)
While frequency increased a first time around September 2013, it seems to have increased a second time on top of that, very recently: around January 30, 2014, i.e. last week. Indeed, see comment 420 and onwards. At that time, we started having a very large number of occurences on a single day.
Actually, if you look closer, you'll see the increase started on the afternoon of January 28th, when we changed from one sort of Amazon instance for Linux64 slaves to another sort.
Depends on: 969590
Blocks: 966070
No longer blocks: 966070
Blocks: 945981
Dan, let's see if we can clean this up - take a look at comment 460, 525, 526, 531.
Assignee: nobody → dglastonbury
Flags: needinfo?(milan)
Some notes from bjacob: I would: 1) figure how to print to the mochitest log. Last I checked, this meant writing to the gDumpFile (search for that) 2) log WebGLContext::DestroyResourcesAndContext, SetDimentions, etc You might also want to figure how to dump about:memory dumps to the mochitest log To force a "lost context" on other WebGLContexts, see this code: WebGLContext::LoseOldestWbGLContextIfLimitExceeded() (you might also want to change the limits there. If you set the limits to 1, what happens?) Here is the other dimension to explore: this specific tests are fuzzing all WebGL entry points. They are sorted alphabetically. B2 means things starting in B, second half... my guess is that this includes bufferData. You can bet what happens when you fuzz bufferData, which takes a size parameter. Huge size --> OOM. So, do add logging to bufferdata We have edited such fuzzing tests before to make them not try too large sizes; it would probably be acceptable to do it again (and then, maybe figure a way to upstream that, maybe as a special mode)
Is it always a Linux ASAN build that fails? Are we running out of RAM because of ASAN keeping the old allocation around to check access after free? What happens is OOM case? Can we log the stack/cause/etc to the mochitest log?
WebGL tests have been unowned for so long that we also have the dumping-ground bug 920904, so knowing whether the failures starred as this which were not on ASan builds were in fact "this" or were that, or whether things in quickCheckAPI tests starred as that, ASan or not, are in fact this, will require that you first determine what "this" is.
OK. I give in! I'm going to request a loaner VM.
Currently, this is causing mochitest-gl on Android 2.3 to fail ~1/3 of the time, clearly nowhere near our allowable failure rate for default visibility. Mochitest-2 on Linux debug/ASAN isn't much better. We need do something here or it's time to starting hiding the test suite or disabling large chunks of it.
Flags: needinfo?(jgilbert)
Since this seems to have generated suggestions to hide webgl mochitests, and above conversations about how to investigate this properly seem to never have been carried out fully, how about we take the intermediate step of just skipping the quickCheckAPI* tests on affected OSes? These are the tests specifically causing trouble here, and for a well-understood reason (they are fuzzing, typically causing large allocations). That would be a trivial patch and would be 1 million times better than hiding WebGL tests.
Attachment #8471023 - Flags: review?(jgilbert)
Attachment #8471023 - Flags: review?(dglastonbury) → review+
Attachment #8471023 - Flags: review?(jgilbert)
https://hg.mozilla.org/integration/mozilla-inbound/rev/d78a39f01102 The above try push was green with about 10 retriggers, but I don't know what the frequency of this orange was. Now that it's on inbound, we'll get more data.
Ryan, shall we close this as fixed? I seem to recall that this also occured at low frequency on OSX, which I didn't try to fix.
Flags: needinfo?(jgilbert) → needinfo?(ryanvm)
I've backported this patch to all active branches, so I'm OK resolving this if you want. OSX tends to be bug 920904 more than anything else, IIRC.
Status: NEW → RESOLVED
Closed: 10 years ago
Flags: needinfo?(ryanvm)
Resolution: --- → FIXED
Whiteboard: [test disabled on Linux][leave open] webgl-internal → webgl-internal
Target Milestone: --- → mozilla34
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: