Closed Bug 692605 Opened 13 years ago Closed 13 years ago

Perma-orange Aurora (and Beta) Linux opt (non-PGO) browser_webconsole_bug_595934_message_categories.js | Test timed out and test_errorPropagation.html | Test timed out followed by shutdown timeout

Categories

(Core :: XPCOM, defect)

x86
Linux
defect
Not set
blocker

Tracking

()

RESOLVED FIXED
mozilla10
Tracking Status
firefox7 --- unaffected
firefox8 --- fixed
firefox9 --- fixed
firefox10 --- fixed

People

(Reporter: mbrubeck, Assigned: bent.mozilla)

References

Details

(Keywords: intermittent-failure, Whiteboard: [qa-] [inbound][see comment #111])

Attachments

(1 file)

Ever since PGO was turned off for opt builds (bug 658313), these failures are occurring in every non-PGO Linux 32-bit opt build on Aurora (Firefox 9). We're pretty sure that non-PGO is to blame. The failures do not occur in PGO builds. They do occur in non-PGO builds even on pushes that were green before the config change, and the failures kept occurring after backing out the changeset where they first occurred. https://tbpl.mozilla.org/php/getParsedLog.php?id=6710381&tree=Mozilla-Aurora Rev3 Fedora 12 mozilla-aurora opt test mochitests-3/5 on 2011-10-06 13:40:03 PDT for push 4c15202ca083 5937 INFO TEST-PASS | /tests/dom/workers/test/test_errorPropagation.html | Correct message event.filename - http://mochi.test:8888/tests/dom/workers/test/errorPropagation_worker.js should equal http://mochi.test:8888/tests/dom/workers/test/errorPropagation_worker.js 5938 INFO TEST-PASS | /tests/dom/workers/test/test_errorPropagation.html | Correct message event.lineno - 48 should equal 48 5939 ERROR TEST-UNEXPECTED-FAIL | /tests/dom/workers/test/test_errorPropagation.html | Test timed out. XScreenSaver state: Disabled User input has been idle for 743 seconds args: ['/home/cltbld/talos-slave/test/build/bin/screentopng'] ... 16466 INFO SimpleTest FINISHED TEST-UNEXPECTED-FAIL | Shutdown | application timed out after 330 seconds with no output INFO | automation.py | Application ran for: 0:14:38.951252 INFO | automation.py | Reading PID log: /tmp/tmp6qt-3tpidlog PROCESS-CRASH | Shutdown | application crashed (minidump found) Crash dump filename: /tmp/tmpCIjJRu/minidumps/68928f9b-b21b-2ce0-74dd07a2-79f2fec2.dmp Operating system: Linux 0.0.0 Linux 2.6.31.5-127.fc12.i686.PAE #1 SMP Sat Nov 7 21:25:57 EST 2009 i686 CPU: x86 GenuineIntel family 6 model 23 stepping 10 2 CPUs Crash reason: SIGABRT Crash address: 0x847 Thread 0 (crashed) 0 libpthread-2.11.so + 0x8edd eip = 0x00c6eedd esp = 0xbf9a0074 ebp = 0xbf9a0078 ebx = 0x08051a80 esi = 0xb7673040 edi = 0x93dd1000 eax = 0x00000003 ecx = 0x00000008 edx = 0x00000000 efl = 0x00200246 Found by: given as instruction pointer in context 1 firefox-bin!malloc_mutex_unlock [jemalloc.c : 1476 + 0x5] eip = 0x0804b1ae esp = 0xbf9a0080 ebp = 0xb7673040 Found by: previous frame's frame pointer 2 firefox-bin!arena_malloc [jemalloc.c : 3818 + 0xa] eip = 0x0804d78d esp = 0xbf9a00a0 ebp = 0xb7673040 ebx = 0x08051a80 Found by: call frame info 3 firefox-bin!imalloc [jemalloc.c : 3892 + 0x10] eip = 0x0804da5c esp = 0xbf9a00d0 ebp = 0xbf9a018c ebx = 0x08051a80 esi = 0x0000000a edi = 0x00000000 Found by: call frame info 4 firefox-bin!malloc [jemalloc.c : 5986 + 0x6] eip = 0x0804da9a esp = 0xbf9a00f0 ebp = 0xbf9a018c ebx = 0x08051a80 esi = 0x0000000a edi = 0x00000000 Found by: call frame info 5 libmozalloc.so!moz_malloc [mozalloc.cpp : 113 + 0x8] eip = 0x00143b41 esp = 0xbf9a0100 ebp = 0xbf9a018c ebx = 0x00144118 esi = 0x00000002 edi = 0x00000000 Found by: call frame info 6 libxul.so!nsStringBuffer::Alloc [nsSubstring.cpp : 209 + 0x8] eip = 0x01c29a4d esp = 0xbf9a0120 ebp = 0xbf9a018c ebx = 0x02323174 esi = 0x00000002 edi = 0x00000000 Found by: call frame info 7 libxul.so!nsAString_internal::MutatePrep [nsTSubstring.cpp : 162 + 0x8] eip = 0x01c29c1d esp = 0xbf9a0140 ebp = 0xbf9a018c ebx = 0x02323174 esi = 0xa5e02e38 edi = 0x00000000 Found by: call frame info 8 libxul.so!nsAString_internal::ReplacePrepInternal [nsTSubstring.cpp : 198 + 0x13] eip = 0x01c29c48 esp = 0xbf9a0170 ebp = 0x00000000 ebx = 0x02323174 esi = 0xa5e02e38 edi = 0x00000000 Found by: call frame info 9 libxul.so!nsAString_internal::ReplacePrep [nsTSubstring.h : 685 + 0x10] eip = 0x01c29d56 esp = 0xbf9a01a0 ebp = 0x00000000 ebx = 0x02323174 esi = 0xa5e02e38 edi = 0x00000000 Found by: call frame info https://tbpl.mozilla.org/php/getParsedLog.php?id=6710557&tree=Mozilla-Aurora Rev3 Fedora 12 mozilla-aurora opt test mochitest-other on 2011-10-06 13:40:06 PDT for push 4c15202ca083 TEST-PASS | chrome://mochitests/content/browser/browser/devtools/webconsole/test/browser/browser_webconsole_bug_595934_message_categories.js | test #4: message found 'multipart/form-data' TEST-PASS | chrome://mochitests/content/browser/browser/devtools/webconsole/test/browser/browser_webconsole_bug_595934_message_categories.js | test #4: error category 'HTML' TEST-INFO | chrome://mochitests/content/browser/browser/devtools/webconsole/test/browser/browser_webconsole_bug_595934_message_categories.js | Console message: [JavaScript Warning: "Form contains enctype=multipart/form-data, but does not contain method=post. Submitting normally with method=GET and no enctype instead." {file: "http://example.com/browser/browser/devtools/webconsole/test//browser/test-bug-595934-html.html" line: 0}] TEST-PASS | chrome://mochitests/content/browser/browser/devtools/webconsole/test/browser/browser_webconsole_bug_595934_message_categories.js | test #5: message found 'no element found' TEST-PASS | chrome://mochitests/content/browser/browser/devtools/webconsole/test/browser/browser_webconsole_bug_595934_message_categories.js | test #5: error category 'malformed-xml' TEST-INFO | chrome://mochitests/content/browser/browser/devtools/webconsole/test/browser/browser_webconsole_bug_595934_message_categories.js | Console message: [JavaScript Error: "no element found" {file: "http://example.com/browser/browser/devtools/webconsole/test//browser/test-bug-595934-malformedxml.xhtml" line: 11}] TEST-PASS | chrome://mochitests/content/browser/browser/devtools/webconsole/test/browser/browser_webconsole_bug_595934_message_categories.js | test #6: message found 'fooBarSVG' TEST-PASS | chrome://mochitests/content/browser/browser/devtools/webconsole/test/browser/browser_webconsole_bug_595934_message_categories.js | test #6: error category 'SVG' TEST-INFO | chrome://mochitests/content/browser/browser/devtools/webconsole/test/browser/browser_webconsole_bug_595934_message_categories.js | Console message: [JavaScript Warning: "Unexpected value fooBarSVG parsing height attribute." {file: "http://example.com/browser/browser/devtools/webconsole/test//browser/test-bug-595934-svg.xhtml" line: 0}] TEST-INFO | unknown test url | [SimpleTest/SimpleTest.js, window.onerror] An error occurred: fooBarWorker is not defined at http://example.com/browser/browser/devtools/webconsole/test//browser/test-bug-595934-workers.js:7 TEST-UNEXPECTED-FAIL | chrome://mochitests/content/browser/browser/devtools/webconsole/test/browser/browser_webconsole_bug_595934_message_categories.js | Test timed out XScreenSaver state: Disabled User input has been idle for 1338 seconds args: ['/home/cltbld/talos-slave/test/build/bin/screentopng'] INFO TEST-END | chrome://mochitests/content/browser/browser/devtools/webconsole/test/browser/browser_webconsole_bug_595934_message_categories.js | finished in 30043ms
Note that we've also had other issues on Linux on the -O3 non-pgo builds... glandium might recall details.
We've had float rounding issues, but nothing crashy that i remember.
BTW, non-PGO should build -Os on aurora, by now.
(In reply to Mike Hommey [:glandium] from comment #3) > BTW, non-PGO should build -Os on aurora, by now. Confirmed, that was bug 655003.
Summary: Perma-orange Aurora Linux opt (non-PGO) browser_webconsole_bug_595934_message_categories.js | Test timed out and test_errorPropagation.html | Test timed out followed by shutdown timeout → Perma-orange Aurora (and Beta) Linux opt (non-PGO) browser_webconsole_bug_595934_message_categories.js | Test timed out and test_errorPropagation.html | Test timed out followed by shutdown timeout
Not quite literally permaorange on m-b: there have been three builds, two of which say in the scrape that they were clobbers, which hit this, and one which does not say it was, which did not hit this.
So, Luke pushed https://hg.mozilla.org/integration/mozilla-inbound/rev/4c0e6b3f7791, removing a totally empty header from jseng, and browser_webconsole_bug_595934_message_categories.js went permaorange on 10.6 opt. The only thing I've seen doing that same sort of senseless and impossible in a reasonable world thing is when we've got code accessing memory that it totally shouldn't be touching, and a nothing change to jseng suddenly affects a test that couldn't possibly be affected. A previous case with a11y only happened with some PGO'd builds, when PGO happened to move things around to make the badness happen; apparently the Aurora Linux thing is the opposite, and PGO will cover our tracks but non-PGO leaves us busted. Given that the Aurora case is both devtools and workers, does that mean that both of them are being bad, or is the devtools test being bitten by bad workers? Will running these tests under Valgrind lead to it shrieking in horror at the things we're doing?
Blocks: 689362
Severity: normal → blocker
The devtools hang is the only devtools test that uses a worker, so I would not be too suspicious of the devtools code. The way these two failing tests work is that they register an error console listener and then use a worker to generate an error. The tests wait until they see the error appear in the console before continuing. Sometimes they randomly time out without anything else seeming to fail (at least, I don't see anything else in the logs). I'm not ready to blame the worker code yet, though. I've looked over the worker error generation code a bunch now and can't see anything amiss. On a whim I decided to take a look at the code in the error console listener implementation and I ran across this (nsConsoleService.cpp#161): * Iterate through any registered listeners and tell them about * the message. We use the mListening flag to guard against * recursive message logs. This could sometimes result in * listeners being skipped because of activity on other threads, * when we only care about the recursive case. That seems pretty bad.. If we're unlucky enough to have an error reported on another thread immediately before the worker code generates the error that the tests are waiting on we'll miss the report and the test will never finish. I'll see what we can do to fix this behavior (or the tests as a bandaid - maybe by looking at the console message list every second or so?). Basically this problem boils down to a bad test mechanism, not a bug in workers. However, I don't think we can simply wave this off due to the crashes, and I'm pretty sure that the crashes are entirely due to the fact that the console manager is using proxies of JS objects to make its listeners get called back on the correct thread. Proxies are no longer safe to be used by JS on other threads, so this is a big red flag. In short, I blame the console service for all my woes ;) CC'ing luke and bsmedberg since they have a plan to nuke proxies and may have a patch to fix that part. Anyone think my hunches are bonkers?
Component: General → XPCOM
QA Contact: general → xpcom
Inconveniently, backing out luke did not fix inbound's permaorange. His 10.6 opt build was a clobber, the two before that were not (and the four before that were burning), so either of them or Waldo's backout not really backing everything out are possibilities if they didn't manage to take effect until after a clobber, as is some releng change. Not sure what, if any, part of that fits with blaming the console service.
No longer blocks: 689362
the inbound permaorange has been fixed by backing out Bug 689101, Bug 692987 and Bug 692722
(In reply to ben turner [:bent] from comment #85)> > Anyone think my hunches are bonkers? Well, I don't get it, but that's not an unusual thing for me. How does the theory that maybe sometimes an error gets reported on another thread right before your error work with the evidence, that in Linux32 PGO builds on mozilla-aurora, it never ever happens, but in Linux32 non-PGO builds on aurora it happens every single time for both tests, and in Linux32 non-PGO builds on mozilla-beta it happens every single time only for the devtools test, not for the workers test? And what part of sfink's pushes caused an error to be reported on another thread right before yours, every single time, only on 10.6 opt?
(In reply to Phil Ringnalda (:philor) from comment #90) Dunno, race conditions like this are always pretty strange, subject to weird tiny variations in timing and whatever OS scheduler you're dealing with. I'm reasonably certain this will go away once bug 675221 removes the proxies and makes the listener logic more sane. In the meantime I believe that we're just going to disable this test.
Depends on: 675221
This is rs=sicking.
Assignee: nobody → bent.mozilla
Status: NEW → ASSIGNED
Attachment #566979 - Flags: review+
Status: ASSIGNED → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla10
Comment on attachment 566979 [details] [diff] [review] Patch to disable console listener tests This just disables small parts of two tests that are problematic. The real fix will land on trunk soon-ish, but the patch isn't really suitable for branches. Disabling these test pieces in the interim seems the smartest path forward.
Attachment #566979 - Flags: approval-mozilla-beta?
Attachment #566979 - Flags: approval-mozilla-aurora?
can we turn this off on aurora or is this a real bug?
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Whiteboard: [orange] [inbound] → [orange] [inbound][see comment #111]
Ok, as per c#111, we can create a patch to turn off the test on aurora. We'll do the work in this bug until it lands.
and, since I'm talking to myself in this bug now, can we get approval to land BenT's patch? :)
I think we leave this FIXED since it has landed on trunk then wait for them to approve the patch for aurora.
Status: REOPENED → RESOLVED
Closed: 13 years ago13 years ago
Resolution: --- → FIXED
Maybe so, I just wanted the aurora-approver-gods to see that there was still something to be done here. Hopefully my incessant prattling will convince them of that! :)
Comment on attachment 566979 [details] [diff] [review] Patch to disable console listener tests Make it so!
Attachment #566979 - Flags: approval-mozilla-beta?
Attachment #566979 - Flags: approval-mozilla-beta+
Attachment #566979 - Flags: approval-mozilla-aurora?
Attachment #566979 - Flags: approval-mozilla-aurora+
Whiteboard: [orange] [inbound][see comment #111] → [qa-][orange] [inbound][see comment #111]
Heads up: having revised the infrastructure for console-monitoring tests, I turned these tests back on in bug 663291 (specifically, https://hg.mozilla.org/integration/mozilla-inbound/rev/b36eaac9ecf8 ) I may not have gotten all the race conditions.
... and they were backed out again because yep, still race conditions. Filed new bug 812714 to get them fixed properly.
Whiteboard: [qa-][orange] [inbound][see comment #111] → [qa-] [inbound][see comment #111]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: