Closed Bug 867183 Opened 12 years ago Closed 11 years ago

Intermittent image\test\unit\test_async_notification.js | Test timed out

Categories

(Core :: Graphics: ImageLib, defect)

x86_64
Windows 8
defect
Not set
normal

Tracking

()

RESOLVED WORKSFORME
Tracking Status
firefox24 --- affected
firefox25 --- affected
firefox26 --- affected

People

(Reporter: RyanVM, Assigned: seth)

References

Details

(Keywords: intermittent-failure)

Attachments

(1 file, 8 obsolete files)

https://tbpl.mozilla.org/php/getParsedLog.php?id=22413278&tree=Mozilla-Central WINNT 6.2 mozilla-central pgo test xpcshell on 2013-04-30 07:12:48 PDT for push dd0c611a0a27 slave: t-w864-ix-020 07:15:58 INFO - TEST-INFO | C:\slave\test\build\tests\xpcshell\tests\image\test\unit\test_async_notification.js | running test ... command timed out: 1200 seconds without output, attempting to kill program finished with exit code 1
Is there really an issue with the test itself? I see random issues on Windows that AFAICT are due to the test harness being unreliable. If this doesn't reoccur within a month, I'm inclined to WORKSFORME the bug.
(In reply to Seth Fowler [:seth] from comment #1) > Is there really an issue with the test itself? I see random issues on > Windows that AFAICT are due to the test harness being unreliable. > > If this doesn't reoccur within a month, I'm inclined to WORKSFORME the bug. No such luck :( Does the new logging help at least?
Flags: needinfo?(seth)
(In reply to Ryan VanderMeulen [:RyanVM][Out May 23-27] from comment #7) > No such luck :( > > Does the new logging help at least? It seems like the frequency has suddenly jumped up as of the past week or so. Odd. The new logging looks like it shows how far we get before things grind to a halt. That is probably going to be very useful. The next step is to correlate this with the test code and figure out which notification is never getting delivered (which I'm guessing is the problem here).
Flags: needinfo?(seth)
Summary: Intermittent image/test/unit/test_async_notification.js command timed out: 1200 seconds without output, attempting to kill → Intermittent image\test\unit\test_async_notification.js | Test timed out
Seth, any chance you could take a fresh look at this?
Flags: needinfo?(seth)
Off the top of my head, I wonder if this is because it's now possible for us to finish decoding synchronously from within callbacks from decode notifications (eg if someone calls RequestDecode() from a notification, we will see if a thread has finished decoding an image, sending out notifications), and that might screw up the logic of these tests. This is another thing that might be fixed by not doing reentrant notifications.
(In reply to Joe Drew (:JOEDREW! \o/) from comment #88) > This is another thing that might be fixed by not doing reentrant > notifications. Is there a bug tracking that work that we can mark this as depending on?
Presuming that this is a regression from multithreaded image decoding.
Assignee: nobody → joe
Blocks: 716140
Flags: needinfo?(seth) → needinfo?
Flags: needinfo?
Attached patch don't reentrantly notify (obsolete) (deleted) — Splinter Review
I earlier tried not calling FinishedSomeDecoding reentrantly, and that failed. But really all we care about is not notifying reentrantly. This might be safer. Try: https://tbpl.mozilla.org/?tree=Try&rev=01a8f5391835
Attachment #780598 - Flags: review?(seth)
Attached patch whitespace-insensitive (obsolete) (deleted) — Splinter Review
A whitespace-insensitive diff of attachment 780598 [details] [diff] [review]
Unfortunately this patch broke some browser-chrome tests, among others. Ultrasigh.
Comment on attachment 780598 [details] [diff] [review] don't reentrantly notify Review of attachment 780598 [details] [diff] [review]: ----------------------------------------------------------------- I'm clearing the review flag; set it again once the tests pass. ::: image/src/RasterImage.cpp @@ +2989,5 @@ > > + mNotifying = false; > + > + } else { > + DecodeDoneWorker::NotifyFinishedSomeDecoding(this, aRequest); So if someone tries to reentrantly notify we perform the notification asynchronously? Not surprised this broke some tests. Hopefully it's the tests that are bad and there's not a good reason to actually require these notifications to be synchronous. ::: image/src/RasterImage.h @@ +645,5 @@ > // END LOCKED MEMBER VARIABLES > > bool mInDecoder; > > + bool mNotifying; Nit: Any reason this (and mInDecoder, for that matter) aren't bitfields?
Attachment #780598 - Flags: review?(seth)
(In reply to Seth Fowler [:seth] from comment #183) > ::: image/src/RasterImage.cpp > @@ +2989,5 @@ > > > > + mNotifying = false; > > + > > + } else { > > + DecodeDoneWorker::NotifyFinishedSomeDecoding(this, aRequest); > > So if someone tries to reentrantly notify we perform the notification > asynchronously? Not surprised this broke some tests. Hopefully it's the > tests that are bad and there's not a good reason to actually require these > notifications to be synchronous. Yeah, that's the intent. Unfortunately I think there's a code requirement for this; hopefully only one that I can fix and be done with it. > Nit: Any reason this (and mInDecoder, for that matter) aren't bitfields? Yep: you can't locklessly update bitfields on multiple threads. (mNotifying could probably be a bitfield since it should only be modified on the main thread.)
Joe, any more luck with this toporange since comment 188? :-)
Flags: needinfo?(joe)
Assignee: joe → nobody
Sorry, I'm not working on Mozilla for the time being.
Flags: needinfo?(joe)
(In reply to Joe Drew (not getting mail) from comment #370) > Sorry, I'm not working on Mozilla for the time being. Sorry Seth, I guess that means you own this now?
Flags: needinfo?(seth)
(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #380) > (In reply to Joe Drew (not getting mail) from comment #370) > > Sorry, I'm not working on Mozilla for the time being. > > Sorry Seth, I guess that means you own this now? <throws up hands> No getting around it, I suppose. =) OK, I'll take this on. I expect to have time to look into it within the next couple of days. Even though Joe probably won't read this, I appreciate him putting together a patch for this; it'll probably be a lot easier to solve the problem starting from the work he's done.
Flags: needinfo?(seth)
Assignee: nobody → seth
Depends on: 905468
Here's a new patch based upon the changes in bug 905468. Like Joe's patch, this patch disallows recursive notification in FinishedSomeDecoding. The difference is that Joe's patch sends async notifications if a sync notification would otherwise be recursive. This causes notifications to arrive out-of-order, which callers don't expect, causing bustage. Instead, this patch accumulates the changes (in the form of ImageStatusDiffs) that recursive invocations would have notified, and sends those notifications synchronously and in the correct order in the top-level call to FinishedSomeDecoding. It doesn't seem to suffer from the problems of Joe's patch, but it may well have problems of its own. It needs a run through the try server before it's ready for review.
Attachment #780598 - Attachment is obsolete: true
Attachment #780599 - Attachment is obsolete: true
This version is rebased against recent changes in bug 905468, which hopefully have resolved the oranges. It also contains some tweaks. New try job here: https://tbpl.mozilla.org/?tree=Try&rev=6e3197390d82
Attachment #791081 - Attachment is obsolete: true
Tests are still orange with this new patch, mostly because of continuing issues in bug 905468. I'll revisit this patch once bug 905468 is green.
OK, things are looking solid green in bug 905468. Let's try this rebased version of the patch. https://tbpl.mozilla.org/?tree=Try&rev=f36b979c3f13
Attachment #791593 - Attachment is obsolete: true
Looks like changes I made ended up breaking browser_bug822367.js again. This version of the patch should fix the problem. https://tbpl.mozilla.org/?tree=Try&rev=123c8f674fe0
Attachment #792348 - Attachment is obsolete: true
Rebase because the name of a function changed in bug 905468. Looks green on try, so I think this is read for review.
Attachment #792587 - Flags: review?(khuey)
Attachment #792436 - Attachment is obsolete: true
Comment on attachment 792587 [details] [diff] [review] Don't recursively notify in FinishedSomeDecoding. Justin will review this.
Attachment #792587 - Flags: review?(khuey) → review?(justin.lebar+bug)
Wow, imagelib really needs to stop being reentrant. This is nuts.
>@@ -3143,16 +3145,43 @@ RasterImage::GetFramesNotified(uint32_t >+RasterImage::RequestDecodeIfNeeded(nsresult aStatus, >+ eShutdownIntent aIntent, >+ bool aDone, >+ bool aWasSize) >+{ >+ MOZ_ASSERT(NS_IsMainThread()); >+ >+ // If we were a size decode and a full decode was requested, now's the time. >+ if ( NS_SUCCEEDED(aStatus) >+ && aIntent != eShutdownIntent_Error >+ && aDone >+ && aWasSize >+ && mWantFullDecode) { '&&' at the end of the line, no whiteapce between '(' and 'NS'. >@@ -3210,49 +3239,50 @@ RasterImage::FinishedSomeDecoding(eShutd >- ImageStatusDiff diff; >- if (request) { >- diff = image->mStatusTracker->Difference(request->mStatusTracker); >- image->mStatusTracker->ApplyDifference(diff); >+ ImageStatusDiff diff = >+ request ? image->mStatusTracker->Difference(request->mStatusTracker) >+ : image->mStatusTracker->DecodeStateAsDifference(); >+ image->mStatusTracker->ApplyDifference(diff); >+ >+ // Notifications can't go out with the decoding lock held, nor can we call >+ // RequestDecodeIfNeeded, so unlock for the rest of the function. >+ MutexAutoUnlock unlock(mDecodingMutex); >+ >+ if (mNotifying) { >+ // Accumulate the status changes. We don't permit recursive notifications >+ // because they cause subtle concurrency bugs, so we'll delay sending out >+ // the notifications until we pop back to the lowest invocation of >+ // FinishedSomeDecoding on the stack. >+ mStatusDiff.Combine(diff); Should you have an early return here? That is, right now we request a decode even if mNotifying, but that seems unnecessary at best. >- return rv; >+ // If we were a size decode and a full decode was requested, now's the time. >+ return RequestDecodeIfNeeded(rv, aIntent, done, wasSize); Repeating this comment doesn't make much sense to me. I'm not totally comfortable with this code, but I'm reasonably convinced this is a net change in the right direction, so r=me with these changes.
Attachment #792587 - Flags: review?(justin.lebar+bug) → review+
Seth, is this good to land?
Flags: needinfo?(seth)
No, unfortunately. We're waiting on reviews in bug 905468. Once that lands, this patch can land. Trust me, I'm champing at the bit here.
Flags: needinfo?(seth)
Ah thank you :-)
Thanks for reviewing this, Justin! I meant to reply to this earlier, so this is a bit later, but... (In reply to Justin Lebar [:jlebar] from comment #478) > Should you have an early return here? That is, right now we request a decode > even if mNotifying, but that seems unnecessary at best. I don't like that either, but I am concerned about making this change as part of this bug. If the inner FinishedSomeDecoding could ever be reporting the completion of a size decode, this would cause the image to fail to request a full decode for itself correctly, and right now it looks to me like that can happen. In the long term I'd like to change how size decodes work in a way that will avoid this concern more cleanly.
Addressed review comments.
Attachment #792587 - Attachment is obsolete: true
Comment on attachment 796927 [details] [diff] [review] Don't recursively notify in FinishedSomeDecoding. [Approval Request Comment] Bug caused by (feature/regressing bug #): 716140 (Off-main-thread image decode.) User impact if declined: Nondeterministic image load failures. The same root cause may also be responsible for other regressions caused by 716140. Testing completed (on m-c, etc.): Landing on m-c today. Plenty of testing on try beforehand. Risk to taking this patch (and alternatives if risky): Minor. The alternative is to not take the patch and accept that there may be occasional issues with image loading. String or IDL/UUID changes made by this patch: None
Attachment #796927 - Flags: approval-mozilla-beta?
Attachment #796927 - Flags: approval-mozilla-aurora?
(In reply to TBPL Robot from comment #570) > RyanVM > https://tbpl.mozilla.org/php/getParsedLog.php?id=27151828&tree=Mozilla- > Inbound > WINNT 6.2 mozilla-inbound pgo test xpcshell on 2013-08-28 20:27:49 > revision: b2e8b8887816 > slave: t-w864-ix-113 > > TEST-UNEXPECTED-FAIL | > C: > \slave\test\build\tests\xpcshell\tests\image\test\unit\test_async_notificatio > n.js | Test timed out > TEST-UNEXPECTED-FAIL | > C: > \slave\test\build\tests\xpcshell\tests\image\test\unit\test_async_notificatio > n.js | test failed (with xpcshell return code: 1), see following log: > Return code: 1 This is from after comment 564 :(
Whiteboard: [leave open]
Comment on attachment 796927 [details] [diff] [review] Don't recursively notify in FinishedSomeDecoding. https://bugzilla.mozilla.org/show_bug.cgi?id=905468#c22, check this comment for approval request.
Attachment #796927 - Flags: approval-mozilla-beta?
Attachment #796927 - Flags: approval-mozilla-beta-
Attachment #796927 - Flags: approval-mozilla-aurora?
Attachment #796927 - Flags: approval-mozilla-aurora+
Argh. Backing out to try to narrow down the cause of bug 910533. https://hg.mozilla.org/integration/mozilla-inbound/rev/47180a284986
Depends on: 910533
(In reply to Seth Fowler [:seth] from comment #585) > Argh. Backing out to try to narrow down the cause of bug 910533. > > https://hg.mozilla.org/integration/mozilla-inbound/rev/47180a284986 Merge of backout: https://hg.mozilla.org/mozilla-central/rev/47180a284986
Pushed a new try job now that (hopefully) bug 910533 is fixed. https://tbpl.mozilla.org/?tree=Try&rev=4f64dbe9786b
When this gets relanded, it should be with an additional NS_WARNING for the recursive notification case, unless that ends up causing too much spew.
In preparation for relanding this, here's a version with the NS_WARNING mentioned in comment 599 added.
Attachment #796927 - Attachment is obsolete: true
Sigh. Given that the frequency of this one appears to have dropped to zero anyway, I'm not sure what to do here other than resolve this. There don't seem to be any issues this time and the patch seems to have stuck. If we hit another instance of this we can reopen the bug.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → WORKSFORME
Whiteboard: [leave open]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: