Closed Bug 770243 Opened 12 years ago Closed 12 years ago

MOZ_ASSERT(mCacheInputStream) failing @ nsHttpChannel::ReadFromCache

Categories

(Core :: Networking: Cache, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla18
Tracking Status
firefox16 + fixed
firefox17 --- fixed
firefox18 --- fixed

People

(Reporter: mayhemer, Assigned: briansmith)

References

Details

(Keywords: regression, Whiteboard: [qa-])

Attachments

(4 files, 2 obsolete files)

Attached file log (obsolete) (deleted) β€”
This should be a higher priority to check on, since when this fails in a release build we don't display any content (cannot read the cache entry and not having a content response from the server).

STR, if I remember correctly:
1. configure proxy for authentication (basic), e.g. squid, for all protocols
2. load http://www.websocket.org/echo.html
3. [optional] try to connect (doesn't matter what the result is)
4. change the proxy settings to just set proxy for http, leave all other protocols blank
5. restart Fx
6. load http://www.websocket.org/echo.html

=> assertion failure

>	xul.dll!mozilla::net::nsHttpChannel::ReadFromCache(bool alreadyMarkedValid=false)  Line 3491 + 0x2c bytes	C++
 	xul.dll!mozilla::net::nsHttpChannel::ProcessNotModified()  Line 2210 + 0xd bytes	C++
 	xul.dll!mozilla::net::nsHttpChannel::ProcessResponse()  Line 1265 + 0x8 bytes	C++
 	xul.dll!mozilla::net::nsHttpChannel::OnStartRequest(nsIRequest * request=0x086e01e0, nsISupports * ctxt=0x00000000)  Line 4784 + 0xe bytes	C++
 	xul.dll!nsInputStreamPump::OnStateStart()  Line 416 + 0x2c bytes	C++
Just confirming that I am able to reproduce using the STR from comment 0.  Hint: you may want to delete the cache before step 1, but I think it is not strictly necessary.  Also, the assertion fails only after the proxy auth dialog is confirmed.  (I have remembered the password.)
BTW: looks like we have to fix the logs in HttpCacheQuery.  Also a log that channel X opens a query Y would be useful...
Simpler STR:
1. configure proxy for authentication (basic), e.g. squid, for all protocols
2. load http://www.websocket.org/echo.html
3. restart Fx
4. load http://www.websocket.org/echo.html

Culprit is the following code at [1]:

    if (httpStatus != 304 && httpStatus != 206) {
        mCacheInputStream.CloseAndRelease();
    }

httpStatus is 407.

I had to catch this at the review time.

[1] http://hg.mozilla.org/mozilla-central/annotate/9c6ad99de47a/netwerk/protocol/http/nsHttpChannel.cpp#l1183
Attached patch v1 (obsolete) (deleted) β€” β€” Splinter Review
Assignee: nobody → honzab.moz
Attachment #638401 - Attachment is obsolete: true
Status: NEW → ASSIGNED
Attachment #638410 - Flags: review?(bsmith)
Honza, your change looks reasonable but in order to understand why it works it would be good to have test cases:

1. 401 followed by 304
2. 401 followed by cacheable 200 OK
3. 407 followed by 304
4. 407 followed by cacheable 200 OK

I guess the cases that are failing now are #1 and #3. We need to test #2 and #4 to test that we're closing the input stream before we try to open an output stream for the cache entry.
I'll add the proposed test.
Attached patch v1 + test (deleted) β€” β€” Splinter Review
Added a test.  Quit self explaining.
Attachment #638410 - Attachment is obsolete: true
Attachment #638410 - Flags: review?(bsmith)
Attachment #642096 - Flags: review?(bsmith)
We either need to take this fix in Aurora 16, or backout bug 722034. Tracking for that release.
No longer blocks: 774527
Brian, review ping?
Honza, here's an alternate fix that I think is better because:

(a) It does not special-case any status codes
(b) It avoids acquiring the cache service lock at that point in the code (this may have be responsible for the CACHE_SERVICE_LOCK_WAIT_MAINTHREAD regression.)

If you don't think this alternate fix is correct, then we can use yours, which I also think is correct.

Note that this patch applies on top of yours. You can just qfold it onto yours.
Attachment #643594 - Flags: review?(honzab.moz)
Attached patch Address nits in test code (deleted) β€” β€” Splinter Review
Instead of listing the nits, I just created a patch to fix them. Up to you whether you want to fold this onto your patch.
Attachment #643595 - Flags: review?(honzab.moz)
Comment on attachment 642096 [details] [diff] [review]
v1 + test

Review of attachment 642096 [details] [diff] [review]:
-----------------------------------------------------------------

Honza, thanks for fixing the bug. Your patch seems reasonable to me. I would prefer us to use the alternate fix that I attached, if possible, but if that's problematic then your patch is also good as-is.
Attachment #642096 - Flags: review?(bsmith) → review+
Comment on attachment 643595 [details] [diff] [review]
Address nits in test code

Review of attachment 643595 [details] [diff] [review]:
-----------------------------------------------------------------

:D funny...
Attachment #643595 - Flags: review?(honzab.moz) → review+
Comment on attachment 643594 [details] [diff] [review]
Close cache input stream only when we're sure we don't need it

Review of attachment 643594 [details] [diff] [review]:
-----------------------------------------------------------------

Yep, this could work.  But aren't you actually removing the optimization of throwing away the stream ASAP we know we won't need it?

Test passes for me, but w/o the "funny" patch since I cannot apply it.
Attachment #643594 - Flags: review?(honzab.moz) → review+
(In reply to Honza Bambas (:mayhemer) from comment #14)
> Yep, this could work.  But aren't you actually removing the optimization of
> throwing away the stream ASAP we know we won't need it?

Throwing away the input stream could be expensive because that process takes the cache service lock on the main thread, so it is not necessarily an optimization. 

Plus, the optimization would only work if we released our descriptor (not just the input stream) so another channel could open a descriptor.
https://hg.mozilla.org/integration/mozilla-inbound/rev/dcc9891978e1
Target Milestone: --- → mozilla17
Sorry, but I had to back this out due to xpcshell test failures.
https://hg.mozilla.org/integration/mozilla-inbound/rev/a36544756614

https://tbpl.mozilla.org/php/getParsedLog.php?id=13690848&tree=Mozilla-Inbound

TEST-UNEXPECTED-FAIL | /home/cltbld/talos-slave/test/build/xpcshell/tests/netwerk/test/unit/test_bug770243.js | test failed (with xpcshell return code: 0), see following log:
TEST-UNEXPECTED-FAIL | /home/cltbld/talos-slave/test/build/xpcshell/tests/netwerk/test/unit/test_bug770243.js | Response body 2 == Response body 3 - See following stack:
Target Milestone: mozilla17 → ---
Out of time to investigate tonight, but I believe the root cause is actually bug 767277; i.e. I think we should have backed out bug 767277 instead (which I will do shortly). See the comment I am about to add to that bug.
Re-pushed to inbound after backing out bug 767277:
https://hg.mozilla.org/integration/mozilla-inbound/rev/64b30d7e3932
Backed out again

https://hg.mozilla.org/integration/mozilla-inbound/rev/ecb953611462

seems to have been failing a test here quite consistently.
What about the v1 (attachment 642096 [details] [diff] [review])?  Does it also cause these problems?
Could we please move here forward?  This makes diagnoses of some auth bugs much more complicated.  Thanks.
Comment on attachment 642096 [details] [diff] [review]
v1 + test

https://tbpl.mozilla.org/?tree=Try&rev=b9f5c4565ae2
Honza, let me know if you want me to take this back.

The test is failing in the same way as it did when these patches got backed out of mozilla-inbound.

Bug 767277 wasn't the problem after all. Here's the comment I added to bug 767277 previously about it:

(In reply to Brian Smith (:bsmith) from bug 767277 comment #12)
> Backed out because I suspect it is causing the test for bug 770243 to fail:
> https://hg.mozilla.org/integration/mozilla-inbound/rev/71854c060f90
> 
> The problem case is this:
> 
> We async dispatch a runnable to close the cache entry "later," thus leaving
> the cache entry open for a while after we've marked it valid. During that
> time between when we marked the cache entry valid and the time we close it,
> another channel decides to open the cache entry. That channel gets a
> ACCESS_READ descriptor instead of an ACCESS_READ_WRITE descriptor, and so it
> skips the validation of the cached response.
> 
> Bug 770243's tests passed previously when this patch was not applied (we
> should verify this last claim). But, with this patch, case 4 of that bug's
> tests fails. If I modify that case to use do_execute_soon, then the test
> succeeds. Debugging the failure, I saw that the cache is returning an
> ACCESS_READ descriptor. This causes HttpCacheQuery::CheckCache() to skip the
> validation of the entry on the assumption that it is valid. But, the test
> only works when each channel opened does revalidation.
Ah!  My new test is failing!  As usual, the behavior is from some reason different on windows (the test passes).

Why didn't I read that comment sooner more carefully...  We are blocked all the time here just because of bad timing.

I will add execute-soon or sync-on-cache-thread to the test.  

If there is anything we *really* need to fix, then let's do it in a different bug (bug 775830?) with reference to this one to have a tool for reproduce/check.
I've asked in bug 722034 if we should consider performing the backout in https://bugzilla.mozilla.org/show_bug.cgi?id=722034#c79 for FF16 as well. If so, we'll untrack this bug.
Any news on this now that 16 beta has been released?
Assignee: honzab.moz → bsmith
We have a patch for this already that landed once but got backed out because it was causing a test to fail. We believe that the problem is with the test and that the patches already attached to this bug are safe enough to land in Nightly, Aurora, and Beta as soon as we tweak that test, which I am planning to do today.
(In reply to Brian Smith (:bsmith) from comment #28)
> We have a patch for this already that landed once but got backed out because
> it was causing a test to fail. We believe that the problem is with the test
> and that the patches already attached to this bug are safe enough to land in
> Nightly, Aurora, and Beta as soon as we tweak that test, which I am planning
> to do today.

Thanks for the update. Sounds like this will make it into beta 5, going to build on Tuesday.
Attached patch Part 4: Fix test (deleted) β€” β€” Splinter Review
All four patches in this bug need to be qfolded together before landing. 

Part 4 does three things:

1. Rebases the test (from Parts 1-3) to the current mozilla-central.
2. For each case, adds a check that each response is or is not from the cache, as appropriate.
3. Syncs with the cache between each case to avoid race conditions.
Attachment #664232 - Flags: review?(honzab.moz)
Thanks Brian.  Any try run reference?
https://tbpl.mozilla.org/?tree=Try&rev=6336b7c8acb4

It looks shockingly green, except for the Linux "B" failures and the normal Android rainbow. I just retriggered the Linux builds to see if they will go green.
Attachment #664232 - Flags: review?(honzab.moz) → review+
https://hg.mozilla.org/integration/mozilla-inbound/rev/cbfe6a468a12
Keywords: regression
Target Milestone: --- → mozilla18
Comment on attachment 642096 [details] [diff] [review]
v1 + test

Try runs for -aurora and -beta are here:
https://tbpl.mozilla.org/?tree=Try&rev=5c787fc0db0d
https://tbpl.mozilla.org/?tree=Try&rev=be55a62862e4

Note that this approval request is for all four patches in this bug, folded together.

[Approval Request Comment]
Bug caused by (feature/regressing bug #): bug 746018

User impact if declined: Some pages and sub-resources of pages will not load correctly, especially when HTTP proxy authentication is used.

Testing completed (on m-c, etc.): Automated test case, landed today on m-i.

Risk to taking this patch (and alternatives if risky): Moderate. There is some possibility that closing the cache entry later could cause unforseen loading problems similar to the problems we encountered and fixed with the automated test.
 
String or UUID changes made by this patch: none
Attachment #642096 - Flags: approval-mozilla-beta?
Attachment #642096 - Flags: approval-mozilla-aurora?
(In reply to Brian Smith (:bsmith) from comment #34)
> Risk to taking this patch (and alternatives if risky): Moderate. There is
> some possibility that closing the cache entry later could cause unforseen
> loading problems similar to the problems we encountered and fixed with the
> automated test.

We're leaning towards backing out bug 722034 and bug 746018 for FF16 at this point, given the moderate risk evaluation here, the other regression tracked in bug 788365, and the fact that a fix would go into our final beta for the first time (this didn't make beta 5).

Can you make the case for taking a forward fix at this point?
Comment on attachment 642096 [details] [diff] [review]
v1 + test

[Triage Comment]
Approving for Aurora in preparation for Beta 16 consideration.
Attachment #642096 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
https://hg.mozilla.org/mozilla-central/rev/cbfe6a468a12
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Flags: in-testsuite+
Resolution: --- → FIXED
https://hg.mozilla.org/releases/mozilla-aurora/rev/f59da8ed70ea
Bug 766274, especially bug 766274 comment 2 and bug 766274 comment16, show that this happens with regular (not proxy) HTTP auth and that this patch fixes the problem when applied to beta 16.
Comment on attachment 642096 [details] [diff] [review]
v1 + test

[Triage Comment]
This bug has a medium risk evaluation, so I want to be very clear with why we've decided to approve for Beta.

1) We've seen reports here and in bug 766274, with very standard browser configurations
2) Our alternatives are to back out bug 722034 (deemed risky) or wontfix
3) Brian believes we'll see regressions on Nightly/Aurora over the next week, if any

Given this, we're willing to take the fix on mozilla-beta with an escape valve of backing out and wontfixing before next Friday's final release build.

In the future, bugs like this need to get attention much sooner in the release.
Attachment #642096 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
https://hg.mozilla.org/releases/mozilla-beta/rev/e34ef88d3b1e

(In reply to Alex Keybl [:akeybl] from comment #41)
> In the future, bugs like this need to get attention much sooner in the
> release.

I agree. And, also, Alex did everything he could to help push this along. The delay was mostly my fault.
https://hg.mozilla.org/releases/mozilla-beta/rev/bf691ea80fa4

Add syncWithCacheIOThread to head_cache.js from bug 737615 (written by michal and r=honzab) because the tests require it, a=bustage
Alex, even you were not very happy to approve this patch, I have to say again it is quit important, I was hitting this issue very often.  

Also, for Beta I personally more trust my patch (the first patch "v1 + test" added to this bug) since it only and only adds 401 and 407 to an already well tested code path that simply doesn't throw away cache input stream that early.

The patch has been reviewed, however it hasn't spent any time on m-c or m-a for real testing...  Hence it may not be a very good option either...
No QA verification since this has in-testsuite coverage. Please add verifyme keyword to request manual verification.
Whiteboard: [qa-]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: