Closed
Bug 717761
Opened 13 years ago
Closed 9 years ago
Main thread can be blocked by IO on the cache thread
Categories
(Core :: Networking: Cache, defect)
Core
Networking: Cache
Tracking
()
RESOLVED
DUPLICATE
of bug 913806
People
(Reporter: u408661, Assigned: michal)
References
Details
(Keywords: main-thread-io, perf, Whiteboard: [Snappy:P1][necko-backlog])
Attachments
(4 files, 2 obsolete files)
(deleted),
patch
|
Details | Diff | Splinter Review | |
(deleted),
application/x-gzip
|
Details | |
(deleted),
patch
|
Details | Diff | Splinter Review | |
(deleted),
text/plain
|
Details |
In bug 715774 comment 3 dmandelin identifies a call stack that caused a disturbingly long jank via the cache:
NtWaitForSingleObject
RtlIntegerToUnicodeString
PR_Lock
nsCacheService::Lock
nsCacheEntryDescriptor::GetMetaDataElement
nsHttpChannel::CheckCache
nsHttpChannel::Connect
nsHttpChannel::OnNormalCacheEntryAvailable
nsHttpChannel::OnCacheEntryAvailable
nsCacheListenerEvent::Run
nsThread::ProcessNextEvent
It seems that getting a metadata element from a particular cache entry can block if anything else is being done with the cache service, as we have one global lock for the entire cache. In this case, it runs on the main thread, causing the jank. We should probably have finer-grained locking to prevent things like this from happening. Assigning to myself as part of the cache work to be done this quarter.
Comment 1•13 years ago
|
||
There are many problems in this code.
One of them is that the HSTS (and SPDY Alternate-Protocol?) checks are happening in the wrong spot--they should be happening BEFORE we looked up the entry in the cache, but currently they are happening AFTER we look up the entry in the cache.
But, a more major problem is that it doesn't make sense for the cache to call a callback on the main thread so that the callback can decide whether to read an entry from the cache (involving the cache thread/lock) or from the network (involving the socket transport thread). AFAICT, instead of calling this callback on the main thread, the cache itself should decide whether validation is necessary. If validation isn't necessary, then the cache should return the response directly to the main thread. If validation isn't necessary, it should jump directly to the socket transport thread to do the validation (skipping the main thread). If that validation response comes back 304, then the socket transport thread should transfer control back to the cache thread to read the response, and then the response should be sent to the main thread. If the validation response comes back otherwise, then that response data should be sent to the main thread and then later sent to the cache.
More generally, the main thread is responsible for issuing a cache lookup (to the cache thread) or issuing a network request (to the socket transport thread), and it is responsible for receiving the response data or error result. But, between those two points, the main thread should never be involved. And, even more generally than that, there should never be any case where any thread (except the cache thread) obtains any cache lock, except for the lock built into the internals of the cache thread's nsIEventTarget implementation. (All of this AFAICT.)
Basically, this is very much like what bz suggested in bug 612632 comment 8. AFAICT, a good solution for bug 612632 will solve this problem too.
Comment 2•13 years ago
|
||
Although the I/O doesn't happen directly on the main thread, the main thread is blocked (because of the lock) on I/O happening on the cache thread, so this is effectively main thread I/O.
(In reply to Brian Smith (:bsmith) from comment #1)
> Basically, this is very much like what bz suggested in bug 612632 comment 8.
> AFAICT, a good solution for bug 612632 will solve this problem too.
While I agree in general that we should be better for validation to be non-blocking in some way or another (which is why I filed this bug), that's different (in my view) from what bz is talking about. He's talking, effectively, about reducing the number of "round trips" in the cache, while this bug is talking about not being dumb about our locking. Perhaps they can/will both be resolved in the same patch, but I'm not convinced of that, since we could easily solve one without solving the other. (I would even argue that we SHOULD do this piecemeal, to make it easier to identify errors we may have made in the implementation of one versus the other.)
Summary: nsHttpChannel::CheckCache can be blocked on unrelated cache operations → Main thread can be blocked by IO on the cache thread
Comment 5•13 years ago
|
||
This may be causing massive slowdowns when filling the cache. See http://groups.google.com/group/mozilla.dev.tech.network/browse_thread/thread/6dfa3e6ebe80c336#
Initially I was hoping this would be pretty simple, but the locking & interdependencies are much hairier than I initially expected. There might still be a solution for this particular issue that doesn't involve redoing *all* the cache locking in one big go, but that requires a bit more investigation.
Updated•13 years ago
|
Here's the start of a patch for this. Doesn't fix all the places yet (particularly calls to nsCacheEntryDescriptor::{Doom,Close} and one synchronous call to OpenCacheEntry), but it's a start. Seems to be working without crashes so far, but I have no quantification of how well this fixes (or doesn't) the slowdowns.
This is the output generated by part of the patch that adds an assertion when we take the lock on the main thread. It's from opening www.cnn.com and then quitting the browser. There are lots of stack traces in here, but a lot of them are the same, just for different channels. You can write a script to slog through it fairly easily to detect the unique stacks that matter.
Re-upload with the appropriate mime type
Attachment #593534 -
Attachment is obsolete: true
Comment 10•13 years ago
|
||
FYI: you can change the MIME type of an existing attachment, via "details | edit details"
Comment 11•13 years ago
|
||
Here is the cleaned-up version of my lock classifications:
* evil (will be taken on the main thread in common operation)
* bad (lock will be taken on the main thread in less performance-critical situations like private browsing transition or only by tests or only when using about:cache)
* benign (lock is always taken on the cache thread).
* unclasified. There are a few instances that I still left unclassified, which are marked UNCLASSIFIED_CACHE_LOCK in the patch. I would like to know what the classification for those instances should be.
Attachment #594338 -
Flags: feedback?(michal.novotny)
Attachment #594338 -
Flags: feedback?(hurley)
Comment 12•13 years ago
|
||
This method is dead code in libxul (AFAICT) and it takes the cache lock on the main thread, so we should just remove it.
Attachment #594346 -
Flags: review?(michal.novotny)
Reporter | ||
Comment 13•13 years ago
|
||
Comment on attachment 594346 [details] [diff] [review]
Remove isStorageEnabledForPolicy
I'm working on this as well a fixing the locking with the lower layers in bug 723582, let's leave it all together
Attachment #594346 -
Flags: review-
Assignee | ||
Comment 14•13 years ago
|
||
Comment on attachment 594346 [details] [diff] [review]
Remove isStorageEnabledForPolicy
Clearing the flag since Nick will solve this in a different bug.
Attachment #594346 -
Flags: review?(michal.novotny)
Updated•13 years ago
|
Whiteboard: [Snappy] → [Snappy:P1]
Updated•13 years ago
|
Attachment #594346 -
Attachment is obsolete: true
Attachment #594338 -
Flags: feedback?(hurley)
Comment 15•12 years ago
|
||
I recently looked at Telemetry reports of transient main thread hangs and I noticed that cache operations show up fairly often:
http://blog.mozilla.org/vdjeric/2012/06/08/cache-plugin-font-operations-most-common-in-chrome-hang-reports/
I attached the call stacks in question. All of the stacks were reported by a single Windows XP machine, however the reports are over the span of a week and this machine is responsible for almost half of the hang stacks in the entire data set.
Updated•12 years ago
|
Attachment #594338 -
Flags: feedback?(michal.novotny)
Comment 16•9 years ago
|
||
honza, what do you think the current status of this is?
Flags: needinfo?(honzab.moz)
Whiteboard: [Snappy:P1] → [Snappy:P1][necko-backlog]
Updated•9 years ago
|
Status: NEW → RESOLVED
Closed: 9 years ago
Flags: needinfo?(honzab.moz)
Resolution: --- → DUPLICATE
Comment 18•9 years ago
|
||
(In reply to Honza Bambas (:mayhemer) from comment #17)
>
> *** This bug has been marked as a duplicate of bug 913806 ***
Either that's the wrong bug... or you should provide some context for duplicating this bug against that bug :)
Updated•9 years ago
|
Flags: needinfo?(honzab.moz)
Comment 19•9 years ago
|
||
This was a cache1 code bug. It's no longer used after we switched to cache2.
Flags: needinfo?(honzab.moz)
Comment 20•9 years ago
|
||
Ok, thanks for explaining!
You need to log in
before you can comment on or make changes to this bug.
Description
•