Closed
Bug 1077084
Opened 10 years ago
Closed 10 years ago
Intermittent test_bug497875.xul | application crashed [@ nsHostResolver::DeQueue(PRCListStr &,nsHostRecord * *)]
Categories
(Core :: Networking, defect)
Tracking
()
Tracking | Status | |
---|---|---|
firefox34 | --- | unaffected |
firefox35 | --- | fixed |
firefox36 | --- | fixed |
firefox-esr31 | --- | unaffected |
b2g-v2.1 | --- | unaffected |
b2g-v2.1S | --- | unaffected |
b2g-v2.2 | --- | fixed |
People
(Reporter: RyanVM, Assigned: bagder)
References
Details
(Keywords: crash, intermittent-failure)
Attachments
(1 file, 4 obsolete files)
(deleted),
patch
|
bagder
:
review+
Sylvestre
:
approval-mozilla-aurora+
|
Details | Diff | Splinter Review |
https://treeherder.mozilla.org/ui/logviewer.html#?job_id=808071&repo=fx-team
builder WINNT 6.2 fx-team debug test mochitest-other
buildid 20141002073921
builduid 7eb8008831114e3dad9a5aeb84d0d683
results warnings (1)
revision 69bb4035eba1
slave t-w864-ix-094
starttime Thu Oct 02 2014 11:17:35 GMT-0400 (Eastern Standard Time)
08:22:01 INFO - 435 INFO TEST-START | chrome://mochitests/content/chrome/content/xul/document/test/test_bug497875.xul
08:22:01 INFO - ++DOMWINDOW == 61 (0A354E00) [pid = 3848] [serial = 748] [outer = 158FC600]
08:22:01 INFO - ++DOCSHELL 09E2D000 == 9 [pid = 3848] [id = 216]
08:22:01 INFO - ++DOMWINDOW == 62 (0D383600) [pid = 3848] [serial = 749] [outer = 00000000]
08:22:01 INFO - [Parent 3848] WARNING: NS_ENSURE_TRUE(nsContentUtils::GetCurrentJSContext()) failed: file c:/builds/moz2_slave/fx-team-w32-d-0000000000000000/build/docshell/base/nsDocShell.cpp, line 9036
08:22:01 INFO - ++DOMWINDOW == 63 (0D389E00) [pid = 3848] [serial = 750] [outer = 0D383600]
08:22:01 INFO - Failed to load ftp://some.website.that.will.cause.an.error/
08:22:03 INFO - TEST-INFO | Main app process: exit status c0000005
08:22:03 INFO - 436 ERROR TEST-UNEXPECTED-FAIL | chrome://mochitests/content/chrome/content/xul/document/test/test_bug497875.xul | application terminated with exit code 3221225477
08:22:03 INFO - runtests.py | Application ran for: 0:02:23.742000
08:22:03 INFO - zombiecheck | Reading PID log: c:\users\cltbld~1.t-w\appdata\local\temp\tmpv5z04hpidlog
08:22:03 INFO - ==> process 3848 launched child process 1208 ("C:\slave\test\build\application\firefox\plugin-container.exe" --channel=3848.de9cd20.1131237995 -greomni "C:\slave\test\build\application\firefox\omni.ja" -appomni "C:\slave\test\build\application\firefox\browser\omni.ja" -appdir "C:\slave\test\build\application\firefox\browser" 3848 "\\.\pipe\gecko-crash-server-pipe.3848" tab)
08:22:11 INFO - mozcrash Saved minidump as C:\slave\test\build\blobber_upload_dir\5f9f6ade-07aa-479b-9263-031226d1b5ee.dmp
08:22:11 INFO - mozcrash Saved app info as C:\slave\test\build\blobber_upload_dir\5f9f6ade-07aa-479b-9263-031226d1b5ee.extra
08:22:11 WARNING - PROCESS-CRASH | chrome://mochitests/content/chrome/content/xul/document/test/test_bug497875.xul | application crashed [@ nsHostResolver::DeQueue(PRCListStr &,nsHostRecord * *)]
08:22:11 INFO - Crash dump filename: c:\users\cltbld~1.t-w\appdata\local\temp\tmpjet6l9.mozrunner\minidumps\5f9f6ade-07aa-479b-9263-031226d1b5ee.dmp
08:22:11 INFO - Operating system: Windows NT
08:22:11 INFO - 6.2.9200
08:22:11 INFO - CPU: x86
08:22:11 INFO - GenuineIntel family 6 model 30 stepping 5
08:22:11 INFO - 8 CPUs
08:22:11 INFO - Crash reason: EXCEPTION_ACCESS_VIOLATION_WRITE
08:22:11 INFO - Crash address: 0x5a5a5a5a
08:22:11 INFO - Thread 73 (crashed)
08:22:11 INFO - 0 xul.dll!nsHostResolver::DeQueue(PRCListStr &,nsHostRecord * *) [nsHostResolver.cpp:69bb4035eba1 : 1271 + 0x5]
08:22:11 INFO - eip = 0x7177d0b6 esp = 0x19cffda4 ebp = 0x19cffda8 ebx = 0x000493e0
08:22:11 INFO - esi = 0x5a5a5a5a edi = 0x19cffe34 eax = 0x19cffe34 ecx = 0x092b85e0
08:22:11 INFO - edx = 0x5a5a5a5a efl = 0x00010216
08:22:11 INFO - Found by: given as instruction pointer in context
08:22:11 INFO - 1 xul.dll!nsHostResolver::GetHostToLookup(nsHostRecord * *) [nsHostResolver.cpp:69bb4035eba1 : 1313 + 0xe]
08:22:11 INFO - eip = 0x7177d3dd esp = 0x19cffdb0 ebp = 0x19cffdd4
08:22:11 INFO - Found by: call frame info
08:22:11 INFO - 2 xul.dll!nsHostResolver::ThreadFunc(void *) [nsHostResolver.cpp:69bb4035eba1 : 1586 + 0xf]
08:22:11 INFO - eip = 0x7177eadb esp = 0x19cffddc ebp = 0x19cffe3c
08:22:11 INFO - Found by: call frame info
08:22:11 INFO - 3 nss3.dll!_PR_NativeRunThread [pruthr.c:69bb4035eba1 : 397 + 0x8]
08:22:11 INFO - eip = 0x6f74f8bf esp = 0x19cffe44 ebp = 0x19cffe5c
08:22:11 INFO - Found by: call frame info
08:22:11 INFO - 4 nss3.dll!pr_root [w95thred.c:69bb4035eba1 : 90 + 0xb]
08:22:11 INFO - eip = 0x6f74136f esp = 0x19cffe64 ebp = 0x19cffe68
08:22:11 INFO - Found by: call frame info
08:22:11 INFO - 5 msvcr100.dll + 0x5c6dd
Comment 1•10 years ago
|
||
probly steve or daniel touched this last.. who will take it?
Flags: needinfo?(sworkman)
Flags: needinfo?(daniel)
Assignee | ||
Comment 2•10 years ago
|
||
I've stared at this code now for a while.
The DeQueue() that crashes seems to attempt to dequeue a node that is already freed (5a5a5a5a), and the fact that it dequeues from mLowQ _could_ mean that this is a node added on nsHostResolver.cpp:1490, which is code that came in with the TTL experiments.
Flags: needinfo?(daniel)
Comment 3•10 years ago
|
||
Tried 100 runs locally on a Win7 VM can't reproduce it. But I may have an explanation and a speculative solution. Please cast your eyes over this and tell me what you think:
On the DNS thread, a request is dequeued from the HIGH priority queue and a lookup completes.
nsHostResolver::OnLookupComplete is called.
LOCK
process the lookup result
UNLOCK
Meanwhile, on the main thread, FlushCache is waiting for the lock. It get's it.
LOCK
prune entries --> release record --> => refcnt == 1
get eviction queue
UNLOCK
release all entries on eviction q --> release record =? refcnt == 0 --> DELETE
Back on the DNS thread, still in OnLookupComplete:
call callback->OnLookupComplete for all the callbacks associated with the record.
LOCK (again)
Issue a 2nd lookup, this time to get TTL
Request now enqueued on LOW priority queue.
UNLOCK
Then, when it comes time to DeQueue from the LOW priority queue, it gets the record, but the data has been erased. Sounds plausible, right?
So, move the IssueLookup for TTL that's in OnLookupComplete into the first mutex block. We could do a holding AddRef for OnLookupComplete, but if the cache has been flushed, then this entry has been removed from the hash table already and a lookup is pointless. Better to have the second lookup issued BEFORE FlushCache has a chance to run and clear it out.
Flags: needinfo?(sworkman)
Assignee | ||
Comment 4•10 years ago
|
||
(In reply to Steve Workman [:sworkman] from comment #3)
> Then, when it comes time to DeQueue from the LOW priority queue, it gets the
> record, but the data has been erased. Sounds plausible, right?
>
> So, move the IssueLookup for TTL that's in OnLookupComplete into the first
> mutex block. We could do a holding AddRef for OnLookupComplete, but if the
> cache has been flushed, then this entry has been removed from the hash table
> already and a lookup is pointless. Better to have the second lookup issued
> BEFORE FlushCache has a chance to run and clear it out.
I completely agree with this analysis and yeah, moving the code to within the previous block seems simpler than doing tricks with adding another ref.
The only possible downside with this, is that in this particular case the events have deemed that the existing entry should be flushed, so this adds a lookup for an entry that otherwise would be (and will be?) discarded.
Assignee | ||
Comment 5•10 years ago
|
||
Here's a patch for what we've discussed.
Attachment #8502425 -
Flags: review?(sworkman)
Comment 6•10 years ago
|
||
Comment on attachment 8502425 [details] [diff] [review]
0001-bug-1077084-move-TTL-lookup-to-within-initial-lock-b.patch
Review of attachment 8502425 [details] [diff] [review]:
-----------------------------------------------------------------
(In reply to Daniel Stenberg [:bagder] from comment #4)
> The only possible downside with this, is that in this particular case the
> events have deemed that the existing entry should be flushed, so this adds a
> lookup for an entry that otherwise would be (and will be?) discarded.
Yup - I think that a PR_REMOVE_LINK should be added to PruneEntry in the PL_DHASH_REMOVE case. But .....
There's one other thing that I'm wondering about here. In PruneEntry we return true if the record is both resolving and onQueue, right? But this means that requests that are on the queue pending resolution will be removed from the cache. Am I reading that correctly? If so, that's something I missed in the earlier review sorry.
So, in RemoveOrRefresh we should
if (resolving && onQueue) {
// Don't remove; resolution pending.
return false;
} else if (resolving && !onQueue) {
// Request already with OS resolver; require a refresh.
mResolveAgain = true;
return false;
} else {
// Already resolved; not in a pending state; remove from cache.
return true;
}
Assuming this is right, the lookup for TTL should still happen, but that's ok. It should be a valid result since it should happen after the NETWORK CHANGE event. We could do a check to see if the pending request is for TTL, but that's extra complexity that I don't think we need.
Anyway, tell me what you think. I'm concerned about this, but I'm aware that I could be missing something in the existing code.
Attachment #8502425 -
Flags: review?(sworkman) → feedback+
Assignee | ||
Comment 7•10 years ago
|
||
(In reply to Steve Workman [:sworkman] from comment #6)
Gah. I feel stupid for not noticing that the entry was added already when the entry is added to a queue but of course it makes sense and yes RemoveOrRefresh needs to be adapted to this.
I left the TTL block moved like in the first version of the patch since it seems to still make sense, and I believe addressed all the other concerns in this v2 patch.
Attachment #8502425 -
Attachment is obsolete: true
Attachment #8502794 -
Flags: review?(sworkman)
Assignee | ||
Comment 8•10 years ago
|
||
Clearly this patch introduces memory leaks:
https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=8dea0317bcc2
Assignee | ||
Comment 9•10 years ago
|
||
v3. Fixed the leak.
Since we now remove the entries from the list in the HostDB_PruneEntry() function, it also removes entries from the evictionq. To avoid that causing problem, I changed the logic to first remove all entries that are in the evictionq and then clear the cache accordingly. This seems to fix the memory leaks. I re-run a try on windows here; https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=8794bdf39851
I'll run a full try too, just want to see that you're with me on this approach Steve.
Attachment #8502794 -
Attachment is obsolete: true
Attachment #8502794 -
Flags: review?(sworkman)
Attachment #8503293 -
Flags: review?(sworkman)
Comment 10•10 years ago
|
||
Comment on attachment 8503293 [details] [diff] [review]
v3-0001-bug-1077084-improved-DNS-cache-pruning-r-sworkman.patch
Review of attachment 8503293 [details] [diff] [review]:
-----------------------------------------------------------------
::: netwerk/dns/nsHostResolver.cpp
@@ +479,5 @@
> + // to get resolved. Leave it.
> + fprintf(stderr, "!!!!! RemoveOrRefresh() line %d\n", __LINE__);
> + return false;
> + }
> + else {
} else {
And remove the fprintf ;)
@@ +485,5 @@
> + // record should be considered stale and not trusted; set a flag to
> + // ensure it is called again.
> + mResolveAgain = true;
> + return false;
> + }
You could put the return false outside the else and just have an "if (!onQueue) { ... }".
@@ +816,5 @@
> + }
> + }
> +
> + // prune the hash from all hosts already resolved
> + PL_DHashTableEnumerate(&mDB, HostDB_PruneEntry, nullptr);
So, by the time we arrive at PruneEntry now, all the resolved (i.e. !resolving && !onQueue) should all be removed. When an entry is resolved, it's added to the evictionQ, so I think PruneEntry is effectively just marking the entries for refresh. Can you check this, please?
Attachment #8503293 -
Flags: review?(sworkman) → feedback+
Comment 11•10 years ago
|
||
(In reply to Daniel Stenberg [:bagder] from comment #7)
> Gah. I feel stupid for not noticing that the entry was added already when
> the entry is added to a queue but of course it makes sense and yes
> RemoveOrRefresh needs to be adapted to this.
It's ok. I feel stupid too ;) I should have spotted it in the earlier review, especially since I'm the one that requested dealing with requests already with the OS resolver!
Assignee | ||
Updated•10 years ago
|
Assignee: nobody → daniel
Assignee | ||
Comment 12•10 years ago
|
||
(In reply to Steve Workman [:sworkman] from comment #10)
> So, by the time we arrive at PruneEntry now, all the resolved (i.e.
> !resolving && !onQueue) should all be removed. When an entry is resolved,
> it's added to the evictionQ, so I think PruneEntry is effectively just
> marking the entries for refresh. Can you check this, please?
Ah, yes exactly what I think. The entries still in the cache at the point after the evictionq is "drained" should all either get resolved soon or be in the process of being resolved right now.
However...
I did a test-run[1] and asserted that 'resolving' was always true in RemoveOrRefresh() but that triggered in some test cases and caused failures. So, the safer approach is clearly to keep the logic that iterates over all entries, and removes those that can be removed and refreshes those in need of a second resolve.
[1] = that test run can be seen here: https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=50fe060f7a05
Assignee | ||
Comment 13•10 years ago
|
||
v4, updated according to previous comments. Only minor changes since I couldn't do the simplification as mention in a previous comment above here.
Test run on win32 debug-only (since that's where we saw the problems before): https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=50fe060f7a05
Attachment #8503293 -
Attachment is obsolete: true
Attachment #8504048 -
Flags: review?(sworkman)
Comment 14•10 years ago
|
||
Comment on attachment 8504048 [details] [diff] [review]
v4-0001-bug-1077084-improved-DNS-cache-pruning-r-sworkman.patch
Review of attachment 8504048 [details] [diff] [review]:
-----------------------------------------------------------------
Thanks for fixing this Daniel! r=me.
::: netwerk/dns/nsHostResolver.cpp
@@ +600,5 @@
> static PLDHashOperator
> HostDB_PruneEntry(PLDHashTable *table,
> + PLDHashEntryHdr *hdr,
> + uint32_t number,
> + void *arg)
nit: Indentation changed here; looks like some weirdness.
@@ +670,5 @@
>
> // We can't set a pref in the context of a pref change callback, so
> // dispatch DnsExperimentChangedInternal for async getting/setting.
> DebugOnly<nsresult> rv = NS_DispatchToMainThread(
> + NS_NewRunnableMethod(self, &DnsExperimentChangedInternal));
Ah, this was changed in a recent patch. Please don't change it back. The class name is needed to avoid a clang warning.
@@ +807,5 @@
> + }
> + }
> +
> + // Refresh the cache entries that are resolving RIGHT now, remove the rest.
> + PL_DHashTableEnumerate(&mDB, HostDB_PruneEntry, nullptr);
Cool. Safer route is better.
Attachment #8504048 -
Flags: review?(sworkman) → review+
Assignee | ||
Comment 15•10 years ago
|
||
(In reply to Steve Workman [:sworkman] from comment #14)
> > DebugOnly<nsresult> rv = NS_DispatchToMainThread(
> > + NS_NewRunnableMethod(self, &DnsExperimentChangedInternal));
>
> Ah, this was changed in a recent patch. Please don't change it back. The
> class name is needed to avoid a clang warning.
Good catch. I don't know how it got to be like that, I blame a crazy rebase or something. It was not done on purpose and I'll get rid of it from my patch!
Assignee | ||
Comment 16•10 years ago
|
||
v5. cleaned up but otherwise the same.
Try build: https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=237b4187386c
Attachment #8504048 -
Attachment is obsolete: true
Attachment #8504516 -
Flags: review+
Assignee | ||
Updated•10 years ago
|
Keywords: checkin-needed
Comment 17•10 years ago
|
||
Keywords: checkin-needed
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla36
Reporter | ||
Comment 19•10 years ago
|
||
AFAICT, we need this on Aurora at least. Do we need this on Beta34 as well? Please nominate accordingly :)
status-firefox34:
--- → ?
status-firefox35:
--- → affected
status-firefox36:
--- → fixed
status-firefox-esr31:
--- → unaffected
Flags: needinfo?(daniel)
Assignee | ||
Comment 20•10 years ago
|
||
This fix is primarily a fix for flaws in the bug 939318 series that are targeted for 35 as I understand it.
It also changes code brought in via bug 1067679 that probably has a completely different target (if any) so some splitting up work is probably required at some point.
Flags: needinfo?(daniel)
Reporter | ||
Comment 21•10 years ago
|
||
Please nominate this for aurora approval when you get a chance :-)
Assignee | ||
Comment 22•10 years ago
|
||
Comment on attachment 8504516 [details] [diff] [review]
v5-0001-bug-1077084-improved-DNS-cache-pruning-r-sworkman.patch
Approval Request Comment
[Feature/regressing bug #]: 939318
[User impact if declined]: occasional crashes occur as outlined in this bug report
[Describe test coverage new/current, TBPL]: not terribly good, which is why 939318 got in with this problem in the first place, but also difficult to do good tests for due to the timing dependent/race nature of the problem
[Risks and why]:
[String/UUID change made/needed]:
Flags: needinfo?(daniel)
Attachment #8504516 -
Flags: approval-mozilla-aurora?
Updated•10 years ago
|
Attachment #8504516 -
Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Reporter | ||
Comment 23•10 years ago
|
||
Comment 24•10 years ago
|
||
blocking-b2g: --- → 2.1S?
Comment 26•10 years ago
|
||
(In reply to Steven Yang [:styang] from comment #26)
> stability issue.
Shouldn't we merge into v2.1 if this is a stability issue?
Flags: needinfo?(vliu) → needinfo?(styang)
Reporter | ||
Comment 30•10 years ago
|
||
Please nominate this patch for b2g34 approval when you get a chance.
status-b2g-v2.1:
--- → affected
status-b2g-v2.1S:
--- → affected
status-b2g-v2.2:
--- → fixed
Flags: needinfo?(daniel)
Assignee | ||
Comment 31•10 years ago
|
||
Comment on attachment 8504516 [details] [diff] [review]
v5-0001-bug-1077084-improved-DNS-cache-pruning-r-sworkman.patch
NOTE: Please see https://wiki.mozilla.org/Release_Management/B2G_Landing to better understand the B2G approval process and landings.
[Approval Request Comment]
Bug caused by (feature/regressing bug #): 939318
User impact if declined: occasional crashes occur as outlined in this bug report
Testing completed: It has been landed and used for quite some time now in other Firefox versions/trees
Risk to taking this patch (and alternatives if risky):
String or UUID changes made by this patch:
Flags: needinfo?(daniel)
Attachment #8504516 -
Flags: approval-mozilla-b2g34?
Comment 32•10 years ago
|
||
Comment on attachment 8504516 [details] [diff] [review]
v5-0001-bug-1077084-improved-DNS-cache-pruning-r-sworkman.patch
Keeping the risk in mind and backing out if causing any side effect.
Attachment #8504516 -
Flags: approval-mozilla-b2g34? → approval-mozilla-b2g34+
Reporter | ||
Comment 33•10 years ago
|
||
(In reply to Daniel Stenberg [:bagder] from comment #23)
> Approval Request Comment
> [Feature/regressing bug #]: 939318
Bug 939318 never landed on b2g34. Consequently, the patch in this bug doesn't apply at all. Why was this nominated for v2.1?
Flags: needinfo?(vliu)
Comment 34•10 years ago
|
||
(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #34)
> (In reply to Daniel Stenberg [:bagder] from comment #23)
> > Approval Request Comment
> > [Feature/regressing bug #]: 939318
>
> Bug 939318 never landed on b2g34. Consequently, the patch in this bug
> doesn't apply at all. Why was this nominated for v2.1?
I think the reason why it was nominated for v2,1 is based on Comment 25 - Comment 30.
Flags: needinfo?(vliu)
Reporter | ||
Comment 35•10 years ago
|
||
Bug 1008091 never got uplifted to v2.1/v2.1S. There's nothing to do here AFAICT.
Reporter | ||
Comment 36•10 years ago
|
||
Comment on attachment 8504516 [details] [diff] [review]
v5-0001-bug-1077084-improved-DNS-cache-pruning-r-sworkman.patch
Clearing the approval for now. Please re-nominate if I'm missing something here.
Attachment #8504516 -
Flags: approval-mozilla-b2g34+
You need to log in
before you can comment on or make changes to this bug.
Description
•