Closed
Bug 738914
Opened 13 years ago
Closed 13 years ago
Intermittent test_websocket.html | Exited with code -1073741819 from a crash [@ nsHttpPipeline::IsDone()]
Categories
(Core :: Networking: WebSockets, defect)
Tracking
()
RESOLVED
FIXED
mozilla15
People
(Reporter: mak, Assigned: mcmanus)
References
Details
(Keywords: intermittent-failure)
Attachments
(1 file)
(deleted),
patch
|
jduell.mcbugs
:
review+
akeybl
:
approval-mozilla-aurora-
|
Details | Diff | Splinter Review |
could be related to recent changes to http pipelining
https://tbpl.mozilla.org/php/getParsedLog.php?id=10344326&tree=Mozilla-Inbound
TEST-UNEXPECTED-FAIL | /tests/content/base/test/test_websocket.html | Exited with code -1073741819 during test run
INFO | automation.py | Application ran for: 0:06:54.891000
INFO | automation.py | Reading PID log: c:\users\cltbld\appdata\local\temp\tmphuue0opidlog
==> process 2936 launched child process 3804
INFO | automation.py | Checking for orphan process with PID: 3804
Downloading symbols from: http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-inbound-win32-debug/1332579810/firefox-14.0a1.en-US.win32.crashreporter-symbols.zip
PROCESS-CRASH | /tests/content/base/test/test_websocket.html | application crashed (minidump found)
Crash dump filename: c:\users\cltbld\appdata\local\temp\tmpizwgjc\minidumps\cbc89f78-dbcc-4ea2-895c-2e6e7569b1be.dmp
Operating system: Windows NT
6.1.7600
CPU: x86
GenuineIntel family 6 model 23 stepping 10
2 CPUs
Crash reason: EXCEPTION_ACCESS_VIOLATION_READ
Crash address: 0xffffffffddddddf9
Thread 0 (crashed)
0 xul.dll!nsHttpPipeline::IsDone() [nsHttpPipeline.cpp:9094735bcb46 : 634 + 0xd]
eip = 0x6941f9c7 esp = 0x002bd6a0 ebp = 0x002bd6c0 ebx = 0x00000001
esi = 0x14828804 edi = 0x00000000 eax = 0x1916d6d0 ecx = 0x1916d6d0
edx = 0xdddddddd efl = 0x00210297
Found by: given as instruction pointer in context
1 xul.dll!nsHttpConnection::TakeTransport(nsISocketTransport * *,nsIAsyncInputStream * *,nsIAsyncOutputStream * *) [nsHttpConnection.cpp:9094735bcb46 : 929 + 0x12]
eip = 0x693fb0b9 esp = 0x002bd6b0 ebp = 0x002bd6c0
Found by: stack scanning
2 xul.dll!nsHttpChannel::OnStopRequest(nsIRequest *,nsISupports *,unsigned int) [nsHttpChannel.cpp:9094735bcb46 : 4488 + 0x52]
eip = 0x6941cfc0 esp = 0x002bd6c8 ebp = 0x002bd704
Found by: call frame info
3 xul.dll!nsInputStreamPump::OnStateStop() [nsInputStreamPump.cpp:9094735bcb46 : 583 + 0x18]
eip = 0x6938b36a esp = 0x002bd70c ebp = 0x002bd72c
Found by: call frame info
4 xul.dll!nsInputStreamPump::OnInputStreamReady(nsIAsyncInputStream *) [nsInputStreamPump.cpp:9094735bcb46 : 405 + 0x7]
eip = 0x6938b414 esp = 0x002bd734 ebp = 0x002bd740
Found by: call frame info
5 xul.dll!nsInputStreamReadyEvent::Run() [nsStreamUtils.cpp:9094735bcb46 : 114 + 0x10]
eip = 0x6a03a08d esp = 0x002bd748 ebp = 0x002bd758
Found by: call frame info
6 xul.dll!nsThread::ProcessNextEvent(bool,bool *) [nsThread.cpp:9094735bcb46 : 656 + 0xd]
eip = 0x6a04eb78 esp = 0x002bd760 ebp = 0x002bd788
Found by: call frame info
7 xul.dll!NS_ProcessNextEvent_P(nsIThread *,bool) [nsThreadUtils.cpp:9094735bcb46 : 245 + 0xc]
eip = 0x6a0116a7 esp = 0x002bd790 ebp = 0x002bd79c
Found by: call frame info
8 xul.dll!mozilla::ipc::MessagePump::Run(base::MessagePump::Delegate *) [MessagePump.cpp:9094735bcb46 : 134 + 0x9]
eip = 0x69f305b8 esp = 0x002bd7a4 ebp = 0x002bd7c8
Assignee | ||
Comment 1•13 years ago
|
||
I'm out of the office this week so don't panic if you don't see an immediate patch - but this is firmly on my radar.
Assignee: nobody → mcmanus
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment 3•13 years ago
|
||
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Assignee | ||
Comment 12•13 years ago
|
||
Thread 0 (crashed)
0 xul.dll!nsHttpPipeline::IsDone() [nsHttpPipeline.cpp:9094735bcb46 : 634 + 0xd]
eip = 0x6941f9c7 esp = 0x002bd6a0 ebp = 0x002bd6c0 ebx = 0x00000001
esi = 0x14828804 edi = 0x00000000 eax = 0x1916d6d0 ecx = 0x1916d6d0
edx = 0xdddddddd efl = 0x00210297
Found by: given as instruction pointer in context
1 xul.dll!nsHttpConnection::TakeTransport(nsISocketTransport * *,nsIAsyncInputStream * *,nsIAsyncOutputStream * *) [nsHttpConnection.cpp:9094735bcb46 : 929 + 0x12]
eip = 0x693fb0b9 esp = 0x002bd6b0 ebp = 0x002bd6c0
Found by: stack scanning
2 xul.dll!nsHttpChannel::OnStopRequest(nsIRequest *,nsISupports *,unsigned int) [nsHttpChannel.cpp:9094735bcb46 : 4488 + 0x52]
eip = 0x6941cfc0 esp = 0x002bd6c8 ebp = 0x002bd704
Found by: call frame info
Thread 0 is the main thread. That obviously is going to have some problems with the nshttppipeline implementation of nsAHttpTransaction. Its not clear to me that this is safe even with the normal mTransaction && mTransaciton->IsDone() check nsHttpConnection does.. that could be racy if an EOF or something arrived on the network thread.
To get this stable this code should ping pong onto the network thread for the taketransport() check and then back to the main thread for OnTransportAvailable().. that's sub-optimal but this is in the high-latency handshake portion of the program anyhow - it won't impact actual message latency in the normal states. We can get it stable and then later figure out how to streamline the handshale.
using the pipeline datastructure makes this more obvious, but imo this is a long standing bug that may need porting.
Assignee | ||
Comment 13•13 years ago
|
||
move nsHttpConnection::TakeTransport() to the socket thread, while leaving OnTransportAvailable on the main thread.
Also, don't call content/nsWebSocket.cpp::OnStart() until the IO handlers are fully installed (we need both OnStartRequest and OnTransportAvailable to have been executed, not just OSR).. not doing this means that when firefox writes a message before the IO handlers are installed it might never go out - this was really obviously triggerable with the threading changes mentioned above, but it was a pre-existing problem.
Attachment #613839 -
Flags: review?(jduell.mcbugs)
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment 27•13 years ago
|
||
Comment on attachment 613839 [details] [diff] [review]
patch 0
Review of attachment 613839 [details] [diff] [review]:
-----------------------------------------------------------------
So the reason IsDone was segfaulting was re-entrant usage of mResponseQ/mRequestQ from both main/socket threads, correct?
Is there a reason we also need to move the call to mListener->OnStart() from WebSocketChannel::OnStartRequest to WebSocketChannel::StartWebsocketData? The move seems ok, and it's intuitively sensible to wait for both OnTransportAvailable and OnStartRequest to be called before calling OnStart. The only issue I could think of is error handling: if for some reason TakeTransport fails (is it possible in practice in this control path?) you don't wind up calling OnTransportAvailable, so you'd now never call OnStart. Not that this appears to make much difference: we don't seem to have error handling for failed TakeTransport either before or after this patch. I'm ok with fixing that in another bug if it's indeed an issue.
All the code to move TakeTransport to the socket thread looks solid.
::: netwerk/protocol/http/nsHttpConnectionMgr.cpp
@@ +387,5 @@
> +public:
> +nsCompleteUpgradeData(nsAHttpConnection *aConn,
> + nsIHttpUpgradeListener *aListener)
> + : mConn(aConn)
> + , mUpgradeListener(aListener) {}
weird indenting. Just put both inits on same line? Or at least same indent level.
Attachment #613839 -
Flags: review?(jduell.mcbugs) → review+
Comment hidden (Legacy TBPL/Treeherder Robot) |
Assignee | ||
Comment 29•13 years ago
|
||
(In reply to Jason Duell (:jduell) from comment #27)
> Comment on attachment 613839 [details] [diff] [review]
> patch 0
>
> Review of attachment 613839 [details] [diff] [review]:
> -----------------------------------------------------------------
>
> So the reason IsDone was segfaulting was re-entrant usage of
> mResponseQ/mRequestQ from both main/socket threads, correct?
>
yes.
but it really exposed a fundamental problem that we aren't seeing reports of - the transaction could just as easily be closed on the socket thread due to an EOF while IsDone() is running on the main thread.. so this isn't so much a regression in 14 as much as 14 bringing that issue to the surface.
> Is there a reason we also need to move the call to mListener->OnStart() from
> WebSocketChannel::OnStartRequest to WebSocketChannel::StartWebsocketData?
without that change the content code can submit a message before the http upgrade is complete (i.e. before the transport object can do real websocket I/O) and OnOutputStreamReady() doesn't send it because mSocketOut is still not setup.. the change I made seemed as you say to be the intuitive way to fix that - an alternative would have been to manually call OnOutputStreamReady out of StartWebsocketData to clear the queue, but in general pushing the buffering back towards the source-generator is good policy.
> The move seems ok, and it's intuitively sensible to wait for both
> OnTransportAvailable and OnStartRequest to be called before calling OnStart.
> The only issue I could think of is error handling: if for some reason
> TakeTransport fails (is it possible in practice in this control path?) you
> don't wind up calling OnTransportAvailable, so you'd now never call OnStart.
> Not that this appears to make much difference: we don't seem to have error
> handling for failed TakeTransport either before or after this patch. I'm ok
> with fixing that in another bug if it's indeed an issue.
>
yes, that's orthogonal if its a problem at all. I don't think you have to call OTA (though I'd have to think about it), but I'm not sure the right start/stop gets sent to the content class.
Assignee | ||
Comment 30•13 years ago
|
||
Comment 31•13 years ago
|
||
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Assignee | ||
Updated•13 years ago
|
Assignee | ||
Comment 32•13 years ago
|
||
Comment on attachment 613839 [details] [diff] [review]
patch 0
[Approval Request Comment]
Regression caused by (bug #): 447866
User impact if declined: crash risk
Testing completed (on m-c, etc.): baked on m-c, has cured an orange there
Risk to taking this patch (and alternatives if risky):
String changes made by this patch: none
Attachment #613839 -
Flags: approval-mozilla-aurora?
Comment 33•13 years ago
|
||
(In reply to Patrick McManus [:mcmanus] from comment #32)
> Risk to taking this patch (and alternatives if risky):
What risk does this patch to nsHttpChannel.cpp carry? I'd only fix an intermittent orange on Aurora if it had very low risk.
Assignee | ||
Comment 34•13 years ago
|
||
(In reply to Alex Keybl [:akeybl] from comment #33)
> (In reply to Patrick McManus [:mcmanus] from comment #32)
> > Risk to taking this patch (and alternatives if risky):
>
> What risk does this patch to nsHttpChannel.cpp carry? I'd only fix an
> intermittent orange on Aurora if it had very low risk.
I never know how to answer this in a quantifiable way. So the question is always easier to ask than answer.
It's a moderate amount of platfrom C++ code. Mostly a rearrangement onto other execution threads of existing code though. It has test coverage and performs better with that same coverage on m-c than on aurora. The code path is pretty central to websockets, so we can be sure it is being exercised outside of tests too.
The aurora code can crash due to a race condition. This is a long standing issue, but a non websockets checkin on 14 made it worse by adding to the list of things that could race dangerously, so the fix for 13 and 14 would be the same - but the risk of triggering the bug is larger on 14. That's why I am not proposing beta?
Comment 35•13 years ago
|
||
(In reply to Patrick McManus [:mcmanus] from comment #34)
> The aurora code can crash due to a race condition. This is a long standing
> issue, but a non websockets checkin on 14 made it worse by adding to the
> list of things that could race dangerously, so the fix for 13 and 14 would
> be the same - but the risk of triggering the bug is larger on 14. That's why
> I am not proposing beta?
I'm trying to avoid taking moderate amounts of code change on any branch without data that suggests this will be a common/significant problem for our users. Without this fix, would we expect the crash signature [@ nsHttpPipeline::IsDone()] to be apparent in our crash data, or is that just one way the bug is apparent?
Assignee | ||
Comment 36•13 years ago
|
||
(In reply to Alex Keybl [:akeybl] from comment #35)
>
> I'm trying to avoid taking moderate amounts of code change on any branch
> without data that suggests this will be a common/significant problem for our
> users. Without this fix, would we expect the crash signature [@
> nsHttpPipeline::IsDone()] to be apparent in our crash data, or is that just
> one way the bug is apparent?
nsHttpPipeline::IsDone() at the top will likely be the most common - they should all contain nsHttpConnection::TakeTransport() somewhere in the stack.
Comment 37•13 years ago
|
||
From looking at crash-stats, I don't believe this is prevalent enough to take any code change to resolve for FF14. If related crashes do rise, we'll reconsider taking the change.
status-firefox13:
--- → affected
Updated•13 years ago
|
Attachment #613839 -
Flags: approval-mozilla-aurora? → approval-mozilla-aurora-
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment 40•13 years ago
|
||
Patrick--any idea of recent changes that might be causing this to fail again/still?
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Assignee | ||
Comment 41•13 years ago
|
||
(In reply to Jason Duell (:jduell) from comment #40)
> Patrick--any idea of recent changes that might be causing this to fail
> again/still?
what are you referring to?
This is wontfix for FF13 and FF14 (see alex comment 37). The crash in 38 is FF14 and comment 39 is either ff 13 or 14 depending on the build time.
Comment 42•13 years ago
|
||
Ah, right. Resume your normal programming...
Status: REOPENED → RESOLVED
Closed: 13 years ago → 13 years ago
Resolution: --- → FIXED
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Updated•12 years ago
|
Keywords: intermittent-failure
Updated•12 years ago
|
Whiteboard: [orange]
You need to log in
before you can comment on or make changes to this bug.
Description
•