738914 - Intermittent test_websocket.html | Exited with code -1073741819 from a crash [@ nsHttpPipeline::IsDone()]

Reporter

Description

•

13 years ago

could be related to recent changes to http pipelining https://tbpl.mozilla.org/php/getParsedLog.php?id=10344326&tree=Mozilla-Inbound TEST-UNEXPECTED-FAIL | /tests/content/base/test/test_websocket.html | Exited with code -1073741819 during test run INFO | automation.py | Application ran for: 0:06:54.891000 INFO | automation.py | Reading PID log: c:\users\cltbld\appdata\local\temp\tmphuue0opidlog ==> process 2936 launched child process 3804 INFO | automation.py | Checking for orphan process with PID: 3804 Downloading symbols from: http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-inbound-win32-debug/1332579810/firefox-14.0a1.en-US.win32.crashreporter-symbols.zip PROCESS-CRASH | /tests/content/base/test/test_websocket.html | application crashed (minidump found) Crash dump filename: c:\users\cltbld\appdata\local\temp\tmpizwgjc\minidumps\cbc89f78-dbcc-4ea2-895c-2e6e7569b1be.dmp Operating system: Windows NT 6.1.7600 CPU: x86 GenuineIntel family 6 model 23 stepping 10 2 CPUs Crash reason: EXCEPTION_ACCESS_VIOLATION_READ Crash address: 0xffffffffddddddf9 Thread 0 (crashed) 0 xul.dll!nsHttpPipeline::IsDone() [nsHttpPipeline.cpp:9094735bcb46 : 634 + 0xd] eip = 0x6941f9c7 esp = 0x002bd6a0 ebp = 0x002bd6c0 ebx = 0x00000001 esi = 0x14828804 edi = 0x00000000 eax = 0x1916d6d0 ecx = 0x1916d6d0 edx = 0xdddddddd efl = 0x00210297 Found by: given as instruction pointer in context 1 xul.dll!nsHttpConnection::TakeTransport(nsISocketTransport * *,nsIAsyncInputStream * *,nsIAsyncOutputStream * *) [nsHttpConnection.cpp:9094735bcb46 : 929 + 0x12] eip = 0x693fb0b9 esp = 0x002bd6b0 ebp = 0x002bd6c0 Found by: stack scanning 2 xul.dll!nsHttpChannel::OnStopRequest(nsIRequest *,nsISupports *,unsigned int) [nsHttpChannel.cpp:9094735bcb46 : 4488 + 0x52] eip = 0x6941cfc0 esp = 0x002bd6c8 ebp = 0x002bd704 Found by: call frame info 3 xul.dll!nsInputStreamPump::OnStateStop() [nsInputStreamPump.cpp:9094735bcb46 : 583 + 0x18] eip = 0x6938b36a esp = 0x002bd70c ebp = 0x002bd72c Found by: call frame info 4 xul.dll!nsInputStreamPump::OnInputStreamReady(nsIAsyncInputStream *) [nsInputStreamPump.cpp:9094735bcb46 : 405 + 0x7] eip = 0x6938b414 esp = 0x002bd734 ebp = 0x002bd740 Found by: call frame info 5 xul.dll!nsInputStreamReadyEvent::Run() [nsStreamUtils.cpp:9094735bcb46 : 114 + 0x10] eip = 0x6a03a08d esp = 0x002bd748 ebp = 0x002bd758 Found by: call frame info 6 xul.dll!nsThread::ProcessNextEvent(bool,bool *) [nsThread.cpp:9094735bcb46 : 656 + 0xd] eip = 0x6a04eb78 esp = 0x002bd760 ebp = 0x002bd788 Found by: call frame info 7 xul.dll!NS_ProcessNextEvent_P(nsIThread *,bool) [nsThreadUtils.cpp:9094735bcb46 : 245 + 0xc] eip = 0x6a0116a7 esp = 0x002bd790 ebp = 0x002bd79c Found by: call frame info 8 xul.dll!mozilla::ipc::MessagePump::Run(base::MessagePump::Delegate *) [MessagePump.cpp:9094735bcb46 : 134 + 0x9] eip = 0x69f305b8 esp = 0x002bd7a4 ebp = 0x002bd7c8

Patrick McManus [:mcmanus]

Assignee

Comment 1

•

13 years ago

I'm out of the office this week so don't panic if you don't see an immediate patch - but this is firmly on my radar.

Assignee: nobody → mcmanus

Honza Bambas (:mayhemer)

Updated

•

13 years ago

Blocks: 739762

Comment hidden (Legacy TBPL/Treeherder Robot)

Ryan VanderMeulen [:RyanVM]

Comment 3

•

13 years ago

https://tbpl.mozilla.org/php/getParsedLog.php?id=10455529&tree=Mozilla-Inbound

Comment hidden (Legacy TBPL/Treeherder Robot)

Patrick McManus [:mcmanus]

Assignee

Comment 12

•

13 years ago

Thread 0 (crashed) 0 xul.dll!nsHttpPipeline::IsDone() [nsHttpPipeline.cpp:9094735bcb46 : 634 + 0xd] eip = 0x6941f9c7 esp = 0x002bd6a0 ebp = 0x002bd6c0 ebx = 0x00000001 esi = 0x14828804 edi = 0x00000000 eax = 0x1916d6d0 ecx = 0x1916d6d0 edx = 0xdddddddd efl = 0x00210297 Found by: given as instruction pointer in context 1 xul.dll!nsHttpConnection::TakeTransport(nsISocketTransport * *,nsIAsyncInputStream * *,nsIAsyncOutputStream * *) [nsHttpConnection.cpp:9094735bcb46 : 929 + 0x12] eip = 0x693fb0b9 esp = 0x002bd6b0 ebp = 0x002bd6c0 Found by: stack scanning 2 xul.dll!nsHttpChannel::OnStopRequest(nsIRequest *,nsISupports *,unsigned int) [nsHttpChannel.cpp:9094735bcb46 : 4488 + 0x52] eip = 0x6941cfc0 esp = 0x002bd6c8 ebp = 0x002bd704 Found by: call frame info Thread 0 is the main thread. That obviously is going to have some problems with the nshttppipeline implementation of nsAHttpTransaction. Its not clear to me that this is safe even with the normal mTransaction && mTransaciton->IsDone() check nsHttpConnection does.. that could be racy if an EOF or something arrived on the network thread. To get this stable this code should ping pong onto the network thread for the taketransport() check and then back to the main thread for OnTransportAvailable().. that's sub-optimal but this is in the high-latency handshake portion of the program anyhow - it won't impact actual message latency in the normal states. We can get it stable and then later figure out how to streamline the handshale. using the pipeline datastructure makes this more obvious, but imo this is a long standing bug that may need porting.

Patrick McManus [:mcmanus]

Assignee

Comment 13

•

13 years ago

Attached patch patch 0 (deleted) — Details — Splinter Review

move nsHttpConnection::TakeTransport() to the socket thread, while leaving OnTransportAvailable on the main thread. Also, don't call content/nsWebSocket.cpp::OnStart() until the IO handlers are fully installed (we need both OnStartRequest and OnTransportAvailable to have been executed, not just OSR).. not doing this means that when firefox writes a message before the IO handlers are installed it might never go out - this was really obviously triggerable with the threading changes mentioned above, but it was a pre-existing problem.

Attachment #613839 - Flags: review?(jduell.mcbugs)

Comment hidden (Legacy TBPL/Treeherder Robot)

Jason Duell

Comment 27

•

13 years ago

Comment on attachment 613839 [details] [diff] [review] patch 0 Review of attachment 613839 [details] [diff] [review]: ----------------------------------------------------------------- So the reason IsDone was segfaulting was re-entrant usage of mResponseQ/mRequestQ from both main/socket threads, correct? Is there a reason we also need to move the call to mListener->OnStart() from WebSocketChannel::OnStartRequest to WebSocketChannel::StartWebsocketData? The move seems ok, and it's intuitively sensible to wait for both OnTransportAvailable and OnStartRequest to be called before calling OnStart. The only issue I could think of is error handling: if for some reason TakeTransport fails (is it possible in practice in this control path?) you don't wind up calling OnTransportAvailable, so you'd now never call OnStart. Not that this appears to make much difference: we don't seem to have error handling for failed TakeTransport either before or after this patch. I'm ok with fixing that in another bug if it's indeed an issue. All the code to move TakeTransport to the socket thread looks solid. ::: netwerk/protocol/http/nsHttpConnectionMgr.cpp @@ +387,5 @@ > +public: > +nsCompleteUpgradeData(nsAHttpConnection *aConn, > + nsIHttpUpgradeListener *aListener) > + : mConn(aConn) > + , mUpgradeListener(aListener) {} weird indenting. Just put both inits on same line? Or at least same indent level.

Attachment #613839 - Flags: review?(jduell.mcbugs) → review+

Comment hidden (Legacy TBPL/Treeherder Robot)

Patrick McManus [:mcmanus]

Assignee

Comment 29

•

13 years ago

(In reply to Jason Duell (:jduell) from comment #27) > Comment on attachment 613839 [details] [diff] [review] > patch 0 > > Review of attachment 613839 [details] [diff] [review]: > ----------------------------------------------------------------- > > So the reason IsDone was segfaulting was re-entrant usage of > mResponseQ/mRequestQ from both main/socket threads, correct? > yes. but it really exposed a fundamental problem that we aren't seeing reports of - the transaction could just as easily be closed on the socket thread due to an EOF while IsDone() is running on the main thread.. so this isn't so much a regression in 14 as much as 14 bringing that issue to the surface. > Is there a reason we also need to move the call to mListener->OnStart() from > WebSocketChannel::OnStartRequest to WebSocketChannel::StartWebsocketData? without that change the content code can submit a message before the http upgrade is complete (i.e. before the transport object can do real websocket I/O) and OnOutputStreamReady() doesn't send it because mSocketOut is still not setup.. the change I made seemed as you say to be the intuitive way to fix that - an alternative would have been to manually call OnOutputStreamReady out of StartWebsocketData to clear the queue, but in general pushing the buffering back towards the source-generator is good policy. > The move seems ok, and it's intuitively sensible to wait for both > OnTransportAvailable and OnStartRequest to be called before calling OnStart. > The only issue I could think of is error handling: if for some reason > TakeTransport fails (is it possible in practice in this control path?) you > don't wind up calling OnTransportAvailable, so you'd now never call OnStart. > Not that this appears to make much difference: we don't seem to have error > handling for failed TakeTransport either before or after this patch. I'm ok > with fixing that in another bug if it's indeed an issue. > yes, that's orthogonal if its a problem at all. I don't think you have to call OTA (though I'd have to think about it), but I'm not sure the right start/stop gets sent to the content class.

status-firefox14: --- → affected

status-firefox15: --- → affected

Target Milestone: --- → mozilla15

Patrick McManus [:mcmanus]

Assignee

Comment 30

•

13 years ago

ff15 https://hg.mozilla.org/integration/mozilla-inbound/rev/92d22954cbd1

Ed Morley [:emorley]

Comment 31

•

13 years ago

https://hg.mozilla.org/mozilla-central/rev/92d22954cbd1

Status: NEW → RESOLVED

Closed: 13 years ago

Resolution: --- → FIXED

Patrick McManus [:mcmanus]

Assignee

Updated

•

13 years ago

status-firefox15: affected → fixed

Patrick McManus [:mcmanus]

Assignee

Comment 32

•

13 years ago

Comment on attachment 613839 [details] [diff] [review] patch 0 [Approval Request Comment] Regression caused by (bug #): 447866 User impact if declined: crash risk Testing completed (on m-c, etc.): baked on m-c, has cured an orange there Risk to taking this patch (and alternatives if risky): String changes made by this patch: none

Attachment #613839 - Flags: approval-mozilla-aurora?

Alex Keybl [:akeybl]

Comment 33

•

13 years ago

(In reply to Patrick McManus [:mcmanus] from comment #32) > Risk to taking this patch (and alternatives if risky): What risk does this patch to nsHttpChannel.cpp carry? I'd only fix an intermittent orange on Aurora if it had very low risk.

Patrick McManus [:mcmanus]

Assignee

Comment 34

•

13 years ago

(In reply to Alex Keybl [:akeybl] from comment #33) > (In reply to Patrick McManus [:mcmanus] from comment #32) > > Risk to taking this patch (and alternatives if risky): > > What risk does this patch to nsHttpChannel.cpp carry? I'd only fix an > intermittent orange on Aurora if it had very low risk. I never know how to answer this in a quantifiable way. So the question is always easier to ask than answer. It's a moderate amount of platfrom C++ code. Mostly a rearrangement onto other execution threads of existing code though. It has test coverage and performs better with that same coverage on m-c than on aurora. The code path is pretty central to websockets, so we can be sure it is being exercised outside of tests too. The aurora code can crash due to a race condition. This is a long standing issue, but a non websockets checkin on 14 made it worse by adding to the list of things that could race dangerously, so the fix for 13 and 14 would be the same - but the risk of triggering the bug is larger on 14. That's why I am not proposing beta?

Alex Keybl [:akeybl]

Comment 35

•

13 years ago

(In reply to Patrick McManus [:mcmanus] from comment #34) > The aurora code can crash due to a race condition. This is a long standing > issue, but a non websockets checkin on 14 made it worse by adding to the > list of things that could race dangerously, so the fix for 13 and 14 would > be the same - but the risk of triggering the bug is larger on 14. That's why > I am not proposing beta? I'm trying to avoid taking moderate amounts of code change on any branch without data that suggests this will be a common/significant problem for our users. Without this fix, would we expect the crash signature [@ nsHttpPipeline::IsDone()] to be apparent in our crash data, or is that just one way the bug is apparent?

Patrick McManus [:mcmanus]

Assignee

Comment 36

•

13 years ago

(In reply to Alex Keybl [:akeybl] from comment #35) > > I'm trying to avoid taking moderate amounts of code change on any branch > without data that suggests this will be a common/significant problem for our > users. Without this fix, would we expect the crash signature [@ > nsHttpPipeline::IsDone()] to be apparent in our crash data, or is that just > one way the bug is apparent? nsHttpPipeline::IsDone() at the top will likely be the most common - they should all contain nsHttpConnection::TakeTransport() somewhere in the stack.

Alex Keybl [:akeybl]

Comment 37

•

13 years ago

From looking at crash-stats, I don't believe this is prevalent enough to take any code change to resolve for FF14. If related crashes do rise, we'll reconsider taking the change.

status-firefox13: --- → affected

Alex Keybl [:akeybl]

Updated

•

13 years ago

Attachment #613839 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora-

Comment hidden (Legacy TBPL/Treeherder Robot)

Jason Duell

Comment 40

•

13 years ago

Patrick--any idea of recent changes that might be causing this to fail again/still?

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Patrick McManus [:mcmanus]

Assignee

Comment 41

•

13 years ago

(In reply to Jason Duell (:jduell) from comment #40) > Patrick--any idea of recent changes that might be causing this to fail > again/still? what are you referring to? This is wontfix for FF13 and FF14 (see alex comment 37). The crash in 38 is FF14 and comment 39 is either ff 13 or 14 depending on the build time.

Jason Duell

Comment 42

•

13 years ago

Ah, right. Resume your normal programming...

Status: REOPENED → RESOLVED

Closed: 13 years ago → 13 years ago

Resolution: --- → FIXED

Comment hidden (Legacy TBPL/Treeherder Robot)

Nobody; OK to take it and work on it

Updated

•

12 years ago

Keywords: intermittent-failure

Nobody; OK to take it and work on it

Updated

•

12 years ago

Whiteboard: [orange]