<a class="header-button" href="https://bugzilla-dev.allizom.org/home" title="Go to home page"> Bugzilla

https://tbpl.mozilla.org/php/getParsedLog.php?id=15057840&tree=Mozilla-Inbound

Comment 2

•

12 years ago

Ed Morley [:emorley]

Comment 3

•

12 years ago

https://tbpl.mozilla.org/php/getParsedLog.php?id=15100837&tree=Mozilla-Inbound

Comment 4

•

12 years ago

So there's no output at *all* from the test? That seems very strange, it seems like the tests should at least start running and then hang, not just hang without any output at all...

Comment 5

•

12 years ago

I think Honza may have been wrong about having the xpcshell test self-kill itself after all. The python script that runs the xpcshell tests buffers the output, only printing the output if there is a failure. (Although the output should end up on disk too.) I think buildbot is killing that python process (and thereby its children) and so the buffered data does not get dumped. Barring some other obvious solutoin, we should but the self-kill back in the test so you can get useful log data to figure out what's going on. We might also try triggering a number of try-server runs on just OS X 10.8 where this seems to be happening. Donovan, if you haven't gotten your level 1 approval in yet, I'd do that so you can do the try-server stuff.

Comment 6

•

12 years ago

Good call, Andrew. I think you are right.

Comment 7

•

12 years ago

It is a known problem (bug 597064) that xpcshell tests that time out show no output.

Comment 8

•

12 years ago

Attached patch timeout code to show error to help debugging (deleted) — Details — Splinter Review

Here is the manual timing out code that was in older versions of the tcpsocket patch. This would at least give us some context about how far the process is getting before it hangs.

Updated

•

12 years ago

Attachment #659911 - Flags: review?(fabrice)

[:fabrice] Fabrice Desré

Updated

•

12 years ago

Attachment #659911 - Flags: review?(fabrice) → review+

Comment 9

•

12 years ago

I've landed the patch on mozilla-inbound so we get better errors: https://hg.mozilla.org/integration/mozilla-inbound/rev/6e78c3efd115

Whiteboard: [orange] → [orange][leave open]

https://hg.mozilla.org/mozilla-central/rev/6e78c3efd115

Comment 10

•

12 years ago

Updated

•

12 years ago

Summary: Intermittent test_tcpsocket.js | command timed out: 1200 seconds without output, attempting to kill → Intermittent test_tcpsocket.js | command timed out: 1200 seconds without output, attempting to kill | The test should never take this long unless the system is hosed.

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment 12

•

12 years ago

The interesting bit: TEST-INFO | /builds/slave/talos-slave/test/build/xpcshell/tests/dom/network/tests/unit/test_tcpsocket.js | Starting badConnect TCPSocket: window init: undefined TCPSocket: startup called TCPSocket: Host info: 127.0.0.1:2 TCPSocket: SSL: false The moral seems to be that either we are: a) experiencing some type of race related to a failure to connect to a ridiculous local port that only happens on OS X 10.8. For example, if the necko callback fires synchronously? In theory, this race could also be on the (bad) success case if something is somehow on port 2. b) There's some kind of weird firewall on localhost for OS X 10.8. Previously we were dying after 20 minutes; now we kill the test after 10 seconds, so normal TCP socket timeouts could operate differently and the test could actually be failing for different reasons.

https://tbpl.mozilla.org/php/getParsedLog.php?id=15286270&tree=Mozilla-Aurora

Comment 13

•

12 years ago

I'm guessing the failure is happening so fast that it is happening synchronously, which prevents the "connection refused" event from firing properly, but that's just a guess. Figuring out when Connection Refused was happening was one of the hardest parts of writing the original tcpsocket patch.

Comment hidden (Legacy TBPL/Treeherder Robot)

Matt Brubeck (:mbrubeck)

Comment 15

•

12 years ago

https://tbpl.mozilla.org/php/getParsedLog.php?id=15380314&tree=Mozilla-Aurora

Reporter

Comment 16

•

12 years ago

Comment hidden (Legacy TBPL/Treeherder Robot)

https://tbpl.mozilla.org/php/getParsedLog.php?id=15494588&tree=Mozilla-Aurora

Reporter

Comment 18

•

12 years ago

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment 20

•

12 years ago

It seems strange that this is only happening on Mac OS X 10.8 Mountain Lion -- are we running bots with 10.7 or 10.6 and it is not happening there? Strange indeed.

Comment hidden (Legacy TBPL/Treeherder Robot)

https://tbpl.mozilla.org/php/getParsedLog.php?id=16285694&tree=Mozilla-Beta

Reporter

Comment 41

•

12 years ago

Comment hidden (Legacy TBPL/Treeherder Robot)

https://tbpl.mozilla.org/php/getParsedLog.php?id=16369013&tree=Mozilla-Beta

Reporter

Comment 44

•

12 years ago

Comment hidden (Legacy TBPL/Treeherder Robot)

Daniel Holbert [:dholbert]

Updated

•

12 years ago

Summary: Intermittent test_tcpsocket.js | command timed out: 1200 seconds without output, attempting to kill | The test should never take this long unless the system is hosed. → Intermittent "test_tcpsocket.js | command timed out: 1200 seconds without output, attempting to kill | The test should never take this long unless the system is hosed", followed by "head.js | exception thrown from do_timeout callback: 2147500036"

Comment hidden (Legacy TBPL/Treeherder Robot)

Nobody; OK to take it and work on it

Assignee

Updated

•

12 years ago

Keywords: intermittent-failure

Comment hidden (Legacy TBPL/Treeherder Robot)

Nobody; OK to take it and work on it

Assignee

Updated

•

12 years ago

Whiteboard: [orange][leave open] → [leave open]

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment 217

•

12 years ago

Donovan, any other ideas on this? As you can see, we still hit this pretty frequently.

Flags: needinfo?(dpreston)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment 223

•

12 years ago

Yes, this is obviously happening quite frequently. If I could reproduce it locally then I would actually have a shot at fixing it. I will look into it. Perhaps I can write a script which will run just the tcpsocket tests in a tight loop until it fails, and then keep the process around so I can examine it.

Flags: needinfo?(dpreston)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment 280

•

12 years ago

I have managed to repro this on my machine by running the tests locally over and over again until they fail. It's not perfect, because it usually takes about 1000 iterations for them to fail, and if I manage to fix it I still won't be 100% sure that the tests running for a long time means that it is fixed, but at least I can have a little more certainty. I have a few ideas for things to try to fix it, and now that I can repro locally I won't be making blind guesses in the dark.

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment 316

•

12 years ago

I was kind of hoping that my fix for bug 861196 might improve whatever is happening here, but comments 313 - 315 show it was unrelated. Can we just modify the xpcshell.ini to just not run the test on OS X machines? This failure rate seems out of control and mozTCPSocket on OS X is really not an important thing. badConnect is a useful test case of error mapping, so I would prefer not to just remove that sub-test in general.

Comment 317

•

12 years ago

I agree that we should probably just disable the test for now. I believe I did see some errors from other OSes than OS X in this huge thread, though. I think it just happens on OS X much more frequently. I do have a simple fix -- call setTimeout(5) before opening the input or output channel -- but it is more of a bandaid than a fix. Instead of getting a connection refused error immediately when the race to connect beat the registering of the callback to catch the error, the connection refused happens five seconds later. Also when this fix it seems to cause a different crash bug to occur much more frequently on my machine (although this particular test never fails any more)

Comment 318

•

12 years ago

5 seconds is potentially quite short when on a device like the unagi running on just 2g. roundtrips can be super sucky.

Comment hidden (Legacy TBPL/Treeherder Robot)

edmorley https://tbpl.mozilla.org/php/getParsedLog.php?id=22546469&tree=Mozilla-Inbound Rev5 MacOSX Mountain Lion 10.8 mozilla-inbound debug test xpcshell on 2013-05-03 03:25:22 slave: talos-mtnlion-r5-082 03:26:38 ERROR - abort: connection ended unexpectedly 03:26:38 ERROR - Automation Error: hg not responding 03:26:38 ERROR - Return code: 255 03:26:38 ERROR - Can't run command ['hg', '--config', 'ui.merge=internal:merge', 'branch'] in non-existent directory /builds/slave/talos-slave/test/build/tools! 03:26:38 ERROR - Can't run command ['hg', '--config', 'ui.merge=internal:merge', 'update', '-C'] in non-existent directory '/builds/slave/talos-slave/test/build/tools'! 03:27:38 ERROR - abort: HTTP Error 500: Internal Server Error 03:27:38 ERROR - Automation Error: hg not responding 03:27:38 ERROR - Return code: 255 03:27:38 ERROR - Can't run command ['hg', '--config', 'ui.merge=internal:merge', 'branch'] in non-existent directory /builds/slave/talos-slave/test/build/tools! 03:27:38 ERROR - Can't run command ['hg', '--config', 'ui.merge=internal:merge', 'update', '-C'] in non-existent directory '/builds/slave/talos-slave/test/build/tools'! 03:29:54 ERROR - abort: connection ended unexpectedly 03:29:54 ERROR - Automation Error: hg not responding 03:29:54 ERROR - Return code: 255 03:29:54 ERROR - Can't run command ['hg', '--config', 'ui.merge=internal:merge', 'branch'] in non-existent directory /builds/slave/talos-slave/test/build/tools! 03:29:54 ERROR - Can't run command ['hg', '--config', 'ui.merge=internal:merge', 'update', '-C'] in non-existent directory '/builds/slave/talos-slave/test/build/tools'! 03:35:00 WARNING - TEST-UNEXPECTED-FAIL | /builds/slave/talos-slave/test/build/tests/xpcshell/tests/dom/network/tests/unit_ipc/test_tcpsocket_ipc.js | test failed (with xpcshell return code: 1), see following log: 03:35:01 INFO - child: TEST-UNEXPECTED-FAIL | /builds/slave/talos-slave/test/build/tests/xpcshell/tests/dom/network/tests/unit/test_tcpsocket.js | The test should never take this long unless the system is hosed. - See following stack: 03:35:01 INFO - child: TEST-UNEXPECTED-FAIL | /builds/slave/talos-slave/test/build/tests/xpcshell/head.js | exception thrown from do_timeout callback: 2147500036 - See following stack: 03:35:12 WARNING - PROCESS-CRASH | /builds/slave/talos-slave/test/build/tests/xpcshell/tests/dom/network/tests/unit_ipc/test_tcpsocket_ipc.js | application crashed [@ nsCxPusher::DoPush(JSContext*)] 03:56:19 ERROR - Return code: 1

Comment 321

•

12 years ago

Yeah probably something like 25 seconds is a better timeout value. The timeout that causes this bug's failure would have to be extended to longer than that.

Comment 322

•

12 years ago

Maybe there's just an attribute we can check on the connection to check its state immediately after adding the listener? Presumably the state of the connection isn't a secret.

Comment 323

•

12 years ago

Yes, this is a good idea. I believe I tried it already and it was not easy to tell the state of the socket, but I may not have tried hard enough. I think the correct thing would be to check input.available() to see if it raises or not (like in onInputStreamReady), but the problem is it may also raise if the connection simply has not yet been made in addition to raising after the connection has failed.

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment 387

•

11 years ago

At this point, I'm probably going to just disable this test on OSX since we aren't making any headway on a fix.

Comment 388

•

11 years ago

Donovan, remind me what the latest status of your investigation was?

Comment 389

•

11 years ago

The patch that I had improved the situation significantly, but I was still able to reproduce the crash after running the tests for many many hours. It might be a good idea to just apply the patch anyway if we can get someone to review it. I'll attach the patch so I don't lose it.

Comment 390

•

11 years ago

Attached patch Don't actually start opening the socket until the next tick, to give callers of this function a chance to set callbacks before the open operation is initiated. (deleted) — Details — Splinter Review

Comment 391

•

11 years ago

Comment on attachment 771477 [details] [diff] [review] Don't actually start opening the socket until the next tick, to give callers of this function a chance to set callbacks before the open operation is initiated. Theoretically bug 885982 should give us the same benefit as this patch, so it might make more sense to just disable the test on OS X for now.

Comment 392

•

11 years ago

I'm ok with disabling the test for now. Will there be a time when it can be turned on again, though? When bug 885982 lands maybe?

Comment 393

•

11 years ago

Yes, that would be a good point at which to try re-enabling it on OS X again.

Comment hidden (Legacy TBPL/Treeherder Robot)

Ed Morley [:emorley]

Updated

•

11 years ago

Summary: Intermittent "test_tcpsocket.js | command timed out: 1200 seconds without output, attempting to kill | The test should never take this long unless the system is hosed", followed by "head.js | exception thrown from do_timeout callback: 2147500036" → Intermittent dom/network/tests/unit/test_tcpsocket.js | Test timed out

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment 413

•

11 years ago

Attached patch Patch to disable the intermittently failing test on OS X. (deleted) — Details — Splinter Review

In case attachment 771477 [details] [diff] [review] is not the right fix, here is a patch which disables this test on OS X.

Updated

•

11 years ago

Attachment #771477 - Flags: review?(honzab.moz)

Comment 414

•

11 years ago

Honza, I believe attachment 771477 [details] [diff] [review] is the correct fix for this, but am not 100% sure. Can you take a look at it and see if it makes sense, or suggest someone else who can review it if you are too busy? Thanks.

Comment hidden (Legacy TBPL/Treeherder Robot)

Honza Bambas (:mayhemer)

Comment 421

•

11 years ago

Comment on attachment 771477 [details] [diff] [review] Don't actually start opening the socket until the next tick, to give callers of this function a chance to set callbacks before the open operation is initiated. Review of attachment 771477 [details] [diff] [review]: ----------------------------------------------------------------- I don't believe this is going to help. What you are adding a delay for is just asyncWait that will for sure trigger not sooner then until the next thread loop (cannot happen sooner the content script is done, so listener will for sure add them self before the socket starts doing anything) and asynccopier for the mux stream that has nothing to do with callbacks. r+ just to let you try whether this helps or not. ::: dom/network/src/TCPSocket.js @@ +54,5 @@ > +function next_turn(func) { > + Timer.createInstance( > + nsITimer > + ).initWithCallback({notify: func}, 0, TYPE_ONE_SHOT); > +} Use dispatch to current thread. Get inspired with: http://hg.mozilla.org/mozilla-central/annotate/b3fcd828cadc/testing/xpcshell/head.js#l435 but since this patch is not going to fix the issue probably, just land it as it is to confirm.

Attachment #771477 - Flags: review?(honzab.moz) → review+

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment 455

•

11 years ago

Donovan, any updates here? Should we just land the disabling patch?

status-firefox24: --- → affected

status-firefox25: --- → affected

status-firefox26: --- → affected

Flags: needinfo?(dpreston)

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment 457

•

11 years ago

Yeah, I think Honza is right in comment 421 that my fix won't work. Landing the patch to just disable the test is the right way forward for now I think.

Flags: needinfo?(dpreston)

Updated

•

11 years ago

Keywords: checkin-needed

Comment hidden (Legacy TBPL/Treeherder Robot)

https://hg.mozilla.org/integration/mozilla-inbound/rev/80c3c49c015e

Comment 459

•

11 years ago

Keywords: checkin-needed

Whiteboard: [leave open] → [test disabled on OSX][leave open]

https://hg.mozilla.org/mozilla-central/rev/80c3c49c015e

Comment 460

•

11 years ago

Comment hidden (Legacy TBPL/Treeherder Robot)

https://hg.mozilla.org/releases/mozilla-beta/rev/a80633da527f https://hg.mozilla.org/releases/mozilla-esr24/rev/33b86ec2907b

Comment 465

•

11 years ago

Patrick McManus [:mcmanus]

Updated

•

11 years ago

Assignee: josh → nobody

Updated

•

9 years ago

Whiteboard: [test disabled on OSX][leave open] → [test disabled on OSX][leave open][necko-backlog]

BMO Automation

Comment 466

•

8 years ago

Bulk assigning P3 to all open intermittent bugs without a priority set in Firefox components per bug 1298978.

Priority: -- → P3