Closed Bug 1713300 Opened 3 years ago Closed 3 years ago

Permafailing tier 2 dom/media/webrtc/tests/mochitests/test_peerConnection_gatherWithStun300.html | Gathering did not complete within 2000 ms with redirect rule: => iceServers: [{"urls":["turn:10.7.204.44:3479?transport=tcp"],"username":"foo","credential"

Categories

(Core :: WebRTC: Audio/Video, defect, P5)

defect

Tracking

()

RESOLVED FIXED
91 Branch
Tracking Status
firefox-esr78 --- unaffected
firefox89 --- unaffected
firefox90 --- fixed
firefox91 --- fixed

People

(Reporter: intermittent-bug-filer, Assigned: bwc)

References

(Regression)

Details

(Keywords: intermittent-failure, regression)

Attachments

(2 files)

Filed by: ncsoregi [at] mozilla.com
Parsed log: https://treeherder.mozilla.org/logviewer?job_id=341046795&repo=autoland
Full log: https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/d1SecatzS0uUD_Kchb4ihg/runs/0/artifacts/public/logs/live_backing.log


[task 2021-05-28T09:51:15.028Z] 09:51:15     INFO - ICE Server: 
[task 2021-05-28T09:51:15.034Z] 09:51:15     INFO - ICE Server: Unhandled Error
[task 2021-05-28T09:51:15.040Z] 09:51:15     INFO - ICE Server: Traceback (most recent call last):
[task 2021-05-28T09:51:15.046Z] 09:51:15     INFO - ICE Server:   File "iceserver\iceserver.py", line 999, in <module>
[task 2021-05-28T09:51:15.052Z] 09:51:15     INFO - ICE Server:     reactor.run()
[task 2021-05-28T09:51:15.060Z] 09:51:15     INFO - ICE Server:   File "C:\tasks\task_1622189839\build\venv\lib\site-packages\twisted\internet\base.py", line 1283, in run
[task 2021-05-28T09:51:15.066Z] 09:51:15     INFO - ICE Server:     self.mainLoop()
[task 2021-05-28T09:51:15.075Z] 09:51:15     INFO - ICE Server:   File "C:\tasks\task_1622189839\build\venv\lib\site-packages\twisted\internet\base.py", line 1295, in mainLoop
[task 2021-05-28T09:51:15.082Z] 09:51:15     INFO - ICE Server:     self.doIteration(t)
[task 2021-05-28T09:51:15.090Z] 09:51:15     INFO - ICE Server:   File "C:\tasks\task_1622189839\build\venv\lib\site-packages\twisted\internet\selectreactor.py", line 143, in doSelect
[task 2021-05-28T09:51:15.097Z] 09:51:15     INFO - ICE Server:     _logrun(selectable, _drdw, selectable, method)
[task 2021-05-28T09:51:15.103Z] 09:51:15     INFO - ICE Server: --- <exception caught here> ---
[task 2021-05-28T09:51:15.111Z] 09:51:15     INFO - ICE Server:   File "C:\tasks\task_1622189839\build\venv\lib\site-packages\twisted\python\log.py", line 103, in callWithLogger
[task 2021-05-28T09:51:15.117Z] 09:51:15     INFO - ICE Server:     return callWithContext({"system": lp}, func, *args, **kw)
[task 2021-05-28T09:51:15.125Z] 09:51:15     INFO - ICE Server:   File "C:\tasks\task_1622189839\build\venv\lib\site-packages\twisted\python\log.py", line 86, in callWithContext
[task 2021-05-28T09:51:15.132Z] 09:51:15     INFO - ICE Server:     return context.call({ILogContext: newCtx}, func, *args, **kw)
[task 2021-05-28T09:51:15.140Z] 09:51:15     INFO - ICE Server:   File "C:\tasks\task_1622189839\build\venv\lib\site-packages\twisted\python\context.py", line 122, in callWithContext
[task 2021-05-28T09:51:15.147Z] 09:51:15     INFO - ICE Server:     return self.currentContext().callWithContext(ctx, func, *args, **kw)
[task 2021-05-28T09:51:15.155Z] 09:51:15     INFO - ICE Server:   File "C:\tasks\task_1622189839\build\venv\lib\site-packages\twisted\python\context.py", line 85, in callWithContext
[task 2021-05-28T09:51:15.161Z] 09:51:15     INFO - ICE Server:     return func(*args,**kw)
[task 2021-05-28T09:51:15.169Z] 09:51:15     INFO - ICE Server:   File "C:\tasks\task_1622189839\build\venv\lib\site-packages\twisted\internet\selectreactor.py", line 154, in _doReadOrWrite
[task 2021-05-28T09:51:15.176Z] 09:51:15     INFO - ICE Server:     self._disconnectSelectable(selectable, why, method=="doRead")
[task 2021-05-28T09:51:15.185Z] 09:51:15     INFO - ICE Server:   File "C:\tasks\task_1622189839\build\venv\lib\site-packages\twisted\internet\posixbase.py", line 252, in _disconnectSelectable
[task 2021-05-28T09:51:15.190Z] 09:51:15     INFO - Buffered messages logged at 09:51:14
[task 2021-05-28T09:51:15.196Z] 09:51:15     INFO - ICE Server:     selectable.readConnectionLost(f)
[task 2021-05-28T09:51:15.204Z] 09:51:15     INFO - ICE Server:   File "C:\tasks\task_1622189839\build\venv\lib\site-packages\twisted\internet\tcp.py", line 307, in readConnectionLost
[task 2021-05-28T09:51:15.210Z] 09:51:15     INFO - ICE Server:     self.connectionLost(reason)
[task 2021-05-28T09:51:15.218Z] 09:51:15     INFO - ICE Server:   File "C:\tasks\task_1622189839\build\venv\lib\site-packages\twisted\internet\tcp.py", line 327, in connectionLost
[task 2021-05-28T09:51:15.224Z] 09:51:15     INFO - ICE Server:     protocol.connectionLost(reason)
[task 2021-05-28T09:51:15.231Z] 09:51:15     INFO - ICE Server: builtins.AttributeError: 'TcpStunRedirectHandler' object has no attribute 'connectionLost'
[task 2021-05-28T09:51:15.237Z] 09:51:15     INFO - ICE Server: 
[task 2021-05-28T09:51:15.242Z] 09:51:15     INFO - Buffered messages finished
[task 2021-05-28T09:51:15.255Z] 09:51:15     INFO - TEST-UNEXPECTED-FAIL | dom/media/webrtc/tests/mochitests/test_peerConnection_gatherWithStun300.html | Gathering did not complete within 2000 ms with redirect rule: => iceServers: [{"urls":["turn:10.7.204.44:3479?transport=tcp"],"username":"foo","credential":"bar"}] 
[task 2021-05-28T09:51:15.256Z] 09:51:15     INFO -     SimpleTest.ok@SimpleTest/SimpleTest.js:417:16
[task 2021-05-28T09:51:15.257Z] 09:51:15     INFO -     gatherWithTimeout@dom/media/webrtc/tests/mochitests/iceTestUtils.js:51:7
[task 2021-05-28T09:51:15.262Z] 09:51:15     INFO - candidates: []
[task 2021-05-28T09:51:15.268Z] 09:51:15     INFO - Not taking screenshot here: see the one that was previously logged
[task 2021-05-28T09:51:15.282Z] 09:51:15     INFO - TEST-UNEXPECTED-FAIL | dom/media/webrtc/tests/mochitests/test_peerConnection_gatherWithStun300.html | Should have two relay candidates with redirect rule: => iceServers: [{"urls":["turn:10.7.204.44:3479?transport=tcp"],"username":"foo","credential":"bar"}] - got +0, expected 2
[task 2021-05-28T09:51:15.283Z] 09:51:15     INFO -     SimpleTest.is@SimpleTest/SimpleTest.js:500:14
[task 2021-05-28T09:51:15.284Z] 09:51:15     INFO -     checkRelayTcp@dom/media/webrtc/tests/mochitests/iceTestUtils.js:164:5
[task 2021-05-28T09:51:15.292Z] 09:51:15     INFO - GECKO(7004) | [Child 14792: Main Thread]: E/signaling [main|PeerConnectionImpl] PeerConnectionImpl.cpp:2778: GetStats: Found no pipelines matching selector.
[task 2021-05-28T09:51:15.302Z] 09:51:15     INFO - TEST-PASS | dom/media/webrtc/tests/mochitests/test_peerConnection_gatherWithStun300.html | No UDP relay candidates should be present with redirect rule: => iceServers: [{"urls":["turn:10.7.204.44:3479?transport=tcp"],"username":"foo","credential":"bar"}] 

Caused by Bug 857668.
Byron, could you help us out with this one?
Thank you.

Flags: needinfo?(docfaraday)

At first blush, it looks like the machine this is running on is simply too slow to run the test case in the time allotted. I'll look into it today.

Assignee: nobody → docfaraday
Flags: needinfo?(docfaraday)

I feared 2000 ms was too aggressive.

Set release status flags based on info from the regressing bug 857668

Every time I try to run test-windows10-aarch64/opt-mochitest-media-gli-e10s on try, it gets an exception. Trying the base revision to see if somehow the fixes I'm trying are the cause.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=5add3afa72214cef193a9f690df6732e76c7cb6a

Edit: mochitest job has been queued for 1413 minutes. What's going on here?

Flags: needinfo?(ahal)

I can't fix this if I can't run our test suite at all on try.

Keywords: stalled

I think that Windows aarch64 platform is very limited and it looks like in this case there weren't any workers available to claim the task before the task deadline was exceeded. Actually looks like this pool is labeled "experimental" and consists of only a single laptop (one of the two workers listed there is quarantined)?

Is this pool try only or something? I'm not really sure how a tier-2 task could be running on a pool with a single machine... But yeah, basically there's no capacity in this pool.

Flags: needinfo?(ahal)

So, I guess the thing to do is disable this test entirely on Windows aarch64 until we have adequate resources for try. Is there a tracking bug I can hook things off of?

Flags: needinfo?(ahal)

I guess I could also extend the timeout and see if that helps on m-c, before taking the step of disabling entirely.

Keywords: stalledleave-open

Ok. Let's land the patch in comment 15, and wait to see how that effects the failure rate. We might consider extending the timeout some more if it continues to occur, but too much and our ability to detect regressions will be compromised. We'll leave this bug open until we determine that the problem is fixed, or we determine that we aren't willing to relax timeouts any further and disable the test.

I'm not aware of any tracking bug for this pool. With so few machines in this pool, it's likely we need to be very intentional about what we run on them, so if you don't see any particular value running these tests with Windows aarch64, then removing these tasks might be the best way forward.

Flags: needinfo?(ahal)

(In reply to Andrew Halberstadt [:ahal] from comment #17)

I'm not aware of any tracking bug for this pool. With so few machines in this pool, it's likely we need to be very intentional about what we run on them, so if you don't see any particular value running these tests with Windows aarch64, then removing these tasks might be the best way forward.

I'm just looking for a bug to make a dependency of "re-enable this test on aarch64". Is there a "promote aarch64 to tier 1" bug anywhere?

Pushed by bcampen@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/23f53e272191 Extend this timeout to give Windows aarch64 more time. r=jib

Relaxing the timeout does not seem to have stopped the failures, but they seem to be happening in later, more complicated subtests (the ones that do both UDP and TCP). Maybe the thing to do is selectively relax the timeouts on aarch64?

Pushed by archaeopteryx@coole-files.de: https://hg.mozilla.org/integration/autoland/rev/72de1a61b749 Extend this timeout even more to give Windows aarch64 more time. r=jib

Comment on attachment 9225197 [details]
Bug 1713300: Extend this timeout to give Windows aarch64 more time. r?jib

Beta/Release Uplift Approval Request

  • User impact if declined: Frequent oranges.
  • Is this code covered by automated tests?: Yes
  • Has the fix been verified in Nightly?: Yes
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): Just a test change, and a simple one at that.
  • String changes made/needed: None
Attachment #9225197 - Flags: approval-mozilla-beta?
Attachment #9226916 - Flags: approval-mozilla-beta?

Seems to only be happening on m-b now. This platform is just really slow it seems.

Does this bug need to stay open at this point, or should we close it as fixed?

Flags: needinfo?(docfaraday)
Whiteboard: [checkin-needed-beta]

Comment on attachment 9225197 [details]
Bug 1713300: Extend this timeout to give Windows aarch64 more time. r?jib

This can land a=test-only

Attachment #9225197 - Flags: approval-mozilla-beta?
Attachment #9226916 - Flags: approval-mozilla-beta?

Let's leave this open another week to see if the oranges are fully gone. Leaving ni on self.

Yep, this looks fixed to me.

Status: NEW → RESOLVED
Closed: 3 years ago
Flags: needinfo?(docfaraday)
Keywords: leave-open
Resolution: --- → FIXED
Target Milestone: --- → 91 Branch
Has Regression Range: --- → yes
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: