Closed Bug 1225248 Opened 9 years ago Closed 9 years ago

Failed to establish Cisco Spark call with Nightly if e10s is enabled

Categories

(Core :: WebRTC, defect, P1)

defect

Tracking

()

RESOLVED WORKSFORME

People

(Reporter: hankpeng, Unassigned)

References

Details

(Keywords: regressionwindow-wanted)

Attachments

(4 files)

No such issue for Nightly if e10s is disabled. Dev Edition with e10s on works well.
Depends on: e10s
Attached file spark-call-failed-with-e10s-on.zip (deleted) —
When did this start? I did a call with Nightly and E10S on Saturday and it worked fine.
(In reply to Eric Rescorla (:ekr) from comment #2)
> When did this start? I did a call with Nightly and E10S on Saturday and it
> worked fine.

The first time I met this issue was Last Thursday, 11/12. It is 100% repeatable on my Windows 10 and Mac OS 10.10.2.
I tried it with Win7 Nightly (11/16) Firefox 45 with e10s on, and had no problems making spark calls (it happened to be using the 1.5.1 OpenH264 build of course).
I also can't repro (Win7 and OSX 10.9).  I'm going to make this a P1 until we know more.

Hank -- can you send us the about:webrtc logs from a failed call?  Do you know if anyone else at Cisco is having problems making Spark calls with Nightly?
backlog: --- → webrtc/webaudio+
Rank: 10
Priority: -- → P1
Flags: needinfo?(hankpeng)
Correction: I was using OSX 10.10.5 and had no problems calling or connecting.

Hi Hank -- Thus far, you're the only person who can repro this problem that I know of.  Could you help me identify a regression window?  You said that you first noticed this on Nov 12th.  Can you try Nightly from Nov 9th and again from Nov 4th and see if both of those also fail under e10s?

You can find Nightly builds from early November in https://ftp.mozilla.org/pub/firefox/nightly/2015/11/

Also, can you capture the info from "about:webrtc" when a call fails and upload it to this bug?

Thanks!
Hank - also check your about:config to see if you have any non-default config settings.
Ethan, I should be using the default config settings. I just cleared the existing profiles, downloaded and installed the latest Nightly. The issue is still there. 
Maire, not sure if others at Cisco have such problem so far. I will dive deep to see what is wrong within webrtc or Spark client. I'd also like to find the regression window if it is a regression.
Flags: needinfo?(hankpeng)
Please check out the about:webrtc info attached. In the end it said: All pairs are failed, and grace period has elapsed. Marking component as failed.
Another about:webrtc info with e10s off. The ICE connections are successful thus the call could be established.
Hey Hank,

Are you familiar with mozregression[1]? That might help you drill down to the first build (and maybe even the push to inbound) that caused this regression for you.

[1]: http://mozilla.github.io/mozregression/
Flags: needinfo?(hankpeng)
Seems not a recent regression on my Windows 10. Tried Nov 9th, Nov 4th and even some October builds, same result is found.
Flags: needinfo?(hankpeng)
This problem may be more like some configuration or environment issue rather than a regression. I removed all the profiles on my Mac OSX and reinstalled Nightly. Now the problem is gone, even on the Nov. 12 version. But it still happens on Windows even after I did the some operations. Is there any other setting storage besides the profile on Windows? 
I asked Sijia (sijchen) to help me do more testings. On her Mac OSX, at first the ICE always failed no matter e10s was on or off. She changed the WiFi connection to another AP, then it worked for both e10s on and off. After that she switched back to the original AP, now the call could be established.
Now I am pretty sure that the problem is caused by my desktop or network environment. After I switch to another network connection, the issue is totally gone. Meanwhile if I switch back to the problematic network, the problem is 100% reproducible. 
I'd like to have the bug status to be RESOLVED/INVALID. Is it ok, Maire?
Flags: needinfo?(mreavy)
@Hank any chance you were connecting IPV6 on the failed network connection, and IPV4 on the successful one?
(In reply to Ethan Hugg [:ehugg] from comment #15)
> @Hank any chance you were connecting IPV6 on the failed network connection,
> and IPV4 on the successful one?

Hi Ethan, nope, it was the IPv4 connectivity check failed. You can check the file attached in comment #9. At present, the Spark media server (Linus) only provides IPv4 address for connection in the SDP answer.
Thanks for digging into this, Hank.  Before we change the status (e.g. mark this resolved), I'd like to see if we can understand why it's failing in that environment -- because until we know, we don't know if other users can hit this.

Nils -- Can you look at the logs and see if there's any clue why this is failing?  I know you're busy, but I think it's worth spending part of a day to see if there's something we're doing wrong.  Thanks.
Flags: needinfo?(mreavy) → needinfo?(drno)
So I looked at all the log Hank provided. From what we have here it looks like Firefox did start sending ICE check requests. But that's about it. Looks like ICE failed because it timed out.

I'm not sure if I should mention it, but it looks a little bit like the call problem we encounter here from the Mozilla Mountain View office - although that comparison is probably far fetched. To bad it is no longer reproducible.
Status: NEW → RESOLVED
Closed: 9 years ago
Flags: needinfo?(drno)
Resolution: --- → WORKSFORME
Attached file BenhamLog.txt (deleted) —
David Benham from Cisco couldn't join a Spark call this afternoon. He tried 3 times but ICE always failed. He is using FF42.0 so e10s should be off. Please check the log.
Flags: needinfo?(drno)
From looking at the log level this seems to have failed because of a STUN timeout on the ICE layer.

(stun/INFO) STUN-CLIENT(nYc+|IP4:171.68.20.41:59660/UDP|IP4:173.37.38.78:33434/UDP(host(IP4:171.68.20.41:59660/UDP)|candidate:0 1 UDP 2130706431 173.37.38.78 33434 typ host)): Timed out

Because Linus doesn't support rtcpmux each call actually requires two ICE streams, one for RTP the second for RTCP. In this case it looks like the first ICE stream for RTP actually connected successfully, but for the second ICE stream for RTCP Firefox never got a response for the STUN requests from Linus. At least that would match what I have seen occasionally happening with Linus.
I would highly recommend to check the server side log file for the above IP address and port combination.

Just in case it helps with finding something in the server side logs the succeeded ICE pair was:

(ice/INFO) ICE-PEER(PC:1449005420019000 (id=3950 url=https://web.ciscospark.com/#/rooms/ece430b0-4469-11e4-bfad-c6100ebf04af):default)/CAND-PAIR(PbU0): setting pair to state SUCCEEDED: PbU0|IP4:171.68.20.41:59656/UDP|IP4:173.37.38.78:33434/UDP(host(IP4:171.68.20.41:59656/UDP)|candidate:0 1 UDP 2130706431 173.37.38.78 33434 typ host)
Flags: needinfo?(drno)
Linus does support rtcp-mux. The rtcp-mux feature should be enabled by default in a Spark call. Anyway, I will ask the linus server team to double check.
Linus does support rtcp-mux. The rtcp-mux feature should be enabled by default in a Spark call. Anyway, I will ask the linus server team to double check.
(In reply to Hank Peng from comment #22)
> Linus does support rtcp-mux. The rtcp-mux feature should be enabled by
> default in a Spark call. Anyway, I will ask the linus server team to double
> check.

Yes sorry you are right. Linus does rtcp-mux, but it does not do bundle for RTP. So the two ICE streams I was referring to earlier are actually one fore audio and the second for video. The result from looking at the log still is the same that one stream succeeded and the second ICE stream timed out (probably audio succeeded and video failed but that is only guessing).
Got the reply from linus:
a few things of note...

1) firefox has changed their SDP a bit and it is causing linus to not advertise bundle. we used to do bundle, but no longer do with FF42. linus wants to see the ICE caps on the audio and video m-lines be the same, but with FF42 they are different (the port numbers are different).

2) the ICE candidates for firefox are a bit strange in that the RTCP port numbers are not RTP+1. the order is IPv6 RTP, IPv4 RTP, IPv6 RTCP, IPv4 RTCP. doesn't make any difference in our use case as rtcp-mux is enabled, but a bit strange.

3) ICE negotiation on the video m-line fails for this call. ICE on the audio m-line seems ok.

note that there is no NAT in this scenario so there are no peer reflexive candidates that must be learned. linus sends a binding request to firefox on the video port but does not receive a binding response. similarly linus does not receive any binding requests from firefox on the video session.

note that linus is operating in single port mode so the remote port (from firefox's point of view) is the same for both audio and video. perhaps that is triggering some strange behavior?
--
Flags: needinfo?(drno)
More from Nathan:
Thach took a deeper look and pointed out a couple places that my analysis was incorrect.

linus is receiving STUN binding requests on the video port but is not responding to them b/c they are aggressive nomination and linus has not yet received a binding response. linus has sent a binding request on the video port but never receives a response.

so the root cause appears to be the same, linus does not receive a binding response from firefox, but hopefully the extra information will be useful.

--
Lets keep this original bug closed and move the discussion of the STUN problem over to bug 1231039.
Flags: needinfo?(drno)
(In reply to Hank Peng from comment #24)
> Got the reply from linus:
> a few things of note...
> 
> 1) firefox has changed their SDP a bit and it is causing linus to not
> advertise bundle. we used to do bundle, but no longer do with FF42. linus
> wants to see the ICE caps on the audio and video m-lines be the same, but
> with FF42 they are different (the port numbers are different).
> 
> 2) the ICE candidates for firefox are a bit strange in that the RTCP port
> numbers are not RTP+1. the order is IPv6 RTP, IPv4 RTP, IPv6 RTCP, IPv4
> RTCP. doesn't make any difference in our use case as rtcp-mux is enabled,
> but a bit strange.
> 
> 3) ICE negotiation on the video m-line fails for this call. ICE on the audio
> m-line seems ok.
> 
> note that there is no NAT in this scenario so there are no peer reflexive
> candidates that must be learned. linus sends a binding request to firefox on
> the video port but does not receive a binding response. similarly linus does
> not receive any binding requests from firefox on the video session.
> 
> note that linus is operating in single port mode so the remote port (from
> firefox's point of view) is the same for both audio and video. perhaps that
> is triggering some strange behavior?
> --
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: