Closed Bug 1225248 Opened 9 years ago Closed 9 years ago

Failed to establish Cisco Spark call with Nightly if e10s is enabled

Tracking

()

Status:

RESOLVED WORKSFORME

Blocking Flags:

backlog

webrtc/webaudio+

People

(Reporter: hankpeng, Unassigned)

References

Details

(Keywords: regressionwindow-wanted)

Attachments

(4 files)

spark-call-failed-with-e10s-on.zip 9 years ago Hank Peng (deleted), application/x-zip-compressed		Details
e10s-spark-call-failure-aboutWebrtc.html 9 years ago Hank Peng (deleted), text/html		Details
e10s-off-spark-call-success-aboutWebrtc.html 9 years ago Hank Peng (deleted), text/html		Details
BenhamLog.txt 9 years ago Hank Peng (deleted), text/plain		Details

Hank Peng

Reporter

Description

•

9 years ago

No such issue for Nightly if e10s is disabled. Dev Edition with e10s on works well.

Hank Peng

Reporter

Updated

•

9 years ago

Depends on: e10s

Hank Peng

Reporter

Comment 1

•

9 years ago

Attached file spark-call-failed-with-e10s-on.zip (deleted) — Details

Eric Rescorla (:ekr)

Comment 2

•

9 years ago

When did this start? I did a call with Nightly and E10S on Saturday and it worked fine.

Hank Peng

Reporter

Comment 3

•

9 years ago

(In reply to Eric Rescorla (:ekr) from comment #2)
> When did this start? I did a call with Nightly and E10S on Saturday and it
> worked fine.

The first time I met this issue was Last Thursday, 11/12. It is 100% repeatable on my Windows 10 and Mac OS 10.10.2.

Randell Jesup [:jesup] (needinfo me)

Comment 4

•

9 years ago

I tried it with Win7 Nightly (11/16) Firefox 45 with e10s on, and had no problems making spark calls (it happened to be using the 1.5.1 OpenH264 build of course).

Maire Reavy [:mreavy]

Comment 5

•

9 years ago

I also can't repro (Win7 and OSX 10.9).  I'm going to make this a P1 until we know more.

Hank -- can you send us the about:webrtc logs from a failed call?  Do you know if anyone else at Cisco is having problems making Spark calls with Nightly?

backlog: --- → webrtc/webaudio+

Rank: 10

Priority: -- → P1

Maire Reavy [:mreavy]

Updated

•

9 years ago

Flags: needinfo?(hankpeng)

Maire Reavy [:mreavy]

Comment 6

•

9 years ago

Correction: I was using OSX 10.10.5 and had no problems calling or connecting.

Hi Hank -- Thus far, you're the only person who can repro this problem that I know of.  Could you help me identify a regression window?  You said that you first noticed this on Nov 12th.  Can you try Nightly from Nov 9th and again from Nov 4th and see if both of those also fail under e10s?

You can find Nightly builds from early November in https://ftp.mozilla.org/pub/firefox/nightly/2015/11/

Also, can you capture the info from "about:webrtc" when a call fails and upload it to this bug?

Thanks!

Ethan Hugg [:ehugg]

Comment 7

•

9 years ago

Hank - also check your about:config to see if you have any non-default config settings.

Hank Peng

Reporter

Comment 8

•

9 years ago

Ethan, I should be using the default config settings. I just cleared the existing profiles, downloaded and installed the latest Nightly. The issue is still there. 
Maire, not sure if others at Cisco have such problem so far. I will dive deep to see what is wrong within webrtc or Spark client. I'd also like to find the regression window if it is a regression.

Flags: needinfo?(hankpeng)

Hank Peng

Reporter

Comment 9

•

9 years ago

Attached file e10s-spark-call-failure-aboutWebrtc.html (deleted) — Details

Please check out the about:webrtc info attached. In the end it said: All pairs are failed, and grace period has elapsed. Marking component as failed.

Hank Peng

Reporter

Comment 10

•

9 years ago

Attached file e10s-off-spark-call-success-aboutWebrtc.html (deleted) — Details

Another about:webrtc info with e10s off. The ICE connections are successful thus the call could be established.

Mike Conley (:mconley) (:⚙️)

Comment 11

•

9 years ago

Hey Hank,

Are you familiar with mozregression[1]? That might help you drill down to the first build (and maybe even the push to inbound) that caused this regression for you.

[1]: http://mozilla.github.io/mozregression/

Flags: needinfo?(hankpeng)

Keywords: regressionwindow-wanted

Hank Peng

Reporter

Comment 12

•

9 years ago

Seems not a recent regression on my Windows 10. Tried Nov 9th, Nov 4th and even some October builds, same result is found.

Flags: needinfo?(hankpeng)

Hank Peng

Reporter

Comment 13

•

9 years ago

This problem may be more like some configuration or environment issue rather than a regression. I removed all the profiles on my Mac OSX and reinstalled Nightly. Now the problem is gone, even on the Nov. 12 version. But it still happens on Windows even after I did the some operations. Is there any other setting storage besides the profile on Windows? 
I asked Sijia (sijchen) to help me do more testings. On her Mac OSX, at first the ICE always failed no matter e10s was on or off. She changed the WiFi connection to another AP, then it worked for both e10s on and off. After that she switched back to the original AP, now the call could be established.

Hank Peng

Reporter

Comment 14

•

9 years ago

Now I am pretty sure that the problem is caused by my desktop or network environment. After I switch to another network connection, the issue is totally gone. Meanwhile if I switch back to the problematic network, the problem is 100% reproducible. 
I'd like to have the bug status to be RESOLVED/INVALID. Is it ok, Maire?

Flags: needinfo?(mreavy)

Ethan Hugg [:ehugg]

Comment 15

•

9 years ago

@Hank any chance you were connecting IPV6 on the failed network connection, and IPV4 on the successful one?

Hank Peng

Reporter

Comment 16

•

9 years ago

(In reply to Ethan Hugg [:ehugg] from comment #15)
> @Hank any chance you were connecting IPV6 on the failed network connection,
> and IPV4 on the successful one?

Hi Ethan, nope, it was the IPv4 connectivity check failed. You can check the file attached in comment #9. At present, the Spark media server (Linus) only provides IPv4 address for connection in the SDP answer.

Maire Reavy [:mreavy]

Comment 17

•

9 years ago

Thanks for digging into this, Hank.  Before we change the status (e.g. mark this resolved), I'd like to see if we can understand why it's failing in that environment -- because until we know, we don't know if other users can hit this.

Nils -- Can you look at the logs and see if there's any clue why this is failing?  I know you're busy, but I think it's worth spending part of a day to see if there's something we're doing wrong.  Thanks.

Flags: needinfo?(mreavy) → needinfo?(drno)

Nils Ohlmeier [:drno]

Comment 18

•

9 years ago

So I looked at all the log Hank provided. From what we have here it looks like Firefox did start sending ICE check requests. But that's about it. Looks like ICE failed because it timed out.

I'm not sure if I should mention it, but it looks a little bit like the call problem we encounter here from the Mozilla Mountain View office - although that comparison is probably far fetched. To bad it is no longer reproducible.

Status: NEW → RESOLVED

Closed: 9 years ago

Flags: needinfo?(drno)

Resolution: --- → WORKSFORME

Hank Peng

Reporter

Comment 19

•

9 years ago

Attached file BenhamLog.txt (deleted) — Details

David Benham from Cisco couldn't join a Spark call this afternoon. He tried 3 times but ICE always failed. He is using FF42.0 so e10s should be off. Please check the log.

Flags: needinfo?(drno)

Nils Ohlmeier [:drno]

Comment 20

•

9 years ago

From looking at the log level this seems to have failed because of a STUN timeout on the ICE layer.

(stun/INFO) STUN-CLIENT(nYc+|IP4:171.68.20.41:59660/UDP|IP4:173.37.38.78:33434/UDP(host(IP4:171.68.20.41:59660/UDP)|candidate:0 1 UDP 2130706431 173.37.38.78 33434 typ host)): Timed out

Because Linus doesn't support rtcpmux each call actually requires two ICE streams, one for RTP the second for RTCP. In this case it looks like the first ICE stream for RTP actually connected successfully, but for the second ICE stream for RTCP Firefox never got a response for the STUN requests from Linus. At least that would match what I have seen occasionally happening with Linus.
I would highly recommend to check the server side log file for the above IP address and port combination.

Just in case it helps with finding something in the server side logs the succeeded ICE pair was:

(ice/INFO) ICE-PEER(PC:1449005420019000 (id=3950 url=https://web.ciscospark.com/#/rooms/ece430b0-4469-11e4-bfad-c6100ebf04af):default)/CAND-PAIR(PbU0): setting pair to state SUCCEEDED: PbU0|IP4:171.68.20.41:59656/UDP|IP4:173.37.38.78:33434/UDP(host(IP4:171.68.20.41:59656/UDP)|candidate:0 1 UDP 2130706431 173.37.38.78 33434 typ host)

Flags: needinfo?(drno)

Hank Peng

Reporter

Comment 21

•

9 years ago

Linus does support rtcp-mux. The rtcp-mux feature should be enabled by default in a Spark call. Anyway, I will ask the linus server team to double check.

Hank Peng

Reporter

Comment 22

•

9 years ago

Linus does support rtcp-mux. The rtcp-mux feature should be enabled by default in a Spark call. Anyway, I will ask the linus server team to double check.

Nils Ohlmeier [:drno]

Comment 23

•

9 years ago

(In reply to Hank Peng from comment #22)
> Linus does support rtcp-mux. The rtcp-mux feature should be enabled by
> default in a Spark call. Anyway, I will ask the linus server team to double
> check.

Yes sorry you are right. Linus does rtcp-mux, but it does not do bundle for RTP. So the two ICE streams I was referring to earlier are actually one fore audio and the second for video. The result from looking at the log still is the same that one stream succeeded and the second ICE stream timed out (probably audio succeeded and video failed but that is only guessing).

Hank Peng

Reporter

Comment 24

•

9 years ago

Got the reply from linus:
a few things of note...

1) firefox has changed their SDP a bit and it is causing linus to not advertise bundle. we used to do bundle, but no longer do with FF42. linus wants to see the ICE caps on the audio and video m-lines be the same, but with FF42 they are different (the port numbers are different).

2) the ICE candidates for firefox are a bit strange in that the RTCP port numbers are not RTP+1. the order is IPv6 RTP, IPv4 RTP, IPv6 RTCP, IPv4 RTCP. doesn't make any difference in our use case as rtcp-mux is enabled, but a bit strange.

3) ICE negotiation on the video m-line fails for this call. ICE on the audio m-line seems ok.

note that there is no NAT in this scenario so there are no peer reflexive candidates that must be learned. linus sends a binding request to firefox on the video port but does not receive a binding response. similarly linus does not receive any binding requests from firefox on the video session.

note that linus is operating in single port mode so the remote port (from firefox's point of view) is the same for both audio and video. perhaps that is triggering some strange behavior?
--

Flags: needinfo?(drno)

Hank Peng

Reporter

Comment 25

•

9 years ago

More from Nathan:
Thach took a deeper look and pointed out a couple places that my analysis was incorrect.

linus is receiving STUN binding requests on the video port but is not responding to them b/c they are aggressive nomination and linus has not yet received a binding response. linus has sent a binding request on the video port but never receives a response.

so the root cause appears to be the same, linus does not receive a binding response from firefox, but hopefully the extra information will be useful.

--

Nils Ohlmeier [:drno]

Comment 26

•

9 years ago

Lets keep this original bug closed and move the discussion of the STUN problem over to bug 1231039.

Flags: needinfo?(drno)

Nils Ohlmeier [:drno]

Comment 27

•

9 years ago

(In reply to Hank Peng from comment #24)
> Got the reply from linus:
> a few things of note...
> 
> 1) firefox has changed their SDP a bit and it is causing linus to not
> advertise bundle. we used to do bundle, but no longer do with FF42. linus
> wants to see the ICE caps on the audio and video m-lines be the same, but
> with FF42 they are different (the port numbers are different).
> 
> 2) the ICE candidates for firefox are a bit strange in that the RTCP port
> numbers are not RTP+1. the order is IPv6 RTP, IPv4 RTP, IPv6 RTCP, IPv4
> RTCP. doesn't make any difference in our use case as rtcp-mux is enabled,
> but a bit strange.
> 
> 3) ICE negotiation on the video m-line fails for this call. ICE on the audio
> m-line seems ok.
> 
> note that there is no NAT in this scenario so there are no peer reflexive
> candidates that must be learned. linus sends a binding request to firefox on
> the video port but does not receive a binding response. similarly linus does
> not receive any binding requests from firefox on the video session.
> 
> note that linus is operating in single port mode so the remote port (from
> firefox's point of view) is the same for both audio and video. perhaps that
> is triggering some strange behavior?
> --

See Also: → https://bugzilla.mozilla.org/show_bug.cgi?id=1231039

You need to log in before you can comment on or make changes to this bug.