Closed Bug 998989 Opened 10 years ago Closed 10 years ago

Loop Client: Upload ICE failure logs

Categories

(Hello (Loop) :: Client, defect, P1)

x86
macOS
defect

Tracking

(Not tracked)

RESOLVED FIXED
mozilla33

People

(Reporter: abr, Assigned: jib)

References

Details

(Whiteboard: [est:3d, p=1.5, 1.5:p2, ft:webrtc])

Attachments

(7 files, 38 obsolete files)

(deleted), patch
Details | Diff | Splinter Review
(deleted), patch
bwc
: review+
Details | Diff | Splinter Review
(deleted), patch
jib
: review+
Details | Diff | Splinter Review
(deleted), patch
Details | Diff | Splinter Review
(deleted), patch
vladan
: review+
Details | Diff | Splinter Review
(deleted), patch
dveditz
: review+
Details | Diff | Splinter Review
(deleted), application/json
Details
The Loop client needs to send ICE failure information to the telemetry servers to assist with in-the-field troubleshooting. The format and means for submission are described here: https://wiki.mozilla.org/Loop/Telemetry#ICE_Failures
Since this is a small amount of work for lots of benefit, I'd like to get this into MLP.
backlog: --- → mlp+
Niko -- Once you're done with Bug 994152 ("Loop needs a 'do not disturb' control"), can you work on this?

I would prioritize Bug 994146 ("Loop client should reset panel when activated") after this bug.  Thanks!
Flags: needinfo?(nperriault)
Priority: -- → P2
Sure, will do.
Flags: needinfo?(nperriault)
Pushed the WIP stuff we poked at this morning (a simple watcher) to <https://github.com/adamroach/gecko-dev/compare/loop-ui-initial...ice-upload-bug-998989?expand=1>.  It compiles; the code hasn't yet been executed.
backlog: mlp+ → ---
Whiteboard: [est:1d] → [est:3d, p=3, 1.5:p2, ft:webrtc]
Target Milestone: --- → mozilla32
Marking this bug as depending on bug 976109 in order to be capable of retrieving a chat window instance once created.
Depends on: 976109
Blocks: 1005175
Depends on: 1008122
Is there any way that this could live in loop-specific code (WebrtcGlobalInformation is chrome-only), and not PeerConnection.js? Because if we do the latter, it becomes messy to scope this just to loop.
Flags: needinfo?(standard8)
Well loop-specific code needs at least something to listen to, and right now no specific event is triggered notifying listeners that a new PC has been created for a given window :)

If you can think of any alternate strategy, please let me know!
I believe Adam & Dan came up with this suggestion, though I believe that what Nicolas says is true (the sdk doesn't give us access to the PCs atm).
Flags: needinfo?(standard8) → needinfo?(adam)
No longer depends on: 976109
Well, if we have the window-id, and are able to access WebrtcGlobalInformation, that should be enough to filter the results that come out, albeit in a kinda hacky substring-matching way. It would not be terribly hard to teach WebrtcGlobalInformation to put the window id in the stats report by itself, either. So how hard is it going to be to get access to the WebrtcGlobalInformation API?
(In reply to Mark Banner (:standard8) from comment #9)
> I believe Adam & Dan came up with this suggestion, though I believe that
> what Nicolas says is true (the sdk doesn't give us access to the PCs atm).

This is correct, except "at the moment" is a bit misleading: the SDK will never surface this. We may get them to add an expando property on the PCs they create, but it's always going to be accessed "from the bottom" (i.e., via PeerConnection.js) rather than "from the top" (using the API).

(In reply to Byron Campen [:bwc] from comment #10)
> Well, if we have the window-id, and are able to access
> WebrtcGlobalInformation, that should be enough to filter the results that
> come out, albeit in a kinda hacky substring-matching way. It would not be
> terribly hard to teach WebrtcGlobalInformation to put the window id in the
> stats report by itself, either. So how hard is it going to be to get access
> to the WebrtcGlobalInformation API?

That's not difficult at all, but it also doesn't solve the problem: what we *need* is some way to find out when a new PC is created instead of having to periodically query. The overall idea here was to make the API as generic as possible ("I want to know whenever a PC is created in window X") rather than making it inherently Loopy. I do see that the current implementation only allows one listener and one Window ID, but that could be fixed with no disruption if and when we find a second user for this feature.
Flags: needinfo?(adam)
Yeah, we could add some chrome-only API that gives you the list of PCs given the window ID to PeerConnection.js, since that's inherently generic and useful. You'll still need WebrtcGlobalInformation for the logging, but it sounds like that's easy.
(In reply to Byron Campen [:bwc] from comment #12)
> Yeah, we could add some chrome-only API that gives you the list of PCs given
> the window ID to PeerConnection.js, since that's inherently generic and
> useful.

You keep talking about this as if it's just getting access to information.

It's not.

It's about getting *notified* when a PC is created.

And *then* getting access to data related to that PC.
Sorry, my wording was unclear; we could definitely do this in PC.js without too much trouble.
After a little more thought, I'm a bit worried about this plan; I'm not sure PC.js lets you register an event handler anywhere other than the winid attached to the peerconnection, so I don't think that there's any viable way to listen for ice connection state changes even if we are given a PeerConnection from another window.

jib, what's your take on this? Am I misunderstanding something here?
Flags: needinfo?(jib)
Maybe what we actually want here is an API that gives you callbacks when a PC is created, _and_ when a PC has closed, with maybe a boolean within PC that indicates whether we think that the call "failed" (for now, this would be set if ICE failed, but in the future we could set in other situations, like media stream errors and the like)? From here, we'd be able to do things like grab the final stats report for that PC, and upload it.
When you say PC.js do you mean in the RTCPeerConnection api, or literally in that file - as in JS vs c++? (we can implement any interface and have the code live in PC.js)

I guess I'm not following, but then I'm not up on what is where in Loop. This sounds like a back-end thing, so a separate API seems cleaner, just from the surface of things I'm hearing here so far, but would love to hear more. I don't know enough about what happens when things are passed between windows, though I imagine you may have compartment. I would ask the DOM guys if you need more on that.
Flags: needinfo?(jib)
s/compartment/compartment-issues/
The plan I'm worried about is to, from a separate window, get a reference to an RTCPeerConnection object, and listen for ice connection state changes on it. I'm pretty sure there's no way to do this even if we have the object, since we're hard-coded to firing this event at the window the RTCPeerConnection lives in right now.

I like the idea of just having a Chrome-Only PC lifecycle observation API, and maybe a boolean in RTCPeerConnection indicating whether we think there was some badness in the call. That would be pretty clean, generically useful, and would give us more flexibility going forward when determining whether we need to upload a report.
(In reply to Byron Campen [:bwc] from comment #19)
> The plan I'm worried about is to, from a separate window, get a reference to
> an RTCPeerConnection object, and listen for ice connection state changes on
> it. I'm pretty sure there's no way to do this even if we have the object,
> since we're hard-coded to firing this event at the window the
> RTCPeerConnection lives in right now.
> 
> I like the idea of just having a Chrome-Only PC lifecycle observation API,
> and maybe a boolean in RTCPeerConnection indicating whether we think there
> was some badness in the call. That would be pretty clean, generically
> useful, and would give us more flexibility going forward when determining
> whether we need to upload a report.

I see the issue you're talking about with event dispatching being bound to the window that the PC is associated with. Interestingly, for the case we're working with, we'll have code running inside that window, although having to work back and forth between the content and chrome halves of our code will be pretty inelegant.

A more general way to find out about call failure events would also solve the issue of having to wrestle with the toolkit for ownership of the "oniceconnectionstatechange" event (we were probably going to have to do something pretty hacky here, like wait for them to install theirs, and then wedge our code in under it), so I like it from that perspective.

Byron: I think you understand this pretty well; would you be in a position to put up a patch that monitors the lifetime in the way you discuss above? Note that we would need literal access to the PC, not just information about its success/failure.
Yeah I could do this, and was planning on giving actual access to the PC, although my mental model did not imply that this access would be absolutely necessary. My plan was to get the stats reports via WebrtcGlobalInformation (since they cannot be gotten from the PC once it is closed, and besides, the PC-based stats callbacks have the same limitation as the ice connection state event ones do). Am I missing something still?
(In reply to Byron Campen [:bwc] from comment #21)
> Am I missing something still?

We'll want the local and remote SDP as well (cf. https://wiki.mozilla.org/Loop/Telemetry). In the future, the toolkit we are using will probably be adding a script-defined property to the PC that identifies the connection for further correlation purposes; e.g.:

> this.connectionId[++i] = createUuid();
> this.peerConnection.ourSecretIdentifierExpandoProperty = this.connectionId[i];

They'll be coordinating with us to agree on the name of this property so that we can go scrape it out and put it in our failure reports.
FWIW, the SDP is in the stats report now.
Ok, there's a (potentially) fatal flaw with waiting until the PC is closed; odds are good that this won't happen until the window itself is closed, which means we'll never be able to fire off an async stats query if we're in the same window (it sounds like we are). I guess we have little choice but to fire callbacks on state changes.
Alright, so as a simple first cut, I'm putting together a single ChromeOnly callback that fires whenever a PC is created, the ICE connection state changes, or the ICE gethering state changes. Basically, a very simple shoulder tap. Patch in a few minutes.
Strawman API. Just a single shoulder-tap callback that fires when an \'important\' change occurs on a PC. We could break this down into a more granular set of callbacks, but we'd need at a minimum a creation callback, an ICE connection state change callback, and an ICE gathering state change callback. Not sure if this is worth breaking down.
Assignee: nobody → docfaraday
Status: NEW → ASSIGNED
Quick and dirty patch to exercise the API. Should be a useful feature too.
Attachment #8425028 - Attachment is obsolete: true
Attachment #8425061 - Attachment is obsolete: true
Comment on attachment 8425063 [details] [diff] [review]
Part 1: ChromeOnly API for getting notifications when PCs are initted, or change ICE connection/gathering state. Also, expose the PC id, and allow getAllStats to be filtered by the same.

Review of attachment 8425063 [details] [diff] [review]:
-----------------------------------------------------------------

Mainly, I want to hear your take on whether to break up the lifecycle callback into more granular callbacks. I could go either way.

::: dom/media/PeerConnection.js
@@ +353,5 @@
>    _initialize: function(rtcConfig) {
>      this._impl.initialize(this._observer, this._win, rtcConfig,
>                            Services.tm.currentThread);
>      this._initIdp();
> +    _globalPCList.notifyLifecycleObservers(this);

This needs to happen after init, since we don't want to pass an uninitted PC to the callback.

@@ +919,5 @@
>    },
>  
>    changeIceGatheringState: function(state) {
>      this._iceGatheringState = state;
> +    _globalPCList.notifyLifecycleObservers(this);

Maybe break this into its own function?

@@ +924,5 @@
>    },
>  
>    changeIceConnectionState: function(state) {
>      this._iceConnectionState = state;
> +    _globalPCList.notifyLifecycleObservers(this);

Same here?

::: dom/webidl/RTCPeerConnection.webidl
@@ +117,5 @@
>    readonly attribute RTCIceConnectionState iceConnectionState;
>    [Pref="media.peerconnection.identity.enabled"]
>    readonly attribute RTCIdentityAssertion? peerIdentity;
>  
> +  readonly attribute DOMString id;

Nice to have for the API in WebrtcGlobalInformation, but could be handy for other uses (is intended to be human-readable).
Attachment #8425063 - Flags: feedback?(adam)
Attachment #8425063 - Flags: feedback?(standard8)
Comment on attachment 8425063 [details] [diff] [review]
Part 1: ChromeOnly API for getting notifications when PCs are initted, or change ICE connection/gathering state. Also, expose the PC id, and allow getAllStats to be filtered by the same.

Feedback provided via IRC
Attachment #8425063 - Flags: feedback?(adam)
Attachment #8425063 - Attachment is obsolete: true
Attachment #8425063 - Flags: feedback?(standard8)
Attachment #8425029 - Attachment is obsolete: true
Comment on attachment 8427304 [details] [diff] [review]
Part 1: ChromeOnly API for getting notifications when PCs are initted, or change ICE connection/gathering state. Also, expose the PC id, and allow getAllStats to be filtered by the same.

Review of attachment 8427304 [details] [diff] [review]:
-----------------------------------------------------------------

Will the API exposed in RTCPeerConnectionStatic suffice (for a usage example, see part 2)?
Attachment #8427304 - Flags: feedback?(standard8)
Comment on attachment 8427304 [details] [diff] [review]
Part 1: ChromeOnly API for getting notifications when PCs are initted, or change ICE connection/gathering state. Also, expose the PC id, and allow getAllStats to be filtered by the same.

Review of attachment 8427304 [details] [diff] [review]:
-----------------------------------------------------------------

I gave this a quick try with the latest Loop code and it seemed to work fine, even with the chat window holding the conversation. So this seems to be sufficient as far as I can tell at the moment.
Attachment #8427304 - Flags: feedback?(standard8) → feedback+
Comment on attachment 8427304 [details] [diff] [review]
Part 1: ChromeOnly API for getting notifications when PCs are initted, or change ICE connection/gathering state. Also, expose the PC id, and allow getAllStats to be filtered by the same.

Review of attachment 8427304 [details] [diff] [review]:
-----------------------------------------------------------------

Whoever can get to it first, abr may be traveling.
Attachment #8427304 - Flags: review?(jib)
Attachment #8427304 - Flags: review?(adam)
Comment on attachment 8427304 [details] [diff] [review]
Part 1: ChromeOnly API for getting notifications when PCs are initted, or change ICE connection/gathering state. Also, expose the PC id, and allow getAllStats to be filtered by the same.

Review of attachment 8427304 [details] [diff] [review]:
-----------------------------------------------------------------

r=me with nits and a DOM reviewer.

::: dom/media/PeerConnection.js
@@ +115,2 @@
>      } else if (topic == "profile-change-net-teardown" ||
>                 topic == "network:offline-about-to-go-offline") {

Don't you need to delete lifecycleobservers here too?

@@ +1318,5 @@
> +function RTCPeerConnectionStatic() {
> +}
> +RTCPeerConnectionStatic.prototype = {
> +  classDescription: "mozRTCPeerConnectionStatic",
> +//  QueryInterface: XPCOMUtils.generateQI([Ci.nsISupports]),

remove commented-out code.

::: dom/webidl/RTCPeerConnectionStatic.webidl
@@ +9,5 @@
> +  interface (see bug 863952), so we need to create a simple interface with a
> +  trivial constructor and no data to hold these functions that really ought to
> +  be static in mozRTCPeerConnection.
> +  TODO(bcampen@mozilla.com) Merge this code into RTCPeerConnection once this
> +  limitation is gone.

Please file a bug blocked on 863952 and include number in todo.

::: media/webrtc/signaling/src/peerconnection/PeerConnectionImpl.h
@@ +407,5 @@
>  #endif
>  
> +  nsresult GetId(nsAString& id)
> +  {
> +    id = NS_ConvertASCIItoUTF16(mName.c_str());

mName never changes, worth storing mName16? Might make that filter thing scale better. Bonus points if you can make them const.

::: media/webrtc/signaling/src/peerconnection/WebrtcGlobalInformation.cpp
@@ +166,5 @@
>        MOZ_ASSERT(p->second);
>  
> +      if (!pcIdFilter.WasPassed() ||
> +          pcIdFilter.Value() ==
> +            NS_ConvertUTF8toUTF16(p->second->GetName().c_str())) {

GetId() is feeling left out.
Attachment #8427304 - Flags: review?(jib) → review+
(In reply to Jan-Ivar Bruaroey [:jib] from comment #38)
> Comment on attachment 8427304 [details] [diff] [review]
> Part 1: ChromeOnly API for getting notifications when PCs are initted, or
> change ICE connection/gathering state. Also, expose the PC id, and allow
> getAllStats to be filtered by the same.
> 
> Review of attachment 8427304 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> r=me with nits and a DOM reviewer.
> 
> ::: dom/media/PeerConnection.js
> @@ +115,2 @@
> >      } else if (topic == "profile-change-net-teardown" ||
> >                 topic == "network:offline-about-to-go-offline") {
> 
> Don't you need to delete lifecycleobservers here too?
> 

   I don't know about this; we seem to be running this code if we're just starting a "Work offline", which doesn't seem like it should unhook callbacks like these. And in the shutdown case, everything is getting torn down; the only real reason we're unhooking these callbacks at all is to not have them hanging around after the window is closed, which will not happen if all of the content is unloaded.

> ::: dom/webidl/RTCPeerConnectionStatic.webidl
> ::: media/webrtc/signaling/src/peerconnection/PeerConnectionImpl.h
> @@ +407,5 @@
> >  #endif
> >  
> > +  nsresult GetId(nsAString& id)
> > +  {
> > +    id = NS_ConvertASCIItoUTF16(mName.c_str());
> 
> mName never changes, worth storing mName16? Might make that filter thing
> scale better. Bonus points if you can make them const.

   This optimization probably isn't worth the code it takes to implement it, and it would need to be conditioned on MOZILLA_INTERNAL_API too (ugh).

> ::: media/webrtc/signaling/src/peerconnection/WebrtcGlobalInformation.cpp
> @@ +166,5 @@
> >        MOZ_ASSERT(p->second);
> >  
> > +      if (!pcIdFilter.WasPassed() ||
> > +          pcIdFilter.Value() ==
> > +            NS_ConvertUTF8toUTF16(p->second->GetName().c_str())) {
> 
> GetId() is feeling left out.

   Yeah, I guess I wrote this before writing GetId().
Attachment #8427304 - Attachment is obsolete: true
Attachment #8427304 - Flags: review?(adam)
Attachment #8430193 - Flags: review?(bzbarsky)
Comment on attachment 8430193 [details] [diff] [review]
Part 1: ChromeOnly API for getting notifications when PCs are initted, or change ICE connection/gathering state. Also, expose the PC id, and allow getAllStats to be filtered by the same.

Carry forward r=jib
Attachment #8430193 - Flags: review+
(In reply to Byron Campen [:bwc] from comment #39)
>    I don't know about this; we seem to be running this code if we're just
> starting a "Work offline", which doesn't seem like it should unhook
> callbacks like these. And in the shutdown case, everything is getting torn
> down; the only real reason we're unhooking these callbacks at all is to not
> have them hanging around after the window is closed, which will not happen
> if all of the content is unloaded.

OK I buy that.
Comment on attachment 8430193 [details] [diff] [review]
Part 1: ChromeOnly API for getting notifications when PCs are initted, or change ICE connection/gathering state. Also, expose the PC id, and allow getAllStats to be filtered by the same.

>+    for (var key in this._lifecycleobservers) {

  for (var key of Object.keys(this._lifecycleobservers)) {

seems saner to me, fwiw.

But even more to the point, if you want a Set, why not use a Set?

>+  QueryInterface: XPCOMUtils.generateQI([Ci.nsISupports,
>+                                       Ci.nsIDOMGlobalPropertyInitializer]),

Indentation issues there.

>+      if (!pcIdFilter.WasPassed() || pcIdFilter.Value() == p->second->GetId()) {

This is a bit allocation-happy.  It might make more sense to have a IdEquals() function that takes a const nsAString& and does an EqualsAscii(mName.c_str()) instead.

r=me modulo that.
Attachment #8430193 - Flags: review?(bzbarsky) → review+
(In reply to Boris Zbarsky [:bz] from comment #43)
> Comment on attachment 8430193 [details] [diff] [review]
> Part 1: ChromeOnly API for getting notifications when PCs are initted, or
> change ICE connection/gathering state. Also, expose the PC id, and allow
> getAllStats to be filtered by the same.
> 
> >+    for (var key in this._lifecycleobservers) {
> 
>   for (var key of Object.keys(this._lifecycleobservers)) {
> 
> seems saner to me, fwiw.
> 
> But even more to the point, if you want a Set, why not use a Set?

   Because I'm still learning stock JS, basically. I'll clean up the loop though.

> >+      if (!pcIdFilter.WasPassed() || pcIdFilter.Value() == p->second->GetId()) {
> 
> This is a bit allocation-happy.  It might make more sense to have a
> IdEquals() function that takes a const nsAString& and does an
> EqualsAscii(mName.c_str()) instead.
> 

   I can probably just add an accessor for a const& to mName and use EqualsAscii on that.
Attachment #8430193 - Attachment is obsolete: true
Comment on attachment 8430972 [details] [diff] [review]
Part 1: ChromeOnly API for getting notifications when PCs are initted, or change ICE connection/gathering state. Also, expose the PC id, and allow getAllStats to be filtered by the same.

Review of attachment 8430972 [details] [diff] [review]:
-----------------------------------------------------------------

Carry forward r=jib and r=bz
Attachment #8430972 - Flags: review+
For some reason, I can't mark Part 1 checkin? Can't do checkin-needed either, since Part 2 hasn't been reviewed yet...
Flags: needinfo?(adam)
Hey Byron, What remains to finish this bug off?  If Part 1 is ready for checkin, I can get it checked in.  Thanks.
Doing a needinfo for Comment 49.
Flags: needinfo?(docfaraday)
So, part 1 needs to be checked in, then the code that uses the API needs to be written (registers an observer, and when a failure is observed, pulls a stats report, formats it, and uploads it).
Flags: needinfo?(docfaraday)
Beyond part 1 for MLP, what we need is something that calls the part 1 API, and then when callbacks happen, checks them for state=failed or state=disconnected. If it finds that, it does an XHR POST as described here: https://wiki.mozilla.org/Loop/Telemetry
Attachment #8430972 - Attachment is obsolete: true
Comment on attachment 8432799 [details] [diff] [review]
Part 1: ChromeOnly API for getting notifications when PCs are initted, or change ICE connection/gathering state. Also, expose the PC id, and allow getAllStats to be filtered by the same.

Review of attachment 8432799 [details] [diff] [review]:
-----------------------------------------------------------------

Carry forward r=jib and r=bz
Attachment #8432799 - Flags: review+
Comment on attachment 8432799 [details] [diff] [review]
Part 1: ChromeOnly API for getting notifications when PCs are initted, or change ICE connection/gathering state. Also, expose the PC id, and allow getAllStats to be filtered by the same.

https://hg.mozilla.org/integration/mozilla-inbound/rev/a7cb61185992
Flags: needinfo?(adam)
Whiteboard: [est:3d, p=3, 1.5:p2, ft:webrtc] → [leave open][est:3d, p=3, 1.5:p2, ft:webrtc]
I've asked Jib to finish off the remaining work for this bug (see Comment 52 above).
Assignee: docfaraday → jib
The remaining piece will land in Fx 33 so I'm updating the milestone and remaining effort points.  (And thanks, Jib, for taking this!)
Whiteboard: [leave open][est:3d, p=3, 1.5:p2, ft:webrtc] → [leave open][est:3d, p=1.5, 1.5:p2, ft:webrtc]
Target Milestone: mozilla32 → mozilla33
The example given in https://wiki.mozilla.org/Loop/Telemetry#Example_URL_and_payload :

  POST https://loop.telemetry.mozilla.org/submit/telemetry/
  75342750-c96c-11e3-9c1a-0800200c9a66/loop/Firefox/32.0/default/20140421104955

doesn't match the template:

  Send an HTTP POST request to LOOP_TELMETRY_URL/
  REPORT_ID/loop/OPERATING_SYSTEM/VERSION/UPDATE_CHANNEL/BUILD_ID

which is it?
Flags: needinfo?(adam)
Whoops; I botched the OPERATING_SYSTEM parameter when coming up with the example URL.

The template is correct. I've updated the example to say "Darwin" rather than "Firefox".
Flags: needinfo?(adam)
Attachment #8425029 - Attachment description: Part 2: Using the API exposed in Part 1, teach about:webrtc to update displayed statistics on individual PCs when the lifecycle callback fires. → Example: Using the API exposed in Part 1, teach about:webrtc to update displayed statistics on individual PCs when the lifecycle callback fires.
Attachment #8427305 - Attachment description: Part 2: Using the API exposed in Part 1, teach about:webrtc to update displayed statistics on individual PCs when the lifecycle callback fires. → Example: Using the API exposed in Part 1, teach about:webrtc to update displayed statistics on individual PCs when the lifecycle callback fires.
Doesn't work at the moment. Looking for big picture feedback.
Attachment #8437462 - Flags: feedback?(adam)
Works, at least as much as I can verify locally (I haven't verified the POST goes out on the wire).
Attachment #8437462 - Attachment is obsolete: true
Attachment #8437462 - Flags: feedback?(adam)
Attachment #8437984 - Flags: review?(adam)
Comment on attachment 8437984 [details] [diff] [review]
Part 2: upload telemetry logs on Loop ICE failure (2)

Review of attachment 8437984 [details] [diff] [review]:
-----------------------------------------------------------------

In general, the deep nesting of long inline functions makes this very hard to follow. If you could untangle this by refactoring some of these into their own function blocks, I think it would be much easier to follow. Minimally, I've identified two inline functions that I think really should be moved up to top-level methods.

I also have some questions about the stats object and why it's undergoing all kinds of vigorous massaging.

::: browser/components/loop/MozLoopService.jsm
@@ +462,5 @@
> +
> +        let window = chatbox.contentWindow;
> +        injectLoopAPI(window);
> +
> +        let uploadTelemetry = pc => {

Please break this out into its own method, rather than defining it inline.

@@ +485,5 @@
> +                appendStats(dict.transportStats, report);
> +                appendStats(dict.iceComponentStats, report);
> +                appendStats(dict.iceCandidatePairStats, report);
> +                appendStats(dict.iceCandidateStats, report);
> +                appendStats(dict.codecStats, report);

This is going to require updating this file every time we add a new stat. Wouldn't it make more sense to do something like:

> for (stat in dict) {
>   appendStats(dict[stat], report);
> }

Also, it's not clear to me that there will be any way, looking at the report, to tell which stat you're looking at.

@@ +492,5 @@
> +
> +              // We have stats and logs.
> +              // Prepare payload from https://wiki.mozilla.org/Loop/Telemetry
> +
> +              let report = convertToRTCStatsReport(internalFormat);

In fact, this all seems a bit of a Rube Golbergesque way to masssage the stats report. Why isn't this simply:

> let report = JSON.stringify(internalFormat)

?

I mean, my head is aching from being four levels deep in nested inline functions...

@@ +497,5 @@
> +              let runtime = Cc["@mozilla.org/xre/app-info;1"].getService(Ci.nsIXULRuntime);
> +              let channel = runtime.defaultUpdateChannel || "default";
> +              let nav = window.navigator;
> +              let firefox = nav.userAgent.split('/')[2].replace(nav.productSub,"").trim();
> +              let version = nav.userAgent.split('/')[3];

I would double-check with someone more knowledgeable than I am about whether this is the cleanest way to get at these values. Parsing them out of the UA string seems a bit fragile.

@@ +514,5 @@
> +                report: "ice failure",
> +                connectionstate: pc.iceConnectionState,
> +                stats: report,
> +                localSdp: JSON.stringify(internalFormat.localSdp),
> +                remoteSdp: JSON.stringify(internalFormat.remoteSdp),

Okay, if the remote and local SDP are part of the stats, then these fields aren't necessary. I'm happy to update the format to remove these under the assumption that they're in the stats field.

@@ +550,5 @@
> +        let ourID = window.QueryInterface(Ci.nsIInterfaceRequestor)
> +            .getInterface(Ci.nsIDOMWindowUtils).currentInnerWindowID;
> +
> +        let pc_static = new window.mozRTCPeerConnectionStatic();
> +        pc_static.registerPeerConnectionLifecycleCallback(function(pc, winID, type) {

Make this function its own callback also, rather than defining it inline.
I forgot to mention: Yes, functionally, this is exactly what I expected. I think the next (final?) review should probably be performed by someone with more javascript experience than I have.
(In reply to Adam Roach [:abr] from comment #64)
> > +                appendStats(dict.transportStats, report);
> > +                appendStats(dict.iceComponentStats, report);
> > +                appendStats(dict.iceCandidatePairStats, report);
> > +                appendStats(dict.iceCandidateStats, report);
> > +                appendStats(dict.codecStats, report);
> 
> This is going to require updating this file every time we add a new stat.

Agree, and it is also duplicate code from the RTCStatsReport interface: http://mxr.mozilla.org/mozilla-central/source/dom/media/PeerConnection.js?rev=a7cb61185992&mark=182-199#180 which I couldn't use directly for reasons I wont go into here.

I'll try to refactor this into using a shared function from PeerConnection.js.

> Also, it's not clear to me that there will be any way, looking at the
> report, to tell which stat you're looking at.

Each stat has a type field. This incidentally is how stats are defined in http://dev.w3.org/2011/webrtc/editor/webrtc.html#rtcstatsreport-object so I think we're on good ground here.

> > +              let report = convertToRTCStatsReport(internalFormat);
> 
> In fact, this all seems a bit of a Rube Golbergesque way to masssage the
> stats report. Why isn't this simply:
> 
> > let report = JSON.stringify(internalFormat)
> 
> ?

We don't expose the internal format elsewhere. I figured by sticking the spec we avoid having to document our internal format, which is slightly different (and perhaps even prone to change).

> > +              let runtime = Cc["@mozilla.org/xre/app-info;1"].getService(Ci.nsIXULRuntime);
> > +              let channel = runtime.defaultUpdateChannel || "default";
> > +              let nav = window.navigator;
> > +              let firefox = nav.userAgent.split('/')[2].replace(nav.productSub,"").trim();
> > +              let version = nav.userAgent.split('/')[3];
> 
> I would double-check with someone more knowledgeable than I am about whether
> this is the cleanest way to get at these values. Parsing them out of the UA
> string seems a bit fragile.

Will do. I figure whoever does the DOM review might have input here.

Btw, quick question on:
>             version: nav.oscpu

I didn't know what to put here, and what I put in doesn't result in anything looking like in your example, it was just the closest thing I could find. Can you elaborate on what this version number is? OS version? and where/how to get it?

> > +                stats: report,
> > +                localSdp: JSON.stringify(internalFormat.localSdp),
> > +                remoteSdp: JSON.stringify(internalFormat.remoteSdp),
> 
> Okay, if the remote and local SDP are part of the stats, then these fields
> aren't necessary. I'm happy to update the format to remove these under the
> assumption that they're in the stats field.

They're part of our INTERNAL stats dictionary, which is a grab-bag of stuff, they're not defined anywhere in the spec for stats, so I like the way you have it FWIW.

Agree with all other comments.
Incorporated feedback. Plan to put gzip in follow-on patch.
Attachment #8437984 - Attachment is obsolete: true
Attachment #8437984 - Flags: review?(adam)
Attachment #8438946 - Flags: review?(bugs)
Attachment #8438946 - Flags: review?(adam)
Attached file Sample output (beautified JSON) (obsolete) (deleted) —
Here is a sample telemetry log (run through codebeautify.org):

> {
>     "ver": 1,
>     "info": {
>         "appUpdateChannel": "default",
>         "appBuildID": "20140612015826",
>         "appName": "Firefox",
>         "appVersion": "33.0",
>         "reason": "loop",
>         "OS": "Darwin",
>         "version": "Intel Mac OS X 10.8"

Version here is probably not what we want. Clarification needed.

> "localSdp": "\"v=0\\r\\no=Mozilla-SIPUA-33.0a1 15115 0 IN IP4...

Issue: There's no standard way to represent newlines in JSON. Thoughts?

> "log": "insert 'ice' (registry) succeeded: ice\ninsert 'ice.pref'...

We should at least settle on one way ('\\r\\n' seems excessive IMHO).
Also, I didn't know SDP started with an in-band quote?

One idea might be to represent sdp and logs using arrays of strings:

> "localSdp": {
>     "v=0",
>     "o=Mozilla-SIPUA-33.0a1 15115 0 IN IP4 0.0.0.0",
>     "s=SIP Call",
>     "nt=0 0",
>     "a=ice-ufrag:71ff7fa6",
>     etc.

Thoughts?
Attachment #8438960 - Attachment mime type: application/json → text/plain
(In reply to Jan-Ivar Bruaroey [:jib] from comment #69)
> > "localSdp": {

 "localSdp": [
Attachment #8438946 - Flags: review?(bugs) → review+
(In reply to Jan-Ivar Bruaroey [:jib] from comment #69)
> We should at least settle on one way ('\\r\\n' seems excessive IMHO).

I think that's just fine, actually -- it's consistent with how we treat SDP everywhere.

Inside the log, I would suggest keeping the "\n" that you show above. I'm not terribly fond of the idea of putting each line into an array, as this may slow processing in our analysis jobs -- I suspect a single string will make this processing faster.
Incorporated more feedback from abr in comment 64. Carrying forward r+ from smaug.
- Now uses Services.appinfo to fill in telemetry metadata.
Attachment #8438946 - Attachment is obsolete: true
Attachment #8438946 - Flags: review?(adam)
Attachment #8439704 - Flags: review+
Attachment #8439704 - Flags: review?(adam)
Attached patch Part 3: gzip ICE telemetry logs (obsolete) (deleted) — Splinter Review
Attachment #8439705 - Flags: review?(bugs)
Attachment #8439705 - Flags: review?(adam)
Attached file Sample output (beautified JSON) (2) (obsolete) (deleted) —
Attachment #8439709 - Flags: review?(adam)
Any ideas about how to test this (simulate ICE failure) on the builders?
Flags: needinfo?(drno)
We could probably strip candidates out of the SDP before calling setRemoteDescription(), but there may be a less nasty way.
Comment on attachment 8439705 [details] [diff] [review]
Part 3: gzip ICE telemetry logs

I should not review browser/ code.
Attachment #8439705 - Flags: review?(bugs)
Comment on attachment 8439705 [details] [diff] [review]
Part 3: gzip ICE telemetry logs

Need JS reviewer. Significant parts cribbed from TelemetryPing.jsm
Attachment #8439705 - Flags: review?(jst)
Technically that needs a browser peer review, not a DOM peer (which is why I cleared the review request).
Comment on attachment 8439705 [details] [diff] [review]
Part 3: gzip ICE telemetry logs

Mark or someone should be able to look at this.

I'd really love it if we can get unit tests on this, especially for the MozLoopService parts, so that we can check we're not breaking it if we change it in future.
Attachment #8439705 - Flags: review?(jst) → review?(mhammond)
Hey guys, this seems to be failing our privacy principles http://www.mozilla.org/en-US/privacy/principles/ -- have you OKd this with legal? These logs are collecting PII (id addresses, possibly user-correlatable conversation IDs and time-stamps) without informing or asking the user. What is the plan for data lifetime and minimization on the server?

Also our privacy policy will need to be expanded to cover this data. You're calling it "telemetry" (small 't') which it is, but when you talk to people in the project make sure they understand you're collecting more than the limited data available to Telemetry (capital 'T').
Flags: needinfo?(jmenon)
Sorry I don't know enough about the tests here.
Are there any tests yet?
Are we talking about simulating this in the WebRTC mochitests?

The only other thing which comes to my mind besides modifying the SDP would be to block packets, which is not an option on builders I assume. So if this is a test which needs to run on tbpl I would suggest to proceed with mangling the SPD offer/answer.
Flags: needinfo?(drno)
(In reply to Nils Ohlmeier [:drno] from comment #83)
> Are there any tests yet?

No.

> Are we talking about simulating this in the WebRTC mochitests?

That's what I had in mind yes. This is the kind of code that when it breaks, nobody notices, so having tests would be advantageous.

> The only other thing which comes to my mind besides modifying the SDP would
> be to block packets, which is not an option on builders I assume.

Could we temporarily block (or misconfigure) access to a stun or turn server or something? Any form of sabotage should do I would think. How do we test ice failures today?
(In reply to Daniel Veditz [:dveditz] from comment #82)
> Hey guys, this seems to be failing our privacy principles
> http://www.mozilla.org/en-US/privacy/principles/ -- have you OKd this with
> legal?

We have had an initial discussion with legal, but I don't believe this has
been explicitly cleared.


> These logs are collecting PII (id addresses, possibly
> user-correlatable conversation IDs and time-stamps) without informing or
> asking the user. What is the plan for data lifetime and minimization on the
> server?

I would like to observe that most of this information will be gathered
incidentally just in the process of operating the system, since we are
involved in call setup for every call. The failure information is more
detailed, but honestly doesn't tell you much you don't already know.
That isn't of course to say that we can't do stuff to improve privacy
but most of that will need to be done by log sanitization, cleansing,
etc.

It's also worth drawing a distinction between Loop, where the data
I just mentioned is inherent to operating the service, and WebRTC
in general, where we don't operate the service and hence don't
get it. We still want to gather failure diagnostics, but we should
sanitize those on the client side much more aggressively and be
careful to get the appropriate level of consent.



> Also our privacy policy will need to be expanded to cover this data. You're
> calling it "telemetry" (small 't') which it is, but when you talk to people
> in the project make sure they understand you're collecting more than the
> limited data available to Telemetry (capital 'T').

No disagreement here.
(In reply to Eric Rescorla (:ekr) from comment #85)
> I would like to observe that most of this information will be gathered
> incidentally just in the process of operating the system, since we are
> involved in call setup for every call.

Sure. Do we keep the information after the call is done?

> The failure information is more detailed, but honestly
> doesn't tell you much you don't already know. That isn't of course
> to say that we can't do stuff to improve privacy but most of that
> will need to be done by log sanitization, cleansing, etc.

It's not that we can't ever have such data but that we cannot be careless with it and hope it sorts itself out. Log sanitization after a defined period sounds great -- is there a design? Someone writing code to do it?

The main thing is that what we do and what our privacy policy says we do have to match; it doesn't matter which side we adjust as long as they stay in sync.
(In reply to Daniel Veditz [:dveditz] from comment #86)
> (In reply to Eric Rescorla (:ekr) from comment #85)
> > I would like to observe that most of this information will be gathered
> > incidentally just in the process of operating the system, since we are
> > involved in call setup for every call.
> 
> Sure. Do we keep the information after the call is done?

We don't have to, but obviously it's pretty convenient for diagnosis.


> > The failure information is more detailed, but honestly
> > doesn't tell you much you don't already know. That isn't of course
> > to say that we can't do stuff to improve privacy but most of that
> > will need to be done by log sanitization, cleansing, etc.
> 
> It's not that we can't ever have such data but that we cannot be careless
> with it and hope it sorts itself out. Log sanitization after a defined
> period sounds great -- is there a design? Someone writing code to do it?

Actually, I suggest we sanitize immediately and then purge after a given
time.


> The main thing is that what we do and what our privacy policy says we do
> have to match; it doesn't matter which side we adjust as long as they stay
> in sync.

Sounds good. Jishnu, can you help us adjust the privacy policy appropriatelY?
Comment on attachment 8439705 [details] [diff] [review]
Part 3: gzip ICE telemetry logs

Review of attachment 8439705 [details] [diff] [review]:
-----------------------------------------------------------------

I think this needs tests - TelemetryPing.jsm, from which this code seems to have been lifted, has a test you could probably also re-purpose.

::: browser/components/loop/MozLoopService.jsm
@@ +485,2 @@
>  
> +        // If upload fails, we loose the report.

lose
Attachment #8439705 - Flags: review?(mhammond)
Removed accidental double-JSON.stringify of sdp (produces \r\n instead of \\r\\n). Carrying forward r=smaug.
Attachment #8439704 - Attachment is obsolete: true
Attachment #8439704 - Flags: review?(adam)
Attachment #8441437 - Flags: review+
Attachment #8441437 - Flags: review?(adam)
Attached patch Part 3: gzip ICE telemetry logs (2) (obsolete) (deleted) — Splinter Review
Fixed typo. No tests yet. Unbitrot to make patches run.
Attachment #8439705 - Attachment is obsolete: true
Attachment #8439705 - Flags: review?(adam)
Attachment #8441439 - Flags: review?(adam)
- Anonymizes IP addresses throughout as 11.11.11.xx, 11.11.12.yy, etc.
- Redacts a=fingerprint and a=identity to EOL.
Attachment #8441440 - Flags: review?(ekr)
Attachment #8441440 - Flags: review?(dveditz)
Attached file Sample output (beautified JSON) (3) (obsolete) (deleted) —
Attachment #8438960 - Attachment is obsolete: true
Attachment #8439709 - Attachment is obsolete: true
Attachment #8439709 - Flags: review?(adam)
Attachment #8441455 - Flags: review?(adam)
Comment on attachment 8441440 [details] [diff] [review]
Part 4: anonymize sensitive data in telemetry log

Review of attachment 8441440 [details] [diff] [review]:
-----------------------------------------------------------------

::: browser/components/loop/MozLoopService.jsm
@@ +324,5 @@
> +   * @param {DOMString} 24-bit "ip"
> +   */
> +  mask24: function(ip) {
> +    if (this.ipMap[ip] === undefined) {
> +      let n = this.ipCount++;

This seems like it's going to make it hard to read the traces because they are 11.11.11 looks a lot like 11.11.12.

You could randomize, but an easier way would just to be to add a largish (between 2^{16} and 2^{24}) prime number [actually, I believe any odd number would work] with each round and then take mod 2^{24}. This should have the same period as adding 1, so doesn't require keeping any more state, but it will make the values look quite a bit more different.

Also, please use a random starting point.

Finally, please do not mask the RFC 1918 addresses.

@@ +327,5 @@
> +    if (this.ipMap[ip] === undefined) {
> +      let n = this.ipCount++;
> +      let p1 = Math.floor(n / 0x10000) % 256;
> +      let p2 = Math.floor(n / 0x100) % 256;
> +      let p3 = n % 256;

I don't really care, but my preference would be to use bit-shift operations rather than math operations. They seem clearer.

@@ +339,5 @@
> +   *
> +   * @param {DOMString} src Input text containing IP numbers as words.
> +   */
> +  anonymizeIPv4: function(s) {
> +    let n = "(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)";

This is a very complicated regex. I would suggest instead using \d+\.\d+\.\d+\.\d+ and then checking that the values in the integers are in-bound.

@@ +353,5 @@
> +   * Redacts lines from match to EOL. Assumes \r\n\ linebreaks.
> +   *
> +   * @param {DOMString} sdp The sdp text.
> +   */
> +  redactSdp: function(sdp) {

I haven't studied this function to verify that it's removing everything important.

I notice that it's not redacting the Loop connection IDs and the tokbox session ids.

The names of the TURN servers may leak the user's location. I notice one is like.
Does this say that the user is near NYC?

turn502-nyc.tokbox.com

@@ +354,5 @@
> +   *
> +   * @param {DOMString} sdp The sdp text.
> +   */
> +  redactSdp: function(sdp) {
> +    return sdp.replace(/\r\na=(fingerprint|identity):.*?\r\n/g, "\r\n");

Why are you requiring that it have \r\n on both sides?
Attachment #8441440 - Flags: review?(ekr) → review-
Priority: P2 → P1
(In reply to Eric Rescorla (:ekr) from comment #95)
> > +  anonymizeIPv4: function(s) {
> > +    let n = "(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)";
> 
> This is a very complicated regex. I would suggest instead using
> \d+\.\d+\.\d+\.\d+ and then checking that the values in the integers are
> in-bound.

This is a popular pattern (google "regexp ip"), and the alternative made the code more complicated (handling false positives), so if it's the same to you I'd like to keep it. Might as well have regexp carry weight.

> > +  redactSdp: function(sdp) {
> 
> I haven't studied this function to verify that it's removing everything
> important.

Me neither, I followed what bbrittain suggested. I'm open to things to add.

> I notice that it's not redacting the Loop connection IDs and the tokbox
> session ids.

I'll try to do that.

> The names of the TURN servers may leak the user's location. I notice one is
> like. Does this say that the user is near NYC?
> 
> turn502-nyc.tokbox.com

I'm near nyc this week so good point. :-) Not sure what the most robust SDP way of finding them is.

> > +  redactSdp: function(sdp) {
> > +    return sdp.replace(/\r\na=(fingerprint|identity):.*?\r\n/g, "\r\n");
> 
> Why are you requiring that it have \r\n on both sides?

I wanted to erase to eol, so that's the reason for the tail one. The head one is to only look for a= at the beginning of a line, but I suppose nuking all occurrences is just as well.

Incorporating all remaining comments. Thanks.
(In reply to Jan-Ivar Bruaroey [:jib] from comment #96)
> (In reply to Eric Rescorla (:ekr) from comment #95)
> > > +  anonymizeIPv4: function(s) {
> > > +    let n = "(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)";
> > 
> > This is a very complicated regex. I would suggest instead using
> > \d+\.\d+\.\d+\.\d+ and then checking that the values in the integers are
> > in-bound.
> 
> This is a popular pattern (google "regexp ip"), and the alternative made the
> code more complicated (handling false positives), so if it's the same to you
> I'd like to keep it. Might as well have regexp carry weight.

My concern is that I am having to work to convince myself it's right
and that makes the code hard to read and maintain. 



> > The names of the TURN servers may leak the user's location. I notice one is
> > like. Does this say that the user is near NYC?
> > 
> > turn502-nyc.tokbox.com
> 
> I'm near nyc this week so good point. :-) Not sure what the most robust SDP
> way of finding them is.

Me neither, but I do think it has to be done. Maybe scrub out all the TURN
server names? Note that we could modify the code that does the logging
to make that easier.


> > > +  redactSdp: function(sdp) {
> > > +    return sdp.replace(/\r\na=(fingerprint|identity):.*?\r\n/g, "\r\n");
> > 
> > Why are you requiring that it have \r\n on both sides?
> 
> I wanted to erase to eol, so that's the reason for the tail one. The head
> one is to only look for a= at the beginning of a line, but I suppose nuking
> all occurrences is just as well.

What is .*? intended to do here?
Blocks: 1027053
(In reply to Eric Rescorla (:ekr) from comment #97)
> (In reply to Jan-Ivar Bruaroey [:jib] from comment #96)
> > (In reply to Eric Rescorla (:ekr) from comment #95)
> > > > +  anonymizeIPv4: function(s) {
> > > > +    let n = "(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)";
> > > 
> > > This is a very complicated regex. I would suggest instead using
> > > \d+\.\d+\.\d+\.\d+ and then checking that the values in the integers are
> > > in-bound.
> > 
> > This is a popular pattern (google "regexp ip"), and the alternative made the
> > code more complicated (handling false positives), so if it's the same to you
> > I'd like to keep it. Might as well have regexp carry weight.
> 
> My concern is that I am having to work to convince myself it's right
> and that makes the code hard to read and maintain. 

I think demanding that regular expressions be readable is overly limiting. ;-)

FWIW I find it mildly readable as follows: 250-255 OR 200-249 OR 0-199

It's from http://www.regular-expressions.info/examples.html (top hit for "regexp ip")

> Me neither, but I do think it has to be done. Maybe scrub out all the TURN
> server names? Note that we could modify the code that does the logging
> to make that easier.

I think I have it working. Patch coming soon.

> > > > +  redactSdp: function(sdp) {
> > > > +    return sdp.replace(/\r\na=(fingerprint|identity):.*?\r\n/g, "\r\n");
> > > 
> > > Why are you requiring that it have \r\n on both sides?
> > 
> > I wanted to erase to eol, so that's the reason for the tail one. The head
> > one is to only look for a= at the beginning of a line, but I suppose nuking
> > all occurrences is just as well.
> 
> What is .*? intended to do here?

.*\r\n  includes any character upto the last \r\n it finds ("greedy").
.*?\r\n includes any character upto the first \r\n it finds ("lazy").

Ref "greedy" @ http://www.regular-expressions.info/examples.html
(In reply to Jan-Ivar Bruaroey [:jib] from comment #98)
> FWIW I find it mildly readable as follows: 250-255 OR 200-249 OR 0-199

I can add that as a comment.
(In reply to Jan-Ivar Bruaroey [:jib] from comment #99)
> (In reply to Jan-Ivar Bruaroey [:jib] from comment #98)
> > FWIW I find it mildly readable as follows: 250-255 OR 200-249 OR 0-199
> 
> I can add that as a comment.

I'd prefer you just use the regex followed by the false positive
check. I appreciate that regexes are often unreadable, but that's
an argument against using them unless you have to, not an argument
that it's OK to have them be unreadable
(In reply to Eric Rescorla (:ekr) from comment #100)
> (In reply to Jan-Ivar Bruaroey [:jib] from comment #99)
> > (In reply to Jan-Ivar Bruaroey [:jib] from comment #98)
> > > FWIW I find it mildly readable as follows: 250-255 OR 200-249 OR 0-199
> > 
> > I can add that as a comment.
> 
> I'd prefer you just use the regex followed by the false positive
> check. I appreciate that regexes are often unreadable, but that's
> an argument against using them unless you have to, not an argument
> that it's OK to have them be unreadable

Note that since you need to check the values anyway to not-rewrite RFC 1918
addresses, it seems like the logic to make sure each value is in the
range 0-255 is pretty trivial (it's a JS one-liner).

You also want to avoid rewriting RFC 3927 addresses.
Incorporated feedback.
Attachment #8441440 - Attachment is obsolete: true
Attachment #8441440 - Flags: review?(dveditz)
Attachment #8442594 - Flags: review?(ekr)
Attachment #8442594 - Flags: review?(dveditz)
Unbitrot. Carrying forward r=smaug.
Attachment #8441437 - Attachment is obsolete: true
Attachment #8441437 - Flags: review?(adam)
Attachment #8443518 - Flags: review+
Attachment #8443518 - Flags: review?(adam)
Comment on attachment 8442594 [details] [diff] [review]
Part 4: anonymize sensitive data in telemetry log (2)

Review of attachment 8442594 [details] [diff] [review]:
-----------------------------------------------------------------

This still needs some cleanup.

::: browser/components/loop/MozLoopService.jsm
@@ +327,5 @@
> +    let isInvalidOrRfc1918or3927 = function(p1, p2, p3, p4) {
> +      let invalid = octet => octet < 0 || octet > 255;
> +      return invalid(p1) || invalid(p2) || invalid(p3) || invalid(p4) ||
> +             (p1 == 10) ||
> +             (p1 == 172 && p2 >= 16 && p2 < 32) ||

suggest <= 31 for consistency

@@ +343,5 @@
> +      let p1 = (this.ipCount >> 16) % 256;
> +      let p2 = (this.ipCount >> 8) % 256;
> +      let p3 = this.ipCount % 256;
> +      this.ipMap[key] = (p1 || 1) + "." + (p2 || 1) + "." + (p3 || 1);
> +      this.ipCount = (this.ipCount + 1049039) % 16777216; // + prime % 2^24

Is there a reason you are trying to avoid 0s in all of these fields. 10.0.0.1 is a totally valid IP, after all.

In any case, instead of doing this, I would simply screen out next ipcount
values which are unacceptable, including those which are isInvalidOrRfc1918or3927 (because they cause confusion).

@@ +349,5 @@
> +    return this.ipMap[key] + "." + p4;
> +  },
> +
> +  /**
> +   * Partially masks ip numbers in input text. This is the one to call.

What does the last half of this comment mean?

@@ +354,5 @@
> +   *
> +   * @param {DOMString} text Input text containing IP numbers as words.
> +   */
> +  anonymizeIPv4: function(text) {
> +    return text.replace(/\b\d+\.\d+\.\d+\.\d+\b/g, this.maskIp.bind(this));

Suggest limiting this to strings with 1-3 digits in each block,

@@ +374,5 @@
> +                                     (match, type) => type + "/xxxx");
> +
> +    return text.replace(/\(id\=(\d+) url\=([^\)]+)\)/g,
> +                        (match, id, url) =>
> +                        "(idzz=" + id + " url\=" + trimUrl(url) + ")");

This seems extremely  brittle. why not remove everything with url=...

@@ +401,5 @@
> +  sanitizeLogs: function(log) {
> +    let rex = /srflx\(IP4:(\d+)\.(\d+)\.(\d+)\.(\d+):(\d+)\/(UDP|TCP)\|([^\)]+\))/g;
> +
> +    return log.replace(rex,
> +                       (match, p1, p2, p3, p4, port, protocol, url) =>

Wouldn't this regex be simpler if you didn't separately match p1, p2, etc.
and just did srflx\(IP4:\d+\.\d...) and then used the first match? You
could even use nested matches to keep the UDP/TCP.

@@ +415,5 @@
>    uploadTelemetry: function(window, pc) {
>      if (!this.telemetryUri) {
>        return;
>      }
> +    this.resetIpMask();

How is this going to behave with two calls in different tabs? It seems like they will be separately randomized? Is that what we want?
Attachment #8442594 - Flags: review?(ekr) → review-
(In reply to Eric Rescorla (:ekr) from comment #104)
> > +    return text.replace(/\(id\=(\d+) url\=([^\)]+)\)/g,
> > +                        (match, id, url) =>
> > +                        "(idzz=" + id + " url\=" + trimUrl(url) + ")");
> 
> This seems extremely  brittle. why not remove everything with url=...

Because I use ')' to find the end of the url. Using that technique generally on 'url=foo' matches might cut out other information.

> > +    this.resetIpMask();
>
> How is this going to behave with two calls in different tabs? It seems like
> they will be separately randomized? Is that what we want?

Yes, they will be separately randomized. Is that not what we want?
Incorporated feedback. Thanks!
Attachment #8442594 - Attachment is obsolete: true
Attachment #8442594 - Flags: review?(dveditz)
Attachment #8443992 - Flags: review?(ekr)
Attachment #8443992 - Flags: review?(dveditz)
Attached file Sample output (beautified JSON) (4) (obsolete) (deleted) —
Attachment #8441455 - Attachment is obsolete: true
Attachment #8441455 - Flags: review?(adam)
Attachment #8443993 - Flags: review?(adam)
Forgot to fold in changes.
Attachment #8443992 - Attachment is obsolete: true
Attachment #8443992 - Flags: review?(ekr)
Attachment #8443992 - Flags: review?(dveditz)
Attachment #8443994 - Flags: review?(ekr)
Attachment #8443994 - Flags: review?(dveditz)
(In reply to Jan-Ivar Bruaroey [:jib] from comment #105)
> (In reply to Eric Rescorla (:ekr) from comment #104)
> > > +    return text.replace(/\(id\=(\d+) url\=([^\)]+)\)/g,
> > > +                        (match, id, url) =>
> > > +                        "(idzz=" + id + " url\=" + trimUrl(url) + ")");
> > 
> > This seems extremely  brittle. why not remove everything with url=...
> 
> Because I use ')' to find the end of the url. Using that technique generally
> on 'url=foo' matches might cut out other information.
> 
> > > +    this.resetIpMask();
> >
> > How is this going to behave with two calls in different tabs? It seems like
> > they will be separately randomized? Is that what we want?
> 
> Yes, they will be separately randomized. Is that not what we want?

I am thinking maybe not, since it means that we will not be able to
correlate across two-tab calls, which is a common test case.

Maybe we should randomize when the number of PCs goes to zero instead?
(In reply to Eric Rescorla (:ekr) from comment #104)
> How is this going to behave with two calls in different tabs? It seems like
> they will be separately randomized? Is that what we want?

I'm not sure this is an interesting enough case that we need extra code for it.

Keep in mind that we're talking about self-calls using Loop. At this point, since we're really trying to track down some known issues with peer-to-peer calls between remote clients, I think it's okay if we don't optimize for the "ICE failure during a Loop self-call" situation.

I'm happy to defer if you think this is important (code like "reset after an hour of inactivity" would get the job done), but I'd prefer not to add special handling for what appears to be a rare corner case.
Comment on attachment 8441439 [details] [diff] [review]
Part 3: gzip ICE telemetry logs (2)

This is a little outside my area of core expertise, so I'm going to lateral it to Vladan.

Vladan: this should be a pretty quick review. If you don't have time, please let me know so I can find an alternate reviewer. Thanks!
Attachment #8441439 - Flags: review?(adam) → review?(vdjeric)
Comment on attachment 8443518 [details] [diff] [review]
Part 2: upload telemetry logs on Loop ICE failure (6) r=smaug

Review of attachment 8443518 [details] [diff] [review]:
-----------------------------------------------------------------

This generally looks good, but it needs some mochi tests. Given that we really want this functionality quickly, I'd be okay with handling those in a follow-up bug.

r+, assuming the changes below.

::: browser/components/loop/MozLoopService.jsm
@@ +333,5 @@
> +            appName: ai.name,
> +            appVersion: ai.version,
> +            reason: "loop",
> +            OS: ai.OS,
> +            version: ai.platformVersion

version: Services.sysinfo.getProperty("version")

@@ +348,5 @@
> +        let uuid = uuidgen.generateUUID().toString();
> +        uuid = uuid.substr(1,uuid.length-2); // remove uuid curly braces
> +
> +        let url = this.telemetryUri;
> +        url += ((url.substr(-1) == "/")? "":"/") + uuid + "/loop/" + ai.OS + "/" +

Please wrap to 80 characters.

@@ +381,4 @@
>      // So I guess the origin is the loop server!?
>      let origin = this.loopServerUri;
>      url = url.spec || url;
> +    let that = this;

With the advent of bind() and fat-arrow functions, this idiom is antiquated. Please remove "that" and change the function to:

> let onPCLifecycleChange = (pc, winID, type) => {
Attachment #8443518 - Flags: review?(adam) → review+
Comment on attachment 8443994 [details] [diff] [review]
Part 4: anonymize sensitive data in telemetry log (4)

Review of attachment 8443994 [details] [diff] [review]:
-----------------------------------------------------------------

This lgtm, but please add a unit test. Because this is urgent, I can live with it happening shortly after landing, but dveditz may feel differently.

::: browser/components/loop/MozLoopService.jsm
@@ +331,5 @@
> +    if (isInvalidOrRfc1918or3927(p1, p2, p3, p4)) {
> +      return ip;
> +    }
> +    let key = [p1, p2, p3].join();
> +    if (this.ipMap[key] === undefined) {

Is there a reason to not just check falsity here?

@@ +367,5 @@
> +  sanitizeUrls: function(text) {
> +    let trimUrl = url => url.replace(/(#call|#incoming).*/g,
> +                                     (match, type) => type + "/xxxx");
> +
> +    return text.replace(/\(id\=(\d+) url\=([^\)]+)\)/g,

Do you actually need to escape the = here?
Attachment #8443994 - Flags: review?(ekr) → review+
Blocks: 1029127
(In reply to Adam Roach [:abr] from comment #112)
> This generally looks good, but it needs some mochi tests. Given that we
> really want this functionality quickly, I'd be okay with handling those in a
> follow-up bug.

Filed follow-up Bug 1029127 for this.
Comment on attachment 8443993 [details]
Sample output (beautified JSON) (4)

The "version" field should be the OS version, not the Firefox version (see my comment on the code patch). Aside from that, this looks like what I was expecting, yes.
Attachment #8443993 - Flags: review?(adam) → review+
Comment on attachment 8441439 [details] [diff] [review]
Part 3: gzip ICE telemetry logs (2)

Review of attachment 8441439 [details] [diff] [review]:
-----------------------------------------------------------------

- We shouldn't gzip & stringify large payloads on the main thread. Telemetry does something similar but Telemetry only does this during idle-daily when the browser is not in use
- We're reinventing the Telemetry submission mechanism in this patch and I think that's going to be very error prone

Alternatives:
1) We can attach this data to regular Telemetry payloads and have the server-side parse it out and store it separately for easier analysis & retrieval,
or 2) you can just drop the payload inside a file in the saved-telemetry-pings folder and regular Telemetry will take care of sending it off to the Telemetry server.

I prefer the second choice. The existing telemetry mechanism has protections against spamming the server, interrupting the user, re-trying, etc. We already use this approach for Android "Application Not Responding" reports, you can see an example here: http://mxr.mozilla.org/mozilla-central/source/mobile/android/base/ANRReporter.java#190
Attachment #8441439 - Flags: review?(vdjeric) → review-
For the second approach outlined above, please use OS.File to write the Telemetry data to the saved-telemetry-pings folder off the main thread
Incorporated feedback from abr. Carrying forward r=smaug, abr. Will handle feedback from Vladan in part 3.
Attachment #8443518 - Attachment is obsolete: true
Attachment #8444991 - Flags: review+
Comment on attachment 8444991 [details] [diff] [review]
Part 2: upload telemetry logs on Loop ICE failure (7) r=smaug, abr

Review of attachment 8444991 [details] [diff] [review]:
-----------------------------------------------------------------

Just a couple of small fly-by comments.

::: browser/components/loop/MozLoopService.jsm
@@ +360,5 @@
> +        xhr.open("POST", url, true);
> +        xhr.setRequestHeader("Content-Type", 'application/json');
> +        xhr.onreadystatechange = function() {
> +          if (xhr.readyState == 4 && xhr.status == 200) {
> +            console.log("Failed to upload telemetry logs: " + xhr.responseText);

Shouldn't this be console.error?

@@ +364,5 @@
> +            console.log("Failed to upload telemetry logs: " + xhr.responseText);
> +          }
> +        };
> +        xhr.send(JSON.stringify(payload));
> +        console.log("Uploading telemetry logs to " + url);

Do we really need to log this at the moment?

@@ +420,5 @@
> +        };
> +
> +        let pc_static = new window.mozRTCPeerConnectionStatic();
> +        pc_static.registerPeerConnectionLifecycleCallback(onPCLifecycleChange);
> +      }.bind(this), true);

Please can you change this to the (event) => { } style and avoid the .bind(this). Mark Hammond's explicitly requested it in the past from me for the loop code.
Comment on attachment 8444991 [details] [diff] [review]
Part 2: upload telemetry logs on Loop ICE failure (7) r=smaug, abr

Review of attachment 8444991 [details] [diff] [review]:
-----------------------------------------------------------------

::: browser/components/loop/MozLoopService.jsm
@@ +396,5 @@
>        chatbox.addEventListener("DOMContentLoaded", function loaded(event) {
>          if (event.target != chatbox.contentDocument) {
>            return;
>          }
>          chatbox.removeEventListener("DOMContentLoaded", loaded, true);

Arrow functions are nameless, and the 'loaded' function needs to unregister itself by name here, so I'm stuck with bind I think.
(In reply to Jan-Ivar Bruaroey [:jib] from comment #120)
> Arrow functions are nameless, and the 'loaded' function needs to unregister
> itself by name here, so I'm stuck with bind I think.

Oh good point, sorry I missed that.
Incorporated feedback.
- Now writes payload as file (off-mainthread using OS.File) in
  'saved-telemetry-pings' inbox.
- Removed no longer needed pref("loop.telemetryURL").
Attachment #8441439 - Attachment is obsolete: true
Attachment #8446864 - Flags: review?(vdjeric)
Incorporated feedback & rebased. Carrying forward r=ekr.
Attachment #8443994 - Attachment is obsolete: true
Attachment #8443994 - Flags: review?(dveditz)
Attachment #8446873 - Flags: review+
Attachment #8446873 - Flags: review?(dveditz)
Attached file Sample output (beautified JSON) (5) r=abr (obsolete) (deleted) —
- Saved telemetry ping file header added.
Attachment #8443993 - Attachment is obsolete: true
Attachment #8446879 - Flags: feedback?(vdjeric)
Comment on attachment 8446864 [details] [diff] [review]
Part 3: save as telemetry ping file for upload by telemetry (3)

Review of attachment 8446864 [details] [diff] [review]:
-----------------------------------------------------------------

::: browser/components/loop/MozLoopService.jsm
@@ +358,5 @@
> +
> +        let directory = OS.Path.join(OS.Constants.Path.profileDir,
> +                                     "saved-telemetry-pings");
> +        let path = OS.Path.join(directory, uuid + ".json");
> +        let data = gEncoder.encode(JSON.stringify(ping));

So we're going to stringify and transcode objects of 1MB+ here, right? This should happen off the main thread or be broken up into smaller events. 
Have you tried measuring the duration of this operation on an older or lower-end computer?
Attachment #8446864 - Flags: review?(vdjeric)
Comment on attachment 8446879 [details]
Sample output (beautified JSON) (5) r=abr

I'm a bit concerned about the size of these files, since we'll be writing them to disk uncompressed, and Telemetry will be reading them uncompressed on startup.

a) What is the absolute maximum log size?
b) What is the maximum number of distinct logs that are likely to be produced during a Firefox session?
c) Will we also collect these logs on Fennec and B2G?

We should have Telemetry probes to record the distribution of log sizes + time spent stringifying & transcoding each log + time spent at startup loading these logs.
Attachment #8446879 - Flags: feedback?(vdjeric)
(In reply to Vladan Djeric (:vladan) from comment #127)
> Comment on attachment 8446879 [details]
> Sample output (beautified JSON) (5) r=abr
> 
> I'm a bit concerned about the size of these files, since we'll be writing
> them to disk uncompressed, and Telemetry will be reading them uncompressed
> on startup.

The sample I've uploaded here was 196K before I ran it through code-beautify.

> a) What is the absolute maximum log size?

Good question. I believe the log is a ring-buffer of a certain size. Byron might know.

> b) What is the maximum number of distinct logs that are likely to be
> produced during a Firefox session?

One per failed call. There is no maximum.

> c) Will we also collect these logs on Fennec and B2G?

I assume so.

> We should have Telemetry probes to record the distribution of log sizes +
> time spent stringifying & transcoding each log + time spent at startup
> loading these logs.

Good idea. I can add some of that. As for time spent at startup I believe that would need to go in the telemetry code if it isn't there already.
Flags: needinfo?(docfaraday)
(In reply to Jan-Ivar Bruaroey [:jib] from comment #128)
> I believe the log is a ring-buffer of a certain size. Byron
> might know.

Followup question: Do the logs go back further than the start of the failed call? If they do, maybe we could look at avoiding this somehow to avoid ballooning logs with repetition after a number of subsequent failed calls.
(In reply to Vladan Djeric (:vladan) from comment #127)
> Comment on attachment 8446879 [details]
> Sample output (beautified JSON) (5) r=abr
> 
> I'm a bit concerned about the size of these files, since we'll be writing
> them to disk uncompressed, and Telemetry will be reading them uncompressed
> on startup.
> 
> a) What is the absolute maximum log size?

4096 lines, with each line being on the order of 240 characters long -- so, just a hair under 1 MB.

> b) What is the maximum number of distinct logs that are likely to be
> produced during a Firefox session?

As Jan-Ivar says, one per failed call. Human nature being what it is, it the calls keep failing, the user is unlikely to keep trying.

> c) Will we also collect these logs on Fennec and B2G?

Under B2G, Loop is being implemented as a downloadable privileged app, so it could, in theory, take advantage of this API. I'll touch base with the implementation team for that client to find out whether they can fit this work into their schedule.

Loop does not run on Fennec (although there is a roadmap task to port the B2G client to Android; the timeframe for doing so is uncertain).
(In reply to Adam Roach [:abr] from comment #130)
> 4096 lines, with each line being on the order of 240 characters long -- so,
> just a hair under 1 MB.

Since my local-loop tests amount to about a 200K report, where the logs include both sides btw, wouldn't it be reasonable to expect typical reports to be in the 100K - 150K range?

It seems problematic that the logs will grow to 1MB in size on repeat calls. I guess clearing the logs after upload is no good as users may still want to view them on about:webrtc. Any other ideas? Can we add some form of GetLogsSince(timestamp)?

I would file a follow-up bug for this, except I was wondering if the 100K estimate - as oposed to 1MB - changes anything about what we feel is needed for this patch.

Said another way: if we fix the log-bloat first, do we still feel the string manipulation needs to be off-main?
Flags: needinfo?(vdjeric)
Right now, the ring buffer is limited to 4096 lines, and each line is limited to 4095 characters. The theoretical maximum is about 16MB, although the average length of a log line is probably not much more than 200 bytes, so we'll be under (but pushing) a meg.
Flags: needinfo?(docfaraday)
Ok, we should have some kind of counter to limit the # of logs per session or per hour. And we should also have a limit on the maximum size of the log.

I don't know how often these errors are likely to happen realistically, but I'm concerned about some sort of repeated error condition potentially writing megabytes of data to device storage. The Telemetry upload code will never read in more than 17 files on startup but the IO, network bandwidth, startup delays and storage footprint could be undesirable especially once ported to mobile devices.

I filed Telemetry bug 1033860 to track the amount of time spent reading in saved pings, so we'll be able to do a before & after comparison.
Flags: needinfo?(vdjeric)
(In reply to Vladan Djeric (:vladan) from comment #133)
> The Telemetry upload code will never read in more than 17 files on startup 

Is startup the only time that these are read? E.g. nothing is sent until I quit and restart firefox?

> Ok, we should have some kind of counter to limit the # of logs per session
> or per hour.

How about 17?

> And we should also have a limit on the maximum size of the log.

As Byron explained we do have a limit. Were you thinking of a lower limit? How low?

> I don't know how often these errors are likely to happen realistically, but
> I'm concerned about some sort of repeated error condition potentially
> writing megabytes of data to device storage.

These are "failed-call" reports, where user-initiated webrtc calls either drop or fail to connect to begin with. There's no auto-repeat, so they're always in response to the user trying to initiate or receive a call. We're talking human rate of repetition. A "repeated error condition" might be a location-based, such as a user failing repeatedly to call from a hotel wifi that has a symmetrical NAT not letting calls through, with no good TURN server.
Incorporated feedback. Thanks!
- Now uses a worker thread for encoding and saving to disk (and for log
  sanitation in upcoming part 4).
- Max 17 reports.
Attachment #8446864 - Attachment is obsolete: true
Attachment #8450018 - Flags: review?(vdjeric)
Refactored to worker thread. Carrying forward r=ekr.
Attachment #8446873 - Attachment is obsolete: true
Attachment #8446873 - Flags: review?(dveditz)
Attachment #8450022 - Flags: review+
Attachment #8450022 - Flags: review?(vdjeric)
Attachment #8450022 - Flags: review?(dveditz)
I'll add another patch with telemetry probes.
> Is startup the only time that these are read? E.g. nothing is sent until I
> quit and restart firefox?

That's right.

> > And we should also have a limit on the maximum size of the log.
> 
> As Byron explained we do have a limit. Were you thinking of a lower limit?
> How low?

Yes, sorry, I meant a lower limit. How about 200KB per log?

> > Ok, we should have some kind of counter to limit the # of logs per session
> > or per hour.
> 
> How about 17?

It could still be a lot of data to write & load, and in those cases when we have more than 17 unsent payloads on startup (of any kind.. Telemetry or ANR or Loop log), we load the 17 most recent files and delete the older payloads.

I'm guessing that sessions with more than a few failed webrtc calls likely won't have unique info in each of the logs. So how about a max of 5 Loop logs written per session?

> These are "failed-call" reports, where user-initiated webrtc calls either
> drop or fail to connect to begin with. There's no auto-repeat, so they're
> always in response to the user trying to initiate or receive a call. We're
> talking human rate of repetition. A "repeated error condition" might be a
> location-based, such as a user failing repeatedly to call from a hotel wifi
> that has a symmetrical NAT not letting calls through, with no good TURN
> server.

Ah ok, thank you for the clarification.
Comment on attachment 8450018 [details] [diff] [review]
Part 3: save as telemetry ping file for upload by telemetry (4)

Review of attachment 8450018 [details] [diff] [review]:
-----------------------------------------------------------------

::: browser/components/loop/MozLoopService.jsm
@@ +310,5 @@
>     * Uploads telemetry logs to telemetryServer
>     *
>     * @param {Object} pc The peerConnection in question.
>     */
>    uploadTelemetry: function(window, pc) {

Do you check that Telemetry is enabled somewhere before calling uploadTelemetry?

@@ +364,5 @@
> +        let worker = new ChromeWorker("MozLoopWorker.js");
> +        worker.onmessage = function(e) {
> +          console.log(e.data.length ?
> +          ("Failed to stage loop report. Error: " + e.data.length) :
> +          "Successfully staged loop report for telemetry upload.");

Won't e.data.length evaluate to undefined for both types of replies?

::: browser/components/loop/MozLoopWorker.js
@@ +25,5 @@
> + * { fail: serialized_form_of_OS.File.Error }
> + */
> +
> +onmessage = function(e) {
> +  if (++Counter > 17) {

Let's not hardcode magic numbers without explanation. Also, see my suggestion in the previous comment about the threshold.

@@ +26,5 @@
> + */
> +
> +onmessage = function(e) {
> +  if (++Counter > 17) {
> +    postMessage({fail: "Maximum loop reports reached for this session"});

Oops, we should "return;" after this postMessage

@@ +38,5 @@
> +
> +  // Save to disk
> +  let array = Encoder.encode(pingStr);
> +  try {
> +    ensureDirExists(directory);

Just use OS.File.exists() here. Also see my comment on ensureDirExists below

@@ +40,5 @@
> +  let array = Encoder.encode(pingStr);
> +  try {
> +    ensureDirExists(directory);
> +    let path = OS.Path.join(directory, filename);
> +    File.writeAtomic(path, array, { tmpPath: path + '.tmp' });

I don't think writeAtomic gets us anything in this situation. If we have a partial write of the .tmp file, Telemetry will still try to upload it since it's in the saved-pings directory. So either provide a different directory in the tmp path or just let Telemetry deal with the corruption (it will just delete any files that are not proper JSON).

@@ +50,5 @@
> +};
> +
> +function ensureDirExists(directory) {
> +  try {
> +    File.makeDir(directory, { unixMode: OS.Constants.S_IRWXU });

We should just let Telemetry create this directory, it's going to be more maintainable to have the Telemetry directory initialization all done in one place. Telemetry will create this directory when it first tries to save a Telemetry ping on Firefox exit.

The only time the directory will not have already been initialized by Telemetry is if this is the first session with Telemetry enabled for this profile. This should be rare, but if you're concerned, we can expose a Telemetry method for requesting that Telemetry initialize its state.
Attachment #8450018 - Flags: review?(vdjeric)
Comment on attachment 8450022 [details] [diff] [review]
Part 4: anonymize sensitive data in telemetry log (6) r=ekr

Review of attachment 8450022 [details] [diff] [review]:
-----------------------------------------------------------------

Are you asking me to review the sanitization routines for completeness? If so, I'm not a good reviewer for this as I don't know the contents of these Loop logs. If you'd like, you can list the privacy-sensitive data remaining after sanitization and I can comment on those. The Telemetry privacy policy is described here https://wiki.mozilla.org/Telemetry/FAQ
Attachment #8450022 - Flags: review?(vdjeric)
(In reply to Vladan Djeric (:vladan) from comment #139)
> > Is startup the only time that these are read? E.g. nothing is sent until I
> > quit and restart firefox?
> 
> That's right.

That seems unfortunate. Is there something we can do to make it happen
directly?



> It could still be a lot of data to write & load, and in those cases when we
> have more than 17 unsent payloads on startup (of any kind.. Telemetry or ANR
> or Loop log), we load the 17 most recent files and delete the older payloads.
> 
> I'm guessing that sessions with more than a few failed webrtc calls likely
> won't have unique info in each of the logs. So how about a max of 5 Loop
> logs written per session?
> 
> > These are "failed-call" reports, where user-initiated webrtc calls either
> > drop or fail to connect to begin with. There's no auto-repeat, so they're
> > always in response to the user trying to initiate or receive a call. We're
> > talking human rate of repetition. A "repeated error condition" might be a
> > location-based, such as a user failing repeatedly to call from a hotel wifi
> > that has a symmetrical NAT not letting calls through, with no good TURN
> > server.
> 
> Ah ok, thank you for the clarification.
(In reply to Vladan Djeric (:vladan) from comment #139)
> How about 200KB per log?

Adam explained to me that the logs can vary in size depending on how many candidates and connection points are involved. Others will need to speak to whether we can do with a smaller ring-buffer. If we miss an important part of the log then we may not get the info we need to detect or fix a particular failure.

> I'm guessing that sessions with more than a few failed webrtc calls likely
> won't have unique info in each of the logs. So how about a max of 5 Loop
> logs written per session?

This seems reasonable to me, except that the dynamics of us never uploading anything until the browser is restarted becomes more important. Given that laptops hibernate, I can easily see myself dragging my laptop to >5 different wifi locations without restarting Firefox (it's so darn stable! And I say that with no sarcasm :-) )

That said, resolving more than 5 failed calls from the same person+session may be something we can improve later.
Flags: needinfo?(adam)
(In reply to Vladan Djeric (:vladan) from comment #140)

> I don't think writeAtomic gets us anything in this situation. If we have a
> partial write of the .tmp file, Telemetry will still try to upload it since
> it's in the saved-pings directory. So either provide a different directory
> in the tmp path or just let Telemetry deal with the corruption (it will just
> delete any files that are not proper JSON).

writeAtomic is a global function while write is an instance method, so why not call the simpler writeAtomic? the tmpFile arg is required according to https://developer.mozilla.org/en-US/docs/JavaScript_OS.File/OS.File_for_workers#OS.File.writeAtomic%28%29

> > +function ensureDirExists(directory) {
> > +  try {
> > +    File.makeDir(directory, { unixMode: OS.Constants.S_IRWXU });
> 
> We should just let Telemetry create this directory, it's going to be more
> maintainable to have the Telemetry directory initialization all done in one
> place. Telemetry will create this directory when it first tries to save a
> Telemetry ping on Firefox exit.

How is creating the dir if it doesn't exist less maintainable? The lone dependency is still the same.

> The only time the directory will not have already been initialized by
> Telemetry is if this is the first session with Telemetry enabled for this
> profile. This should be rare, but if you're concerned, we can expose a
> Telemetry method for requesting that Telemetry initialize its state.

It might matter for diagnosing problems over the phone or email for people who don't have telemetry enabled already. It's much easier to say "turn on telemetry and then try your call again (and then quit)", than "turn on telemetry, then quit, then go back to where you were, then try your call again (and then quit)".

If you want to expose a method that wraps makedir that's fine, even though I don't see what encapsulation it buys.

Will incorporate all other comments. Thanks!
(In reply to Jan-Ivar Bruaroey [:jib] from comment #144)
> The lone dependency is still the same.

The path I mean.
(In reply to Vladan Djeric (:vladan) from comment #141)
> Are you asking me to review the sanitization routines for completeness?

No that's fine. I just thought you'd want a look since it adds more processing overhead. All of that is in the worker now so it shouldn't matter.
(In reply to Eric Rescorla (:ekr) from comment #142)
> (In reply to Vladan Djeric (:vladan) from comment #139)
> > > Is startup the only time that these are read? E.g. nothing is sent until I
> > > quit and restart firefox?
> > 
> > That's right.
> 
> That seems unfortunate. Is there something we can do to make it happen
> directly?

I agree. I would have guessed other idle opportunities exist while firefox sits there.
Incorporated feedback.
Attachment #8450018 - Attachment is obsolete: true
Attachment #8452303 - Flags: review?(vdjeric)
Unbitrot.
Attachment #8450022 - Attachment is obsolete: true
Attachment #8450022 - Flags: review?(dveditz)
Attachment #8452310 - Flags: review+
Comment on attachment 8452310 [details] [diff] [review]
Part 4: anonymize sensitive data in telemetry log (7) r=ekr

How's your review queue? Let me know if I should hit up someone else.
Attachment #8452310 - Flags: review?(dveditz)
(In reply to Jan-Ivar Bruaroey [:jib] from comment #143)
> (In reply to Vladan Djeric (:vladan) from comment #139)
> > How about 200KB per log?
> 
> Adam explained to me that the logs can vary in size depending on how many
> candidates and connection points are involved. Others will need to speak to
> whether we can do with a smaller ring-buffer. If we miss an important part
> of the log then we may not get the info we need to detect or fix a
> particular failure.

Yes; I would be hesitant to clamp down to something as small as 200kB here, since machines with novel network interface configurations are both (a) more likely to trigger failure corner cases, and (b) more likely to produce significantly larger logs than more mainstream configurations. We would effectively be hobbling our ability to diagnose the trickier scenarios that arise in the field.

> > I'm guessing that sessions with more than a few failed webrtc calls likely
> > won't have unique info in each of the logs. So how about a max of 5 Loop
> > logs written per session?
> 
> This seems reasonable to me

I'm okay with this. The goal isn't to have 100% reporting; what we want is statistically significant sampling of the kinds of failures we're seeing in the field.
Flags: needinfo?(adam)
> > > Is startup the only time that these are read? E.g. nothing is sent until I
> > > quit and restart firefox?
> > 
> > That's right.
> 
> That seems unfortunate. Is there something we can do to make it happen
> directly?

I think it's better to do the uploads during idle-daily instead of on demand, but we could have the TelemetryPing code read in any new submissions from saved-telemetry-pings during the sending process instead of startup (bug 833545) -- I can try to get to this landed during Nightly 33
> > How about 200KB per log?
> 
> Adam explained to me that the logs can vary in size depending on how many
> candidates and connection points are involved. Others will need to speak to
> whether we can do with a smaller ring-buffer. If we miss an important part
> of the log then we may not get the info we need to detect or fix a
> particular failure.

Ok but I still think we should have a limit that is smaller than 16MB.

> > I'm guessing that sessions with more than a few failed webrtc calls likely
> > won't have unique info in each of the logs. So how about a max of 5 Loop
> > logs written per session?
> 
> This seems reasonable to me, except that the dynamics of us never uploading
> anything until the browser is restarted becomes more important. Given that
> laptops hibernate, I can easily see myself dragging my laptop to >5
> different wifi locations without restarting Firefox (it's so darn stable!
> And I say that with no sarcasm :-) )

Yes, I'll make time to work on bug 833545 in the next week or so
Maybe I'm misreading Bug 833545. Are you saying that the goal of that bug is to upload in response to new ping files, e.g. without requiring a restart after a failed call? I think the concern expressed here is that uploads may get delayed by days or weeks from people never restarting their firefox.
(In reply to Vladan Djeric (:vladan) from comment #154)
> Ok but I still think we should have a limit that is smaller than 16MB.

This seems to be the remaining outstanding issue. Can we proceed with the reviews while you and Adam agree on a number? Then I'll plug it in.
Flags: needinfo?(adam)
(In reply to Jan-Ivar Bruaroey [:jib] from comment #156)
> (In reply to Vladan Djeric (:vladan) from comment #154)
> > Ok but I still think we should have a limit that is smaller than 16MB.
> 
> This seems to be the remaining outstanding issue. Can we proceed with the
> reviews while you and Adam agree on a number? Then I'll plug it in.

The 16 MB is a theoretical limit that we will never hit in practice. The log lines emitted are of a rather predictable length, and there's no way we're ever going over 1 MB. If you insist on having code that explicitly truncates at 1 MB, I have no objection other than the fact that it's busywork and an unnecessary delay in landing this patch. We would never hit it.
Flags: needinfo?(adam)
(In reply to Jan-Ivar Bruaroey [:jib] from comment #155)
> Maybe I'm misreading Bug 833545. Are you saying that the goal of that bug is
> to upload in response to new ping files, e.g. without requiring a restart
> after a failed call? I think the concern expressed here is that uploads may
> get delayed by days or weeks from people never restarting their firefox.

Telemetry currently reads in saved pings on startup and uploads them on idle-daily. That means that any ping files added to the saved-telemetry-pings directory during a Firefox session will not be uploaded until at least the next session.

With bug 833545, Telemetry will read in any ping files present in the saved-telemetry-pings directory during the idle-daily event instead of at session start. So for a multi-day session, Telemetry will upload any files present in the saved-telemetry-pings directory roughly every 24 hours without needing a restart.
> writeAtomic is a global function while write is an instance method, so why
> not call the simpler writeAtomic? the tmpFile arg is required according to

writeAtomic is used when we don't want to corrupt an existing file with a partial write because of power loss etc. It first writes the data to a temporary file, then renames it to the destination filename (and optionally fsyncs it or backs up the destination file). In your use case, you're not overwriting an existing file, and the way you're using it, it doesn't provide any corruption protections. I don't strongly object to you using it instead of write(), it was just surprising to see it used for something other than its data integrity properties.

Also, the tmpFile arg is actually not required, the documentation is just out of date.

> How is creating the dir if it doesn't exist less maintainable? The lone
> dependency is still the same.

You're hardcoding the directory name and permissions outside Telemetry code. If we change this directory in the future, we'll have to remember to change it in your code as well.
I don't want to delay you guys, so let's land it as-is and I've filed bug 1036715 to fix it up later.

> It might matter for diagnosing problems over the phone or email for people
> who don't have telemetry enabled already. It's much easier to say "turn on
> telemetry and then try your call again (and then quit)", than "turn on
> telemetry, then quit, then go back to where you were, then try your call
> again (and then quit)".

You actually can't turn on Telemetry mid-session (it doesn't take effect until restart); regular Telemetry pings from partial sessions aren't particularly desirable.
It seems like you're trying to use Telemetry payload files for debugging/diagnostics. I think this is a bad idea because it's not designed for that. For example, Telemetry automatically deletes payload files after it uploads them or if it finds that they're too old. That's not a behavior you want in a logging or diagnostic system. Maybe you could have another pref that causes Loop to store the same files in a diagnostic directory, regardless of Telemetry opt-in status.
> The 16 MB is a theoretical limit that we will never hit in practice. The log
> lines emitted are of a rather predictable length, and there's no way we're
> ever going over 1 MB. If you insist on having code that explicitly truncates
> at 1 MB, I have no objection other than the fact that it's busywork and an
> unnecessary delay in landing this patch. We would never hit it.

I don't really know anything about this logging buffer, so if you say we'll never go over 1MB, that's enough for me. I figured a one-liner to truncate against unforseen scnearios would be prudent, but if it doesn't make sense, don't put it in.
(In reply to Vladan Djeric (:vladan) from comment #159)
> I don't strongly object to you using it instead of write(), it was just
> surprising to see it used for something other than its data integrity properties.

Fair enough. Just so you know, all the OS.File examples use writeAtomic here https://developer.mozilla.org/en-US/docs/JavaScript_OS.File/OS.File_for_workers#Example.3A_Write_a_string_to_a_file - If we don't want to steer people that way we should probably put in a write() example. I actually couldn't find an example of write(). 

> Also, the tmpFile arg is actually not required, the documentation is just
> out of date.

Thanks, didn't know that, thogh I might as well provide a reasonable tmp filename, since I gather it creates the temp file regardless.

> It seems like you're trying to use Telemetry payload files for
> debugging/diagnostics.

You're right, I'm mixing things up. Thanks for the explanation.
Comment on attachment 8452303 [details] [diff] [review]
Part 3: save as telemetry ping file for upload by telemetry (5)

Review of attachment 8452303 [details] [diff] [review]:
-----------------------------------------------------------------

::: browser/components/loop/MozLoopService.jsm
@@ +317,5 @@
>     * Uploads telemetry logs to telemetryServer
>     *
>     * @param {Object} pc The peerConnection in question.
>     */
>    uploadTelemetry: function(window, pc) {

Nit: We might want to rename this method now that it's not doing a direct upload. Your call

@@ +371,5 @@
> +        let worker = new ChromeWorker("MozLoopWorker.js");
> +        worker.onmessage = function(e) {
> +          console.log(e.data.ok ?
> +            ("Failed to stage loop report. Error: " + e.data.fail) :
> +            "Successfully staged loop report for telemetry upload.");

Isn't this backwards? if e.data.ok == true, you don't want to report "Failed to stage loop report"

@@ +421,5 @@
>            if (type == "iceconnectionstatechange") {
>              switch(pc.iceConnectionState) {
>                case "failed":
>                case "disconnected":
> +                if (Services.prefs.getBoolPref("toolkit.telemetry.enabled")) {

You can just call Services.telemetry.canSend here, although there is a slight difference in semantics between the two when a user has enabled Telemetry during the current session but hasn't restarted Firefox yet. Your call

::: browser/components/loop/MozLoopWorker.js
@@ +44,5 @@
> +  let array = Encoder.encode(pingStr);
> +  try {
> +    ensureDirExists(directory);
> +    let path = OS.Path.join(directory, filename);
> +    File.writeAtomic(path, array, { tmpPath: path + '.tmp' });

You don't need the tmpPath, the docs are out of date

@@ +55,5 @@
> +
> +function ensureDirExists(directory) {
> +  try {
> +    File.makeDir(directory, { unixMode: OS.Constants.S_IRWXU });
> +  } catch (ex if ex instanceof OS.File.Error && ex.becauseExists()) {}

you can eliminate this try-catch by passing another option "ignoreExisting: true" to makeDir()
Attachment #8452303 - Flags: review?(vdjeric) → review+
> > Also, the tmpFile arg is actually not required, the documentation is just
> > out of date.
> 
> Thanks, didn't know that, thogh I might as well provide a reasonable tmp
> filename, since I gather it creates the temp file regardless.

Actually, if you don't provide a tmpPath, i think it essentially behaves the same as write()  :/
We should add a probe to report the Loop log sizes and counts per session, but that can wait for later
(In reply to Vladan Djeric (:vladan) from comment #162)
> You can just call Services.telemetry.canSend here, although there is a
> slight difference in semantics between the two when a user has enabled
> Telemetry during the current session but hasn't restarted Firefox yet.

I tried this and this is going to make it a little hard to test given that it seems not enabled on (local?) nightly builds: http://mxr.mozilla.org/mozilla-central/source/toolkit/components/telemetry/Telemetry.cpp?mark=2287-2287#2285

But I agree we probably shouldn't be writing these files if they never go anywhere.

So I've put in a new test pref:
> if (Services.telemetry.canSend || Services.prefs.getBoolPref("toolkit.telemetry.test"))

Please let me know if this is OK or if there's another way you guys already use to test this stuff.
Flags: needinfo?(vdjeric)
> I tried this and this is going to make it a little hard to test given that
> it seems not enabled on (local?) nightly builds:
> http://mxr.mozilla.org/mozilla-central/source/toolkit/components/telemetry/
> Telemetry.cpp?mark=2287-2287#2285

It's true, sending Telemetry is not enabled on local builds. This is intentional as we don't want to collect performance/usage data from dev test builds.

You can always add "export MOZILLA_OFFICIAL=1" to your mozconfig but then make sure to change the toolkit.telemetry.server pref to a different server URL so we don't end up receiving reports from dev builds.
When debugging Telemetry, I usually point that toolkit.telemetry.server pref to a python script listening on localhost. The script un-gzips the Telemetry ping and writes it out to my drive for inspection.

So I've put in a new test pref:
> > if (Services.telemetry.canSend || Services.prefs.getBoolPref("toolkit.telemetry.test"))
> 
> Please let me know if this is OK or if there's another way you guys already
> use to test this stuff.

Yeah I'd suggest the approach I outlined above with the localhost gzip server. I can send you my script if you like.
Flags: needinfo?(vdjeric)
Incorporated feedback.
- Also added test on new unlisted toolkit.telemetry.test pref to allow testing
  of loop report generation code, without spamming all devs with report files.
Attachment #8452303 - Attachment is obsolete: true
Attachment #8453993 - Flags: review?(vdjeric)
Attachment #8453993 - Attachment description: Part 3: save as telemetry ping file for upload by telemetry (5) → Part 3: save as telemetry ping file for upload by telemetry (6)
Attachment #8453993 - Flags: review?(vdjeric) → review+
Unbitrot.
Attachment #8452310 - Attachment is obsolete: true
Attachment #8452310 - Flags: review?(dveditz)
Attachment #8454014 - Flags: review+
Attachment #8454014 - Flags: review?(dveditz)
Comment on attachment 8452310 [details] [diff] [review]
Part 4: anonymize sensitive data in telemetry log (7) r=ekr

Review of attachment 8452310 [details] [diff] [review]:
-----------------------------------------------------------------

I'm going to continue reviewing this v7 of part 4 because the newer ("unbitrot") v8 no longer calls the sanitization code. But the sanitization code itself looks good.

r=dveditz if you
 * call the sanitization
 * sanitize "relay(IP4:" server names the same as "srflx(IP4:" names

::: browser/components/loop/MozLoopWorker.js
@@ +160,5 @@
> + *
> + * @param {DOMString} log The log text.
> + */
> +function sanitizeLogs(log) {
> +  let rex = /srflx\(IP4:\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}:\d{1,5}\/(UDP|TCP)\|[^\)]+\)/g;

There are also server names in the same format, only "relay" in place of "srflx". These should both be the same, either sanitized or not. The server names do indicate a gross geographical area but I'm not sure it's fine-grained enough to call PII. As an error log it might be informative to know if certain problems only occur with a particular server.
Attachment #8452310 - Flags: review+
Comment on attachment 8454014 [details] [diff] [review]
Part 4: anonymize sensitive data in telemetry log (8) r=ekr

Review of attachment 8454014 [details] [diff] [review]:
-----------------------------------------------------------------

r- on this version because it looks like the new code isn't called. (merge fail?)
Attachment #8454014 - Flags: review?(dveditz) → review-
Attachment #8452310 - Attachment is obsolete: false
Attachment #8454014 - Attachment is obsolete: true
Incorporated feedback + fixed bug.
- Fixed merge fail.
- Filters relay as well.
- Fixed embarrassing bug where maskIp() was cranking our pseudo-random IPs
  WITHIN the Rfc1918 range rather than outside it (and fairly slowly) :*/
Attachment #8452310 - Attachment is obsolete: true
Attachment #8454132 - Flags: review?(dveditz)
Attachment #8446879 - Attachment is obsolete: true
Comment on attachment 8454132 [details] [diff] [review]
Part 4: anonymize sensitive data in telemetry log (9) r=ekr

Review of attachment 8454132 [details] [diff] [review]:
-----------------------------------------------------------------

r=dveditz
Attachment #8454132 - Flags: review?(dveditz) → review+
Clearing the needinfo request for Jishnu -- Geoff, Urmika and I had a discussion around the related issues back on June 30th, and they didn't see any red flags from the legal side of things (even after I sent them an annotated example log).
Flags: needinfo?(jmenon)
ice_fail_upload.patch wouldn't apply cleanly (and had the old telemetryURL code in it), so I grabbed the patch from your try push instead. The other two patches apply just fine.

https://hg.mozilla.org/integration/mozilla-inbound/rev/b336c6598bf1
https://hg.mozilla.org/integration/mozilla-inbound/rev/f0e6af66a4cf
https://hg.mozilla.org/integration/mozilla-inbound/rev/ff281663a545
Keywords: checkin-needed
Whiteboard: [leave open][est:3d, p=1.5, 1.5:p2, ft:webrtc] → [est:3d, p=1.5, 1.5:p2, ft:webrtc]
Does this need QA testing? If so, please advise.
Whiteboard: [est:3d, p=1.5, 1.5:p2, ft:webrtc] → [est:3d, p=1.5, 1.5:p2, ft:webrtc][qa?]
QA Contact: anthony.s.hughes
(In reply to Anthony Hughes, QA Mentor (:ashughes) from comment #178)
> Does this need QA testing? If so, please advise.

-> needinfo abr.
Flags: needinfo?(adam)
(In reply to Anthony Hughes, QA Mentor (:ashughes) from comment #178)
> Does this need QA testing? If so, please advise.

Yes please. We are short on tests here. See comment 165 and Bug 1029127 for some of the challenges. My primary concern is testing that:

 1. real ice failures actually trigger these reports
 2. saved report files get staged and look valid 
 3. report files actually get uploaded to servers (and cleaned up locally)
 4. Servers actually do something with these files.

I hope Adam has more on the server end.
Jan-Ivar pretty much hit the high points here. The server-side testing is probably best paired with the telemetry analysis bug (Bug 999028), so I wouldn't worry about it too much.

So, testing-wise, you'd want to either cause or simulate an ICE failure (causing one would be as easy as blocking all UDP traffic and then trying to set up a call without a TURN server); verify that an ICE log report is staged in the telemetry directory; and then verify that the file eventually gets uploaded to telemetry.
Flags: needinfo?(adam)
Nils, could you help me with designing a test for this? Comment 180 and comment 181 help but I think it's a bit over my head technically.
Flags: needinfo?(drno)
Whiteboard: [est:3d, p=1.5, 1.5:p2, ft:webrtc][qa?] → [est:3d, p=1.5, 1.5:p2, ft:webrtc][qa+]
Flags: qe-verify+
Whiteboard: [est:3d, p=1.5, 1.5:p2, ft:webrtc][qa+] → [est:3d, p=1.5, 1.5:p2, ft:webrtc]
A couple of question/points here:
- Is this only working if I use the Loop client (the bug title seems to suggest that) or with any WebRTC call?
- Has anyone an idea where such Telemetry data would get stored?
- I was able to make a Loop call without any media and watched the Browser console as I assumed that I would see some of the log messages from this patch here in the console, but nothing showed up on the console.
- I have no idea how I would be able to verify the server parts. I don't think any of us QAs has access to the servers.
Flags: needinfo?(drno)
(In reply to Nils Ohlmeier [:drno] from comment #183)
> A couple of question/points here:
> - Is this only working if I use the Loop client (the bug title seems to
> suggest that) or with any WebRTC call?

Only the Loop client. We don't really know it's a "call" otherwise.

> - Has anyone an idea where such Telemetry data would get stored?

Client-side the logs are staged on disk in a profileDir sub-folder named "saved-telemetry-pings" for upload to the telemetry servers on next startup (or if you never restart then after a long while of idle, next day I think).

> - I was able to make a Loop call without any media and watched the Browser
> console as I assumed that I would see some of the log messages from this
> patch here in the console, but nothing showed up on the console.

For general diagnostics I would use about:webrtc instead. This logging feature is part of telemetry and is specifically aimed at catching ice failures.

Also, if you're using Nightly then it is probably disabled. See comment 166.

That said, please let me know when you've confirmed this, as in my tests I've relied on triggering this feature manually, and right now I'm having difficulty creating a real situation that triggers the feature. Specifically, it activates on iceConnectionStates "failed" and/or "disconnected". I assume we have tests for those?

> - I have no idea how I would be able to verify the server parts. I don't
> think any of us QAs has access to the servers.

Me neither unfortunately.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: