Closed Bug 840186 Opened 11 years ago Closed 11 years ago

Automation infrastructure should not submit FHR data to production

Categories

(Firefox Health Report Graveyard :: Client: Desktop, defect, P2)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: gps, Assigned: aphadke)

References

Details

Attachments

(1 file)

In bug 840134 we confirmed that RelEng machines are sending traffic to FHR production systems.

Daniel says we are receiving both uploads and deletes.

This surprises me. The profile should need to live for 24h before submission can occur. Perhaps some other test that monkeypatches is causing submission to occur earlier.
Yet another reason why we need bug 617414, so things like this get caught (or at least prevented) :-)
Blocks: 617414
Are there prefs we can use to disable submission? Is that something you want to do?
Yes, there are prefs to disable submission. I /think/ we could disable submission without significant issues. But, I'd prefer to have the prefs when running tests match what we ship as much as possible because not doing so may devalue our tests.

Normally I'd be like "yeah, disable the prefs." It's only the "WTF 24h old profile in automation?!" that prevented me from doing that without blinking. It leads me to believe something very weird is happening.
Among the very many possibilities is the way that test slaves sometimes set their clocks wildly wrongly. Fortunately, bug 840134 is hidden from me, so I don't have to think beyond that about what might have caused it :)
That bug is nothing more than the original request to IT to do some network sniffing and find out what hosts were responsible for the requests we were seeing.  All of the hosts found in the sampling were talos related hosts.
Were (well, are, I guess) there interesting time-related patterns? That smells like too many slaves to be bad clocks, not enough to be all, and just right to be something like a timezone screwup in calculating whether a profile is 24 hours old, so that something like midnight to 7 or 8am looks like it's a day older than it is.
While all of the machines have "talos" in their hostname, I believe these machines do things besides Talos runs. This is obvious from clicking around TBPL and looking at the hostnames of the test runners.
Yeah, just a historical artifact, all slaves which run tests are named talos-*, all talos-* slaves run every flavor of test.
Attached patch Change server URI, v1 (deleted) — Splinter Review
Instead of disabling submission completely (which would change the characters of FHR too much for my liking), I'm instead pointing it at the local HTTP server. The endpoint doesn't exist. FHR will happily eat the presumed 404 and try again later. This shouldn't cause any tests to fail (we start our own httpd.js server for tests that require it).

I did not update the reftests.py file because meh. I suppose I may need to update Talos configs as well.

What I'd really like is to find out which test suite is causing submissions so we don't play whack-a-mole. Perhaps someone from Ops can sniff network traffic and we can identify the test job running at that instant...
Attachment #713739 - Flags: review?(emorley)
Comment on attachment 713739 [details] [diff] [review]
Change server URI, v1

We should also be changing reftest, talos, xpcshell(?) & also notify other consumers (eg mozmill) IMO, but deferring to Joel for a second opinion.
Attachment #713739 - Flags: review?(emorley) → review?(jmaher)
Comment on attachment 713739 [details] [diff] [review]
Change server URI, v1

Review of attachment 713739 [details] [diff] [review]:
-----------------------------------------------------------------

::: build/automation.py.in
@@ +534,5 @@
>  user_pref("datareporting.policy.dataSubmissionPolicyBypassAcceptance", true);
>  
> +// Point Firefox Health Report at a local server. We don't care if it actually
> +// works. It just can't hit the default production endpoint.
> +user_pref("datareporting.healthreport.documentServerURI", "http://%(server)/healthreport/");

will Firefox crash or run into problems with this?  I would like to run this on try server before landing.  If this is deemed acceptable, we can do this for the other test harnesses.  Right now I believe that would just be Talos because automation.py.in works for mochitest/reftest and maybe xpcshell.
Attachment #713739 - Flags: review?(jmaher) → review+
automation.py does not affect xpcshell. xpcshell tests don't have a default profile.
(In reply to Joel Maher (:jmaher) from comment #11)
> Comment on attachment 713739 [details] [diff] [review]
> Change server URI, v1
> 
> Review of attachment 713739 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: build/automation.py.in
> @@ +534,5 @@
> >  user_pref("datareporting.policy.dataSubmissionPolicyBypassAcceptance", true);
> >  
> > +// Point Firefox Health Report at a local server. We don't care if it actually
> > +// works. It just can't hit the default production endpoint.
> > +user_pref("datareporting.healthreport.documentServerURI", "http://%(server)/healthreport/");
> 
> will Firefox crash or run into problems with this?  I would like to run this
> on try server before landing.

Why would Firefox crash failing to submit data to a URI? FHR is designed to handle server failures gracefully, like any well-behaving service client.

I can't speak for the built-in HTTP server it is hitting, but if it crashes on seeing what should be a 404, that would be bad.

I respectively decline to run a try first: I'm confident pushing this directly to inbound. And, that's part of what inbound is for, no?
I only ask questions because I see weird stuff in automation all the time.  If you want to push to inbound that is fine, but I personally wouldn't do that unless I was watching the tree and willing to back it out right away.  Also I would like to make sure we keep an eye on whichever test suite/chunk this is so we can look for random failures or increased crashes.
Well, I trust my code to do the right thing. And, I should be watching the tree for the next few hours, so...

https://hg.mozilla.org/integration/mozilla-inbound/rev/d29f06f3ff67
Whiteboard: [leave open]
Target Milestone: --- → mozilla21
I should test my patches before pushing them.

Reverted due to orange.

https://hg.mozilla.org/integration/mozilla-inbound/rev/7c28a61eb677
I forgot the 's' after the %(server) formatting string. Doh. I tested locally and this didn't blow up.

https://hg.mozilla.org/integration/mozilla-inbound/rev/19abb6fef9dd
Daniel: Have the number of requests from Mozilla IPs changed at all since this merged?
Flags: needinfo?(deinspanjer)
FYI we will do the same change for mozbase and mozmill on bug 843312.
Status: NEW → ASSIGNED
Mark has been pulling numbers lately. Any change in this?
Flags: needinfo?(deinspanjer) → needinfo?(mreid)
I've been looking at FHR payload info, which doesn't contain the IPs.  Daniel, you mentioned you can find this info in Hive?
Flags: needinfo?(mreid) → needinfo?(deinspanjer)
Daniel: any update on this? If we are still submitting from automation I would love to get it fixed.
Any update on this, Daniel?
Priority: -- → P2
Component: Metrics and Firefox Health Report → Client: Desktop
Product: Mozilla Services → Firefox Health Report
Target Milestone: mozilla21 → ---
Ping - this is blocking bug 617414.
Ugh. Sorry for the lack of update here.  I'm assigning this to Anurag who can perform the queries against the metrics.mozilla.com weblogs looking for IPs that come from internal IP ranges.
Assignee: gps → aphadke
Flags: needinfo?(deinspanjer)
started backfill for FHR log data for month of June. Will update the bug as soon as the collection + analysis is complete (ETA: 6/14)
Note that Android also has an uploader, but the uploader uses Android SharedPreferences instead of Gecko prefs, so this patch will have no effect.

It shouldn't upload within the first 24 hours, just like desktop, and it will use the system clock via Java's System.currentTimeMillis() to make that determination.

If you start seeing documents uploaded from the Tegras, then there is a deeper problem here -- perhaps a tiny man with a subatomic screwdriver, maliciously attacking our NTP hosts, tests taking more than 24 hours to run, or profiles being reused between test runs.

(You'll be able to tell the latter because we report the profile creation time in the payload itself.)
none of the top 1k IPs match the blacklist IP list @
https://intranet.mozilla.org/Metrics/BlacklistedIPs

I did a reverse org/isp lookup and that didn't resolve to Mozilla
either. It's safe to say that the machines aren't submitting data
to FHR.
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Thank you for confirming :-)
Whiteboard: [leave open]
Product: Firefox Health Report → Firefox Health Report Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: