840186 - Automation infrastructure should not submit FHR data to production

Reporter

Description

•

11 years ago

In bug 840134 we confirmed that RelEng machines are sending traffic to FHR production systems.

Daniel says we are receiving both uploads and deletes.

This surprises me. The profile should need to live for 24h before submission can occur. Perhaps some other test that monkeypatches is causing submission to occur earlier.

Ed Morley [:emorley]

Comment 1

•

11 years ago

Yet another reason why we need bug 617414, so things like this get caught (or at least prevented) :-)

Blocks: 617414

(not currently active) Ted Mielczarek

Comment 2

•

11 years ago

Are there prefs we can use to disable submission? Is that something you want to do?

Gregory Szorc [:gps]

Reporter

Comment 3

•

11 years ago

Yes, there are prefs to disable submission. I /think/ we could disable submission without significant issues. But, I'd prefer to have the prefs when running tests match what we ship as much as possible because not doing so may devalue our tests.

Normally I'd be like "yeah, disable the prefs." It's only the "WTF 24h old profile in automation?!" that prevented me from doing that without blinking. It leads me to believe something very weird is happening.

Phil Ringnalda (:philor)

Comment 4

•

11 years ago

Among the very many possibilities is the way that test slaves sometimes set their clocks wildly wrongly. Fortunately, bug 840134 is hidden from me, so I don't have to think beyond that about what might have caused it :)

Daniel Einspanjer [:dre] [:deinspanjer]

Comment 5

•

11 years ago

That bug is nothing more than the original request to IT to do some network sniffing and find out what hosts were responsible for the requests we were seeing.  All of the hosts found in the sampling were talos related hosts.

Phil Ringnalda (:philor)

Comment 6

•

11 years ago

Were (well, are, I guess) there interesting time-related patterns? That smells like too many slaves to be bad clocks, not enough to be all, and just right to be something like a timezone screwup in calculating whether a profile is 24 hours old, so that something like midnight to 7 or 8am looks like it's a day older than it is.

Gregory Szorc [:gps]

Reporter

Comment 7

•

11 years ago

While all of the machines have "talos" in their hostname, I believe these machines do things besides Talos runs. This is obvious from clicking around TBPL and looking at the hostnames of the test runners.

Phil Ringnalda (:philor)

Comment 8

•

11 years ago

Yeah, just a historical artifact, all slaves which run tests are named talos-*, all talos-* slaves run every flavor of test.

Gregory Szorc [:gps]

Reporter

Comment 9

•

11 years ago

Attached patch Change server URI, v1 (deleted) — Details — Splinter Review

Instead of disabling submission completely (which would change the characters of FHR too much for my liking), I'm instead pointing it at the local HTTP server. The endpoint doesn't exist. FHR will happily eat the presumed 404 and try again later. This shouldn't cause any tests to fail (we start our own httpd.js server for tests that require it).

I did not update the reftests.py file because meh. I suppose I may need to update Talos configs as well.

What I'd really like is to find out which test suite is causing submissions so we don't play whack-a-mole. Perhaps someone from Ops can sniff network traffic and we can identify the test job running at that instant...

Attachment #713739 - Flags: review?(emorley)

Ed Morley [:emorley]

Comment 10

•

11 years ago

Comment on attachment 713739 [details] [diff] [review]
Change server URI, v1

We should also be changing reftest, talos, xpcshell(?) & also notify other consumers (eg mozmill) IMO, but deferring to Joel for a second opinion.

Attachment #713739 - Flags: review?(emorley) → review?(jmaher)

Joel Maher ( :jmaher ) (UTC -8)

Comment 11

•

11 years ago

Comment on attachment 713739 [details] [diff] [review]
Change server URI, v1

Review of attachment 713739 [details] [diff] [review]:
-----------------------------------------------------------------

::: build/automation.py.in
@@ +534,5 @@
>  user_pref("datareporting.policy.dataSubmissionPolicyBypassAcceptance", true);
>  
> +// Point Firefox Health Report at a local server. We don't care if it actually
> +// works. It just can't hit the default production endpoint.
> +user_pref("datareporting.healthreport.documentServerURI", "http://%(server)/healthreport/");

will Firefox crash or run into problems with this?  I would like to run this on try server before landing.  If this is deemed acceptable, we can do this for the other test harnesses.  Right now I believe that would just be Talos because automation.py.in works for mochitest/reftest and maybe xpcshell.

Attachment #713739 - Flags: review?(jmaher) → review+

(not currently active) Ted Mielczarek

Comment 12

•

11 years ago

automation.py does not affect xpcshell. xpcshell tests don't have a default profile.

Gregory Szorc [:gps]

Reporter

Comment 13

•

11 years ago

(In reply to Joel Maher (:jmaher) from comment #11)
> Comment on attachment 713739 [details] [diff] [review]
> Change server URI, v1
> 
> Review of attachment 713739 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: build/automation.py.in
> @@ +534,5 @@
> >  user_pref("datareporting.policy.dataSubmissionPolicyBypassAcceptance", true);
> >  
> > +// Point Firefox Health Report at a local server. We don't care if it actually
> > +// works. It just can't hit the default production endpoint.
> > +user_pref("datareporting.healthreport.documentServerURI", "http://%(server)/healthreport/");
> 
> will Firefox crash or run into problems with this?  I would like to run this
> on try server before landing.

Why would Firefox crash failing to submit data to a URI? FHR is designed to handle server failures gracefully, like any well-behaving service client.

I can't speak for the built-in HTTP server it is hitting, but if it crashes on seeing what should be a 404, that would be bad.

I respectively decline to run a try first: I'm confident pushing this directly to inbound. And, that's part of what inbound is for, no?

Joel Maher ( :jmaher ) (UTC -8)

Comment 14

•

11 years ago

I only ask questions because I see weird stuff in automation all the time.  If you want to push to inbound that is fine, but I personally wouldn't do that unless I was watching the tree and willing to back it out right away.  Also I would like to make sure we keep an eye on whichever test suite/chunk this is so we can look for random failures or increased crashes.

Gregory Szorc [:gps]

Reporter

Comment 15

•

11 years ago

Well, I trust my code to do the right thing. And, I should be watching the tree for the next few hours, so...

https://hg.mozilla.org/integration/mozilla-inbound/rev/d29f06f3ff67

Whiteboard: [leave open]

Target Milestone: --- → mozilla21

Gregory Szorc [:gps]

Reporter

Comment 16

•

11 years ago

I should test my patches before pushing them.

Reverted due to orange.

https://hg.mozilla.org/integration/mozilla-inbound/rev/7c28a61eb677

Gregory Szorc [:gps]

Reporter

Comment 17

•

11 years ago

I forgot the 's' after the %(server) formatting string. Doh. I tested locally and this didn't blow up.

https://hg.mozilla.org/integration/mozilla-inbound/rev/19abb6fef9dd

Ryan VanderMeulen [:RyanVM]

Comment 18

•

11 years ago

https://hg.mozilla.org/mozilla-central/rev/19abb6fef9dd

Gregory Szorc [:gps]

Reporter

Comment 19

•

11 years ago

Daniel: Have the number of requests from Mozilla IPs changed at all since this merged?

Flags: needinfo?(deinspanjer)

Henrik Skupin [:whimboo][⌚️UTC+2]

Comment 20

•

11 years ago

FYI we will do the same change for mozbase and mozmill on bug 843312.

Status: NEW → ASSIGNED

Gregory Szorc [:gps]

Reporter

Comment 21

•

11 years ago

Mark has been pulling numbers lately. Any change in this?

Flags: needinfo?(deinspanjer) → needinfo?(mreid)

Mark Reid [:mreid]

Comment 22

•

11 years ago

I've been looking at FHR payload info, which doesn't contain the IPs.  Daniel, you mentioned you can find this info in Hive?

Flags: needinfo?(mreid) → needinfo?(deinspanjer)

Gregory Szorc [:gps]

Reporter

Comment 23

•

11 years ago

Daniel: any update on this? If we are still submitting from automation I would love to get it fixed.

Richard Newman [:rnewman]

Comment 24

•

11 years ago

Any update on this, Daniel?

Priority: -- → P2

Gregory Szorc [:gps]

Reporter

Updated

•

11 years ago

Component: Metrics and Firefox Health Report → Client: Desktop

Product: Mozilla Services → Firefox Health Report

Target Milestone: mozilla21 → ---

Ed Morley [:emorley]

Comment 25

•

11 years ago

Ping - this is blocking bug 617414.

Daniel Einspanjer [:dre] [:deinspanjer]

Comment 26

•

11 years ago

Ugh. Sorry for the lack of update here.  I'm assigning this to Anurag who can perform the queries against the metrics.mozilla.com weblogs looking for IPs that come from internal IP ranges.

Assignee: gps → aphadke

Flags: needinfo?(deinspanjer)

Anurag Phadke[:aphadke@mozilla.com]

Assignee

Comment 27

•

11 years ago

started backfill for FHR log data for month of June. Will update the bug as soon as the collection + analysis is complete (ETA: 6/14)

Richard Newman [:rnewman]

Comment 28

•

11 years ago

Note that Android also has an uploader, but the uploader uses Android SharedPreferences instead of Gecko prefs, so this patch will have no effect.

It shouldn't upload within the first 24 hours, just like desktop, and it will use the system clock via Java's System.currentTimeMillis() to make that determination.

If you start seeing documents uploaded from the Tegras, then there is a deeper problem here -- perhaps a tiny man with a subatomic screwdriver, maliciously attacking our NTP hosts, tests taking more than 24 hours to run, or profiles being reused between test runs.

(You'll be able to tell the latter because we report the profile creation time in the payload itself.)

Anurag Phadke[:aphadke@mozilla.com]

Assignee

Comment 29

•

11 years ago

none of the top 1k IPs match the blacklist IP list @
https://intranet.mozilla.org/Metrics/BlacklistedIPs

I did a reverse org/isp lookup and that didn't resolve to Mozilla
either. It's safe to say that the machines aren't submitting data
to FHR.

Status: ASSIGNED → RESOLVED

Closed: 11 years ago

Resolution: --- → FIXED

Ed Morley [:emorley]

Comment 30

•

11 years ago

Thank you for confirming :-)

Whiteboard: [leave open]

BMO Automation

Updated

•

6 years ago

Product: Firefox Health Report → Firefox Health Report Graveyard