Closed
Bug 840186
Opened 11 years ago
Closed 11 years ago
Automation infrastructure should not submit FHR data to production
Categories
(Firefox Health Report Graveyard :: Client: Desktop, defect, P2)
Firefox Health Report Graveyard
Client: Desktop
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: gps, Assigned: aphadke)
References
Details
Attachments
(1 file)
(deleted),
patch
|
jmaher
:
review+
|
Details | Diff | Splinter Review |
In bug 840134 we confirmed that RelEng machines are sending traffic to FHR production systems. Daniel says we are receiving both uploads and deletes. This surprises me. The profile should need to live for 24h before submission can occur. Perhaps some other test that monkeypatches is causing submission to occur earlier.
Comment 1•11 years ago
|
||
Yet another reason why we need bug 617414, so things like this get caught (or at least prevented) :-)
Blocks: 617414
Comment 2•11 years ago
|
||
Are there prefs we can use to disable submission? Is that something you want to do?
Reporter | ||
Comment 3•11 years ago
|
||
Yes, there are prefs to disable submission. I /think/ we could disable submission without significant issues. But, I'd prefer to have the prefs when running tests match what we ship as much as possible because not doing so may devalue our tests. Normally I'd be like "yeah, disable the prefs." It's only the "WTF 24h old profile in automation?!" that prevented me from doing that without blinking. It leads me to believe something very weird is happening.
Comment 4•11 years ago
|
||
Among the very many possibilities is the way that test slaves sometimes set their clocks wildly wrongly. Fortunately, bug 840134 is hidden from me, so I don't have to think beyond that about what might have caused it :)
Comment 5•11 years ago
|
||
That bug is nothing more than the original request to IT to do some network sniffing and find out what hosts were responsible for the requests we were seeing. All of the hosts found in the sampling were talos related hosts.
Comment 6•11 years ago
|
||
Were (well, are, I guess) there interesting time-related patterns? That smells like too many slaves to be bad clocks, not enough to be all, and just right to be something like a timezone screwup in calculating whether a profile is 24 hours old, so that something like midnight to 7 or 8am looks like it's a day older than it is.
Reporter | ||
Comment 7•11 years ago
|
||
While all of the machines have "talos" in their hostname, I believe these machines do things besides Talos runs. This is obvious from clicking around TBPL and looking at the hostnames of the test runners.
Comment 8•11 years ago
|
||
Yeah, just a historical artifact, all slaves which run tests are named talos-*, all talos-* slaves run every flavor of test.
Reporter | ||
Comment 9•11 years ago
|
||
Instead of disabling submission completely (which would change the characters of FHR too much for my liking), I'm instead pointing it at the local HTTP server. The endpoint doesn't exist. FHR will happily eat the presumed 404 and try again later. This shouldn't cause any tests to fail (we start our own httpd.js server for tests that require it). I did not update the reftests.py file because meh. I suppose I may need to update Talos configs as well. What I'd really like is to find out which test suite is causing submissions so we don't play whack-a-mole. Perhaps someone from Ops can sniff network traffic and we can identify the test job running at that instant...
Attachment #713739 -
Flags: review?(emorley)
Comment 10•11 years ago
|
||
Comment on attachment 713739 [details] [diff] [review] Change server URI, v1 We should also be changing reftest, talos, xpcshell(?) & also notify other consumers (eg mozmill) IMO, but deferring to Joel for a second opinion.
Attachment #713739 -
Flags: review?(emorley) → review?(jmaher)
Comment 11•11 years ago
|
||
Comment on attachment 713739 [details] [diff] [review] Change server URI, v1 Review of attachment 713739 [details] [diff] [review]: ----------------------------------------------------------------- ::: build/automation.py.in @@ +534,5 @@ > user_pref("datareporting.policy.dataSubmissionPolicyBypassAcceptance", true); > > +// Point Firefox Health Report at a local server. We don't care if it actually > +// works. It just can't hit the default production endpoint. > +user_pref("datareporting.healthreport.documentServerURI", "http://%(server)/healthreport/"); will Firefox crash or run into problems with this? I would like to run this on try server before landing. If this is deemed acceptable, we can do this for the other test harnesses. Right now I believe that would just be Talos because automation.py.in works for mochitest/reftest and maybe xpcshell.
Attachment #713739 -
Flags: review?(jmaher) → review+
Comment 12•11 years ago
|
||
automation.py does not affect xpcshell. xpcshell tests don't have a default profile.
Reporter | ||
Comment 13•11 years ago
|
||
(In reply to Joel Maher (:jmaher) from comment #11) > Comment on attachment 713739 [details] [diff] [review] > Change server URI, v1 > > Review of attachment 713739 [details] [diff] [review]: > ----------------------------------------------------------------- > > ::: build/automation.py.in > @@ +534,5 @@ > > user_pref("datareporting.policy.dataSubmissionPolicyBypassAcceptance", true); > > > > +// Point Firefox Health Report at a local server. We don't care if it actually > > +// works. It just can't hit the default production endpoint. > > +user_pref("datareporting.healthreport.documentServerURI", "http://%(server)/healthreport/"); > > will Firefox crash or run into problems with this? I would like to run this > on try server before landing. Why would Firefox crash failing to submit data to a URI? FHR is designed to handle server failures gracefully, like any well-behaving service client. I can't speak for the built-in HTTP server it is hitting, but if it crashes on seeing what should be a 404, that would be bad. I respectively decline to run a try first: I'm confident pushing this directly to inbound. And, that's part of what inbound is for, no?
Comment 14•11 years ago
|
||
I only ask questions because I see weird stuff in automation all the time. If you want to push to inbound that is fine, but I personally wouldn't do that unless I was watching the tree and willing to back it out right away. Also I would like to make sure we keep an eye on whichever test suite/chunk this is so we can look for random failures or increased crashes.
Reporter | ||
Comment 15•11 years ago
|
||
Well, I trust my code to do the right thing. And, I should be watching the tree for the next few hours, so... https://hg.mozilla.org/integration/mozilla-inbound/rev/d29f06f3ff67
Whiteboard: [leave open]
Target Milestone: --- → mozilla21
Reporter | ||
Comment 16•11 years ago
|
||
I should test my patches before pushing them. Reverted due to orange. https://hg.mozilla.org/integration/mozilla-inbound/rev/7c28a61eb677
Reporter | ||
Comment 17•11 years ago
|
||
I forgot the 's' after the %(server) formatting string. Doh. I tested locally and this didn't blow up. https://hg.mozilla.org/integration/mozilla-inbound/rev/19abb6fef9dd
Reporter | ||
Comment 19•11 years ago
|
||
Daniel: Have the number of requests from Mozilla IPs changed at all since this merged?
Flags: needinfo?(deinspanjer)
Comment 20•11 years ago
|
||
FYI we will do the same change for mozbase and mozmill on bug 843312.
Status: NEW → ASSIGNED
Reporter | ||
Comment 21•11 years ago
|
||
Mark has been pulling numbers lately. Any change in this?
Flags: needinfo?(deinspanjer) → needinfo?(mreid)
Comment 22•11 years ago
|
||
I've been looking at FHR payload info, which doesn't contain the IPs. Daniel, you mentioned you can find this info in Hive?
Flags: needinfo?(mreid) → needinfo?(deinspanjer)
Reporter | ||
Comment 23•11 years ago
|
||
Daniel: any update on this? If we are still submitting from automation I would love to get it fixed.
Reporter | ||
Updated•11 years ago
|
Component: Metrics and Firefox Health Report → Client: Desktop
Product: Mozilla Services → Firefox Health Report
Target Milestone: mozilla21 → ---
Comment 25•11 years ago
|
||
Ping - this is blocking bug 617414.
Comment 26•11 years ago
|
||
Ugh. Sorry for the lack of update here. I'm assigning this to Anurag who can perform the queries against the metrics.mozilla.com weblogs looking for IPs that come from internal IP ranges.
Assignee: gps → aphadke
Flags: needinfo?(deinspanjer)
Assignee | ||
Comment 27•11 years ago
|
||
started backfill for FHR log data for month of June. Will update the bug as soon as the collection + analysis is complete (ETA: 6/14)
Comment 28•11 years ago
|
||
Note that Android also has an uploader, but the uploader uses Android SharedPreferences instead of Gecko prefs, so this patch will have no effect. It shouldn't upload within the first 24 hours, just like desktop, and it will use the system clock via Java's System.currentTimeMillis() to make that determination. If you start seeing documents uploaded from the Tegras, then there is a deeper problem here -- perhaps a tiny man with a subatomic screwdriver, maliciously attacking our NTP hosts, tests taking more than 24 hours to run, or profiles being reused between test runs. (You'll be able to tell the latter because we report the profile creation time in the payload itself.)
Assignee | ||
Comment 29•11 years ago
|
||
none of the top 1k IPs match the blacklist IP list @ https://intranet.mozilla.org/Metrics/BlacklistedIPs I did a reverse org/isp lookup and that didn't resolve to Mozilla either. It's safe to say that the machines aren't submitting data to FHR.
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Updated•6 years ago
|
Product: Firefox Health Report → Firefox Health Report Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•