Closed
Bug 1069454
Opened 10 years ago
Closed 10 years ago
deal with bad data better
Categories
(Release Engineering Graveyard :: Applications: Balrog (backend), defect)
Release Engineering Graveyard
Applications: Balrog (backend)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: bhearsum, Assigned: bhearsum)
References
Details
Attachments
(1 file, 1 obsolete file)
(deleted),
patch
|
bhearsum
:
review+
bhearsum
:
checked-in+
|
Details | Diff | Splinter Review |
No description provided.
Assignee | ||
Comment 1•10 years ago
|
||
These can be seen on https://errormill.mozilla.org/Releng/balrog-prod/group/170698/
For example:
A request to https://aus3.mozilla.org/update/3/Firefox/27.0/20131216183647/Darwin_x86_64-gcc3/en-US/beta/Darwin%2011.4.2/default/default/update.xml
Ends up with this traceback:
KeyError: u'Darwin_x86_64-gcc3'
Stacktrace (most recent call last):
File "flask/app.py", line 1060, in full_dispatch_request
rv = self.dispatch_request()
File "flask/app.py", line 1047, in dispatch_request
return self.view_functions[rule.endpoint](**req.view_args)
File "flask/views.py", line 56, in view
return self.dispatch_request(*args, **kwargs)
File "flask/views.py", line 112, in dispatch_request
return meth(*args, **kwargs)
File "auslib/web/views/client.py", line 38, in get
xml = release.createXML(query, update_type, app.config["WHITELISTED_DOMAINS"], app.config["SPECIAL_FORCE_HOSTS"])
File "auslib/blob.py", line 222, in createXML
localeData = self.getPlatformData(buildTarget)["locales"][locale]
File "auslib/blob.py", line 118, in getPlatformData
platform = self.getResolvedPlatform(platform)
File "auslib/blob.py", line 115, in getResolvedPlatform
return self['platforms'][platform].get('alias', platform)
This doesn't surprise me - Darwin_x86_64-gcc3 is not a build target we use for releases that we ship anymore - and the buildid isn't one that's known to us as a release build either. However, we've already had thousands of tracebacks for this, so there's clearly something in the wild that's using this. I'm guessing that there's some 3rd party build that still points at our update server.
For Balrog, we should probably catch these exceptions rather than having them bubble up.
Summary: lots of tracebacks from Darwin → lots of tracebacks from Darwin_x86_64-gcc3 after pushing balrog to beta
Assignee | ||
Comment 2•10 years ago
|
||
Whatever we do for this might fix https://errormill.mozilla.org/Releng/balrog-prod/group/172691/too...
Assignee | ||
Updated•10 years ago
|
Blocks: balrog-release
Assignee | ||
Comment 3•10 years ago
|
||
On the theory that maybe debug dep builds somehow had "beta" as an update channel, I tried to find a build that matched https://aus3.mozilla.org/update/3/Firefox/32.0/20140822024446/Darwin_x86_64-gcc3/en-US/beta/Darwin 13.3.0/default/default/update.xml
I couldn't find anything in http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-beta-macosx64-debug/ with that buildid.
I also inspected one of the builds and the channel was set to "default". So, I don't know where it's coming from, but we're now up to 8k exceptions of this type - so there's some custom build or builds out there that are built exclusively for 64-bit Mac, trying to update from us.
Assignee | ||
Comment 4•10 years ago
|
||
At more or less random, I noticed that one URL here was https://aus3.mozilla.org/update/3/Firefox/31.0/20140610163407/Darwin_x86_64-gcc3/en-US/beta/Darwin%2013.4.0/default/default/update.xml. I googled the buildid and it turned out to be 31.0b1. The one in the previous comment is a valid beta too, as it turns out.
This makes me _strongly_ suspect there's some sort of automated process pinging these URLs. AFAICT, update verify and final verify are innocent, and we had some in the past 7min (when no releases or running). I'm wondering if there's some sort of CI test or QA test that does this...
Whimboo, do you know of any automation that would use the buildid of a proper release build, but a debug-style build target (like Darwin_x86_64-gcc3)?
Flags: needinfo?(hskupin)
Assignee | ||
Comment 5•10 years ago
|
||
Then again, maybe not...I just noticed that we get the original IPs on Sentry, and they seem to come from all over the world.
Assignee | ||
Comment 6•10 years ago
|
||
Found http://superuser.com/questions/581334/automatic-updates-of-firefox-stable-and-beta-dont-work-on-os-x randomly, which appears to have a user in the wild running a beta with Darwin_x86_64-gcc3...
Comment 7•10 years ago
|
||
I think needinfo is not needed anymore on me. I'm also not familiar with such builds. All what we test in our CI are builds we get notifications via Pulse. So if those builds would appear from qa.scl3.mozilla.com something would be wrong with the build machines.
Flags: needinfo?(hskupin)
Assignee | ||
Comment 8•10 years ago
|
||
Reading more on that stackoverflow page, I see that the user has stripped away the x86 parts of the binary, which results in BUILD_TARGET changing to Darwin_x86_64-gcc3. I now highly suspect that this is why we're getting so many of these requests. Sadly, these also means that tons of users in the wild are _not_ getting updates.
This bug is scoped to having Balrog deal with these requests more gracefully, but I've filed others to try and help our userbase:
https://bugzilla.mozilla.org/show_bug.cgi?id=1071576
https://bugzilla.mozilla.org/show_bug.cgi?id=1071580
Assignee | ||
Comment 9•10 years ago
|
||
We talked about this a bit today. Nick suggested raising specific tracebacks when we find "bad data" like this. We can then add a blacklist of tracebacks to NOT send to Sentry. That way we won't clog it up with tracebacks for things that aren't Balrog's fault. This will probably fix bug 1069508, too.
Assignee: nobody → bhearsum
Summary: lots of tracebacks from Darwin_x86_64-gcc3 after pushing balrog to beta → deal with unknown build targets being sent to balrog
Assignee | ||
Comment 10•10 years ago
|
||
(In reply to Ben Hearsum [:bhearsum] from comment #9)
> We talked about this a bit today. Nick suggested raising specific tracebacks
> when we find "bad data" like this. We can then add a blacklist of tracebacks
> to NOT send to Sentry. That way we won't clog it up with tracebacks for
> things that aren't Balrog's fault. This will probably fix bug 1069508, too.
This idea seems to work well, based on my quick tests:
diff --git a/auslib/web/base.py b/auslib/web/base.py
index a86f709..8a3a363 100644
--- a/auslib/web/base.py
+++ b/auslib/web/base.py
@@ -8,28 +8,39 @@ from raven.contrib.flask import Sentry
from auslib.AUS import AUS
app = Flask(__name__)
AUS = AUS()
sentry = Sentry()
from auslib.web.views.client import ClientRequestView
+
+class BadDataError(Exception):
+ """Raised when a client sends data that appears to be invalid."""
+ pass
+
+
@app.errorhandler(404)
def fourohfour(error):
"""We don't return 404s in AUS. Instead, we return empty XML files"""
response = make_response('<?xml version="1.0"?>\n<updates>\n</updates>')
response.mimetype = 'text/xml'
return response
@app.errorhandler(Exception)
def generic(error):
- # Log the error with Sentry before eating it (see bug 885173 for background)
- if sentry.client:
- sentry.captureException()
+ """Deals with any unhandled exceptions. If the exception is not a
+ BadDataError, it will be sent to Sentry. Regardless of the exception,
+ a 200 response with no updates is returned, because that's what the client
+ # expects. See bugs 885173 and 1069454 for additional background."""
+
+ if not isinstance(error, BadDataError):
+ if sentry.client:
+ sentry.captureException()
log.debug('Hit exception, sending an empty response')
response = make_response('<?xml version="1.0"?>\n<updates>\n</updates>')
response.mimetype = 'text/xml'
return response
@app.route('/robots.txt')
def robots():
return send_from_directory(app.static_folder, "robots.txt")
So it's just a matter of raising BadDataError in the right places now...
Assignee | ||
Comment 11•10 years ago
|
||
I think this covers all the cases identified in this bug and bug 1069508. I'm wondering if we should use it in bug 1069512 (for truly bad version numbers - not just things like w.x.y.z that we don't support yet) as well, but maybe that can be addressed there?
I think the patch is pretty straightforward, but I'm always a bit unsure of whether to do exception handling for things like getResolvedPlatform/getPlatformData/getLocaleData that depend on each other. It seems safest to do it at every level, so that's what I've done.
Attachment #8503328 -
Flags: review?(nthomas)
Assignee | ||
Updated•10 years ago
|
Summary: deal with unknown build targets being sent to balrog → deal with bad data better
Comment 13•10 years ago
|
||
Comment on attachment 8503328 [details] [diff] [review]
deal with bad data better
Review of attachment 8503328 [details] [diff] [review]:
-----------------------------------------------------------------
r+ with suggestions for improvements.
::: auslib/blobs/apprelease.py
@@ +370,1 @@
> localeData = self.getPlatformData(buildTarget)["locales"][locale]
Another chance to use getLocaleData() ?
::: auslib/blobs/gmp.py
@@ +43,4 @@
>
> def getPlatformData(self, vendor, platform):
> platform = self.getResolvedPlatform(vendor, platform)
> return self['vendors'][vendor]['platforms'][platform]
You have a try/except in this situation for apprelease, which catches a bogus alias. Worth having here too.
::: auslib/test/blobs/test_apprelease.py
@@ +25,5 @@
> self.assertEquals(blob.getPlatformData('a'), dict(foo=1))
>
> + def testGetPlatformDataRaisesBadDataError(self):
> + blob = SimpleBlob(platforms=dict(a=dict(foo=1)))
> + self.assertRaises(BadDataError, blob.getPlatformData, "c")
Could also test that BadDataError is raised when an alias points to a non-existent platform.
::: auslib/web/base.py
@@ +26,5 @@
> def generic(error):
> + """Deals with any unhandled exceptions. If the exception is not a
> + BadDataError, it will be sent to Sentry. Regardless of the exception,
> + a 200 response with no updates is returned, because that's what the client
> + # expects. See bugs 885173 and 1069454 for additional background."""
Nit, stray #.
Attachment #8503328 -
Flags: review?(nthomas) → review+
Assignee | ||
Comment 14•10 years ago
|
||
I'm assuming it's okay to carry forward the r+, but correct me if I'm wrong!
Attachment #8506347 -
Flags: review+
Comment 15•10 years ago
|
||
Looks good to me.
Assignee | ||
Updated•10 years ago
|
Attachment #8503328 -
Attachment is obsolete: true
Comment 16•10 years ago
|
||
Commit pushed to master at https://github.com/mozilla/balrog
https://github.com/mozilla/balrog/commit/8d1f8341d878c09c6367e2284c1f2a50cacffaed
bug 1069454: deal with bad data better. r=nthomas
Assignee | ||
Comment 17•10 years ago
|
||
Comment on attachment 8506347 [details] [diff] [review]
add more tests; catch more errors; use getLocaleData more
Pushed this patch, will have it deployed to production later today.
Attachment #8506347 -
Flags: checked-in+
Assignee | ||
Comment 18•10 years ago
|
||
We stopped getting https://errormill.mozilla.org/Releng/balrog-prod/group/172899/ and some other Sentry events after this made it into production. I've marked them all as resolved, so we're all done here now!
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Updated•5 years ago
|
Product: Release Engineering → Release Engineering Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•