Closed Bug 1400141 Opened 7 years ago Closed 7 years ago

Publish to balrog failures with "requests.exceptions.HTTPError: 400 Client Error: BAD REQUEST"

Categories

(Release Engineering :: Release Automation: Other, defect, P1)

defect

Tracking

(firefox57 fixed)

RESOLVED FIXED
Tracking Status
firefox57 --- fixed

People

(Reporter: nthomas, Assigned: bhearsum)

References

Details

Attachments

(4 files)

Recently we've had releases whic hit several errors like this: 2017-09-15 05:10:32,060 - DEBUG - Data sent: {'data_version': 823, 'product': 'Devedition', 'hashFunction': 'sha512', 'data': '{"displayVersion": "57.0 Beta 1", "partials": [{"hashValue": "714016cf61ff8b46bd15d666fa91cdd801eed6e0935480d963a7dfa43503279ba0dbebc4ccb2b04eac27db52dc1b8a4cab74bf97a1e98677f5e94d274b1851a9", "from": "Devedition-56.0b11-build1", "filesize": 12054152}, {"hashValue": "e0bf6cabaeb0a45fb2b5246da8d92ce7638eabf48e93d250b88843f8a242a6b17a8a725ea452840ae430a53ac8ee81c8ecdfad0ff6874ac2866973d34f2249a7", "from": "Devedition-56.0b9-build1", "filesize": 12130272}], "buildID": "20170915003610", "platformVersion": "57.0", "appVersion": "57.0", "completes": [{"hashValue": "0e6259fe3f2760c86bf34c338f41ca2b9d0062f45beb52921fcaaee48905382c5195a634ff320d3969f0ba495c3869489c2368aea3e548316db0ee2de2279c18", "from": "*", "filesize": 38051763}]}', 'schema_version': 4} 2017-09-15 05:10:32,515 - DEBUG - "PUT /api/releases/Devedition-57.0b1-build1/builds/WINNT_x86-msvc/ja HTTP/1.1" 400 261 2017-09-15 05:10:32,516 - ERROR - Caught HTTPError: {"status": 400, "data": ["Failed to update row, old_data_version doesn't match current data_version"], "type": "about:blank", "detail": "Couldn't update release: Failed to update row, old_data_version doesn't match current data_version", "title": "Bad Request"} 2017-09-15 05:10:32,516 - DEBUG - REQUEST STATS: {"url": "http://balrog/api/releases/Devedition-57.0b1-build1/builds/WINNT_x86-msvc/ja", "timestamp": 1505452232.51623, "method": "PUT", "elapsed_secs": 0.4555790424346924, "status_code": 400} 2017-09-15 05:10:32,516 - DEBUG - retry: Caught exception: Traceback (most recent call last): File "/home/worker/tools/lib/python/vendor/redo-1.4.1/redo/__init__.py", line 152, in retry return action(*args, **kwargs) File "/home/worker/bin/funsize-balrog-submitter.py", line 181, in <lambda> partialInfo=partial_info, completeInfo=complete_info, File "/home/worker/tools/lib/python/balrog/submitter/cli.py", line 480, in run return ReleaseSubmitterBase.run(self, *args, schemaVersion=4, **kwargs) File "/home/worker/tools/lib/python/balrog/submitter/cli.py", line 437, in run schemaVersion=schemaVersion) File "/home/worker/tools/lib/python/vendor/balrogclient-0.0.4/balrogclient/api.py", line 224, in update_build return self.request(method='PUT', data=data) File "/home/worker/tools/lib/python/vendor/balrogclient-0.0.4/balrogclient/api.py", line 111, in request return self.do_request(url, data, method) File "/home/worker/tools/lib/python/vendor/balrogclient-0.0.4/balrogclient/api.py", line 130, in do_request req.raise_for_status() File "/usr/lib/python2.7/dist-packages/requests/models.py", line 825, in raise_for_status raise HTTPError(http_error_msg, response=self) HTTPError: 400 Client Error: BAD REQUEST We don't retry these errors so a human ends up rerunning them with a tc client.
Full log for that was https://queue.taskcluster.net/v1/task/HL7sn3OpR7SpDmwFsuqjEg/runs/0/artifacts/public/logs/live.log. Something changed in the last few days around this, but before the last Balrog deploy. Questions * are we attempting to merge changes since there's no locale overlap. Not sure if we ever brought this nightly behaviour to release * we don't have more partials (yet!) but did the number funsize-balrog workers increase so there's more contention on the API ? * are the retry paramaters optimal ? By the fifth retry we've backed off more than two minutes. If we're not hammering the API it might be worth retrying more, with less backoff but more jitter
Data point - there are 16 funsize-balrog workers running as fast as they can as of *now*, with a small backlog of release jobs and several hundred other jobs pending.
Flags: needinfo?(bhearsum)
(In reply to Nick Thomas [:nthomas] from comment #1) > Full log for that was > https://queue.taskcluster.net/v1/task/HL7sn3OpR7SpDmwFsuqjEg/runs/0/ > artifacts/public/logs/live.log. > > Something changed in the last few days around this, but before the last > Balrog deploy. Questions > * are we attempting to merge changes since there's no locale overlap. Not > sure if we ever brought this nightly behaviour to release We _should_ be - this logic is implemented pretty generically - we can try to turn on some more verbose logging if we'd like to confirm that. > * we don't have more partials (yet!) but did the number funsize-balrog > workers increase so there's more contention on the API ? I don't think I can answer the first part of this, but it wouldn't shock me if contention caused more 400s if the number of workers went up. > * are the retry paramaters optimal ? By the fifth retry we've backed off > more than two minutes. If we're not hammering the API it might be worth > retrying more, with less backoff but more jitter Good idea. More jitter seems like it should help....and perhaps we should increase the number of retries, too?
Flags: needinfo?(bhearsum)
I had a quick look around in datadog at the balrog-web-prod and balrog-api-prod dashboards. Not super confident I'm looking at the right data, but anyway .... On balrog-web-prod there's a "nginx req/sec admin". There's no jump to higher peaks over the last month, they're go up to about 5 requests/second. You can certainly spot the change to 2 nightlies per day (bug 1349227, ~ August 30), when the load becomes more consistent. We probably are running nightlies and releases at the same time now, when previously they were separated. There is a jump in 'ELB Backend 4xx and 5xx' from August 23, but that might only be client-facing traffic though. Quite a lot higher than 2xx though, which is odd. Was that when we stopped hiding errors from the clients ? re retry params, we're making the call at https://dxr.mozilla.org/mozilla-central/source/taskcluster/docker/funsize-balrog-submitter/scripts/funsize-balrog-submitter.py#174 so get the defaults of 5 attempts, 60 sleep, 1.5 sleepscale, and a -1 to 1 second jitter. So typicall retries are 60s, 90+/1, 135+/2, etc. You'd kinda hope that jobs would start out staggered but perhaps not. I'll write a PR for jitter and retries, maybe pull down the initial sleep a bit too.
(In reply to Nick Thomas [:nthomas] from comment #4) > I had a quick look around in datadog at the balrog-web-prod and > balrog-api-prod dashboards. Not super confident I'm looking at the right > data, but anyway .... > > On balrog-web-prod there's a "nginx req/sec admin". There's no jump to > higher peaks over the last month, they're go up to about 5 requests/second. > You can certainly spot the change to 2 nightlies per day (bug 1349227, ~ > August 30), when the load becomes more consistent. We probably are running > nightlies and releases at the same time now, when previously they were > separated. That's an interesting note. While nightlies and releases wouldn't impact each other directly (in the form of more contention updating the same release, but if the increased load slows down the requests, I could see that increasing the number of OutdatedDataErrors. I've not been able to coax that data out of Datadog - relud, is that something you can help with? > There is a jump in 'ELB Backend 4xx and 5xx' from August 23, but that might > only be client-facing traffic though. Quite a lot higher than 2xx though, > which is odd. Was that when we stopped hiding errors from the clients ? Yeah, those would be the public side, so unlikely to be relevant here. This is worth some investigating though - I filed bug 1400860 for this. > re retry params, we're making the call at > https://dxr.mozilla.org/mozilla-central/source/taskcluster/docker/funsize- > balrog-submitter/scripts/funsize-balrog-submitter.py#174 so get the defaults > of 5 attempts, 60 sleep, 1.5 sleepscale, and a -1 to 1 second jitter. So > typicall retries are 60s, 90+/1, 135+/2, etc. You'd kinda hope that jobs > would start out staggered but perhaps not. I'll write a PR for jitter and > retries, maybe pull down the initial sleep a bit too. Thanks - maybe we'll get lucky with this...
Flags: needinfo?(dthorn)
> if the increased load slows down the requests I did some analysis on response time. I only included the admin response time, not the agent. Looks like around the 30th we started trending up in response time and we're taking longer more consistently. an average increase of 200ms makes a lot of sense for the increased load, but that's barely above a single standard deviation, and I in general seems like a small increase. I don't know much about what kind of latency would cause this issue, but that seems like it's /really/ sensitive if it's the cause.
Flags: needinfo?(dthorn)
(In reply to Daniel Thorn [:relud] from comment #6) > Created attachment 8909456 [details] > Screen Shot 2017-09-18 at 12.52.12 PM.png > > > if the increased load slows down the requests > > I did some analysis on response time. I only included the admin response > time, not the agent. Looks like around the 30th we started trending up in > response time and we're taking longer more consistently. an average increase > of 200ms makes a lot of sense for the increased load, but that's barely > above a single standard deviation, and I in general seems like a small > increase. I don't know much about what kind of latency would cause this > issue, but that seems like it's /really/ sensitive if it's the cause. This lines up precisely with when we started generating a second set of nightlies (https://hg.mozilla.org/mozilla-central/annotate/2f39c0312466/.cron.yml#l20). If the average time is up by ~200ms it could be that it's up by much, much more at times when we're submitting releases and nightlies at the same time. Can we see what the average times are for a more specific period? For example, on September 14th from 6:30am to 8am UTC.
Flags: needinfo?(dthorn)
hmm. that's a bit harder to say. I haven't written a function to parse our nginx logs to json yet, and that's really needed for doing finer date splitting. I think tomorrow I could probably write that function and get august/sept backfilled for admin nginx access logs, and then connect up redash so we could evaluate some of this stuff in a proper graphing solution. gotta have a conversation about PII with :bhearsum before I set that up. I'll update with more details tomorrow.
Turns out we retry more times for Nightly and start with a 10s wait time, so mostly I'm syncing the release case to that. This is a stab in the dark because the load patterns for nightly and release are likely different. Also adds a -3 to +3s jitter since that's significantly more than the usual time to retrieve data_version and PUT the new data. More generally, we may hit this problem because we have separate PUTs for each partial for a locale. If we chunked by locale and submitted all partials in one go we'd may avoid it entirely. We'd have to check what caching we have in partial generation, could be a big time hit if we don't have cache available between separate jobs. Alternatively, wait for all partials before submitting into Balrog.
Attachment #8909599 - Flags: review?(bhearsum)
We hit this 16 times in 56.0rc1. This is a major impediment to having smooth releases.
Attachment #8909599 - Flags: review?(bhearsum) → review+
I realized this morning that the other thing that has contributed to load in recent times is moving DevEdition to release-style submission. Although it's reduced middle-of-the-night load, it doubled load for releases, which is obviously the thing we're primarily concerned with. If things are bad enough that we're taking stabs in the dark, I wonder if we should move up the already planned database instance upgrade, and upgrade the admin instance as well. One would think these should reduce request time, and that *should* reduce the number of 400s/retries. (In reply to Nick Thomas [:nthomas] from comment #9) > More generally, we may hit this problem because we have separate PUTs for > each partial for a locale. If we chunked by locale and submitted all > partials in one go we'd may avoid it entirely. We'd have to check what > caching we have in partial generation, could be a big time hit if we don't > have cache available between separate jobs. Alternatively, wait for all > partials before submitting into Balrog. Short of rearchitecting Balrog (in some TBD way), I think this is the proper fix here. IIRC, we've talked a bit about using balrogworker to queue a bunch of updates before submitting them.
I wonder if compressing the payload on submission would help at all?
Pushed by jlorenzo@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/762071fc48dd Use redo params from nightly with a jitter r=bhearsum
Per bhearsum's request, I landed Nick's patch to inbound[1] and release[2]. Patch needs to be uplifted to beta. However, like :jcristau reminded us, this will be done by sheriffs after the patch reached central. [1] https://hg.mozilla.org/integration/mozilla-inbound/rev/762071fc48dda899efa95501e531cc5a39e09aff [2] https://hg.mozilla.org/releases/mozilla-release/rev/16d9778c018a3c3aae3c6c62bc29258627ffed97
We discovered this morning that the server side logic that merges non-conflicting updates together doesn't do what we thought when two requests update the same list in a Release. Instead of merging the lists together, it causes a conflict (which causes a client side retry). I _think_ the reason for this is because lists aren't safely mergable in some circumstances - eg: if one update tries to update or remove an item, we don't know what to do. I'll look into improving this logic. Probably the simplest thing to do is to add everything when multiple add-only changes are made. In theory this could result in duplicate items in the list, but I doubt that will happen in practice (and shouldn't cause any problems when it does). Maybe we can check the "from" key of the dictionary to ensure we have no duplicates, too. As far as we can tell, we have no standard use cases for nightly nor release where we modify or remove items from the list of partials. The closest we come is overwriting the entire thing with the new funsize submissions (eg: https://pastebin.mozilla.org/9032727). If we do happen to see an update that modifies or removes an item from the list, and we have a conflict, we should still raise an OutdatedDataError.
Assignee: nobody → bhearsum
Priority: -- → P1
I managed to get some form of "smarter" merging working in https://github.com/mozilla/balrog/compare/master...mozbhearsum:smarter-partial-merging?expand=1. It's a horrible implementation because it ends up iterating over platforms+locales up to 12 extra times (two patch types * 3 blobs * 2 iterations per blob). CPU usage is very low on admin (<10%), even at peak times, so maybe it's okay? I'm not sure how to verify how well it will perform in production. I'm still looking at hooking into the dictdiffer merge/patch functions instead, which should avoid the need to add any extra iterations over the blob.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Status: RESOLVED → REOPENED
Keywords: leave-open
Resolution: FIXED → ---
(In reply to Ben Hearsum (:bhearsum) from comment #16) > I managed to get some form of "smarter" merging working in > https://github.com/mozilla/balrog/compare/master...mozbhearsum:smarter- > partial-merging?expand=1. It's a horrible implementation because it ends up > iterating over platforms+locales up to 12 extra times (two patch types * 3 > blobs * 2 iterations per blob). CPU usage is very low on admin (<10%), even > at peak times, so maybe it's okay? I'm not sure how to verify how well it > will perform in production. > > I'm still looking at hooking into the dictdiffer merge/patch functions > instead, which should avoid the need to add any extra iterations over the > blob. I made some headway here by ripping out dictdiffer completely, and replacing it with a super basic replacement that I wrote myself. The replacement will walk through nested dictionaries and combine changes from both sides - including combining lists (on the assumption that order doesn't matter). It's not ready for review yet, but if anyone is curious feel free to have a look at and/or comment on https://github.com/mozilla/balrog/pull/409.
Relud did some infrastructure changes today that have greatly reduced the load on the primary RDS database. This *should* result in quicker admin requests, which *should* result in fewer 400s. We should know whether or not that's true with the next few releases. I've also made some good progress on my merging improvements. The PR has some more robust tests, and also passes all of the existing ones. It still needs cleanup, and probably additional tests, but it's looking promising that it will be ready by the end of the week.
FTR, the infra changes were rolled back after some bustage.
I think we should probably move the infra work to a different bug at this point, here's some details on the failure/rollback: (In reply to Nick Thomas [:nthomas] from comment #20) > FTR, the infra changes were rolled back after some bustage. We just had a postmortem on this. Here's the summary: * At 21:00 UTC a cronjob ran that deleted many rows from the releases_history table. * This caused a spike in replication lag and sluggishness in the replica db. * Starting at 21:01 UTC, we started serving large numbers of 500s due to the replica db sluggishness. * At 21:23, Relud switched web traffic back to the primary database. * At 21:25, we stopped serving 500s. * Around 23:45, the replica finally caught up with the master (without human intervention). It appears that the replica was bottlenecked on the disk in some way, though we're not exactly sure how. It could be that we ran through the iop quota when spinning up the database initially, or that the disk was simply too slow to handle replicating the deletes + web traffic. We're going to see what happens with the replica when the cleanup cronjob runs again today, which should help us narrow things down a bit further. At things stand now, we have no plans to proceed on this until we have more data and critical releases are out of the way. I'll continue pushing forward on https://github.com/mozilla/balrog/pull/409 as an alternative way to reduce the number of retries when submitting releases.
Attached patch smarter blob merging (deleted) — Splinter Review
Attachment #8910913 - Flags: review?(nthomas)
Attachment #8910913 - Flags: review?(catlee)
Attachment #8910913 - Flags: review?(nthomas) → review+
Commit pushed to master at https://github.com/mozilla/balrog https://github.com/mozilla/balrog/commit/bd9c0ee5b2617d7c82987089e2c03a0782aea425 bug 1400141: Smarter partial merging (#409). r=nthomas,catlee,allan-silva
Attachment #8910913 - Attachment is patch: true
Attachment #8910913 - Attachment mime type: text/x-github-pull-request → text/plain
Attachment #8910913 - Flags: review?(catlee) → review+
Flags: needinfo?(dthorn)
Depends on: 1405714
We've made improvements on the client and server side directly in this bug -- the server side improvements to blob merging landed in production today. There's some additional work being done in bug 1246993 that should fully eliminate the 400s that I expect to land next week - leaving this open to track that.
Depends on: 1246993
It looks like https://github.com/mozilla/balrog/pull/409 caused some perf issues as described in https://bugzilla.mozilla.org/show_bug.cgi?id=1405924: > (In reply to Nick Thomas [:nthomas] from comment #0) > Since bug 1405714, and possibly the first nightly builds since then, the > Balrog admin host has been intermittently unavailable. In datadog there is a > lot of RDS read ops, and 1-2 pending admin connections since ~2300 UTC. All of those extra read ops are likely the SELECTs from releases & releases_history when we attempt a merge. Those queries look like: SELECT releases_history.change_id, releases_history.changed_by, releases_history.timestamp, releases_history.name, releases_history.product, releases_history.`read_only`, releases_history.data, releases_history.data_version FROM releases_history WHERE releases_history.data_version = %s AND releases_history.name = %s and SELECT releases.name, releases.product, releases.data_version FROM releases WHERE releases.name = %s SELECT releases.data_version FROM releases WHERE releases.name = %s LIMIT 1 SELECT releases.data FROM releases WHERE releases.name = %s LIMIT 1 The latter three are unlikely to be the source of any extra load, as they all use the primary key in the WHERE clause. My suspicion here is that the releases_history query is very slow when querying by a name that has a huge number of rows, like Firefox-mozilla-central-nightly-latest. It appears that we have about ~15,000 rows like that at the moment. According to https://github.com/mozilla/balrog/blob/master/auslib/migrate/versions/006_releases_name_index.py, we should have an index on releases_history.name, but I'm not sure that's terribly helpful when name is not remotely unique. A few random thoughts on things that might help here: - Get even more aggressive about removing nightly history (only keep it for 1 day). This would give us a one-time perf hit while we remove most of the current ~2 weeks that we keep. This should greatly reduce the releases_history query time. - Add a multi-column index on name, data_version. I don't know how viable this is both disk-space wise and perf-wise. - Give the db more CPU/memory/iops to simply make it faster. If it even helps, this would be a brute force short term fix - not something we can rely on in even the medium term. - Some combination of the above -- We also saw a few things in Sentry: * ~120 IOErrors over a ~4h span * 8 errors that are not getting reported properly (looks like there's an Exception being raised after the original one, and masking the original problem) * 2 TransactionErrors like: TransactionError: ("(OperationalError) (1205, 'Lock wait timeout exceeded; try restarting transaction')",) The IOErrors have absolutely no context in them, just "write error" without even a the last frame of a stack. This makes me think it's coming from uwsgi itself and not Balrog - perhaps this is what happens if the app doesn't respond in time? The set of 8 errors doesn't have a useful stack either, but it does have a bunch of log messages attached to it that are almost exactly the same as the 2 TransactionErrors - except with some extra messages after the second Exception is raised. I suspect that the root exception is the same for both of these. I don't think this are related to the root cause, because the log messages don't show them even _attempting_ to merge - these are probably secondary errors after the app got bogged down enough.
Relud and I spoke today about the outage and how we could prevent such problems in the future. This particular issue only revealed itself under true production conditions with full releases_history and full production traffic. To catch these types of bugs in the future, he suggested that we could set-up an additional Balrog stage environment that admin traffic would be tee'd too. I've filed this as https://bugzilla.mozilla.org/show_bug.cgi?id=1406128 with some additional details.
Attached patch improve getChange performance (deleted) — Splinter Review
Attachment #8916115 - Flags: review?(catlee)
Comment on attachment 8916115 [details] [diff] [review] improve getChange performance Review of attachment 8916115 [details] [diff] [review]: ----------------------------------------------------------------- is there a way to test performance of this change? ::: auslib/db.py @@ +754,5 @@ > + # Additional details in https://github.com/mozilla/balrog/pull/419#issuecomment-334851038 > + change_ids = self.select(columns=[self.change_id], where=where, > + transaction=transaction) > + if len(change_ids) != 1: > + self.log.debug("Found %s changes, should have been 1", len(change_ids)) how important would it be when debugging to distinguish this log line from the one below?
Attachment #8916115 - Attachment is patch: true
Attachment #8916115 - Attachment mime type: text/x-github-pull-request → text/plain
Attachment #8916115 - Flags: review?(catlee) → review+
(In reply to Chris AtLee [:catlee] from comment #28) > Comment on attachment 8916115 [details] [diff] [review] > improve getChange performance > > Review of attachment 8916115 [details] [diff] [review]: > ----------------------------------------------------------------- > > is there a way to test performance of this change? Thanks for the push on this. Relud is going to clone the prod db to stage, and I'm going to get some nightlies running with this patch against stage afterwards. If those don't show the same perf issues as last week, that's as sure as we can get before we head to prod. > ::: auslib/db.py > @@ +754,5 @@ > > + # Additional details in https://github.com/mozilla/balrog/pull/419#issuecomment-334851038 > > + change_ids = self.select(columns=[self.change_id], where=where, > > + transaction=transaction) > > + if len(change_ids) != 1: > > + self.log.debug("Found %s changes, should have been 1", len(change_ids)) > > how important would it be when debugging to distinguish this log line from > the one below? Hard to be certain, but there's no harm in making sure we can distinguish between them.
Commit pushed to master at https://github.com/mozilla/balrog https://github.com/mozilla/balrog/commit/ecb6110ec6b51ce43047af6c36241397392b21b6 bug 1400141: Improve History.getChange efficiency. (#419). r=catlee
Depends on: 1407670
(In reply to [github robot] from comment #30) > Commit pushed to master at https://github.com/mozilla/balrog > > https://github.com/mozilla/balrog/commit/ > ecb6110ec6b51ce43047af6c36241397392b21b6 > bug 1400141: Improve History.getChange efficiency. (#419). r=catlee This patch hit production today and it appears to have solved the performance issues. Around the time of the push there was retriggered nightlies that managed to submit without any issues. The true test will be the nightlies that start around ~6pm eastern today. I'll be watching them closely for any issues. bug 1407670 has one final fix for server side merging as well, but it won't land until next week at the earliest.
The number of 400s has been greatly reduced, although not fully eliminated. We have a spattering of them with every set of nightlies still (around 100 per day), but not enough that causes submissions to ultimately fail. My suspicion is that these are only happening when the server side merging is attempted, but then a new change is committed in between the time the merging starts, and the time we try to commit the merged result. This case still causes expected 400s, because we only make one attempt at server side merging. I haven't dug deeply into the data, so it's possible it's not this, or that there's other reasons we're still getting some 400s too. Because I haven't seen nor heard of any issues with 400s causing manual intervention for Releases since these fixes landed, I'm going to close this.
Status: REOPENED → RESOLVED
Closed: 7 years ago7 years ago
Resolution: --- → FIXED
Removing leave-open keyword from resolved bugs, per :sylvestre.
Keywords: leave-open
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: