Closed
Bug 851585
Opened 12 years ago
Closed 12 years ago
server side retries in balrog are causing huge amounts of OutdatedDataErrors
Categories
(Release Engineering :: General, defect)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: bhearsum, Assigned: bhearsum)
References
()
Details
(Whiteboard: [balrog])
Attachments
(1 file)
(deleted),
patch
|
nthomas
:
review+
bhearsum
:
checked-in+
|
Details | Diff | Splinter Review |
OutdatedDataError: Failed to update row, old_data_version doesn't match current data_version
Stacktrace (most recent call last):
File "flask/app.py", line 1292, in wsgi_app
response = self.full_dispatch_request()
File "flask/app.py", line 1062, in full_dispatch_request
rv = self.handle_user_exception(e)
File "flask/app.py", line 1060, in full_dispatch_request
rv = self.dispatch_request()
File "flask/app.py", line 1047, in dispatch_request
return self.view_functions[rule.endpoint](**req.view_args)
File "flask/views.py", line 56, in view
return self.dispatch_request(*args, **kwargs)
File "flask/views.py", line 112, in dispatch_request
return meth(*args, **kwargs)
File "auslib/admin/views/base.py", line 49, in put
return self._put(*args, transaction=trans, **kwargs)
File "auslib/admin/views/base.py", line 49, in put
return self._put(*args, transaction=trans, **kwargs)
File "auslib/admin/views/base.py", line 16, in decorated
return f(*args, changed_by=username, **kwargs)
File "auslib/admin/views/base.py", line 32, in decorated
return f(*args, **kwargs)
File "auslib/admin/views/releases.py", line 150, in _put
return changeRelease(release, changed_by, transaction, exists, commit, self.log)
File "auslib/admin/views/releases.py", line 110, in changeRelease
commitCallback(rel, product, version, incomingData, releaseInfo['data'], old_data_version)
File "auslib/admin/views/releases.py", line 148, in commit
return retry(db.releases.addLocaleToRelease, kwargs=dict(name=rel, platform=platform, locale=locale, data=localeData, old_data_version=old_data_version, changed_by=changed_by, transaction=transaction))
File "auslib/util/retry.py", line 12, in retry
return upstream_retry(*a, **kw)
File "mozilla_buildtools/retry.py", line 27, in retry
return action(*args, **kwargs)
File "auslib/db.py", line 821, in addLocaleToRelease
transaction=transaction)
File "auslib/db.py", line 390, in update
return self._prepareUpdate(transaction, where, what, changed_by, old_data_version)
File "auslib/db.py", line 364, in _prepareUpdate
raise OutdatedDataError("Failed to update row, old_data_version doesn't match current data_version")
Maybe this is because we have so many repacks trying to update in parallel? We might need to consider some sort of queuing system rather than retries if that's the case. Maybe the client show make the request and the server can complete it asynchronously?
Assignee | ||
Updated•12 years ago
|
Component: Release Engineering: Automation (Release Automation) → Release Engineering: Automation (General)
QA Contact: bhearsum → catlee
Whiteboard: [balrog]
Assignee | ||
Comment 1•12 years ago
|
||
We should probably try to fix this before we go to production for Nightly...or at the very least, get to the bottom of it and decide we can live with it.
Blocks: balrog-nightly
Assignee | ||
Comment 2•12 years ago
|
||
I started looking into this a bit more but it's very very hard to make sense of the application logs at the moment -- all of the request data is interleaved between threads/processes...I filed bug 859861 on this.
Depends on: 859861
Assignee | ||
Comment 3•12 years ago
|
||
I looked into this more today and found interesting results. It turns out that we retry 5 times on the server but never update the old_data_version. Eg:
2013-04-11 05:35:00,211 - PID: 24251 - Request: 3028172396 - mozilla_buildtools.retry.retry#26: retry: Calling <bound method Releases.addLocaleToRelease ... 'old_data_version': 15}, attempt #1
2013-04-11 05:35:05,304 - PID: 24251 - Request: 3028172396 - mozilla_buildtools.retry.retry#26: retry: Calling <bound method Releases.addLocaleToRelease ... 'old_data_version': 15}, attempt #2
...
2013-04-11 05:35:20,456 - PID: 24251 - Request: 3028172396 - mozilla_buildtools.retry.retry#26: retry: Calling <bound method Releases.addLocaleToRelease ... 'old_data_version': 15}, attempt #5
After all this useless retrying, the server returns a 500 error and the client retries and succeeds (generally - in some cases it may end up losing another race and retrying *again*):
10.8.33.241 - stage-ffxbld [11/Apr/2013:05:35:00 -0700] "PUT /releases/Firefox-mozilla-central-nightly-20130411030925/builds/Linux_x86-gcc3/lv HTTP/1.1" 500 541 "-" "python-requests/0.10.8"
10.8.33.241 - stage-ffxbld [11/Apr/2013:05:35:23 -0700] "PUT /releases/Firefox-mozilla-central-nightly-20130411030925/builds/Linux_x86-gcc3/lv HTTP/1.1" 201 24 "-" "python-requests/0.10.8"
Assignee | ||
Comment 4•12 years ago
|
||
So, there's a couple of things we could do here....
If we want to keep the server side retrying we could get more picky about which exceptions we catch in the retry to make sure we don't catch OutdatedDataError.
The other obvious option is to rip out retries from the server and let the client deal with it, as it's doing already. I'm leaning towards this because I think this is what pretty much every other website in the world does. It seems quite wrong to make a client's request take a long time because we're sleeping before retrying.
Assignee | ||
Comment 5•12 years ago
|
||
I tried out idea #2 and it still passes all tests...
Assignee | ||
Comment 6•12 years ago
|
||
I think this makes the most sense...I'm happy to take a different approach if you disagree though.
Comment 7•12 years ago
|
||
Comment on attachment 736952 [details] [diff] [review]
no more server side retry
Sorry for the delay in getting this, fell through the cracks. Looks fine to me but will try to watch out for errors using the UI etc.
Attachment #736952 -
Flags: review?(nthomas) → review+
Assignee | ||
Updated•12 years ago
|
Summary: lots of OutdatedDataError: Failed to update row, old_data_version doesn't match current data_version on balrog dev → server side retries in balrog are causing huge amounts of OutdatedDataErrors
Assignee | ||
Comment 8•12 years ago
|
||
Comment on attachment 736952 [details] [diff] [review]
no more server side retry
(In reply to Nick Thomas [:nthomas] from comment #7)
> Comment on attachment 736952 [details] [diff] [review]
> no more server side retry
>
> Sorry for the delay in getting this, fell through the cracks. Looks fine to
> me but will try to watch out for errors using the UI etc.
Good point. This might make us more aware of UI enhancements that we may need to deal with failures.
Attachment #736952 -
Flags: checked-in+
Comment 9•12 years ago
|
||
Commit pushed to master at https://github.com/mozilla/balrog
https://github.com/mozilla/balrog/commit/e64d847f42bd8cb188f08f64df2cede16b041386
bug 851585: server side retries in balrog are causing huge amounts of OutdatedDataErrors. r=nthomas
Assignee | ||
Comment 10•12 years ago
|
||
Tested fine on Jenkins.
http://10.134.48.37:8080/job/Balrog/24/ is green and http://10.134.48.37:8080/job/Balrog%20Snippet%20Comparison/30/ has the number of failures as the previous run.
It's been automatically deployed to both of the dev nodes. I've marked the error in Sentry as resolved, so we should notice if it continues to happen even with this patch.
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Updated•11 years ago
|
Product: mozilla.org → Release Engineering
Updated•7 years ago
|
Component: General Automation → General
You need to log in
before you can comment on or make changes to this bug.
Description
•