Closed Bug 851585 Opened 12 years ago Closed 12 years ago

server side retries in balrog are causing huge amounts of OutdatedDataErrors

Categories

(Release Engineering :: General, defect)

x86_64
Linux
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bhearsum, Assigned: bhearsum)

References

()

Details

(Whiteboard: [balrog])

Attachments

(1 file)

OutdatedDataError: Failed to update row, old_data_version doesn't match current data_version Stacktrace (most recent call last): File "flask/app.py", line 1292, in wsgi_app response = self.full_dispatch_request() File "flask/app.py", line 1062, in full_dispatch_request rv = self.handle_user_exception(e) File "flask/app.py", line 1060, in full_dispatch_request rv = self.dispatch_request() File "flask/app.py", line 1047, in dispatch_request return self.view_functions[rule.endpoint](**req.view_args) File "flask/views.py", line 56, in view return self.dispatch_request(*args, **kwargs) File "flask/views.py", line 112, in dispatch_request return meth(*args, **kwargs) File "auslib/admin/views/base.py", line 49, in put return self._put(*args, transaction=trans, **kwargs) File "auslib/admin/views/base.py", line 49, in put return self._put(*args, transaction=trans, **kwargs) File "auslib/admin/views/base.py", line 16, in decorated return f(*args, changed_by=username, **kwargs) File "auslib/admin/views/base.py", line 32, in decorated return f(*args, **kwargs) File "auslib/admin/views/releases.py", line 150, in _put return changeRelease(release, changed_by, transaction, exists, commit, self.log) File "auslib/admin/views/releases.py", line 110, in changeRelease commitCallback(rel, product, version, incomingData, releaseInfo['data'], old_data_version) File "auslib/admin/views/releases.py", line 148, in commit return retry(db.releases.addLocaleToRelease, kwargs=dict(name=rel, platform=platform, locale=locale, data=localeData, old_data_version=old_data_version, changed_by=changed_by, transaction=transaction)) File "auslib/util/retry.py", line 12, in retry return upstream_retry(*a, **kw) File "mozilla_buildtools/retry.py", line 27, in retry return action(*args, **kwargs) File "auslib/db.py", line 821, in addLocaleToRelease transaction=transaction) File "auslib/db.py", line 390, in update return self._prepareUpdate(transaction, where, what, changed_by, old_data_version) File "auslib/db.py", line 364, in _prepareUpdate raise OutdatedDataError("Failed to update row, old_data_version doesn't match current data_version") Maybe this is because we have so many repacks trying to update in parallel? We might need to consider some sort of queuing system rather than retries if that's the case. Maybe the client show make the request and the server can complete it asynchronously?
Component: Release Engineering: Automation (Release Automation) → Release Engineering: Automation (General)
QA Contact: bhearsum → catlee
Whiteboard: [balrog]
We should probably try to fix this before we go to production for Nightly...or at the very least, get to the bottom of it and decide we can live with it.
I started looking into this a bit more but it's very very hard to make sense of the application logs at the moment -- all of the request data is interleaved between threads/processes...I filed bug 859861 on this.
Depends on: 859861
I looked into this more today and found interesting results. It turns out that we retry 5 times on the server but never update the old_data_version. Eg: 2013-04-11 05:35:00,211 - PID: 24251 - Request: 3028172396 - mozilla_buildtools.retry.retry#26: retry: Calling <bound method Releases.addLocaleToRelease ... 'old_data_version': 15}, attempt #1 2013-04-11 05:35:05,304 - PID: 24251 - Request: 3028172396 - mozilla_buildtools.retry.retry#26: retry: Calling <bound method Releases.addLocaleToRelease ... 'old_data_version': 15}, attempt #2 ... 2013-04-11 05:35:20,456 - PID: 24251 - Request: 3028172396 - mozilla_buildtools.retry.retry#26: retry: Calling <bound method Releases.addLocaleToRelease ... 'old_data_version': 15}, attempt #5 After all this useless retrying, the server returns a 500 error and the client retries and succeeds (generally - in some cases it may end up losing another race and retrying *again*): 10.8.33.241 - stage-ffxbld [11/Apr/2013:05:35:00 -0700] "PUT /releases/Firefox-mozilla-central-nightly-20130411030925/builds/Linux_x86-gcc3/lv HTTP/1.1" 500 541 "-" "python-requests/0.10.8" 10.8.33.241 - stage-ffxbld [11/Apr/2013:05:35:23 -0700] "PUT /releases/Firefox-mozilla-central-nightly-20130411030925/builds/Linux_x86-gcc3/lv HTTP/1.1" 201 24 "-" "python-requests/0.10.8"
So, there's a couple of things we could do here.... If we want to keep the server side retrying we could get more picky about which exceptions we catch in the retry to make sure we don't catch OutdatedDataError. The other obvious option is to rip out retries from the server and let the client deal with it, as it's doing already. I'm leaning towards this because I think this is what pretty much every other website in the world does. It seems quite wrong to make a client's request take a long time because we're sleeping before retrying.
I tried out idea #2 and it still passes all tests...
Attached patch no more server side retry (deleted) — Splinter Review
I think this makes the most sense...I'm happy to take a different approach if you disagree though.
Assignee: nobody → bhearsum
Status: NEW → ASSIGNED
Attachment #736952 - Flags: review?(nthomas)
Comment on attachment 736952 [details] [diff] [review] no more server side retry Sorry for the delay in getting this, fell through the cracks. Looks fine to me but will try to watch out for errors using the UI etc.
Attachment #736952 - Flags: review?(nthomas) → review+
Summary: lots of OutdatedDataError: Failed to update row, old_data_version doesn't match current data_version on balrog dev → server side retries in balrog are causing huge amounts of OutdatedDataErrors
Comment on attachment 736952 [details] [diff] [review] no more server side retry (In reply to Nick Thomas [:nthomas] from comment #7) > Comment on attachment 736952 [details] [diff] [review] > no more server side retry > > Sorry for the delay in getting this, fell through the cracks. Looks fine to > me but will try to watch out for errors using the UI etc. Good point. This might make us more aware of UI enhancements that we may need to deal with failures.
Attachment #736952 - Flags: checked-in+
Commit pushed to master at https://github.com/mozilla/balrog https://github.com/mozilla/balrog/commit/e64d847f42bd8cb188f08f64df2cede16b041386 bug 851585: server side retries in balrog are causing huge amounts of OutdatedDataErrors. r=nthomas
Tested fine on Jenkins. http://10.134.48.37:8080/job/Balrog/24/ is green and http://10.134.48.37:8080/job/Balrog%20Snippet%20Comparison/30/ has the number of failures as the previous run. It's been automatically deployed to both of the dev nodes. I've marked the error in Sentry as resolved, so we should notice if it continues to happen even with this patch.
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: