Closed Bug 851585 Opened 12 years ago Closed 12 years ago

server side retries in balrog are causing huge amounts of OutdatedDataErrors

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: bhearsum, Assigned: bhearsum)

References

(
URL
)

Details

(Whiteboard: [balrog])

Attachments

(1 file)

no more server side retry 12 years ago bhearsum@mozilla.com (:bhearsum) (deleted), patch	nthomas : review+ bhearsum : checked-in+	Details \| Diff \| Splinter Review

bhearsum@mozilla.com (:bhearsum)

Assignee

Description

•

12 years ago

OutdatedDataError: Failed to update row, old_data_version doesn't match current data_version Stacktrace (most recent call last): File "flask/app.py", line 1292, in wsgi_app response = self.full_dispatch_request() File "flask/app.py", line 1062, in full_dispatch_request rv = self.handle_user_exception(e) File "flask/app.py", line 1060, in full_dispatch_request rv = self.dispatch_request() File "flask/app.py", line 1047, in dispatch_request return self.view_functions[rule.endpoint](**req.view_args) File "flask/views.py", line 56, in view return self.dispatch_request(*args, **kwargs) File "flask/views.py", line 112, in dispatch_request return meth(*args, **kwargs) File "auslib/admin/views/base.py", line 49, in put return self._put(*args, transaction=trans, **kwargs) File "auslib/admin/views/base.py", line 49, in put return self._put(*args, transaction=trans, **kwargs) File "auslib/admin/views/base.py", line 16, in decorated return f(*args, changed_by=username, **kwargs) File "auslib/admin/views/base.py", line 32, in decorated return f(*args, **kwargs) File "auslib/admin/views/releases.py", line 150, in _put return changeRelease(release, changed_by, transaction, exists, commit, self.log) File "auslib/admin/views/releases.py", line 110, in changeRelease commitCallback(rel, product, version, incomingData, releaseInfo['data'], old_data_version) File "auslib/admin/views/releases.py", line 148, in commit return retry(db.releases.addLocaleToRelease, kwargs=dict(name=rel, platform=platform, locale=locale, data=localeData, old_data_version=old_data_version, changed_by=changed_by, transaction=transaction)) File "auslib/util/retry.py", line 12, in retry return upstream_retry(*a, **kw) File "mozilla_buildtools/retry.py", line 27, in retry return action(*args, **kwargs) File "auslib/db.py", line 821, in addLocaleToRelease transaction=transaction) File "auslib/db.py", line 390, in update return self._prepareUpdate(transaction, where, what, changed_by, old_data_version) File "auslib/db.py", line 364, in _prepareUpdate raise OutdatedDataError("Failed to update row, old_data_version doesn't match current data_version") Maybe this is because we have so many repacks trying to update in parallel? We might need to consider some sort of queuing system rather than retries if that's the case. Maybe the client show make the request and the server can complete it asynchronously?

bhearsum@mozilla.com (:bhearsum)

Assignee

Updated

•

12 years ago

Component: Release Engineering: Automation (Release Automation) → Release Engineering: Automation (General)

QA Contact: bhearsum → catlee

Whiteboard: [balrog]

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 1

•

12 years ago

We should probably try to fix this before we go to production for Nightly...or at the very least, get to the bottom of it and decide we can live with it.

Blocks: balrog-nightly

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 2

•

12 years ago

I started looking into this a bit more but it's very very hard to make sense of the application logs at the moment -- all of the request data is interleaved between threads/processes...I filed bug 859861 on this.

Depends on: 859861

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 3

•

12 years ago

I looked into this more today and found interesting results. It turns out that we retry 5 times on the server but never update the old_data_version. Eg: 2013-04-11 05:35:00,211 - PID: 24251 - Request: 3028172396 - mozilla_buildtools.retry.retry#26: retry: Calling <bound method Releases.addLocaleToRelease ... 'old_data_version': 15}, attempt #1 2013-04-11 05:35:05,304 - PID: 24251 - Request: 3028172396 - mozilla_buildtools.retry.retry#26: retry: Calling <bound method Releases.addLocaleToRelease ... 'old_data_version': 15}, attempt #2 ... 2013-04-11 05:35:20,456 - PID: 24251 - Request: 3028172396 - mozilla_buildtools.retry.retry#26: retry: Calling <bound method Releases.addLocaleToRelease ... 'old_data_version': 15}, attempt #5 After all this useless retrying, the server returns a 500 error and the client retries and succeeds (generally - in some cases it may end up losing another race and retrying *again*): 10.8.33.241 - stage-ffxbld [11/Apr/2013:05:35:00 -0700] "PUT /releases/Firefox-mozilla-central-nightly-20130411030925/builds/Linux_x86-gcc3/lv HTTP/1.1" 500 541 "-" "python-requests/0.10.8" 10.8.33.241 - stage-ffxbld [11/Apr/2013:05:35:23 -0700] "PUT /releases/Firefox-mozilla-central-nightly-20130411030925/builds/Linux_x86-gcc3/lv HTTP/1.1" 201 24 "-" "python-requests/0.10.8"

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 4

•

12 years ago

So, there's a couple of things we could do here.... If we want to keep the server side retrying we could get more picky about which exceptions we catch in the retry to make sure we don't catch OutdatedDataError. The other obvious option is to rip out retries from the server and let the client deal with it, as it's doing already. I'm leaning towards this because I think this is what pretty much every other website in the world does. It seems quite wrong to make a client's request take a long time because we're sleeping before retrying.

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 5

•

12 years ago

I tried out idea #2 and it still passes all tests...

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 6

•

12 years ago

Attached patch no more server side retry (deleted) — Details — Splinter Review

I think this makes the most sense...I'm happy to take a different approach if you disagree though.

Assignee: nobody → bhearsum

Status: NEW → ASSIGNED

Attachment #736952 - Flags: review?(nthomas)

Nick Thomas [:nthomas] (UTC+12)

Comment 7

•

12 years ago

Comment on attachment 736952 [details] [diff] [review] no more server side retry Sorry for the delay in getting this, fell through the cracks. Looks fine to me but will try to watch out for errors using the UI etc.

Attachment #736952 - Flags: review?(nthomas) → review+

bhearsum@mozilla.com (:bhearsum)

Assignee

Updated

•

12 years ago

Summary: lots of OutdatedDataError: Failed to update row, old_data_version doesn't match current data_version on balrog dev → server side retries in balrog are causing huge amounts of OutdatedDataErrors

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 8

•

12 years ago

Comment on attachment 736952 [details] [diff] [review] no more server side retry (In reply to Nick Thomas [:nthomas] from comment #7) > Comment on attachment 736952 [details] [diff] [review] > no more server side retry > > Sorry for the delay in getting this, fell through the cracks. Looks fine to > me but will try to watch out for errors using the UI etc. Good point. This might make us more aware of UI enhancements that we may need to deal with failures.

Attachment #736952 - Flags: checked-in+

[github robot]

Comment 9

•

12 years ago

Commit pushed to master at https://github.com/mozilla/balrog https://github.com/mozilla/balrog/commit/e64d847f42bd8cb188f08f64df2cede16b041386 bug 851585: server side retries in balrog are causing huge amounts of OutdatedDataErrors. r=nthomas

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 10

•

12 years ago

Tested fine on Jenkins. http://10.134.48.37:8080/job/Balrog/24/ is green and http://10.134.48.37:8080/job/Balrog%20Snippet%20Comparison/30/ has the number of failures as the previous run. It's been automatically deployed to both of the dev nodes. I've marked the error in Sentry as resolved, so we should notice if it continues to happen even with this patch.

Status: ASSIGNED → RESOLVED

Closed: 12 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

11 years ago

Product: mozilla.org → Release Engineering

Nobody; OK to take it and work on it

Updated

•

7 years ago

Component: General Automation → General

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Quick Search

server side retries in balrog are causing huge amounts of OutdatedDataErrors

Categories

(Release Engineering :: General, defect)

Tracking

(Not tracked)

People

(Reporter: bhearsum, Assigned: bhearsum)

References

(
URL
)

Details

(Whiteboard: [balrog])

Crash Data

Security

(public)

User Story

Attachments

(1 file)

Description

Updated

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Updated

Comment 8

Comment 9

Comment 10

Updated

Updated

Attachment

General

Description

File Name

Content Type