671488 - Add blob caching to balrog

Reporter

Description

•

13 years ago

There is lots of caching we might implement, but the obvious one is the xml returned for a request. Possibly keyed of updateQuery, probably for at least minutes. We should also build in a mechanism to empty the cache when the rules or json have changed.

Dustin J. Mitchell [:dustin] (he/him)

Comment 1

•

13 years ago

Add caching to what?

Nick Thomas [:nthomas] (UTC+12)

Reporter

Updated

•

13 years ago

Summary: Add caching → Add caching to AUS3

Chris Cooper [:coop] (he/him)

Updated

•

13 years ago

Component: Release Engineering → Release Engineering: Automation

Priority: P3 → --

QA Contact: release → catlee

Hardware: x86 → All

Whiteboard: [updates]

Chris AtLee [:catlee]

Updated

•

13 years ago

Priority: -- → P3

Whiteboard: [updates] → [updates][balrog]

bhearsum@mozilla.com (:bhearsum)

Assignee

Updated

•

12 years ago

Blocks: balrog-nightly
No longer blocks: balrog

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 2

•

12 years ago

Catlee and I were talking the other day and the idea of caching JSON decoding came up. It's (probably) going to be one of the more expensive things we do.

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 3

•

12 years ago

It'll be easier to do this once we have perf tests for Balrog, because we'll have a better idea of where our bottlenecks are.

Depends on: 832464

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 4

•

11 years ago

Will also be informed by testing against production nodes once those are up.

Depends on: 832461

Nobody; OK to take it and work on it

Updated

•

11 years ago

Product: mozilla.org → Release Engineering

bhearsum@mozilla.com (:bhearsum)

Assignee

Updated

•

11 years ago

No longer blocks: balrog-nightly

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 5

•

11 years ago

mass component change

Component: General Automation → Balrog: Backend

bhearsum@mozilla.com (:bhearsum)

Assignee

Updated

•

11 years ago

Summary: Add caching to AUS3 → Add caching to balrog

bhearsum@mozilla.com (:bhearsum)

Assignee

Updated

•

11 years ago

Blocks: balrog-beta

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 6

•

11 years ago

We decided to add zeus caching in bug 982618 that negates the need for this (for now at least). We may come back to it at some point in the future.

No longer blocks: balrog-beta

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 7

•

10 years ago

Catlee and I talked about some ideas today: 10:45 <@catlee> caching! 10:45 < bhearsum> yeah, a blob cache would help with that too 10:45 < bhearsum> zeus helps a lot already 10:45 <@catlee> yeah, I bet 10:45 < bhearsum> but it doesn't help in cases where you have slightly differently urls that end up with the same blob 10:45 <@catlee> how long can you keep something in the blob cache? 10:46 <@catlee> is there a quick way to check if it's up to date? 10:46 < bhearsum> i was thinking it could be defined with a maximum size and use it as an LRU 10:46 < bhearsum> and then some cap on cache length for items too 10:46 < bhearsum> 30 or 60s or something 10:46 <@catlee> you could cache data version 10:47 < bhearsum> ah, that's a good idea! 10:47 <@catlee> that should be fast to check 10:47 < bhearsum> look up data version each time, and then only get the blob if data version is changed? 10:47 <@catlee> well, you can probably even cache that for short periods 10:47 <@catlee> so you check the data version at most once a minute 10:47 <@catlee> or maybe 30s 10:47 < bhearsum> yeah, possibly... 10:47 < bhearsum> we need to figure out what our maximum livable TTL is though 10:48 < bhearsum> eg, if we cache 1min in app and 30s at zeus, that's 1m30s 10:48 <@catlee> yeah 10:48 <@catlee> I wouldn't do it per request 10:48 < bhearsum> check data version? 10:48 <@catlee> yeah 10:49 <@catlee> even 10s would be fine

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 8

•

10 years ago

We decided today this didn't block using Balrog for release, unless the gradual turnup of Balrog revealed load issues on the apache or db hosts.

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 9

•

10 years ago

After doing a live test as part of bug 1075542, this has become necessary for shipping to the release channel. https://bugzilla.mozilla.org/show_bug.cgi?id=1075542#c1 has details.

Blocks: balrog-release

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 10

•

10 years ago

Attached patch first stab at lru for blobs (obsolete) (deleted) — Details — Splinter Review

This needs more extensive manual testing (and maybe more unit tests), but it's gone through a couple of iterations and passes all tests. It ended up being more complicated than anticipated for a couple of reasons: 1) We need to support caching and not caching with the same object. This is because both apps share a single global namespace. Unsurprisingly, a non-caching cache library isn't a thing, so I had to write the MaybeCacher wrapper. 2) I couldn't figure out a way to magically deal with data_version, so it's being dealt with by hand in getReleaseBlob for now. At some point in the future I think it might be good to look at moving caching all the way down to AUSTable.select(), or maybe to have a better defined interface that expects data_version to be returned from any function that we could decorate.

Assignee: nobody → bhearsum

Status: NEW → ASSIGNED

Attachment #8533957 - Flags: feedback?(nthomas)

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 11

•

10 years ago

Comment on attachment 8533957 [details] [diff] [review] first stab at lru for blobs Review of attachment 8533957 [details] [diff] [review]: ----------------------------------------------------------------- A great start. Is repoze.lru is abandoned now, but the best fit for us ? ::: auslib/db.py @@ +875,5 @@ > + # We need to get the current data version of the blob no matter what, > + # because we need to invalidate the cache of the cached version is > + # of a different version. > + try: > + data_version = self.select(where=[self.name==name], columns=[self.data_version], limit=1, transaction=transaction)[0] We'll could end up doing ~3500 select/second here, in a world with geo-ip rules that mean we can't cache in Zeus (assumed 150M ADI, always running Firefox, checking twice a day, likely on the high side). What if we build the cache like this: * data_version is cached with a TTL of 60 seconds or so * if we're over that: check data_version in the db * if data_version has changed: invalidate blob cache Moving the select inside the 'if cached:' block would be start on that, but the cache timeout may be different.

Attachment #8533957 - Flags: feedback?(nthomas) → feedback+

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 12

•

10 years ago

(In reply to Nick Thomas [:nthomas] from comment #11) > Comment on attachment 8533957 [details] [diff] [review] > first stab at lru for blobs > > Review of attachment 8533957 [details] [diff] [review]: > ----------------------------------------------------------------- > > A great start. Is repoze.lru is abandoned now, but the best fit for us ? > > ::: auslib/db.py > @@ +875,5 @@ > > + # We need to get the current data version of the blob no matter what, > > + # because we need to invalidate the cache of the cached version is > > + # of a different version. > > + try: > > + data_version = self.select(where=[self.name==name], columns=[self.data_version], limit=1, transaction=transaction)[0] > > We'll could end up doing ~3500 select/second here, in a world with geo-ip > rules that mean we can't cache in Zeus (assumed 150M ADI, always running > Firefox, checking twice a day, likely on the high side). > > What if we build the cache like this: > * data_version is cached with a TTL of 60 seconds or so > * if we're over that: check data_version in the db > * if data_version has changed: invalidate blob cache > > Moving the select inside the 'if cached:' block would be start on that, but > the cache timeout may be different. This sounds good to me. After fixing my original patch to use the blob cache in getReleases() the top hotspots were: * evaluateRules * getReleaseBlob * getRulesMatchingQuery (plus things underneath these, such as select() and things underneath _it_, such as convertRows). I'll look at getting both of these cached a bit - I don't see any reason not to do that now.

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 13

•

10 years ago

(In reply to Nick Thomas [:nthomas] from comment #11) > Comment on attachment 8533957 [details] [diff] [review] > first stab at lru for blobs > > Review of attachment 8533957 [details] [diff] [review]: > ----------------------------------------------------------------- > > A great start. Is repoze.lru is abandoned now, but the best fit for us ? > > ::: auslib/db.py > @@ +875,5 @@ > > + # We need to get the current data version of the blob no matter what, > > + # because we need to invalidate the cache of the cached version is > > + # of a different version. > > + try: > > + data_version = self.select(where=[self.name==name], columns=[self.data_version], limit=1, transaction=transaction)[0] > > We'll could end up doing ~3500 select/second here, in a world with geo-ip > rules that mean we can't cache in Zeus (assumed 150M ADI, always running > Firefox, checking twice a day, likely on the high side). > > What if we build the cache like this: > * data_version is cached with a TTL of 60 seconds or so > * if we're over that: check data_version in the db > * if data_version has changed: invalidate blob cache Actually, one follow-up question...does this mean that we would never reload the blob unless data_version has changed? I recall talking about that, and I have it my head that we weren't comfortable having no TTL on the blob itself (yet) - but I could be misremembering.

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 14

•

10 years ago

Blob caching is going to be the biggest win, and we decided to focus on that first. We'll want to do some caching for the rules table too, but that will likely happen in another bug.

Summary: Add caching to balrog → Add blob caching to balrog

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 15

•

10 years ago

Attached patch blob + blob version caching (obsolete) (deleted) — Details — Splinter Review

Functionality-wise, I think this is ready to go. My limited tests with cProfile show very little database usage. I couldn't figure out a way to profile across a large set of requests, but other metrics showed great improvement with caching across 40,000 requests. The wall clock time without caching was 20min, and mysql constantly used 20% CPU. With caching, wall clock went to 10min and mysql sat at 5% cpu (with rule caching this would probably sit pretty close to 0%). Wall clock time may even lower in production. I suspect my local VM has some weird overhead and I/O inefficiencies that we wouldn't see there. The fact that I'm blasting it non stop with requests (instead of spreading them over a longer time) probably reduces efficiency too. I couldn't figure out a great way to handle the caching and non-caching cases, so I just added a big 'ol toggle to MaybeCacher. It works, but it's a bit ugly. getReleaseBlob could probably be improved in some way too, but I haven't been able to figure out how. That might become a bit more obvious when we start caching rules -- maybe there will be a common pattern that becomes obvious. One fun thing I found when profiling is that we call getReleaseBlob for name="*" for almost every request. That's very silly, so I got rid of it. I think the rest of the patch is pretty well documented but let me know if anything needs more explanation. I spoke with cturra and he's up for doing another release channel traffic test next week, so I'm planning to do probably the same day that this goes to production.

Attachment #8533957 - Attachment is obsolete: true

Attachment #8535679 - Flags: review?(nthomas)

Attachment #8535679 - Flags: review?(catlee)

Chris AtLee [:catlee]

Comment 16

•

10 years ago

Comment on attachment 8535679 [details] [diff] [review] blob + blob version caching Review of attachment 8535679 [details] [diff] [review]: ----------------------------------------------------------------- ::: admin.py @@ +47,4 @@ > from auslib.admin.base import app > from migrate.exceptions import DatabaseAlreadyControlledError > > + cache.enabled = False this doesn't seem to do anything in MaybeCacher ::: auslib/blobs/apprelease.py @@ +73,5 @@ > + # even attempt to look it up. > + if patch["from"] != "*": > + return dbo.releases.getReleaseBlob(name=patch["from"]) > + else: > + return None good catch! ::: auslib/db.py @@ +901,5 @@ > + blob = createBlob(row['data']) > + cache.put("blob", name, {"data_version": data_version, "blob": blob}) > + except IndexError: > + raise KeyError("Couldn't find release with name '%s'" % name) > + we spoke in IRC about possibly using a pattern of foo = cache.get(cache_name, key, calc_foo) instead of foo = cache.get(cache_name, key) if not foo: foo = calc_foo() cache.put(cache_name, key, foo) ::: auslib/test/test_db.py @@ +937,5 @@ > + self.releases.getReleaseBlob(name="a") > + t.return_value += 1 > + > + self._checkCacheStats(cache.caches["blob"], 5, 4, 1) > + self._checkCacheStats(cache.caches["blob_version"], 5, 3, 2) some comments here as to why we expect these numbers of hits/misses would be great (and below too) ::: auslib/util/cache.py @@ +11,5 @@ > + > + If maxsize is less than 1, this class' methods will always return nothing. > + If maxsize is 1 or more, caching will be performed. It is up to callers to > + cope with both cases. In a world where bug 1109295 is fixed, we may not > + need this anymore.""" I think this docstring needs updating re: how to disable the cache.

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 17

•

10 years ago

Attached patch fix up comments/docstrings, factor out getters in getReleaseBlob (obsolete) (deleted) — Details — Splinter Review

I think this addresess all the review comments. Thanks for the help factoring the getters.

Attachment #8535679 - Attachment is obsolete: true

Attachment #8535679 - Flags: review?(nthomas)

Attachment #8535679 - Flags: review?(catlee)

Attachment #8535831 - Flags: review?(nthomas)

Attachment #8535831 - Flags: review?(catlee)

Chris AtLee [:catlee]

Comment 18

•

10 years ago

Comment on attachment 8535831 [details] [diff] [review] fix up comments/docstrings, factor out getters in getReleaseBlob Review of attachment 8535831 [details] [diff] [review]: ----------------------------------------------------------------- ::: auslib/web/base.py @@ +48,5 @@ > +Hit Percentage: %s%% > +Requests: %s > +Hits: %s > +Misses: %s > +""" % (float(cache.caches["blob"].hits) / float(cache.caches["blob"].lookups) *100, cache.caches["blob"].lookups, cache.caches["blob"].hits, cache.caches["blob"].misses) oooh, I missed this before! extra bonus points if you add json formatted output too.

Attachment #8535831 - Flags: review?(catlee) → review+

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 19

•

10 years ago

I filed bug 1111032 for rule caching.

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 20

•

10 years ago

(In reply to Chris AtLee [:catlee] from comment #18) > Comment on attachment 8535831 [details] [diff] [review] > fix up comments/docstrings, factor out getters in getReleaseBlob > > Review of attachment 8535831 [details] [diff] [review]: > ----------------------------------------------------------------- > > ::: auslib/web/base.py > @@ +48,5 @@ > > +Hit Percentage: %s%% > > +Requests: %s > > +Hits: %s > > +Misses: %s > > +""" % (float(cache.caches["blob"].hits) / float(cache.caches["blob"].lookups) *100, cache.caches["blob"].lookups, cache.caches["blob"].hits, cache.caches["blob"].misses) > > oooh, I missed this before! > > extra bonus points if you add json formatted output too. Oops, I actually meant to remove that and file a bug on getting cache metrics sent somewhere. This was my very hacky way to know if the cache was working. It's not great for production because you get different stats per wsgi process, and they reset after we hit max requests (5000 per process in production IIRC).

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 21

•

10 years ago

Comment on attachment 8535831 [details] [diff] [review] fix up comments/docstrings, factor out getters in getReleaseBlob Review of attachment 8535831 [details] [diff] [review]: ----------------------------------------------------------------- Great stuff! ::: auslib/blobs/apprelease.py @@ +73,5 @@ > + # even attempt to look it up. > + if patch["from"] != "*": > + return dbo.releases.getReleaseBlob(name=patch["from"]) > + else: > + return None A very nice catch. Lets watch out for new exceptions in sentry, in case the previous try/except was catching cases we don't handle now. ::: auslib/db.py @@ +903,5 @@ > + # of the cached blob and the latest data version don't match, we need > + # to update the cache with the latest blob. > + if cached_blob["data_version"] != data_version: > + blob_info = getBlob() > + cache.put("blob", name, blob_info) Don't you need to update the blob_version cache too ? Otherwise they're out of sync for a while, and the blob cache will keep being updated. You could test for this by checking how many times getBlob() is called. ::: auslib/test/test_db.py @@ +937,5 @@ > + self.releases.getReleaseBlob(name="a") > + t.return_value += 1 > + > + # We've retrieved the blob and blob version 5 times. > + # The blob cache has a ttl of 5, so we're expecting the first one Nit, s/5/10/ ::: auslib/util/cache.py @@ +23,5 @@ > + def reset(self): > + self.caches.clear() > + > + def get(self, name, key, value_getter=None): > + """Returns the value of the specified key from the named cached. Nit s/cached/cache/ @@ +55,5 @@ > + if not name: > + for c in self.caches.values(): > + c.clear() > + > + self.caches[name].clear() Maybe an else ahead of the last line. ::: balrog-server.py @@ +64,5 @@ > pass > > + if options.profile_dir: > + from werkzeug.contrib.profiler import ProfilerMiddleware > + if not path.exists(options.profile_dir): Great! ::: balrog.ini-dist @@ +17,5 @@ > ; domains that are allowed in update data, comma separated > domain_whitelist=download.mozilla.org,stage.mozilla.org,ftp.mozilla.org > + > +[caches] > +; Caches to set-up. Keys are cache names, values are size and ttl Nit, values are number of cache entries and ttl (seconds) ::: puppet/templates/balrog.ini.erb @@ +9,5 @@ > ; domains that are allowed in update data, comma separated > domain_whitelist=download.mozilla.org,stage.mozilla.org,ftp.mozilla.org > + > +[caches] > +; Caches to set-up. Keys are cache names, values are size and ttl See other nit.

Attachment #8535831 - Flags: review?(nthomas) → review+

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 22

•

10 years ago

(In reply to Nick Thomas [:nthomas] from comment #21) > ::: auslib/db.py > @@ +903,5 @@ > > + # of the cached blob and the latest data version don't match, we need > > + # to update the cache with the latest blob. > > + if cached_blob["data_version"] != data_version: > > + blob_info = getBlob() > > + cache.put("blob", name, blob_info) > > Don't you need to update the blob_version cache too ? Otherwise they're out > of sync for a while, and the blob cache will keep being updated. You could > test for this by checking how many times getBlob() is called. Oops, good catch! I'll fix this and get some test coverage for it.

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 23

•

10 years ago

(In reply to Ben Hearsum [:bhearsum] from comment #22) > (In reply to Nick Thomas [:nthomas] from comment #21) > > ::: auslib/db.py > > @@ +903,5 @@ > > > + # of the cached blob and the latest data version don't match, we need > > > + # to update the cache with the latest blob. > > > + if cached_blob["data_version"] != data_version: > > > + blob_info = getBlob() > > > + cache.put("blob", name, blob_info) > > > > Don't you need to update the blob_version cache too ? Otherwise they're out > > of sync for a while, and the blob cache will keep being updated. You could > > test for this by checking how many times getBlob() is called. > > Oops, good catch! I'll fix this and get some test coverage for it. Hmm, diving into this a bit more I'm not sure I see how this could happen. "data_version" will always be newer than the one in cached_blob because the blob_version ttl is lower than blob's. This could be a problem of the ttls of blob and blob_version get out of sync (eg, blob_version refreshes right before blob expires), but I don't think this can happen -- anytime blob_version gets invalidated we'll end up updating the blob cache. I suppose that there might be a very small window for this to happen if the blob ttl expires after we retrieve data_version (I don't think it could happen before that, otherwise data_version would've expired already) - I'm not sure if this is what you were getting at though. Nevertheless, it seems like it would be a good idea to check whether cached_blob["data_version"] or data_version is newer, and update both caches appropriately.

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 24

•

10 years ago

I pushed https://github.com/bhearsum/balrog/commit/96c3c9d7b9d1d1c6e935a103df2d3b9b87b8b884 which I _think_ gets rid of any possiblity of extra blob retrievals through more specific checking of the data versions. Nick, does this address what you were talking about? I haven't found a way to test this case, because inner methods like getBlob are unmockable.

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 25

•

10 years ago

I think I just got myself confused. If the blob changes we won't notice until the blob_version ttl expires, at which point both caches are updated very close together, just the time to do getBlob() between them. Lets not worry about the blob changing again in that small gap. If we ever see load spikes that last as long as the blob_version ttl we can revisit.

[github robot]

Comment 26

•

10 years ago

Commit pushed to master at https://github.com/mozilla/balrog https://github.com/mozilla/balrog/commit/d51724def8e71ec9b9c9ec97efe1d89a011bcf08 bug 671488: Add blob caching to balrog. r=nthomas,catlee

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 27

•

10 years ago

Attached patch fix review nits & better handling of data_version inequality (deleted) — Details — Splinter Review

It looks like I'm OK to carry forward the r+ here -- the only code changes in this patch are the fix to MaybeCacher.clear() and the better hnadling of data_version differences. I've gone ahead and landed this so I can do some testing on dev. Before landing I ran the same set of requests that I was testing with locally, which took 15min14s to complete. Once dev picks up this change I'll run it again and see how it performs.

Attachment #8535831 - Attachment is obsolete: true

Attachment #8537212 - Flags: checked-in+

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 28

•

10 years ago

Oops, I forgot to remove the /cache_stats endpoint in my original push. I just pushed that separately.

bhearsum@mozilla.com (:bhearsum)

Assignee

Updated

•

10 years ago

Depends on: 1112187

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 29

•

10 years ago

This popped up when I tried to test in dev: KeyError: "Couldn't find release with name 'Firefox-34.0b9-build1'" Stacktrace (most recent call last): File "flask/app.py", line 1475, in full_dispatch_request rv = self.dispatch_request() File "flask/app.py", line 1461, in dispatch_request return self.view_functions[rule.endpoint](**req.view_args) File "flask/views.py", line 84, in view return self.dispatch_request(*args, **kwargs) File "flask/views.py", line 149, in dispatch_request return meth(*args, **kwargs) File "auslib/web/views/client.py", line 38, in get xml = release.createXML(query, update_type, app.config["WHITELISTED_DOMAINS"], app.config["SPECIAL_FORCE_HOSTS"]) File "auslib/blobs/apprelease.py", line 131, in createXML patches = self._getPatchesXML(localeData, updateQuery, whitelistedDomains, specialForceHosts) File "auslib/blobs/apprelease.py", line 528, in _getPatchesXML xml = self._getSpecificPatchXML(patchKey, patchType, patch, updateQuery, whitelistedDomains, specialForceHosts) File "auslib/blobs/apprelease.py", line 80, in _getSpecificPatchXML fromRelease = self._getFromRelease(patch) File "auslib/blobs/apprelease.py", line 75, in _getFromRelease return dbo.releases.getReleaseBlob(name=patch["from"]) File "auslib/db.py", line 889, in getReleaseBlob data_version = cache.get("blob_version", name, getDataVersion) File "auslib/util/cache.py", line 40, in get value = value_getter() File "auslib/db.py", line 887, in getDataVersion raise KeyError("Couldn't find release with name '%s'" % name) Which is probably because dev doesn't have a good data set (I uploaded the Firefox-35.0b3-build1 blob and some other ones, but not all of the blobs for the partials they have). Nonetheless, the code should probably handle this better. There's also some other tracebacks that may be rooted in the same issue. I'll look into them as well.

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 30

•

10 years ago

Looks like this is partly (and maybe fully) a silly mistake in apprelease.py. When I factored out the fromRelease logic I didn't include the KeyError that it catches: https://github.com/mozilla/balrog/commit/d51724def8e71ec9b9c9ec97efe1d89a011bcf08#diff-1a7ea9e25a2f1a8c15c06fd95243142aL290 I pushed https://github.com/mozilla/balrog/commit/e34f70bb3c16a3b1c94033bd55f74327fb157078 as a bustage fix for this.

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 31

•

10 years ago

Another one came up, but it's a bit less serious. It happens when we have a rule that points out a non-existent release. I think this is happening a lot on dev because the database clean up script is deleting old releases without checking to see if a rule points at them... IndexError: list index out of range Stacktrace (most recent call last): File "flask/app.py", line 1475, in full_dispatch_request rv = self.dispatch_request() File "flask/app.py", line 1461, in dispatch_request return self.view_functions[rule.endpoint](**req.view_args) File "flask/views.py", line 84, in view return self.dispatch_request(*args, **kwargs) File "flask/views.py", line 149, in dispatch_request return meth(*args, **kwargs) File "auslib/web/views/client.py", line 35, in get release, update_type = AUS.evaluateRules(query) File "auslib/AUS.py", line 84, in evaluateRules release = dbo.releases.getReleases(name=rule['mapping'], limit=1)[0]

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 32

•

10 years ago

Also found: ValueError: Couldn't find fileUrl Stacktrace (most recent call last): File "flask/app.py", line 1475, in full_dispatch_request rv = self.dispatch_request() File "flask/app.py", line 1461, in dispatch_request return self.view_functions[rule.endpoint](**req.view_args) File "flask/views.py", line 84, in view return self.dispatch_request(*args, **kwargs) File "flask/views.py", line 149, in dispatch_request return meth(*args, **kwargs) File "auslib/web/views/client.py", line 38, in get xml = release.createXML(query, update_type, app.config["WHITELISTED_DOMAINS"], app.config["SPECIAL_FORCE_HOSTS"]) File "auslib/blobs/apprelease.py", line 135, in createXML patches = self._getPatchesXML(localeData, updateQuery, whitelistedDomains, specialForceHosts) File "auslib/blobs/apprelease.py", line 532, in _getPatchesXML xml = self._getSpecificPatchXML(patchKey, patchType, patch, updateQuery, whitelistedDomains, specialForceHosts) File "auslib/blobs/apprelease.py", line 88, in _getSpecificPatchXML url = self._getUrl(updateQuery, patchKey, patch, specialForceHosts) File "auslib/blobs/apprelease.py", line 677, in _getUrl raise ValueError("Couldn't find fileUrl")

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 33

•

10 years ago

My cached run took 14min still. I'm not really sure what to make of this. It's a bit faster than the non-caching run, but not as fast as I expected. The fact that I'm replaying requests with 0 delay between them may be a big factor here - we could be tickling a different bottleneck because of this. The other curious thing is that I started running a second non-caching test earlier (by accident - I had thought that the patch was landed already, but it wasn't) and it took significantly longer than 15min and did not even complete. Given the above I'm inclined to go ahead with this in production and try a _real_ test - I'm really not confident in being able to replicate actual production load. I need to set-up the latest release data for that still, and I'm going to try to fix one or both of the IndexError and ValueErrors from previous comments - I don't think either of them are blockers though - they're both related to bad data.

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 34

•

10 years ago

Attached patch fix IndexError and ValueError (deleted) — Details — Splinter Review

Here's a follow-up to fix the other two exceptions. The IndexError is kindof a silly one - we're indexing an array that could potentially be empty. The ValueError is one that the blob raises itself - and I think it's better classified as a BadDataError. The times we hit it right now are because the Thunderbird-34.0b1-build1 blob is just not useful (https://aus4-admin.mozilla.org/releases/Thunderbird-34.0b1-build1/data).

Attachment #8537435 - Flags: review?(nthomas)

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 35

•

10 years ago

(In reply to Ben Hearsum [:bhearsum] from comment #27) > Created attachment 8537212 [details] [diff] [review] > fix review nits & better handling of data_version inequality Looks like all of 96c3c9d7b9d1d1c6e935a103df2d3b9b87b8b884 is included, did you mean to revert some of it ? eg the commented out test, the cache.put on blob_version we decided we didn't need

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 36

•

10 years ago

Comment on attachment 8537435 [details] [diff] [review] fix IndexError and ValueError Review of attachment 8537435 [details] [diff] [review]: ----------------------------------------------------------------- The patch looks fine, but just a couple of queries ... (In reply to Ben Hearsum [:bhearsum] from comment #34) > The ValueError is one that the blob raises itself - and I think it's better > classified as a BadDataError. The times we hit it right now are because the > Thunderbird-34.0b1-build1 blob is just not useful > (https://aus4-admin.mozilla.org/releases/Thunderbird-34.0b1-build1/data). Did you mean to point to dev here ? We're using Thunderbird-34.0b1-build3 in production, and that seems fine. More importantly, BadDataError has been for odd queries up to now, and we'll lose the sentry reports by using it here. So we rely on QE to catch 'no update offered' errors ?

Attachment #8537435 - Flags: review?(nthomas) → review+

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 37

•

10 years ago

(In reply to Nick Thomas [:nthomas] from comment #35) > (In reply to Ben Hearsum [:bhearsum] from comment #27) > > Created attachment 8537212 [details] [diff] [review] > > fix review nits & better handling of data_version inequality > > Looks like all of 96c3c9d7b9d1d1c6e935a103df2d3b9b87b8b884 is included, did > you mean to revert some of it ? eg the commented out test, the cache.put on > blob_version we decided we didn't need I wasn't sure whether or not to remove the test - it's a framework for one that would be useful - if I could figure out a way to verify it probably. I'm happy to remove it if you think it's just noise. I left the cache.put to handle the off chance that the blob data version is higher than the blob version one (I think there's still a remote possibility of this). I see that that's broken anyways though, as it's re-caching already cached blob version. I can fix this up or just plain remove it.

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 38

•

10 years ago

(In reply to Nick Thomas [:nthomas] from comment #36) > Comment on attachment 8537435 [details] [diff] [review] > fix IndexError and ValueError > > Review of attachment 8537435 [details] [diff] [review]: > ----------------------------------------------------------------- > > The patch looks fine, but just a couple of queries ... > > (In reply to Ben Hearsum [:bhearsum] from comment #34) > > The ValueError is one that the blob raises itself - and I think it's better > > classified as a BadDataError. The times we hit it right now are because the > > Thunderbird-34.0b1-build1 blob is just not useful > > (https://aus4-admin.mozilla.org/releases/Thunderbird-34.0b1-build1/data). > > Did you mean to point to dev here ? We're using Thunderbird-34.0b1-build3 in > production, and that seems fine. I did mean to point to this one (dev is just a copy of it). This blob is actually not OK as far as I can tell -- it's missing all of the top level stuff like fileUrls - so you can't construct a useful response from it. > More importantly, BadDataError has been for odd queries up to now, and we'll > lose the sentry reports by using it here. So we rely on QE to catch 'no > update offered' errors ? Good point, let's not overload this. Do you think it's useful for exceptions caused by bad blobs to bubble up to Sentry? If so, it could be argued that this patch isn't wanted at all (or that we should raise an exception if mapping doesn't resolve to a release, at least).

bhearsum@mozilla.com (:bhearsum)

Assignee

Updated

•

10 years ago

Depends on: 1113123

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 39

•

10 years ago

When this went live in production around 7:30am pacific this morning we saw a huge dip in network traffic between the database and webheads. They were all pulling ~30MB/sec before the patch went live, and are now only pulling ~1MB/sec. Average CPU utilization was around 15-20% (varies by web head) and dropped to 10-15% -- so about 5% less on each webhead. That's less than I expected but it's something. We'll proceed with the release channel test in bug 1112662 in a bit and go from there.

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 40

•

10 years ago

So the main patch here is in production now. There's still the IndexError and ValueError to fix, so I'll leave this bug open until we deal with them. Nick, any thoughts on: > > More importantly, BadDataError has been for odd queries up to now, and we'll > > lose the sentry reports by using it here. So we rely on QE to catch 'no > > update offered' errors ? > > Good point, let's not overload this. Do you think it's useful for exceptions > caused by bad blobs to bubble up to Sentry? If so, it could be argued that > this patch isn't wanted at all (or that we should raise an exception if > mapping doesn't resolve to a release, at least).

Flags: needinfo?(nthomas)

bhearsum@mozilla.com (:bhearsum)

Assignee

Updated

•

10 years ago

No longer depends on: 832464

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 41

•

10 years ago

I don't have a good answer here. We need some way to know if we screwed up, and errormill may be the best we have right now.

Flags: needinfo?(nthomas)

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 42

•

10 years ago

(In reply to Nick Thomas [:nthomas] (Away until Jan 12th) from comment #41) > I don't have a good answer here. We need some way to know if we screwed up, > and errormill may be the best we have right now. OK, I'm not going to do anything about these for now then. We can re-evaluate this later.

Status: ASSIGNED → RESOLVED

Closed: 10 years ago

Resolution: --- → FIXED

BMO Automation

Updated

•

5 years ago

Product: Release Engineering → Release Engineering Graveyard

first stab at lru for blobs 10 years ago bhearsum@mozilla.com (:bhearsum) (deleted), patch	nthomas : feedback+	Details \| Diff \| Splinter Review
blob + blob version caching 10 years ago bhearsum@mozilla.com (:bhearsum) (deleted), patch		Details \| Diff \| Splinter Review
fix up comments/docstrings, factor out getters in getReleaseBlob 10 years ago bhearsum@mozilla.com (:bhearsum) (deleted), patch	nthomas : review+ catlee : review+	Details \| Diff \| Splinter Review
fix review nits & better handling of data_version inequality 10 years ago bhearsum@mozilla.com (:bhearsum) (deleted), patch	bhearsum : checked-in+	Details \| Diff \| Splinter Review
fix IndexError and ValueError 10 years ago bhearsum@mozilla.com (:bhearsum) (deleted), patch	nthomas : review+	Details \| Diff \| Splinter Review