Closed Bug 1384816 Opened 7 years ago Closed 7 years ago

Docker 'Instance is unhealthy errors' in local dev due to slow prod db download

Categories

(Release Engineering Graveyard :: Applications: Balrog (backend), defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: nthomas, Assigned: bhearsum)

References

Details

Attachments

(3 files)

The prod db dump is over 130M now, and uncompressed at rest in S3 (which might be a couple of bugs), and sometimes that means it can take quite a long time to download and import. Specifically longer than the healthcheck for balrogadmin, which is 10 retries 5s apart (see docker-compose.yml). Then you get a couple of messages from docker that 'instance <id> is unhealthy' related to balrogagent and balrogui. It turns out you can just leave 'docker-compose up' running until scripts/prod_db_dump.sql is non-zero in size, and then ^C docker, maybe do a down, and then do up again. The download is cached and so the setup works the second time. Simply making the # of retries for balrogadmin 100 WFM, but obviously it depends on the capacity of your local piece of string. This could be a barrier to contributors.
It's probably a good idea to bump the timeouts immediately, and I think it would be good to start producing a compressed dump instead of plain text. That takes it down to 32M for me with --best. This would fix things in the short/medium term. I'm not quite sure what to do for the long term. We never had a great experience trying to maintained a checked in sample database, so I'm not sure it's a good idea to go back to that. Other random ideas: - Somehow cap the size of the production dump. We already cut out a lot of history, but maybe we should be more aggressive about that. - Check in a current dump, always import that to start with, and then upgrade the DB to latest. This would increase the size of the balrog repo clone, but you wouldn't be grabbing a huge file when starting the containers. The downside is that you wouldn't have current data by default.
Attached file bump retries (deleted) —
Quick and dirty "fix"
Attachment #8890896 - Flags: review?(nthomas)
Attachment #8890896 - Flags: review?(nthomas) → review+
Priority: -- → P3
(In reply to Ben Hearsum (:bhearsum) from comment #1) > It's probably a good idea to bump the timeouts immediately, and I think it > would be good to start producing a compressed dump instead of plain text. > That takes it down to 32M for me with --best. This would fix things in the > short/medium term. Agree that compression is a good idea. xz seems to do even better, only 17M at default setting of -6 (no improvement at -9). > I'm not quite sure what to do for the long term. We never had a great > experience trying to maintained a checked in sample database, so I'm not > sure it's a good idea to go back to that. Other random ideas: > - Somehow cap the size of the production dump. We already cut out a lot of > history, but maybe we should be more aggressive about that. The majority of the space seems to be 260 or so releases, which is larger than I expected because we need all the releases users might get a partial from. I think we only need enough to look up the buildID in that case, and could strip out partials and completes at the locale level, but not 100% convinced that's maintainable solution. > - Check in a current dump, always import that to start with, and then > upgrade the DB to latest. This would increase the size of the balrog repo > clone, but you wouldn't be grabbing a huge file when starting the > containers. The downside is that you wouldn't have current data by default. Given the size wins from compression we could avoid doing this. However it does suggest a way to handle the issue with the master code not matching the prod dump. If we lookup the migrate_version in the prod dump and create the a fresh db with that, then upgrade to the latest version afterwards.
Ah, I see you've covered the last suggestion in more detail in bug 1376331.
(In reply to Nick Thomas [:nthomas] from comment #4) > (In reply to Ben Hearsum (:bhearsum) from comment #1) > > It's probably a good idea to bump the timeouts immediately, and I think it > > would be good to start producing a compressed dump instead of plain text. > > That takes it down to 32M for me with --best. This would fix things in the > > short/medium term. > > Agree that compression is a good idea. xz seems to do even better, only 17M > at default setting of -6 (no improvement at -9). Interesting! The latest db dump for me ended up at 22M with xz. zstd is another option, which compresses just as well as xz, but is faster. Whatever we choose, I think it's best to do the compression either as part of https://github.com/mozilla/balrog/blob/8da1568e25bd5050da29a717c62a2a6defc5ff4f/scripts/manage-db.py#L96 or https://github.com/mozilla/balrog/blob/8da1568e25bd5050da29a717c62a2a6defc5ff4f/scripts/run.sh#L54. Once that's done, we should ask CloudOps to update the environment variable they set to ensure the file is named correctly on S3, and update https://github.com/mozilla/balrog/blob/8da1568e25bd5050da29a717c62a2a6defc5ff4f/scripts/import-db.py#L13 to pull it.
Depends on: 1390197
Bumping this again - compression of the prod db dump would be a really nice win for everyone who contributes. Co-ordination of code changes & deploying to prod seem non-trivial, but perhaps we can add some 'try for compressed, fallback to not' logic around the place.
(In reply to Nick Thomas [:nthomas] (UTC+13) from comment #7) > Bumping this again - compression of the prod db dump would be a really nice > win for everyone who contributes. Co-ordination of code changes & deploying > to prod seem non-trivial, but perhaps we can add some 'try for compressed, > fallback to not' logic around the place. Yeah...we should do this. Maybe we can start publishing a compression version to a new location (instead of publishing the uncompressed one), and then update the init script to use it?
Priority: P3 → P1
Sounds good to me.
Attached file use compressed dumps (deleted) —
Assignee: nobody → bhearsum
Status: NEW → ASSIGNED
Attachment #8933356 - Flags: review?(nthomas)
Attachment #8933356 - Flags: review?(nthomas) → review+
Commit pushed to master at https://github.com/mozilla/balrog https://github.com/mozilla/balrog/commit/c27c5829ac4737f2784f3933d5504db52a512323 bug 1384816: Compress production dumps with xz (#452). r=nthomas,relud
(In reply to [github robot] from comment #12) > Commit pushed to master at https://github.com/mozilla/balrog > > https://github.com/mozilla/balrog/commit/ > c27c5829ac4737f2784f3933d5504db52a512323 > bug 1384816: Compress production dumps with xz (#452). r=nthomas,relud This is expected to hit prod next Wednesday, which means we should be able to start using the new, compressed dumps on Thursday, December 7th.
Depends on: 1429414
This is in prod, but we just missed yesterday's cronjob. We should see a new, compressed dump around 2300 UTC today.
We've got xz dumps now. I pushed https://github.com/mozilla/balrog/commit/24aeb30119142beff820725582ed83e30a3798a5 to get us using them. I'm also going to file a bug to get the dumps renamed - it's confusing to have a .sql file not be plain-text.
Depends on: 1430140
Depends on: 1434280
With bug 1430140 now in production I updated the import script to point at the new location: https://github.com/mozilla/balrog/commit/f2cae685ad5dc92d488147eb652ee015c293d3d5
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Working nicely here, thanks Ben and all.
Product: Release Engineering → Release Engineering Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: