Closed
Bug 1384816
Opened 7 years ago
Closed 7 years ago
Docker 'Instance is unhealthy errors' in local dev due to slow prod db download
Categories
(Release Engineering Graveyard :: Applications: Balrog (backend), defect, P1)
Release Engineering Graveyard
Applications: Balrog (backend)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: nthomas, Assigned: bhearsum)
References
Details
Attachments
(3 files)
The prod db dump is over 130M now, and uncompressed at rest in S3 (which might be a couple of bugs), and sometimes that means it can take quite a long time to download and import. Specifically longer than the healthcheck for balrogadmin, which is 10 retries 5s apart (see docker-compose.yml). Then you get a couple of messages from docker that 'instance <id> is unhealthy' related to balrogagent and balrogui. It turns out you can just leave 'docker-compose up' running until scripts/prod_db_dump.sql is non-zero in size, and then ^C docker, maybe do a down, and then do up again. The download is cached and so the setup works the second time.
Simply making the # of retries for balrogadmin 100 WFM, but obviously it depends on the capacity of your local piece of string. This could be a barrier to contributors.
Assignee | ||
Comment 1•7 years ago
|
||
It's probably a good idea to bump the timeouts immediately, and I think it would be good to start producing a compressed dump instead of plain text. That takes it down to 32M for me with --best. This would fix things in the short/medium term.
I'm not quite sure what to do for the long term. We never had a great experience trying to maintained a checked in sample database, so I'm not sure it's a good idea to go back to that. Other random ideas:
- Somehow cap the size of the production dump. We already cut out a lot of history, but maybe we should be more aggressive about that.
- Check in a current dump, always import that to start with, and then upgrade the DB to latest. This would increase the size of the balrog repo clone, but you wouldn't be grabbing a huge file when starting the containers. The downside is that you wouldn't have current data by default.
Assignee | ||
Comment 2•7 years ago
|
||
Quick and dirty "fix"
Attachment #8890896 -
Flags: review?(nthomas)
Reporter | ||
Updated•7 years ago
|
Attachment #8890896 -
Flags: review?(nthomas) → review+
Comment 3•7 years ago
|
||
Commit pushed to master at https://github.com/mozilla/balrog
https://github.com/mozilla/balrog/commit/982e3deb387f12dbe14dc1ad13ca174ec61c7b36
bug 1384816: Bump retries for balrogadmin (#359). r=nthomas
Assignee | ||
Updated•7 years ago
|
Priority: -- → P3
Reporter | ||
Comment 4•7 years ago
|
||
(In reply to Ben Hearsum (:bhearsum) from comment #1)
> It's probably a good idea to bump the timeouts immediately, and I think it
> would be good to start producing a compressed dump instead of plain text.
> That takes it down to 32M for me with --best. This would fix things in the
> short/medium term.
Agree that compression is a good idea. xz seems to do even better, only 17M at default setting of -6 (no improvement at -9).
> I'm not quite sure what to do for the long term. We never had a great
> experience trying to maintained a checked in sample database, so I'm not
> sure it's a good idea to go back to that. Other random ideas:
> - Somehow cap the size of the production dump. We already cut out a lot of
> history, but maybe we should be more aggressive about that.
The majority of the space seems to be 260 or so releases, which is larger than I expected because we need all the releases users might get a partial from. I think we only need enough to look up the buildID in that case, and could strip out partials and completes at the locale level, but not 100% convinced that's maintainable solution.
> - Check in a current dump, always import that to start with, and then
> upgrade the DB to latest. This would increase the size of the balrog repo
> clone, but you wouldn't be grabbing a huge file when starting the
> containers. The downside is that you wouldn't have current data by default.
Given the size wins from compression we could avoid doing this. However it does suggest a way to handle the issue with the master code not matching the prod dump. If we lookup the migrate_version in the prod dump and create the a fresh db with that, then upgrade to the latest version afterwards.
Reporter | ||
Comment 5•7 years ago
|
||
Ah, I see you've covered the last suggestion in more detail in bug 1376331.
Assignee | ||
Comment 6•7 years ago
|
||
(In reply to Nick Thomas [:nthomas] from comment #4)
> (In reply to Ben Hearsum (:bhearsum) from comment #1)
> > It's probably a good idea to bump the timeouts immediately, and I think it
> > would be good to start producing a compressed dump instead of plain text.
> > That takes it down to 32M for me with --best. This would fix things in the
> > short/medium term.
>
> Agree that compression is a good idea. xz seems to do even better, only 17M
> at default setting of -6 (no improvement at -9).
Interesting! The latest db dump for me ended up at 22M with xz. zstd is another option, which compresses just as well as xz, but is faster.
Whatever we choose, I think it's best to do the compression either as part of https://github.com/mozilla/balrog/blob/8da1568e25bd5050da29a717c62a2a6defc5ff4f/scripts/manage-db.py#L96 or https://github.com/mozilla/balrog/blob/8da1568e25bd5050da29a717c62a2a6defc5ff4f/scripts/run.sh#L54. Once that's done, we should ask CloudOps to update the environment variable they set to ensure the file is named correctly on S3, and update https://github.com/mozilla/balrog/blob/8da1568e25bd5050da29a717c62a2a6defc5ff4f/scripts/import-db.py#L13 to pull it.
Reporter | ||
Comment 7•7 years ago
|
||
Bumping this again - compression of the prod db dump would be a really nice win for everyone who contributes. Co-ordination of code changes & deploying to prod seem non-trivial, but perhaps we can add some 'try for compressed, fallback to not' logic around the place.
Assignee | ||
Comment 8•7 years ago
|
||
(In reply to Nick Thomas [:nthomas] (UTC+13) from comment #7)
> Bumping this again - compression of the prod db dump would be a really nice
> win for everyone who contributes. Co-ordination of code changes & deploying
> to prod seem non-trivial, but perhaps we can add some 'try for compressed,
> fallback to not' logic around the place.
Yeah...we should do this. Maybe we can start publishing a compression version to a new location (instead of publishing the uncompressed one), and then update the init script to use it?
Priority: P3 → P1
Reporter | ||
Comment 9•7 years ago
|
||
Sounds good to me.
Comment 10•7 years ago
|
||
Assignee | ||
Comment 11•7 years ago
|
||
Reporter | ||
Updated•7 years ago
|
Attachment #8933356 -
Flags: review?(nthomas) → review+
Comment 12•7 years ago
|
||
Commit pushed to master at https://github.com/mozilla/balrog
https://github.com/mozilla/balrog/commit/c27c5829ac4737f2784f3933d5504db52a512323
bug 1384816: Compress production dumps with xz (#452). r=nthomas,relud
Assignee | ||
Comment 13•7 years ago
|
||
(In reply to [github robot] from comment #12)
> Commit pushed to master at https://github.com/mozilla/balrog
>
> https://github.com/mozilla/balrog/commit/
> c27c5829ac4737f2784f3933d5504db52a512323
> bug 1384816: Compress production dumps with xz (#452). r=nthomas,relud
This is expected to hit prod next Wednesday, which means we should be able to start using the new, compressed dumps on Thursday, December 7th.
Assignee | ||
Comment 14•7 years ago
|
||
This is in prod, but we just missed yesterday's cronjob. We should see a new, compressed dump around 2300 UTC today.
Assignee | ||
Comment 15•7 years ago
|
||
We've got xz dumps now. I pushed https://github.com/mozilla/balrog/commit/24aeb30119142beff820725582ed83e30a3798a5 to get us using them. I'm also going to file a bug to get the dumps renamed - it's confusing to have a .sql file not be plain-text.
Assignee | ||
Comment 16•7 years ago
|
||
With bug 1430140 now in production I updated the import script to point at the new location: https://github.com/mozilla/balrog/commit/f2cae685ad5dc92d488147eb652ee015c293d3d5
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Reporter | ||
Comment 17•7 years ago
|
||
Working nicely here, thanks Ben and all.
Updated•5 years ago
|
Product: Release Engineering → Release Engineering Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•