Closed Bug 1384816 Opened 7 years ago Closed 7 years ago

Docker 'Instance is unhealthy errors' in local dev due to slow prod db download

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: nthomas, Assigned: bhearsum)

References

Details

Attachments

(3 files)

bump retries 7 years ago bhearsum@mozilla.com (:bhearsum) (deleted), text/x-github-pull-request	nthomas : review+	Details
Link to GitHub pull-request: https://github.com/mozilla/balrog/pull/452 7 years ago GitHub Bugzilla PR Linker (deleted), text/x-github-pull-request		Details
use compressed dumps 7 years ago bhearsum@mozilla.com (:bhearsum) (deleted), text/x-github-pull-request	nthomas : review+	Details

Nick Thomas [:nthomas] (UTC+12)

Reporter

Description

•

7 years ago

The prod db dump is over 130M now, and uncompressed at rest in S3 (which might be a couple of bugs), and sometimes that means it can take quite a long time to download and import. Specifically longer than the healthcheck for balrogadmin, which is 10 retries 5s apart (see docker-compose.yml). Then you get a couple of messages from docker that 'instance <id> is unhealthy' related to balrogagent and balrogui. It turns out you can just leave 'docker-compose up' running until scripts/prod_db_dump.sql is non-zero in size, and then ^C docker, maybe do a down, and then do up again. The download is cached and so the setup works the second time. Simply making the # of retries for balrogadmin 100 WFM, but obviously it depends on the capacity of your local piece of string. This could be a barrier to contributors.

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 1

•

7 years ago

It's probably a good idea to bump the timeouts immediately, and I think it would be good to start producing a compressed dump instead of plain text. That takes it down to 32M for me with --best. This would fix things in the short/medium term. I'm not quite sure what to do for the long term. We never had a great experience trying to maintained a checked in sample database, so I'm not sure it's a good idea to go back to that. Other random ideas: - Somehow cap the size of the production dump. We already cut out a lot of history, but maybe we should be more aggressive about that. - Check in a current dump, always import that to start with, and then upgrade the DB to latest. This would increase the size of the balrog repo clone, but you wouldn't be grabbing a huge file when starting the containers. The downside is that you wouldn't have current data by default.

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 2

•

7 years ago

Attached file bump retries (deleted) — Details

Quick and dirty "fix"

Attachment #8890896 - Flags: review?(nthomas)

Nick Thomas [:nthomas] (UTC+12)

Reporter

Updated

•

7 years ago

Attachment #8890896 - Flags: review?(nthomas) → review+

[github robot]

Comment 3

•

7 years ago

Commit pushed to master at https://github.com/mozilla/balrog https://github.com/mozilla/balrog/commit/982e3deb387f12dbe14dc1ad13ca174ec61c7b36 bug 1384816: Bump retries for balrogadmin (#359). r=nthomas

bhearsum@mozilla.com (:bhearsum)

Assignee

Updated

•

7 years ago

Priority: -- → P3

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 4

•

7 years ago

(In reply to Ben Hearsum (:bhearsum) from comment #1) > It's probably a good idea to bump the timeouts immediately, and I think it > would be good to start producing a compressed dump instead of plain text. > That takes it down to 32M for me with --best. This would fix things in the > short/medium term. Agree that compression is a good idea. xz seems to do even better, only 17M at default setting of -6 (no improvement at -9). > I'm not quite sure what to do for the long term. We never had a great > experience trying to maintained a checked in sample database, so I'm not > sure it's a good idea to go back to that. Other random ideas: > - Somehow cap the size of the production dump. We already cut out a lot of > history, but maybe we should be more aggressive about that. The majority of the space seems to be 260 or so releases, which is larger than I expected because we need all the releases users might get a partial from. I think we only need enough to look up the buildID in that case, and could strip out partials and completes at the locale level, but not 100% convinced that's maintainable solution. > - Check in a current dump, always import that to start with, and then > upgrade the DB to latest. This would increase the size of the balrog repo > clone, but you wouldn't be grabbing a huge file when starting the > containers. The downside is that you wouldn't have current data by default. Given the size wins from compression we could avoid doing this. However it does suggest a way to handle the issue with the master code not matching the prod dump. If we lookup the migrate_version in the prod dump and create the a fresh db with that, then upgrade to the latest version afterwards.

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 5

•

7 years ago

Ah, I see you've covered the last suggestion in more detail in bug 1376331.

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 6

•

7 years ago

(In reply to Nick Thomas [:nthomas] from comment #4) > (In reply to Ben Hearsum (:bhearsum) from comment #1) > > It's probably a good idea to bump the timeouts immediately, and I think it > > would be good to start producing a compressed dump instead of plain text. > > That takes it down to 32M for me with --best. This would fix things in the > > short/medium term. > > Agree that compression is a good idea. xz seems to do even better, only 17M > at default setting of -6 (no improvement at -9). Interesting! The latest db dump for me ended up at 22M with xz. zstd is another option, which compresses just as well as xz, but is faster. Whatever we choose, I think it's best to do the compression either as part of https://github.com/mozilla/balrog/blob/8da1568e25bd5050da29a717c62a2a6defc5ff4f/scripts/manage-db.py#L96 or https://github.com/mozilla/balrog/blob/8da1568e25bd5050da29a717c62a2a6defc5ff4f/scripts/run.sh#L54. Once that's done, we should ask CloudOps to update the environment variable they set to ensure the file is named correctly on S3, and update https://github.com/mozilla/balrog/blob/8da1568e25bd5050da29a717c62a2a6defc5ff4f/scripts/import-db.py#L13 to pull it.

bhearsum@mozilla.com (:bhearsum)

Assignee

Updated

•

7 years ago

Depends on: 1390197

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 7

•

7 years ago

Bumping this again - compression of the prod db dump would be a really nice win for everyone who contributes. Co-ordination of code changes & deploying to prod seem non-trivial, but perhaps we can add some 'try for compressed, fallback to not' logic around the place.

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 8

•

7 years ago

(In reply to Nick Thomas [:nthomas] (UTC+13) from comment #7) > Bumping this again - compression of the prod db dump would be a really nice > win for everyone who contributes. Co-ordination of code changes & deploying > to prod seem non-trivial, but perhaps we can add some 'try for compressed, > fallback to not' logic around the place. Yeah...we should do this. Maybe we can start publishing a compression version to a new location (instead of publishing the uncompressed one), and then update the init script to use it?

Priority: P3 → P1

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 9

•

7 years ago

Sounds good to me.

GitHub Bugzilla PR Linker

Comment 10

•

7 years ago

Attached file Link to GitHub pull-request: https://github.com/mozilla/balrog/pull/452 (deleted) — Details

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 11

•

7 years ago

Attached file use compressed dumps (deleted) — Details

Assignee: nobody → bhearsum

Status: NEW → ASSIGNED

Attachment #8933356 - Flags: review?(nthomas)

Nick Thomas [:nthomas] (UTC+12)

Reporter

Updated

•

7 years ago

Attachment #8933356 - Flags: review?(nthomas) → review+

[github robot]

Comment 12

•

7 years ago

Commit pushed to master at https://github.com/mozilla/balrog https://github.com/mozilla/balrog/commit/c27c5829ac4737f2784f3933d5504db52a512323 bug 1384816: Compress production dumps with xz (#452). r=nthomas,relud

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 13

•

7 years ago

(In reply to [github robot] from comment #12) > Commit pushed to master at https://github.com/mozilla/balrog > > https://github.com/mozilla/balrog/commit/ > c27c5829ac4737f2784f3933d5504db52a512323 > bug 1384816: Compress production dumps with xz (#452). r=nthomas,relud This is expected to hit prod next Wednesday, which means we should be able to start using the new, compressed dumps on Thursday, December 7th.

bhearsum@mozilla.com (:bhearsum)

Assignee

Updated

•

7 years ago

Depends on: 1429414

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 14

•

7 years ago

This is in prod, but we just missed yesterday's cronjob. We should see a new, compressed dump around 2300 UTC today.

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 15

•

7 years ago

We've got xz dumps now. I pushed https://github.com/mozilla/balrog/commit/24aeb30119142beff820725582ed83e30a3798a5 to get us using them. I'm also going to file a bug to get the dumps renamed - it's confusing to have a .sql file not be plain-text.

bhearsum@mozilla.com (:bhearsum)

Assignee

Updated

•

7 years ago

Depends on: 1430140

bhearsum@mozilla.com (:bhearsum)

Assignee

Updated

•

7 years ago

Depends on: 1434280

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 16

•

7 years ago

With bug 1430140 now in production I updated the import script to point at the new location: https://github.com/mozilla/balrog/commit/f2cae685ad5dc92d488147eb652ee015c293d3d5

Status: ASSIGNED → RESOLVED

Closed: 7 years ago

Resolution: --- → FIXED

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 17

•

7 years ago

Working nicely here, thanks Ben and all.

BMO Automation

Updated

•

5 years ago

Product: Release Engineering → Release Engineering Graveyard

You need to log in before you can comment on or make changes to this bug.