Closed Bug 1022368 Opened 10 years ago Closed 10 years ago

Kill some puppet masters in AWS

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: rail, Assigned: rail)

References

Details

(Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/710] )

Attachments

(1 file, 1 obsolete file)

no-puppet2.diff 10 years ago Rail Aliiev [:rail] (deleted), patch	Callek : review+ rail : checked-in-	Details \| Diff \| Splinter Review
killem-again.diff 10 years ago Rail Aliiev [:rail] (deleted), patch	dustin : review+ rail : checked-in+	Details \| Diff \| Splinter Review

Rail Aliiev [:rail]

Assignee

Description

•

10 years ago

Since we reduced our load on puppet, I think we can kill 2 out of 4 running AWS puppet masters. Any reason why we shoulnd't do this?

Nick Thomas [:nthomas] (UTC+12)

Comment 1

•

10 years ago

Redundancy ? If we go down to one per region, and that machine goes down, do we try other regions ?

Rail Aliiev [:rail]

Assignee

Comment 2

•

10 years ago

If one of the region masters goes down the slaves fall back to one of the next available puppet masters in all regions (including in-house). We have this in our config: http://hg.mozilla.org/build/puppet/file/27b30fedfb63/manifests/moco-config.pp#l20 The implementation is here: http://hg.mozilla.org/build/puppet/file/27b30fedfb63/modules/config/lib/puppet/parser/functions/sort_servers_by_group.rb Example output of /etc/puppet/puppetmasters.txt (the file we use to iterate over puppet masters by puppet agents) from one of the slaves: $ cat /etc/puppet/puppetmasters.txt releng-puppet2.srv.releng.use1.mozilla.com releng-puppet1.srv.releng.use1.mozilla.com releng-puppet2.srv.releng.usw2.mozilla.com releng-puppet2.build.scl1.mozilla.com releng-puppet1.srv.releng.scl3.mozilla.com releng-puppet1.srv.releng.usw2.mozilla.com releng-puppet2.srv.releng.scl3.mozilla.com

Dustin J. Mitchell [:dustin] (he/him)

Comment 3

•

10 years ago

At this point, we have 8, so if we lose one then the load on the others increases by about 14%. We'll lose scl1 soon, too, so if we turn off two of the AWS masters we'll only have four, which means that a failure will increase load by 33%. So we should watch load carefull after turning these off, and if it's getting anywhere near capacity, add more capacity, either in more powerful instances or more instances.

Rail Aliiev [:rail]

Assignee

Comment 4

•

10 years ago

(In reply to Dustin J. Mitchell [:dustin] from comment #3) > We'll lose scl1 soon, too, so if we turn off two of the AWS masters we'll > only have four, which means that a failure will increase load by 33%. ... which was decreased last week by "a lot" :) I don't have any puppet master load stats, but the fact that we moved 2-3K instances off of puppet, leaving rarely started on-demand instances the only users of puppet, makes me think that the 14% and 33% are way behind the decrease.

Dustin J. Mitchell [:dustin] (he/him)

Comment 5

•

10 years ago

Yep, I think it's a good idea to turn them off -- we just need to be aware of the risk. Load average seems to be the best metric for load.

Rail Aliiev [:rail]

Assignee

Comment 6

•

10 years ago

Attached patch no-puppet2.diff (obsolete) (deleted) — Details — Splinter Review

Not shutting down the masters yet.

Attachment #8437061 - Flags: review?(dustin)

Justin Wood (:Callek)

Updated

•

10 years ago

Attachment #8437061 - Flags: review?(dustin) → review+

Rail Aliiev [:rail]

Assignee

Comment 7

•

10 years ago

Comment on attachment 8437061 [details] [diff] [review] no-puppet2.diff https://hg.mozilla.org/build/puppet/rev/9f2fb7db6535

Attachment #8437061 - Flags: checked-in+

Amy Rich [:arr] [:arich]

Comment 8

•

10 years ago

It looks like the cron jobs are still running to try to rsync stuff from releng-puppet2.srv.releng.scl3.mozilla.com and it's filling up the releng-shared puppet-errors folder with hundreds of authentication failed messages.

Nick Thomas [:nthomas] (UTC+12)

Comment 9

•

10 years ago

I've disabled the jobs in releng-puppet2.srv.releng.{use1,usw2}:/etc/cron.d/{rsync-secrets,ssl_git_sync} to quench the spamming.

Rail Aliiev [:rail]

Assignee

Comment 10

•

10 years ago

I completely forgot that we also use these masters for mock packages... Probably it's worth to leave them alive but just change the instance type from m3.xlarge back to m3.large.

Rail Aliiev [:rail]

Assignee

Comment 11

•

10 years ago

Comment on attachment 8437061 [details] [diff] [review] no-puppet2.diff remote: https://hg.mozilla.org/build/puppet/rev/da69a61b039e remote: https://hg.mozilla.org/build/puppet/rev/7d69f5e64c78

Attachment #8437061 - Flags: checked-in+ → checked-in-

Rail Aliiev [:rail]

Assignee

Comment 12

•

10 years ago

The masters are back, all use m3.large. No puppet errors so far.

Status: NEW → RESOLVED

Closed: 10 years ago

Resolution: --- → WONTFIX

Rail Aliiev [:rail]

Assignee

Comment 13

•

10 years ago

It's time to revisit this. We barely use puppet masters in AWS. The main consumers are buildbot-masters, slaves switched to AMI-based solution.

Status: RESOLVED → REOPENED

Resolution: WONTFIX → ---

:kanban-engops

Updated

•

10 years ago

Whiteboard: [kanban:engops:https://kanbanize.com/ctrl_board/6/655]

:kanban-engops

Updated

•

10 years ago

Whiteboard: [kanban:engops:https://kanbanize.com/ctrl_board/6/655] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/710] [kanban:engops:https://kanbanize.com/ctrl_board/6/655]

Rail Aliiev [:rail]

Assignee

Comment 14

•

10 years ago

Switched to the scl3 masters for golden images. https://hg.mozilla.org/build/cloud-tools/rev/31b64f52d7a6

Rail Aliiev [:rail]

Assignee

Comment 15

•

10 years ago

Attached patch killem-again.diff (deleted) — Details — Splinter Review

Take 2. Once this landed I'm going to watch the logs to see what accesses them before I shut them down.

Attachment #8437061 - Attachment is obsolete: true

Attachment #8516869 - Flags: review?(dustin)

Dustin J. Mitchell [:dustin] (he/him)

Comment 16

•

10 years ago

Comment on attachment 8516869 [details] [diff] [review] killem-again.diff Review of attachment 8516869 [details] [diff] [review]: ----------------------------------------------------------------- Hm, I thought you were going to remove one from each region. Can you do a quick grep to figure out the rate of catalog compilation on these hosts? If it's a tiny fraction of that on the scl3 hosts, this is probably OK.

Attachment #8516869 - Flags: review?(dustin) → review+

Dustin J. Mitchell [:dustin] (he/him)

Comment 17

•

10 years ago

I see about 350 catalog compilations per hour on an scl3 master. I see about 20/hr on an AWS master. So the total is 350*2 + 20*4 = 780/hr, and divided by two scl3 masters that's 390/hr. Which seems unlikely to cause excessive pain.

Dustin J. Mitchell [:dustin] (he/him)

Comment 18

•

10 years ago

When this does land, please file a relops bug for me to remove flows for those puppetmasters and update firewall-tests accordingly.

Dustin J. Mitchell [:dustin] (he/him)

Updated

•

10 years ago

Blocks: 1090508

Nobody; OK to take it and work on it

Updated

•

10 years ago

Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/710] [kanban:engops:https://kanbanize.com/ctrl_board/6/655] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/710]

Rail Aliiev [:rail]

Assignee

Comment 19

•

10 years ago

Comment on attachment 8516869 [details] [diff] [review] killem-again.diff remote: https://hg.mozilla.org/build/puppet/rev/3e7b3aedb9aa remote: https://hg.mozilla.org/build/puppet/rev/841dc1c7197a

Attachment #8516869 - Flags: checked-in+

Rail Aliiev [:rail]

Assignee

Updated

•

10 years ago

Depends on: 1100560

Dustin J. Mitchell [:dustin] (he/him)

Updated

•

10 years ago

Depends on: 1100573

Pete Moore [:pmoore][:pete]

Comment 20

•

10 years ago

Thanks for ack'ing alerts Rail. BTW do we need to do anything to permanently remove nagios checks, or is the ack sufficient?

Rail Aliiev [:rail]

Assignee

Comment 21

•

10 years ago

(In reply to Pete Moore [:pete][:pmoore] from comment #20) > Thanks for ack'ing alerts Rail. BTW do we need to do anything to permanently > remove nagios checks, or is the ack sufficient? I filed bug 1100560 to deal with that.

Amy Rich [:arr] [:arich]

Updated

•

10 years ago

Depends on: 1101051

Rail Aliiev [:rail]

Assignee

Comment 22

•

10 years ago

I shut the masters down this morning. I'll let them sit around for a bit before I terminate them.

Dustin J. Mitchell [:dustin] (he/him)

Comment 23

•

10 years ago

A little too late, but for future reference, https://wiki.mozilla.org/ReleaseEngineering/PuppetAgain/HowTo/Remove_a_Puppetmaster is the guide for doing this. bug 1101051 had some other missed work for shutting down these hosts. I just changed all of the CNAMEs pointing to the now-gone hosts to point to scl3 hosts instead.

Rail Aliiev [:rail]

Assignee

Comment 24

•

10 years ago

I also remove the IPs from the puppetagain-apt.pvt.build.mozilla.org A entry list.

Rail Aliiev [:rail]

Assignee

Comment 25

•

10 years ago

Dustin, do you think that I can go ahead and kill the AWS masters now? Are the in-house masters OK?

Dustin J. Mitchell [:dustin] (he/him)

Comment 26

•

10 years ago

They're looking fine. However, given the fluid nature of near-term plans, I'd suggest leaving them in place. Given hindsight, I think it might have been smarter not to shut these down, since eventually we'll be moving things *out* of scl3, not in.

Rail Aliiev [:rail]

Assignee

Comment 27

•

10 years ago

I prefer to add them when we need them. The process is quite straight forward.

Rail Aliiev [:rail]

Assignee

Comment 28

•

10 years ago

Done.

Status: REOPENED → RESOLVED

Closed: 10 years ago → 10 years ago

Resolution: --- → FIXED

You need to log in before you can comment on or make changes to this bug.