Closed
Bug 614786
Opened 14 years ago
Closed 14 years ago
Rotate ftp staging site to new disk array
Categories
(mozilla.org Graveyard :: Server Operations, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: justdave, Assigned: justdave)
References
Details
(Whiteboard: [downtime 3 of 3 on Thu 2/3 6am pdt])
Here's the current situation:
Available disk arrays:
Filesystem Size Used Avail Use%
[A] Netapp #1 3.1T 2.7T 408G 87%
[B] EQL via NFS 2.0T 1.8T 247G 88%
[C] Netapp #2 4.8T 2.8T 2.0T 59%
Currently mounted as:
[A] /pub/mozilla.org
[B] /pub/mozilla.org/firefox
[C] not yet in use
With 5.1 TB of total space available
The plan:
[A] /pub/mozilla.org/firefox
[B] going away
[C] /pub/mozilla.org
With 7.9 TB of total space available
The original plan was to recombine everything onto [C], but since we're already using 4.5 TB (out of an available 4.8 TB on the new drive), I think it makes more sense to keep the old netapp array in the mix, and eliminate the iscsi-over-nfs hack. Doing it this way, in addition to eliminating the performance issues with the iscsi-over-nfs it's currently using, will add 1.8 TB of disk capacity instead of removing 0.3 TB.
Doing this move is going to require TWO downtimes.
1) Move [A] to [C]
2) Move [B] to [A]
We obviously can't do #2 until #1 is done, and there will be additional prep required between the two steps.
An initial sync of [A] to [C] has already been completed (as evident by the disk usage in the table at the top). Incremental syncs have been tested to take approximately 70 minutes per run. In order to ensure no dataloss we'll need to ensure nobody can write to the disk during the final sync before remounting the drives in their swapped positions. I would recommend advertising a 2 hour outage for this, and we'll probably have it up and running again way sooner than that. Our technology has improved. The last time we did this (with only 1.5 TB of data at the time) it took over 6 hours. :)
The amount of time required for the [B]-to-[A] move is unknown, and there's no way to test it until [A] is freed up by the first move. I suspect it'll take longer, despite the smaller dataset, because of the NFS-via-Linux step in the middle getting the data off the old drive; but that's only a theory until it's actually tested.
Flags: needs-treeclosure?
Flags: needs-downtime+
Assignee | ||
Comment 1•14 years ago
|
||
A graphical diagram of the current setup is available at http://people.mozilla.org/~justdave/MirrorNetwork.pdf
Assignee | ||
Updated•14 years ago
|
Whiteboard: [pending scheduling of downtime]
Comment 2•14 years ago
|
||
Step 1 sounds like it can be done in whatever the next downtime is. I'm on buildduty most of next week, we can probably figure something out. Could Step 2 wait until the holidays, when it's easier to get longer downtime windows?
Assignee | ||
Comment 3•14 years ago
|
||
Sure. We'll probably want to wait until we get a trial run on the incremental syncs for step 2 before deciding how long to wait. It may surprise us and go faster for all we know. Then again, it may not. :)
Comment 4•14 years ago
|
||
Sounds like a good plan to me, with the added advantage that the netapp partitions can be resized (storage permitting).
(In reply to comment #0)
> The amount of time required for the [B]-to-[A] move is unknown, and there's no
> way to test it until [A] is freed up by the first move. I suspect it'll take
> longer, despite the smaller dataset, because of the NFS-via-Linux step in the
> middle getting the data off the old drive; but that's only a theory until it's
> actually tested.
No doubt you already thought of doing the syncs on dm-ftp01.m.o to avoid the extra NFS hop.
Assignee | ||
Comment 5•14 years ago
|
||
(In reply to comment #4)
> No doubt you already thought of doing the syncs on dm-ftp01.m.o to avoid the
> extra NFS hop.
Actually, the thought had slipped my mind, but that's a good idea. We'll have to fix the ACLs to allow us to mount it read/write over there (it's read-only currently), but that's certainly doable.
Comment 6•14 years ago
|
||
justdave: could this be done on 17th? zandr will be doing a tree-closing downtime in bug#616658 that day, so it would be great to do this at the same time.
Assignee | ||
Comment 7•14 years ago
|
||
Depends on the time of day. I'll be doing my RHEL6 recertification exam for my RHCE that day, 9am to 4:30pm Central time, and given that it's downtown Chicago, I'd allow at least 90 minutes travel time to get back to my sister's place and get online afterwards. So I guess if we're talking after 5pm pacific it'd probably work.
Comment 8•14 years ago
|
||
(In reply to comment #6)
> justdave: could this be done on 17th? zandr will be doing a tree-closing
> downtime in bug#616658 that day, so it would be great to do this at the same
> time.
(In reply to comment #7)
> Depends on the time of day. I'll be doing my RHEL6 recertification exam for my
> RHCE that day, 9am to 4:30pm Central time, and given that it's downtown
> Chicago, I'd allow at least 90 minutes travel time to get back to my sister's
> place and get online afterwards. So I guess if we're talking after 5pm pacific
> it'd probably work.
Per zandr, the downtime will be from 8am to 5pm (Pacific), but that includes time for spinning back up systems after the recabling work is finished.
On the 17th, your window would be from 8am to 2pm (Pacific). Does that work for you? If not, is there someone else in IT who can do this on your behalf?
Assignee | ||
Comment 9•14 years ago
|
||
10am pacific might work. I've actually got two separate exams, one is 9:00 to 11:30 central, the other 2:00 to 4:30 central, so other than grabbing lunch, I'll basically be sitting around doing nothing for 2.5 hours between the two exams. With the estimated runtime for the switch being 90 minutes that'll probably be enough time to do it.
Updated•14 years ago
|
Whiteboard: [pending scheduling of downtime]
Assignee | ||
Comment 12•14 years ago
|
||
Part one happened yesterday.
Part 2 will depend on timing figuring out how long it'll take to sync the filesystems. I expect the initial sync to take a day or two, and the followup syncs will determine how long of an outage we need.
Is RelEng happy with the state of stage right now? (data integrity I mean). The next step is to wipe out the contents of the array we just vacated in prep for copying the firefox stuff into it, and I want to make sure we don't need it for a data reversion or something first.
Comment 13•14 years ago
|
||
(In reply to comment #12)
> Part one happened yesterday.
>
> Part 2 will depend on timing figuring out how long it'll take to sync the
> filesystems. I expect the initial sync to take a day or two, and the followup
> syncs will determine how long of an outage we need.
>
> Is RelEng happy with the state of stage right now? (data integrity I mean).
> The next step is to wipe out the contents of the array we just vacated in prep
> for copying the firefox stuff into it, and I want to make sure we don't need it
> for a data reversion or something first.
Per IRC, we're happy with things and haven't seen any issues. Go ahead.
Assignee | ||
Comment 14•14 years ago
|
||
ok, so to cleanly copy this stuff over to the new partition, I need to remove a couple of the bind mounts on dm-ftp01. This *shouldn't* affect anything visible to production, but it depends on the order the mounts were initially set up, and there's a really slim chance that the tryserver and tinderbox directories might briefly disappear.
> * 10.253.0.139:/data/try-builds on /mnt/cm-ixstore01/try-builds type nfs (rw,noatime,rsize=32768,wsize=32768,nfsvers=3,proto=tcp,addr=10.253.0.139)
> * 10.253.0.139:/data/tinderbox-builds on /mnt/cm-ixstore01/tinderbox-builds type nfs (rw,noatime,rsize=32768,wsize=32768,nfsvers=3,proto=tcp,addr=10.253.0.139)
> * /mnt/eql/builds/firefox on /mnt/netapp/stage/archive.mozilla.org/pub/firefox type bind (ro,bind,_netdev)
> X /mnt/cm-ixstore01/try-builds/trybuilds on /mnt/eql/builds/firefox/tryserver-builds/old type none (rw,bind)
> * /mnt/cm-ixstore01/try-builds/trybuilds on /mnt/netapp/stage/archive.mozilla.org/pub/firefox/tryserver-builds/old type bind (ro,bind,_netdev)
> X /mnt/cm-ixstore01/tinderbox-builds/tinderbox-builds on /mnt/eql/builds/firefox/tinderbox-builds type bind (ro,bind,_netdev)
> * /mnt/cm-ixstore01/tinderbox-builds/tinderbox-builds on /mnt/netapp/stage/archive.mozilla.org/pub/firefox/tinderbox-builds type bind (ro,bind,_netdev)
The two with the X in front are the two I need to get rid of. The ones under /mnt/netapp/stage are the ones that are visible on stage.m.o and ftp.m.o. *IF* the ixstore mounts were mounted into eql before eql was mounted into netapp, *THEN* there's a chance that those directories will disappear from netapp when I unmount them from eql, which will require the netapp versions of those bind mounts to be unmounted and remounted. If they were mounted afterwards, then they won't disappear and the production directories won't be affected.
Just to be safe, we're scheduling a downtime to do the unmounts. This is tentatively Wed Jan 12 during EST AM.
Whiteboard: [downtime 2 of 3 on Jan 12)
Comment 15•14 years ago
|
||
We may not be able to hit this downtime because of the requirement that even tho we have all our ducks in a row we still have to run this completely up the chain-of-command flagpole.
So, started that process just now and have tossed the ball to zandr since he can better coordinate with IT - you guys let me know when this gets scheduled.
Assignee | ||
Comment 16•14 years ago
|
||
OK, the process in step 14 has been completed. Turns out they were mounted in the correct order, so we did *not* wind up having any downtime on the production paths, and we could have gotten away with not shutting everything down after all. Better safe than sorry though, since there wasn't any guarantee in advance.
Next step is the final cutover, timing on that will depend on how long an incremental rsync between the two partitions takes, which will probably take me a couple days to determine.
Whiteboard: [downtime 2 of 3 on Jan 12) → [downtime 3 of 3 on ???) [waiting for duration to be figured out]
Comment 17•14 years ago
|
||
There's some fallout from this morning, http://stage.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/ is empty.
Assignee | ||
Comment 18•14 years ago
|
||
surf proxies it to dm-ftp01, which for some reason had the httpd docroot pointed at the mount points we removed instead of the supposed-to-be-public-facing ones. Changed httpd to point at the correct ones, works now (as of 08:47)
Updated•14 years ago
|
Assignee: justdave → zandr
Comment 19•14 years ago
|
||
(In reply to comment #16)
> OK, the process in step 14 has been completed. Turns out they were mounted in
> the correct order, so we did *not* wind up having any downtime on the
> production paths, and we could have gotten away with not shutting everything
> down after all. Better safe than sorry though, since there wasn't any
> guarantee in advance.
>
> Next step is the final cutover, timing on that will depend on how long an
> incremental rsync between the two partitions takes, which will probably take me
> a couple days to determine.
Any ETA?
Comment 21•14 years ago
|
||
(In reply to comment #19)
> (In reply to comment #16)
> > OK, the process in step 14 has been completed. Turns out they were mounted in
> > the correct order, so we did *not* wind up having any downtime on the
> > production paths, and we could have gotten away with not shutting everything
> > down after all. Better safe than sorry though, since there wasn't any
> > guarantee in advance.
> >
> > Next step is the final cutover, timing on that will depend on how long an
> > incremental rsync between the two partitions takes, which will probably take me
> > a couple days to determine.
>
> Any ETA?
justdave/zandr: Any ETA?
Bumping priority based on comment in bug#629129:
"We've got a few alerts about this partition the past couple of weeks. Right now
we're sitting at about 95G (~5%) free. We're not going to last much longer with
this though, we increase use by many GB per day, for nightlies.
I know some people, Joduinn and justdave in particular, chatted about stage
disk space in the past 6 months, but other than some new mounts for older
try/dep builds, I don't know what came out of it.
In any case, this will require action in the near future."
Severity: normal → major
Comment 23•14 years ago
|
||
Even after getting us back to > 100G yesterday, Nagios went off again:
11:56 <nagios> [47] surf:disk - /mnt/netapp/stage/archive.mozilla.org/pub/firefox is WARNING: DISK WARNING - free space: /mnt/netapp/stage/archive.mozilla.org/pub/firefox 73232 MB (4%):
Elevated try load is part of this, and we'll probably gain some space on Monday when many of this weeks builds are archived to a different partition, but we'll certainly spike again next Thursday/Friday.
Assignee | ||
Comment 24•14 years ago
|
||
This is being hampered by the large number of tryserver builds getting submitted in the last week or so (around 100 per day!) since those are stored on the partition we're trying to move. An rsync of 3 days' worth just completed and took over 22 hours to complete. I've got another rsync running now picking up that 22 hours' worth of changes. Making this happen is going to require finding a time of day when the least amount of change is happening and getting a continuous rsync going trying to get the shortest time possible for an incremental sync. If the continuous sync doesn't manage to find a good time of day for it we may have to do something like asking people not to submit try builds for several hours in advance of the planned move time or somesuch.
Assignee | ||
Comment 25•14 years ago
|
||
most recent incremental sync took 7 hours to sync 22 hours worth of data (coming straight off the one that took 22 hours to transfer 3 days' worth)
Assignee | ||
Comment 26•14 years ago
|
||
timing on the "continuous run" passes over the last day or so:
time spent completion time
------------ ----------------------------
141m05.367s Fri Jan 28 02:12:15 PST 2011
70m44.413s Fri Jan 28 03:23:00 PST 2011
73m02.988s Fri Jan 28 04:36:03 PST 2011
168m53.443s Fri Jan 28 07:24:56 PST 2011
201m52.250s Fri Jan 28 10:46:49 PST 2011
52m55.436s Fri Jan 28 11:39:44 PST 2011
Assignee | ||
Comment 27•14 years ago
|
||
time spent completion time
------------ ----------------------------
51m19.049s Fri Jan 28 12:31:04 PST 2011
55m07.203s Fri Jan 28 13:26:11 PST 2011
50m41.688s Fri Jan 28 14:16:53 PST 2011
54m07.810s Fri Jan 28 15:11:01 PST 2011
49m55.261s Fri Jan 28 16:00:56 PST 2011
45m57.048s Fri Jan 28 16:46:53 PST 2011
44m50.918s Fri Jan 28 17:31:44 PST 2011
43m10.745s Fri Jan 28 18:14:55 PST 2011
48m11.283s Fri Jan 28 19:03:07 PST 2011
49m37.833s Fri Jan 28 19:52:45 PST 2011
46m47.324s Fri Jan 28 20:39:32 PST 2011
40m41.071s Fri Jan 28 21:20:13 PST 2011
43m31.812s Fri Jan 28 22:03:45 PST 2011
Assignee | ||
Comment 28•14 years ago
|
||
If today was a representative day, then it looks like the best time to do this is sometime between 4p and 9pm pacific, and the midnight to 11am block should be avoided at all costs.
Assignee | ||
Comment 29•14 years ago
|
||
And our downtime is going to be about an hour.
Comment 30•14 years ago
|
||
zandr, when's good to get this scheduled?
Comment 31•14 years ago
|
||
(In reply to comment #30)
> zandr, when's good to get this scheduled?
Based on comment 28, this looks like a good fit for the usual Tuesday 7pm PST window. Will socialize that today so we can announce by EOD.
Comment 32•14 years ago
|
||
If you're looking for a time when you can close the tryserver tree to get this done, note bug 630065 - you can't currently actually close it (though I guess maybe you could shut off the try buildmaster, so builds wouldn't happen even though pushes would continue).
Comment 33•14 years ago
|
||
I'm still learning my way around the RelEng infra, so apologies if this is a dumb question:
Should bug 630065 block this downtime? Or is announcing the downtime and saying "I told you so" sufficient?
Comment 34•14 years ago
|
||
We've had tons of downtimes without being able to truly close Try, I don't think we should block on that.
Comment 35•14 years ago
|
||
(In reply to comment #23)
> Even after getting us back to > 100G yesterday, Nagios went off again:
> 11:56 <nagios> [47] surf:disk -
> /mnt/netapp/stage/archive.mozilla.org/pub/firefox is WARNING: DISK WARNING -
> free space: /mnt/netapp/stage/archive.mozilla.org/pub/firefox 73232 MB (4%):
>
> Elevated try load is part of this, and we'll probably gain some space on Monday
> when many of this weeks builds are archived to a different partition, but we'll
> certainly spike again next Thursday/Friday.
(In reply to comment #24)
> This is being hampered by the large number of tryserver builds getting
> submitted in the last week or so (around 100 per day!) since those are stored
> on the partition we're trying to move. An rsync of 3 days' worth just
> completed and took over 22 hours to complete. I've got another rsync running
> now picking up that 22 hours' worth of changes. Making this happen is going to
> require finding a time of day when the least amount of change is happening and
> getting a continuous rsync going trying to get the shortest time possible for
> an incremental sync. If the continuous sync doesn't manage to find a good time
> of day for it we may have to do something like asking people not to submit try
> builds for several hours in advance of the planned move time or somesuch.
justdave, I agree there is heavy load on tryserver, but there are also heavy load on tm and m-c... all of which are posting builds under /pub/firefox. Unless I'm missing something, you will actually need to close *all* trees, not just TryServer.
Am I missing something?
Assignee | ||
Comment 36•14 years ago
|
||
(In reply to comment #35)
> justdave, I agree there is heavy load on tryserver, but there are also heavy
> load on tm and m-c... all of which are posting builds under /pub/firefox.
> Unless I'm missing something, you will actually need to close *all* trees, not
> just TryServer.
>
> Am I missing something?
I don't remember implying anywhere that we wouldn't have to close all trees, or that only try server would need to be.
I'll have another couple day's worth of rsync timings (it's been running continuously all weekend) in a few minutes.
Assignee | ||
Comment 37•14 years ago
|
||
Here's the timings picking up from where I left off in comment 27.
time spent completion time
------------ ----------------------------
55m07.203s Fri Jan 28 13:26:11 PST 2011
50m41.688s Fri Jan 28 14:16:53 PST 2011
54m07.810s Fri Jan 28 15:11:01 PST 2011
49m55.261s Fri Jan 28 16:00:56 PST 2011
45m57.048s Fri Jan 28 16:46:53 PST 2011
44m50.918s Fri Jan 28 17:31:44 PST 2011
43m10.745s Fri Jan 28 18:14:55 PST 2011
48m11.283s Fri Jan 28 19:03:07 PST 2011
49m37.833s Fri Jan 28 19:52:45 PST 2011
46m47.324s Fri Jan 28 20:39:32 PST 2011
40m41.071s Fri Jan 28 21:20:13 PST 2011
43m31.812s Fri Jan 28 22:03:45 PST 2011
41m12.234s Fri Jan 28 22:44:58 PST 2011
42m10.788s Fri Jan 28 23:29:58 PST 2011
45m28.571s Sat Jan 29 00:15:27 PST 2011
47m43.769s Sat Jan 29 01:03:11 PST 2011
51m44.002s Sat Jan 29 01:54:55 PST 2011
48m30.270s Sat Jan 29 02:43:25 PST 2011
47m19.974s Sat Jan 29 03:30:45 PST 2011
78m19.010s Sat Jan 29 04:49:04 PST 2011
147m25.606s Sat Jan 29 07:16:29 PST 2011
174m11.984s Sat Jan 29 10:10:41 PST 2011
49m43.413s Sat Jan 29 11:00:25 PST 2011
44m13.218s Sat Jan 29 11:44:38 PST 2011
44m31.266s Sat Jan 29 12:29:09 PST 2011
46m37.220s Sat Jan 29 13:15:47 PST 2011
45m43.052s Sat Jan 29 14:01:30 PST 2011
44m22.308s Sat Jan 29 14:45:52 PST 2011
46m33.268s Sat Jan 29 15:32:25 PST 2011
43m45.723s Sat Jan 29 16:16:11 PST 2011
44m19.970s Sat Jan 29 17:00:31 PST 2011
43m49.773s Sat Jan 29 17:44:21 PST 2011
42m59.756s Sat Jan 29 18:27:21 PST 2011
43m22.887s Sat Jan 29 19:10:44 PST 2011
42m27.148s Sat Jan 29 19:53:11 PST 2011
40m55.606s Sat Jan 29 20:34:06 PST 2011
42m25.584s Sat Jan 29 21:16:32 PST 2011
39m36.874s Sat Jan 29 21:56:09 PST 2011
37m14.723s Sat Jan 29 22:33:24 PST 2011
37m37.328s Sat Jan 29 23:11:01 PST 2011
38m45.227s Sat Jan 29 23:49:46 PST 2011
43m01.181s Sun Jan 30 00:32:47 PST 2011
43m52.116s Sun Jan 30 01:16:39 PST 2011
48m28.173s Sun Jan 30 02:05:08 PST 2011
44m39.384s Sun Jan 30 02:49:47 PST 2011
48m05.095s Sun Jan 30 03:37:52 PST 2011
72m21.407s Sun Jan 30 04:50:14 PST 2011
144m43.013s Sun Jan 30 07:14:57 PST 2011
149m49.792s Sun Jan 30 09:44:46 PST 2011
45m24.427s Sun Jan 30 10:30:11 PST 2011
44m39.833s Sun Jan 30 11:14:51 PST 2011
42m42.739s Sun Jan 30 11:57:34 PST 2011
43m49.922s Sun Jan 30 12:41:23 PST 2011
43m28.116s Sun Jan 30 13:24:52 PST 2011
40m53.771s Sun Jan 30 14:05:45 PST 2011
39m38.953s Sun Jan 30 14:45:24 PST 2011
39m30.793s Sun Jan 30 15:24:55 PST 2011
39m24.620s Sun Jan 30 16:04:20 PST 2011
41m37.206s Sun Jan 30 16:45:57 PST 2011
44m25.262s Sun Jan 30 17:30:22 PST 2011
40m42.905s Sun Jan 30 18:11:05 PST 2011
40m49.902s Sun Jan 30 18:51:55 PST 2011
40m38.700s Sun Jan 30 19:32:34 PST 2011
40m24.559s Sun Jan 30 20:12:58 PST 2011
41m15.890s Sun Jan 30 20:54:14 PST 2011
38m23.992s Sun Jan 30 21:32:38 PST 2011
40m51.554s Sun Jan 30 22:13:30 PST 2011
39m13.943s Sun Jan 30 22:52:44 PST 2011
40m31.389s Sun Jan 30 23:33:15 PST 2011
42m41.015s Mon Jan 31 00:15:56 PST 2011
43m56.014s Mon Jan 31 00:59:52 PST 2011
44m22.505s Mon Jan 31 01:44:15 PST 2011
39m27.340s Mon Jan 31 02:23:42 PST 2011
40m51.176s Mon Jan 31 03:04:33 PST 2011
46m21.102s Mon Jan 31 03:50:55 PST 2011
114m4.194s Mon Jan 31 05:44:59 PST 2011
243m16.968s Mon Jan 31 09:48:16 PST 2011
194m47.532s Mon Jan 31 13:03:03 PST 2011
Comment 38•14 years ago
|
||
This is consistent with our current theory, which is that the nightlies create a huge amount of stuff to move, and it takes hours to push through.
Comment 39•14 years ago
|
||
1) justdave, thanks - this is great data.
2) from releng+zandr meeting : this means we can do this downtime anytime *except* the few hours after the nightlies are created.
3) We're proposing doing the downtime from 6-9am PST as this is lowest checkin load, so least disruptive to developers. Open question is:
3a) should we trigger the nightlies earlier (say 1am PST?), so that the rsync would be handled well in advance of the downtime? OR
3b) delay triggering the nightlies until after the downtime is over (say 9am PST)? This would mean handling nightly build+l10n load at the same time as developers start usual checkin load, so (3b) feels less optimal to me. Therefore I propose we do (3a).
Any comments, thoughts before we cast this in stone?
Assignee | ||
Comment 40•14 years ago
|
||
From the data we have so far it looks like we need to wait until at least 11am if you want the downtime to be less than an hour, that or trigger the nightlies that much earlier or after we're done.
Assignee | ||
Comment 41•14 years ago
|
||
And that's on a weekend. The weekday data we have so far (Monday) seems to imply that 4am to 1pm is off limits. Note that the times listed on that chart are when the rsync completed. It's running in a loop, so the end time of the previous pass is the start time of the one whose length of time is listed there (within a few seconds)
Comment 42•14 years ago
|
||
(In reply to comment #40)
> From the data we have so far it looks like we need to wait until at least 11am
> if you want the downtime to be less than an hour, that or trigger the nightlies
> that much earlier or after we're done.
Justdave: We're totally fine with triggering nightlies earlier/later for that one day (thursday), just to make this ftp-sync and final switchover happen. Given that, could we do this during the Thursday morning downtime?
Assignee | ||
Comment 43•14 years ago
|
||
Whiteboard: [downtime 3 of 3 on ???) [waiting for duration to be figured out] → [downtime 3 of 3 on Thu 2/3 6am pdt]
Assignee | ||
Comment 44•14 years ago
|
||
here's the procedure I'm planning on:
1) At 6:00am PDT, I put in an /etc/nologin file on surf to prevent ssh/scp/rsync-over-ssh connections from coming in.
2) The continuous rsync loop already in progress will be allowed to complete.
3) An additional loop will be allowed to complete with the entire loop happening while no one can upload. This will ensure that every last bit of the data has been copied over.
4) httpd, vsftpd, and xinetd(rsync) will be shut down on all of surf, dm-ftp01, and dm-download02, to remove all readers and allow me to unmount the partitions.
5) The NFS mounts to dm-ftp01 will be dropped from surf and dm-download02 (FREAKING HURRAY!!!!)
6) Both partitions will be unmounted from dm-ftp01
7) the new partition will be mounted in place of the dm-ftp01 NFS mount on all three servers (in the case of dm-ftp01 this is in place of the iscsi mount)
8) the old partition will be mounted on surf in a separate out-of-tree mount point to allow any last minute cleanup or retrieval of missing items we didn't catch in step 3 (really unlikely, but better safe than sorry).
9) httpd, vsftpd, and xinetd(rsync) will be re-enabled on all servers where applicable
10) /etc/nologin will be removed from surf to allow uploads again.
11) profit!
Assignee | ||
Comment 45•14 years ago
|
||
11) permanently disable nfsd on dm-ftp01 :)
Comment 46•14 years ago
|
||
(In reply to comment #42)
> (In reply to comment #40)
> > From the data we have so far it looks like we need to wait until at least 11am
> > if you want the downtime to be less than an hour, that or trigger the nightlies
> > that much earlier or after we're done.
>
> Justdave: We're totally fine with triggering nightlies earlier/later for that
> one day (thursday), just to make this ftp-sync and final switchover happen.
> Given that, could we do this during the Thursday morning downtime?
Nightly scheduler now tweaked to fire after tonight's downtime is over. This should help reduce the amount of rsync-ing needed.
Assignee | ||
Comment 47•14 years ago
|
||
ok, completed through step 10. Disabling nfsd on dm-ftp01 will need to wait until we decide we're done with the old mount (which is at /mnt/eql/builds on surf and dm-ftp01)
Assignee | ||
Comment 48•14 years ago
|
||
the reverse-proxies to dm-ftp01 from dm-download02 and surf for the /pub/mozilla.org/firefox directory have been removed (serving locally off the netapp nfs mount now)
Assignee | ||
Comment 49•14 years ago
|
||
ok, going to call this done.
Assignee: zandr → justdave
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Comment 50•14 years ago
|
||
This appears wrong:
[ffxbld@surf firefox]$ df -h . tinderbox-builds tryserver-builds
Filesystem Size Used Avail Use% Mounted on
10.253.0.11:/vol/stage
3.1T 1.8T 1.3T 58% /mnt/netapp/stage/archive.mozilla.org/pub/firefox
/mnt/cm-ixstore01/tinderbox-builds/tinderbox-builds
17T 807G 16T 5% /mnt/netapp/stage/archive.mozilla.org/pub/firefox/tinderbox-builds
10.253.0.11:/vol/stage
3.1T 1.8T 1.3T 58% /mnt/netapp/stage/archive.mozilla.org/pub/firefox
John O'Duinn wanted to have tinderbox-builds mounted on HA disk, and moved to tinderbox-builds/old (on 16T non-HA disk) at 14-20 days. To me that means tinderbox-builds should be mounting /vol/stage, and cm-ixstore01 should be mounted on tinderbox-builds/old.
The reason for this is that cm-ixstore01 is, aiui, on a single head and if we lose that, we're looking at a day of downtime and a burning tree.
Assignee | ||
Comment 51•14 years ago
|
||
That's the way it's been since those got set up, I didn't touch those mount points (other than unmounting them around the swap of the other two). The setup of those was on a different bug (I don't know which one, I didn't do it). I'd suggest reopening that one if you can find it, or filing a new bug.
Updated•10 years ago
|
Product: mozilla.org → mozilla.org Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•