Closed
Bug 710233
Opened 13 years ago
Closed 12 years ago
Get & deploy dongles for Windows 7 slaves
Categories
(Infrastructure & Operations :: RelOps: General, task, P1)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: armenzg, Assigned: dividehex)
References
Details
(Whiteboard: all w7 slaves have a dongle attached)
In bug 702504 we discovered that we need dongles for the Windows 7 slaves.
We currently have 80 Windows 7 slaves. 3 of them already have a dongle (ref, 036 & 053).
I am not expecting this to be deployed before the 2nd week of January as everyone's hands are full and I will be out for 2 weeks but I would like us to purchase the dongles ahead of time.
We will need a downtime (or a low capacity session) for this as I expect a performance hit and needs to be done all of them in one shot.
Meanwhile I am verifying which unit tests will fail with the dongle.
Thanks in advance.
Comment 1•13 years ago
|
||
We have 52 dongles from the old snow machines, so we'll need about 30 more. I'll take care of the purchasing. Jake, where's the best place to ship these for your soldering pleasure?
Assignee: server-ops-releng → dustin
Comment 2•13 years ago
|
||
I placed an order with digi-key for 40 solder cups, shipped to mtv1 via FedEx ground: 31759872
Comment 3•13 years ago
|
||
The adapter PID on monoprice is 4850. I'm waiting to hear back about our net-30 terms with them before I order.
Comment 4•13 years ago
|
||
Monoprice order is placed. It will be here Thursday. Couriers rock :)
Comment 5•13 years ago
|
||
I believe all of the required equipment is in place now. Jake, can you verify, and get the soldering taken care of?
Assignee: dustin → jwatkins
Assignee | ||
Comment 6•13 years ago
|
||
I have installed the 100ohm dongles on talos-r3-w7-001 thru 050. Later today I'll pick up the solder cups ( I have the adapters already ) so I can start building them tonight at home.
Comment 7•13 years ago
|
||
So in light of comment 0, this shouldn't have happened -- my fault for not catching that before Jake knocked this out.
Per Bear, it seems not to have affected screen resolutions, so we'll leave them as they are until Monday, when Armen is back.
Erica has the 100Ω resistors from when she was working with Dao. She is, AFAIK, in Phoenix right now, and will be back by Monday, so hopefully we can lay hands on those shortly. If that seems unlikely, we'll just buy some more.
Comment 8•13 years ago
|
||
This is impacting tests; bug 662154 is permaorange in debug builds on Win7 slaves across all trees.
Comment 9•13 years ago
|
||
(In reply to Dustin J. Mitchell [:dustin] from comment #7)
> So in light of comment 0, this shouldn't have happened -- my fault for not
> catching that before Jake knocked this out.
>
> Per Bear, it seems not to have affected screen resolutions, so we'll leave
> them as they are until Monday, when Armen is back.
>
> Erica has the 100Ω resistors from when she was working with Dao. She is,
> AFAIK, in Phoenix right now, and will be back by Monday, so hopefully we can
> lay hands on those shortly. If that seems unlikely, we'll just buy some
> more.
(In reply to Matt Brubeck (:mbrubeck) from comment #8)
> This is impacting tests; bug 662154 is permaorange in debug builds on Win7
> slaves across all trees.
mbrubeck: The dongles were installed today by accident. I dont have exact time when today, but comment#^ makes me suspect sometime this morning. I'm trying to gauge how certain you are that this orange is caused by the new dongles. I know its hard to judge, but have you any idea if the permaorange test you see could in any way be related to any code landings that happened today coincidentally around the same time as the dongle work?
dustin: From comment#8, it looks like waiting until Monday may not be an option. Worst case, can we just revert back - pull the dongles and get those tests back to green again, until we are ready for it and know that staging machines with new dongles pass with green tests?
Comment 10•13 years ago
|
||
See, e.g., https://tbpl.mozilla.org/?tree=Mozilla-Beta&rev=01ef9195f79b for certainty - it landed on Tuesday, well before this started, but my retriggers that should have proved it was still green instead passed on 052, 058 and 062, and failed on 014, 018 and 043, the difference certainly fitting rather nicely with "above or below 50."
Comment 11•13 years ago
|
||
Because Jake rocks way way above and beyond the call of duty, the exact time when they were uninstalled was 9:47 PM.
Comment 12•13 years ago
|
||
According to the meeting I just had with coop et.al., this is definitely causing permaorange on all talos boxes below 50, so the dongles should be removed ASAP.
Severity: normal → critical
Comment 13•13 years ago
|
||
Obviously I've taken too many cold meds to be able to read clearly. As philor states in comment 11, the dongles were removed last night.
Severity: critical → normal
Reporter | ||
Comment 14•13 years ago
|
||
I'm very sorry this happened.
I made comments, before I left, on the blocking bug (rather than this one) and added a dependency to it for bug 710233 which asks the gfx team to fix such perma-oranges.
I've asked again for people from gfx to take on the bug.
Assignee | ||
Comment 15•13 years ago
|
||
12 new dongles have been soldered. Plus the ones I got from Erica. This brings us to 80 dongles w/adapters that are in scl1 and ready to be deployed. I will of course *wait* until this is unblocked and I get the green light to deploy. :-P
(In reply to Armen Zambrano G. [:armenzg] - Release Engineer from comment #14)
> I'm very sorry this happened.
But no one is sorrier than me.
Reporter | ||
Comment 16•13 years ago
|
||
gfx seems to have landed a fix.
I will be testing in the next day (as the release permits) that we're good to go.
Whiteboard: waiting on a test run on preproduction
Reporter | ||
Comment 17•13 years ago
|
||
I have put talos-r3-w7-036 and talos-r3-w7-053 on my development masters taking jobs.
We should have results in the morning and determine if there are no more perma-oranges.
After that we can schedule a downtime whenever it is possible.
Reporter | ||
Comment 18•13 years ago
|
||
I re-opened bug 712630 to track down 2 new found failures.
Re-triggering the jobs again to confirm that they are permanent failures.
Whiteboard: waiting on a test run on preproduction → waiting on 2 new perma oranges
Updated•13 years ago
|
Priority: -- → P1
Reporter | ||
Comment 20•13 years ago
|
||
Poked developers in bug 712630.
Reporter | ||
Comment 21•13 years ago
|
||
I am doing a last dry run.
We should be ready as soon as I verify the results.
What date and time could we schedule this?
We will need to ask for a downtime.
Reporter | ||
Comment 22•13 years ago
|
||
This might be done next week. To be discussed and finalized in today's relops meeting.
It will also need to be discussed with jhford who would be buildduty next week.
I can help from EDT time to gracefully shutdown the Windows test masters 45-60 mins ahead of the work.
dividehex says that it should not take longer than an hour to add all the dongles.
We should ask for a 2 hour window downtime and open earlier if we need to.
We just have to re-trigger jobs a lot of jobs as soon as the dongles are attached to an existing job.
On the day prior to the downtime, I will trigger jobs in staging to verify that no new perma-oranges got introduced.
Makes sense?
Whiteboard: waiting on 2 new perma oranges → probably: downtime to be scheduled next week with other IT work at that colo
Reporter | ||
Comment 23•13 years ago
|
||
I have to verify it once more as one of the two slaves that have the dongle did not really have a higher screen resolution.
Sorry for false news :(
Whiteboard: probably: downtime to be scheduled next week with other IT work at that colo → waiting on dependent bug
Comment 24•13 years ago
|
||
Armen: any update here? Before the MozCamp, I recall you saying you might have thought of a way to fix or even avoid this.
Reporter | ||
Comment 25•13 years ago
|
||
The idea was to enforce the screen resolution. This would mean that even after the dongles we would still be running on the same state. This means that we could then at our own choice decide to change the screen resolution and even choose which branches to run with which resolution.
I need more time to test this. I am very spread out right now. We can talk tomorrow on helping me focus on the right items.
Reporter | ||
Comment 26•13 years ago
|
||
I started working on this again on Friday (see dependent bug). It seems that my idea works (preliminary testing) and today I will be working on reviewing results and polishing patches.
At this moment no downtime seems necessary and we could soon be deploying the dongles.
I would prefer to deploy 10 dongles first to proof that my changes are really no-op and then deploy the remaining either that day or on another time.
Reporter | ||
Comment 27•13 years ago
|
||
dividehex and I spoke on IRC and I mentioned on the dependent bug.
We have planned to deploy 5 dongles on Tuesday.
Whiteboard: waiting on dependent bug → [5 dongles to be deployed on Tuesday 6/12/12]
Reporter | ||
Comment 28•13 years ago
|
||
The code change was deployed successfully and we can now deploy 5 dongles in production.
I am going to change the dependencies:
1) deploy 5 dongles in production
2) check that there are no regressions
3) deploy remaining dongles in production
4) check that there are no regressions
5) (bug 712630) ask developers to take care of using the try server to fix the orange
Reporter | ||
Comment 29•13 years ago
|
||
dividehex I have disabled these slaves to get a dongle unto:
* talos-r3-w7-001 (staging)
* talos-r3-w7-002 (staging)
* talos-r3-w7-003 (staging)
* talos-r3-w7-004
* talos-r3-w7-005
* talos-r3-w7-006
* talos-r3-w7-007
* talos-r3-w7-008
* talos-r3-w7-010 (staging)
Would you also want to do #9 so we have a complete range? (I ask since we had only agreed yesterday to staging slaves and 5 production slaves).
Thanks!
Assignee | ||
Comment 30•13 years ago
|
||
(In reply to Armen Zambrano G. [:armenzg] - Release Engineer from comment #29)
> dividehex I have disabled these slaves to get a dongle unto:
> * talos-r3-w7-001 (staging)
> * talos-r3-w7-002 (staging)
> * talos-r3-w7-003 (staging)
> * talos-r3-w7-004
> * talos-r3-w7-005
> * talos-r3-w7-006
> * talos-r3-w7-007
> * talos-r3-w7-008
> * talos-r3-w7-010 (staging)
>
> Would you also want to do #9 so we have a complete range? (I ask since we
> had only agreed yesterday to staging slaves and 5 production slaves).
>
> Thanks!
That will not be a problem. I will be at SCL1 in the afternoon.
Assignee | ||
Comment 31•13 years ago
|
||
Dongles have been deployed to:
> * talos-r3-w7-001 (staging)
> * talos-r3-w7-002 (staging)
> * talos-r3-w7-004
> * talos-r3-w7-005
> * talos-r3-w7-006
> * talos-r3-w7-007
> * talos-r3-w7-008
> * talos-r3-w7-009
> * talos-r3-w7-010 (staging)
A dongle will be deployed to talos-r3-w7-003 (staging) when it is finished being imaged.
Assignee | ||
Comment 32•13 years ago
|
||
talos-r3-w7-003 has been reimaged and dongle is attached
Reporter | ||
Comment 33•13 years ago
|
||
talos-r3-w7-0[01-10] & 36 & 56 have dongles
I have not been able to spot any perma-oranges on the production slaves with dongles.
dividehex, who/when can deploy all the remaining dongles? Next week is fine (if possible). How long do you estimate it will take?
I believe we are talking about 67 Windows 7 32-bit production slaves.
We can schedule a small downtime if you think that will help even though it is not necessary.
Whiteboard: [5 dongles to be deployed on Tuesday 6/12/12] → talos-r3-w7-0[01-10] & 36 & 56 have dongles
Reporter | ||
Updated•13 years ago
|
Assignee | ||
Comment 34•13 years ago
|
||
Armen: I can do this today or on 06/26. All of Relops and I will be in Berlin next week. If I do it today and something goes wrong, you will need to file a bug with DCops to have them remove the dongles.
Should only take about 5 mins to deploy to all talos-r3-w7's in SCL1. We should NOT schedule downtime for this if its not needed.
Lets me know if you want me to move forward today or if you want to wait until relops it back from Berlin.
Reporter | ||
Comment 35•13 years ago
|
||
We decided to wait until Jake is back.
Updated•13 years ago
|
Blocks: t-r3-w764-003
Reporter | ||
Comment 36•12 years ago
|
||
Can we deploy the dongles this week?
Which day/time can we do this?
I am on duty so I can prepare the slaves in advance.
Assignee | ||
Comment 37•12 years ago
|
||
Armenzg: does Wed 6/27 in the afternoon work for you?
Reporter | ||
Comment 38•12 years ago
|
||
Sounds good! (updating white board)
Whiteboard: talos-r3-w7-0[01-10] & 36 & 56 have dongles → deploying on Wed 6/27 PDT afternoon - talos-r3-w7-0[01-10] & 36 & 56 have dongles
Reporter | ||
Comment 39•12 years ago
|
||
I have set aside a first batch:
talos-r3-w7-0[11-50]
They should be ready in 50 minutes from this comment.
These minis will have a dongle:
talos-r3-w7-0{36,56}
These minis *might* have a dongle:
talos-r3-w7-024
talos-r3-w7-044
We will do the remaining batch after this first one.
Reporter | ||
Comment 40•12 years ago
|
||
dividehex did those machines.
I rebooted them and will check on them before we got for the 2nd set.
We're now disabling talos-r3-w7-051 to talos-r3-w7-079.
Those machines will be ready to get a dongle attached in 50 minutes from now.
Reporter | ||
Comment 41•12 years ago
|
||
dividehex: deployed except 55,67,70-79
Reporter | ||
Comment 42•12 years ago
|
||
talos-r3-w7-0{55,70,71,72,73,75,76,77,78} got done as well.
Just waiting on 67, 74 & 79.
Reporter | ||
Comment 43•12 years ago
|
||
All done.
I have been checking all day and there's not been anything obvious broken.
Tomorrow we should have more data points and be sure that nothing went wrong.
Thanks Jake.
Whiteboard: deploying on Wed 6/27 PDT afternoon - talos-r3-w7-0[01-10] & 36 & 56 have dongles → all w7 slaves have a dongle attached
Reporter | ||
Updated•12 years ago
|
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Updated•11 years ago
|
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in
before you can comment on or make changes to this bug.
Description
•