Closed
Bug 404013
Opened 17 years ago
Closed 17 years ago
bl-bldlnx{01,03} have stopped performance testing (dhcp lease problems?)
Categories
(Infrastructure & Operations :: RelOps: General, task)
Infrastructure & Operations
RelOps: General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: reed, Assigned: justin)
Details
bl-bldlnx03.office.mozilla.org has been testing for 5+ hours. justdave says that dhcp stuff in the office got broken, so the machine probably lost its lease. It needs to be fixed so that tinderbox can actually run perf tests on our Linux builds.
I'm closing the tree until this is fixed, but as justdave says that somebody needs to physically be on the console to actually fix it, I'm only filing this as critical (instead of blocker, as it is tier 1), so that it pages oncall when somebody is actually at the office to fix it (as per bug 384966, comment #17).
Comment 1•17 years ago
|
||
I did attempt to go through the raritan to kick it, but I can't find it on the server list (I did go through and check everything that showed it was connected, and they were all something else)
Reporter | ||
Comment 2•17 years ago
|
||
Raising severity now that business hours have started.
Severity: critical → blocker
Assignee | ||
Comment 3•17 years ago
|
||
I've said this before, but machines at the office *can not* be considered teir one and *can not* shut the tree. We just dont have the resources to handle outages there. If this is on a tier one list somewhere, it needs to be changed.
BTW - thanks for not paging after hours reed - appreciate it.
I'll take a look at it asap...prob in the next hour.
Assignee: server-ops → justin
Comment 4•17 years ago
|
||
bl-bldlnx01.office.mozilla.org too please
Summary: bl-bldlnx03 has been testing for 5+ hours (dhcp lease problems?) → bl-bldlnx{01,03} have stopped performance testing (dhcp lease problems?)
Assignee | ||
Comment 5•17 years ago
|
||
rebooted bl-bldlnx03. There is no machine labeled bl-bldlnx01 in the server room, so someone from build will have to show us what machine that is.
Reporter | ||
Comment 6•17 years ago
|
||
(In reply to comment #3)
> I've said this before, but machines at the office *can not* be considered teir
> one and *can not* shut the tree. We just dont have the resources to handle
> outages there. If this is on a tier one list somewhere, it needs to be
> changed.
So, what you are saying conflicts with what bug 384966, comment #17 says. We need to know who is right, as these should definitely be considered tier 1 machines that close the tree when broken, as we don't have any replacements for these.
Comment 7•17 years ago
|
||
(In reply to comment #5)
> rebooted bl-bldlnx03. There is no machine labeled bl-bldlnx01 in the server
> room, so someone from build will have to show us what machine that is.
I don't think anyone from build is going to be in the office anytime soon. It used to have a label on it, I believe it was setup by aravind a year ago or so.
It should be one of the IBM xservers. reed says it's right about bl-bldlnx02, if that helps :)
Comment 8•17 years ago
|
||
(In reply to comment #7)
> It should be one of the IBM xservers. reed says it's right about bl-bldlnx02,
> if that helps :)
s/about/above/
Comment 9•17 years ago
|
||
I have restarted tinderbox on bl-bldlnx03
Assignee | ||
Comment 10•17 years ago
|
||
there is no label for 02 either - just 03 & 04. I'll dig around, but please have someone label asap. Is John not around?
Comment 11•17 years ago
|
||
(In reply to comment #10)
> there is no label for 02 either - just 03 & 04. I'll dig around, but please
> have someone label asap. Is John not around?
No he's out for the next week.
We should go through these and relabel, I guess the labels fell off or something :P I'm positive they used to be labeled.
Do you need me to come down and do this? I'd just plug them into the KVM and identify them.
Assignee | ||
Comment 12•17 years ago
|
||
There is no such machine named bl-bldlnx01 that I can find after looking through all the machines. I just ran through all of them and here are the host names:
bl-bldlnx02 (which was rebooted as console was dead)
bl-bldlnx03
bl-bldlnx04
bl-amotest01
bl-amotest02
bl-bldxp01
bl-bldxp02
Not sure what you guys want done here. Another key example of why these *can not* be tier one machines.
Assignee | ||
Comment 13•17 years ago
|
||
it was no where near bldlnx02, but found it, rebooted and it's back up.
Status: NEW → RESOLVED
Closed: 17 years ago
Resolution: --- → FIXED
Comment 14•17 years ago
|
||
Confirmed that it's reporting again. Re-opening the tree.
Comment 15•17 years ago
|
||
Tinderbox restarted on bl-bldlnx01, it's reporting to Mozilla1.8 tree.
Comment 16•17 years ago
|
||
It appears that at least some of the machines that were rebooted for this bug do not have their time of day set correctly. This is resulting in confusing tinderbox pages.
Comment 17•17 years ago
|
||
Filed bug 404275, on bl-bldxp01. Here's a handy hint about "should I file a new bug?" - if what you want to talk about is *anything* other than exactly "what this bug was originally reported about not only isn't fixed now, but was not ever fixed, despite the bug being closed" then you want a new bug, not a comment on a closed bug.
Comment 18•17 years ago
|
||
We(In reply to comment #17)
> Filed bug 404275, on bl-bldxp01. Here's a handy hint about "should I file a new
> bug?" - if what you want to talk about is *anything* other than exactly "what
> this bug was originally reported about not only isn't fixed now, but was not
> ever fixed, despite the bug being closed" then you want a new bug, not a
> comment on a closed bug.
>
Well,the problem i was trying to point, which was that these machines were all between 40 minutes and over an hour ahead has already been fixed. Semes to me they are now as close to synced as they have ever been.
Comment 19•17 years ago
|
||
Weird, I just got the mail for that comment, and didn't notice that it was two days old.
Comment 20•17 years ago
|
||
(In reply to comment #11)
> (In reply to comment #10)
> > there is no label for 02 either - just 03 & 04. I'll dig around, but please
> > have someone label asap. Is John not around?
> No he's out for the next week.
Correct, I'm mot in office. Traveling on vacation with intermittent connectivity. Back for Thanksgiving.
(In reply to comment #11)
> We should go through these and relabel, I guess the labels fell off or
> something :P I'm positive they used to be labeled.
(In reply to comment #13)
> it was no where near bldlnx02, but found it, rebooted and it's back up.
Are these machines now all labeled, or should I file a bug to do this when I'm back in the office?
(In reply to comment #3)
(In reply to comment #6)
(In reply to comment #12)
The discussion of whether these machines had Tier1 support or not came up during the summer. My recollection was that these machines were *not* Tier1 support, because they were in the office. Whatever could be done remotely, fine, but IT would not be driving into MV to reboot. Full Tier1 support would require moving these machines to the colo. Please correct me if I'm mistaken.
If people had different understandings, thats fine, we should file a separate bug to track the support discussion, and if needed, the colo move. We could also point to the bug during any tree closures, if needed.
$0.02.
Updated•11 years ago
|
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in
before you can comment on or make changes to this bug.
Description
•