Closed Bug 404013 Opened 17 years ago Closed 17 years ago

bl-bldlnx{01,03} have stopped performance testing (dhcp lease problems?)

Categories

(Infrastructure & Operations :: RelOps: General, task)

task
Not set
blocker

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: reed, Assigned: justin)

Details

bl-bldlnx03.office.mozilla.org has been testing for 5+ hours. justdave says that dhcp stuff in the office got broken, so the machine probably lost its lease. It needs to be fixed so that tinderbox can actually run perf tests on our Linux builds. I'm closing the tree until this is fixed, but as justdave says that somebody needs to physically be on the console to actually fix it, I'm only filing this as critical (instead of blocker, as it is tier 1), so that it pages oncall when somebody is actually at the office to fix it (as per bug 384966, comment #17).
I did attempt to go through the raritan to kick it, but I can't find it on the server list (I did go through and check everything that showed it was connected, and they were all something else)
Raising severity now that business hours have started.
Severity: critical → blocker
I've said this before, but machines at the office *can not* be considered teir one and *can not* shut the tree. We just dont have the resources to handle outages there. If this is on a tier one list somewhere, it needs to be changed. BTW - thanks for not paging after hours reed - appreciate it. I'll take a look at it asap...prob in the next hour.
Assignee: server-ops → justin
bl-bldlnx01.office.mozilla.org too please
Summary: bl-bldlnx03 has been testing for 5+ hours (dhcp lease problems?) → bl-bldlnx{01,03} have stopped performance testing (dhcp lease problems?)
rebooted bl-bldlnx03. There is no machine labeled bl-bldlnx01 in the server room, so someone from build will have to show us what machine that is.
(In reply to comment #3) > I've said this before, but machines at the office *can not* be considered teir > one and *can not* shut the tree. We just dont have the resources to handle > outages there. If this is on a tier one list somewhere, it needs to be > changed. So, what you are saying conflicts with what bug 384966, comment #17 says. We need to know who is right, as these should definitely be considered tier 1 machines that close the tree when broken, as we don't have any replacements for these.
(In reply to comment #5) > rebooted bl-bldlnx03. There is no machine labeled bl-bldlnx01 in the server > room, so someone from build will have to show us what machine that is. I don't think anyone from build is going to be in the office anytime soon. It used to have a label on it, I believe it was setup by aravind a year ago or so. It should be one of the IBM xservers. reed says it's right about bl-bldlnx02, if that helps :)
(In reply to comment #7) > It should be one of the IBM xservers. reed says it's right about bl-bldlnx02, > if that helps :) s/about/above/
I have restarted tinderbox on bl-bldlnx03
there is no label for 02 either - just 03 & 04. I'll dig around, but please have someone label asap. Is John not around?
(In reply to comment #10) > there is no label for 02 either - just 03 & 04. I'll dig around, but please > have someone label asap. Is John not around? No he's out for the next week. We should go through these and relabel, I guess the labels fell off or something :P I'm positive they used to be labeled. Do you need me to come down and do this? I'd just plug them into the KVM and identify them.
There is no such machine named bl-bldlnx01 that I can find after looking through all the machines. I just ran through all of them and here are the host names: bl-bldlnx02 (which was rebooted as console was dead) bl-bldlnx03 bl-bldlnx04 bl-amotest01 bl-amotest02 bl-bldxp01 bl-bldxp02 Not sure what you guys want done here. Another key example of why these *can not* be tier one machines.
it was no where near bldlnx02, but found it, rebooted and it's back up.
Status: NEW → RESOLVED
Closed: 17 years ago
Resolution: --- → FIXED
Confirmed that it's reporting again. Re-opening the tree.
Tinderbox restarted on bl-bldlnx01, it's reporting to Mozilla1.8 tree.
It appears that at least some of the machines that were rebooted for this bug do not have their time of day set correctly. This is resulting in confusing tinderbox pages.
Filed bug 404275, on bl-bldxp01. Here's a handy hint about "should I file a new bug?" - if what you want to talk about is *anything* other than exactly "what this bug was originally reported about not only isn't fixed now, but was not ever fixed, despite the bug being closed" then you want a new bug, not a comment on a closed bug.
We(In reply to comment #17) > Filed bug 404275, on bl-bldxp01. Here's a handy hint about "should I file a new > bug?" - if what you want to talk about is *anything* other than exactly "what > this bug was originally reported about not only isn't fixed now, but was not > ever fixed, despite the bug being closed" then you want a new bug, not a > comment on a closed bug. > Well,the problem i was trying to point, which was that these machines were all between 40 minutes and over an hour ahead has already been fixed. Semes to me they are now as close to synced as they have ever been.
Weird, I just got the mail for that comment, and didn't notice that it was two days old.
(In reply to comment #11) > (In reply to comment #10) > > there is no label for 02 either - just 03 & 04. I'll dig around, but please > > have someone label asap. Is John not around? > No he's out for the next week. Correct, I'm mot in office. Traveling on vacation with intermittent connectivity. Back for Thanksgiving. (In reply to comment #11) > We should go through these and relabel, I guess the labels fell off or > something :P I'm positive they used to be labeled. (In reply to comment #13) > it was no where near bldlnx02, but found it, rebooted and it's back up. Are these machines now all labeled, or should I file a bug to do this when I'm back in the office? (In reply to comment #3) (In reply to comment #6) (In reply to comment #12) The discussion of whether these machines had Tier1 support or not came up during the summer. My recollection was that these machines were *not* Tier1 support, because they were in the office. Whatever could be done remotely, fine, but IT would not be driving into MV to reboot. Full Tier1 support would require moving these machines to the colo. Please correct me if I'm mistaken. If people had different understandings, thats fine, we should file a separate bug to track the support discussion, and if needed, the colo move. We could also point to the bug during any tree closures, if needed. $0.02.
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.