Closed Bug 752185 (talos-r4-snow-041) Opened 12 years ago Closed 11 years ago

talos-r4-snow-041 problem tracking

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task, P3)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: nthomas, Unassigned)

References

Details

(Whiteboard: [buildduty][badslave?][decommission?])

Needs a reboot.
This error was being repeated on the screen

"disk0s2: media is not present."

The system came back online after a reboot
back in production
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
I think this slave has RAM issues or something:
Bug 784328
Bug 784323
Bug 781820
Bug 781816
Bug 781150
Bug 781148
Bug 780889

I've been merrily filing bugs for these new crashes - but I'm starting to suspect the slave now, since 7 out of the 21 new crashes in the last two weeks have come from that slave alone - and those failures (above) have not been seen on any other machine.

Please can we take this machine out of production and run a memory diag or something? :-)
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Disabled in slavealloc.
Depends on: 786006
hardware diagnostics were run twice with no errors.
Status: REOPENED → RESOLVED
Closed: 12 years ago12 years ago
Resolution: --- → FIXED
Depends on: 794926
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Depends on: 787281
Whiteboard: [buildduty][buildslaves][capacity] → [buildduty][badslave?][decommission?]
Looks like we didn't try to re-image this machine yet. Let's do that, and then kill it with fire if it doesn't work.
No longer depends on: 787281
Depends on: 807010
Running tests fine again in production.
Status: REOPENED → RESOLVED
Closed: 12 years ago12 years ago
Resolution: --- → FIXED
The problem is, other than the now-burned edmorley, the rest of us aren't likely to file on a random-looking crash like https://tbpl.mozilla.org/php/getParsedLog.php?id=16767443&tree=Mozilla-Inbound, we'll just blow it off.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
https://tbpl.mozilla.org/php/getParsedLog.php?id=16792815&tree=Fx-Team (a talos run with a strange and sudden "process killed by signal 11")
https://tbpl.mozilla.org/php/getParsedLog.php?id=16800234&tree=Mozilla-Inbound with exactly the sort of "this is the only slave that ever has or ever will hit it" memory corruption GC crash that started the whole "pull busted slaves and run diagnostics on them" thing.
disabled in slavealloc
Setup needed.
Well, after this machine sat idle for nearly two months I've done post-imaging setup and it's back in production.
Status: REOPENED → RESOLVED
Closed: 12 years ago12 years ago
Resolution: --- → FIXED
https://tbpl.mozilla.org/php/getParsedLog.php?id=18649267&tree=Mozilla-Inbound is it crashing with a minidump so malformed, minidump_stackwalk just stared at it saying "wtf is that? is that an AMD64 crash? wtf?"
https://tbpl.mozilla.org/php/getParsedLog.php?id=18654576&tree=Mozilla-Aurora and
https://tbpl.mozilla.org/php/getParsedLog.php?id=18652897&tree=Mozilla-Inbound are exactly the sort of... oh, I already typed that in comment 11.

Please apply the comment 7 solution and decomm it - it is clearly and unquestionably busted, and we apparently lack sufficient diagnostics to even guess what parts to start replacing.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Depends on: 829014
(In reply to Phil Ringnalda (:philor) from comment #16)
> https://tbpl.mozilla.org/php/getParsedLog.php?id=18654576&tree=Mozilla-
> Aurora and
> https://tbpl.mozilla.org/php/getParsedLog.php?id=18652897&tree=Mozilla-
> Inbound are exactly the sort of... oh, I already typed that in comment 11.
> 
> Please apply the comment 7 solution and decomm it - it is clearly and
> unquestionably busted, and we apparently lack sufficient diagnostics to even
> guess what parts to start replacing.

We've been looking at repairing the r4 machines lately, so we'll give that a try first.
Diagnostics found corrupt files, was reimaged and brought back to life just now.
Status: REOPENED → RESOLVED
Closed: 12 years ago11 years ago
Resolution: --- → FIXED
And we're seeing the same failures on it that we were seeing before in bug 781816.
https://tbpl.mozilla.org/?tree=Mozilla-Aurora
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Disabled in slavealloc.
Diagnostics does not help.

Change memory, re-image (bug 864979) and try again.

If not, decommission.
Depends on: 864979
Already puppetized.
Updated the old password + autologin.
Back in the pool, for good or ill.
Status: REOPENED → RESOLVED
Closed: 11 years ago11 years ago
Resolution: --- → FIXED
Yeah, ill. It did one try reftest job, finished with an exception, and has since "done" 393 jobs all saying "device not configured" the first time it tries to write anything to disk, setting retry on everything else and burning talos.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Rebooted and disabled in slavealloc to stop the rot. Decommission time ?
Depends on: 885875
Sent for decomm in bug 885875.
Status: REOPENED → RESOLVED
Closed: 11 years ago11 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.