Closed Bug 824754 (t-snow-r4-0044) Opened 12 years ago Closed 11 years ago

t-snow-r4-0044 problem tracking

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task, P3)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: hwine, Unassigned)

References

Details

(Whiteboard: [buildduty][buildslaves][capacity][badslave])

Attachments

(2 files, 1 obsolete file)

talos-r4-snow-046 is reported as "seeing a rash of mysterious crashes in debug tests" - see bug 824498 for details

Please run diagnostics to see if there are any hardware issues, and resolve as needed.

Regardless of hardware issues, please reimage host before returning to releng.
Summary: talos-r4-snow-046 → talos-r4-snow-046 showing inexplicable crashes, hardware suspected
Depends on: 824755
colo-trip: --- → scl1
I did a regular reboot on this host and it came up with no problems, I didn't read the diags part. Will get to this Monday
Depends on: 825350
Depends on: 825648
Assignee: server-ops-dcops → nobody
Component: Server Operations: DCOps → Release Engineering: Machine Management
QA Contact: dmoore → armenzg
Summary: talos-r4-snow-046 showing inexplicable crashes, hardware suspected → talos-r4-snow-046 problem tracking
Depends on: 829293
This is now re-enabled in prod
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
https://tbpl.mozilla.org/php/getParsedLog.php?id=19457159&tree=Mozilla-Inbound is exactly the same sort of mysterious crash as bug 824498
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Whiteboard: [buildduty][buildslaves][capacity] → [buildduty][buildslaves][capacity][badslave]
disabled in slavealloc again.

coop ideas on what our next step is?
Flags: needinfo?(coop)
(In reply to Justin Wood (:Callek) from comment #4)
> disabled in slavealloc again.
> 
> coop ideas on what our next step is?

This is the point where we usually need to replace the logic board. Please open an IT bug with dcops to start that process. Bonus points if you can batch it with other slaves that need the same attention.
Flags: needinfo?(coop)
Depends on: 838893
Slave has been repaired, reimaged, and is back in service.
Status: REOPENED → RESOLVED
Closed: 11 years ago11 years ago
Resolution: --- → FIXED
See also bug 824498 - it looks like we're still getting intermittent crashes on this machine.
Also, probably dupes of this - bug 845294, bug 847108, bug 847196
I'm moving this slave to staging permanently and will mark it as such in slavealloc.
Status: REOPENED → RESOLVED
Closed: 11 years ago11 years ago
Resolution: --- → FIXED
Swapping 1-for-1 slaves between staging and production. See previous comments in this bug for reasons why snow-046 is unsuitable for production.
Assignee: nobody → coop
Status: RESOLVED → REOPENED
Attachment #720729 - Flags: review?(armenzg)
Resolution: FIXED → ---
Attachment #720729 - Flags: review?(armenzg) → review+
Can you please check if it exists on the graphs production DB?
Comment on attachment 720729 [details] [diff] [review]
Move snow-010 to production and snow-046 to staging

Review of attachment 720729 [details] [diff] [review]:
-----------------------------------------------------------------

https://hg.mozilla.org/build/buildbot-configs/rev/8f4dfb408cd0
Attachment #720729 - Flags: checked-in+
I've rebooted both slaves into their own respective pools.
Status: REOPENED → RESOLVED
Closed: 11 years ago11 years ago
Resolution: --- → FIXED
Merged and reconfiguration completed.
Product: mozilla.org → Release Engineering
not responding to pdu reboots
Status: RESOLVED → REOPENED
Depends on: 912545
Resolution: FIXED → ---
Depends on: 912664
Assignee: coop → nobody
Back in production after an HD replacement.
Status: REOPENED → RESOLVED
Closed: 11 years ago11 years ago
Resolution: --- → FIXED
Attached patch add snow-046 back to the production pool (obsolete) (deleted) — Splinter Review
Disabled in slavealloc meanwhile.
Attachment #814148 - Flags: review?(bhearsum)
Attachment #814148 - Flags: review?(bhearsum) → review+
    test_stag_not_in_prod ...                                            [FAIL]

===============================================================================
[FAIL]: test_slave_allocation.SlaveCheck.test_stag_not_in_prod

Traceback (most recent call last):
  File "test/test_slave_allocation.py", line 33, in test_stag_not_in_prod
    'declared as staging-only:\n%s' % '\n'.join(sorted(common_slaves))
twisted.trial.unittest.FailTest: Staging-only slaves should not be declared as production and vice versa. However, the following production slaves declared as staging-only:
talos-r4-snow-046
not equal:
a = set()
b = set(['talos-r4-snow-046'])
Backed it out and here's the good patch after testing it with test-masters.sh
Attachment #814148 - Attachment is obsolete: true
Attachment #814166 - Flags: review?(bhearsum)
Attachment #814166 - Flags: review?(bhearsum) → review+
Comment on attachment 814166 [details] [diff] [review]
add snow-046 back to the production pool and remove it from staging

https://hg.mozilla.org/build/buildbot-configs/rev/1ec4a515adcf
Attachment #814166 - Flags: checked-in+
Assignee: nobody → armenzg
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
in production
Rebooted the machine into production and enabled it on slavealloc.
Assignee: armenzg → nobody
Status: REOPENED → RESOLVED
Closed: 11 years ago11 years ago
Resolution: --- → FIXED
I'll want to be able to find the GC crash in https://tbpl.mozilla.org/php/getParsedLog.php?id=28895921&tree=Try again.
What do you mean? Is the slave doing something unexpected?
I don't know much about "finding the GC crash"
Garbage collection and cycle collection apparently do a good job of exercising RAM, and so typically a machine with bad RAM (or a CPU that's bad about talking to its RAM, or a single trace with a hairline crack in it between the CPU and the RAM, or whatever it may really be) will, along with hitting PPoD failures in reftests, hit a lot of GC crashes.

That's what this slave did, and the reason we had multiple bugs filed about crashes in tests that only happened on this slave, and that's the reason it was in staging rather than production with a slavealloc note saying not to put it in production.

Disabled in slavealloc, please do not bring it back to production without diagnosing what's actually wrong with the memory, fixing it, and running at least a hundred test runs in staging without a single unexplained unexpected not-seen-on-other-slaves crash.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
philor: any ideas if we could word this for developers to try to find an easy test case?

(In reply to Phil Ringnalda (:philor) from comment #32)
> Garbage collection and cycle collection apparently do a good job of
> exercising RAM, and so typically a machine with bad RAM (or a CPU that's bad
> about talking to its RAM, or a single trace with a hairline crack in it
> between the CPU and the RAM, or whatever it may really be) will, along with
> hitting PPoD failures in reftests, hit a lot of GC crashes.
> 
> That's what this slave did, and the reason we had multiple bugs filed about
> crashes in tests that only happened on this slave, and that's the reason it
> was in staging rather than production with a slavealloc note saying not to
> put it in production.
> 
> Disabled in slavealloc, please do not bring it back to production without
> diagnosing what's actually wrong with the memory, fixing it, and running at
> least a hundred test runs in staging without a single unexplained unexpected
> not-seen-on-other-slaves crash.
Sorry, that was casual phrasing on my part. I have no reason to believe that GC/CC are *better* at detecting bad RAM than memtest86 (which has been under development for almost 20 years, focusing on just that one task), they are simply the most likely part of our tests to wind up crashing. Running tests for 24 hours with me looking at the results will indeed show intermittent memory failures, along with lots of noise which is not from intermittent memory failures, but it's not a better way.

I think it would be a far better thing to focus on first running memtest86 on this slave once, since I think any diagnostics on it would have been before we started using that rather than Apple's memory diagnostics, and second on having a plan to run it long enough to detect intermittent failures that don't happen on just one quick run.
Thanks philor :)
Depends on: 933886
Memory analysis requested in bug 933886.
2013-01-03 - "Ran hardware diagnostic three times but did not find any issues.  All hardware passed."
2013-01-15 - "Host has been reimaged."
2013-02-05 - "mysterious crashes"
2013-02-15 - logic board replaced
2013-02-17 - same issues
2013-10-02 - back from HD replacement
2013-10-09 - reported more issues
2013-11-11 - memtest did not find any issues "After running memtest86+ multiples times"

I can only see ourselves replacing the memory and give it one more shot. After that we should decommission it.
Depends on: 937656
RAM has been replaced.
Putting back into production.
I will check tomorrow.
Assignee: nobody → armenzg
It looks good.
Status: REOPENED → RESOLVED
Closed: 11 years ago11 years ago
Resolution: --- → FIXED
Assignee: armenzg → nobody
QA Contact: armenzg → bugspam.Callek
Alias: talos-r4-snow-046 → t-snow-r4-0044
Summary: talos-r4-snow-046 problem tracking → t-snow-r4-0044 problem tracking
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: