Closed Bug 824754 (t-snow-r4-0044) Opened 12 years ago Closed 11 years ago

t-snow-r4-0044 problem tracking

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: hwine, Unassigned)

References

Details

(Whiteboard: [buildduty][buildslaves][capacity][badslave])

Attachments

(2 files, 1 obsolete file)

Move snow-010 to production and snow-046 to staging 11 years ago Chris Cooper [:coop] (he/him) (deleted), patch	armenzg : review+ coop : checked-in+	Details \| Diff \| Splinter Review
add snow-046 back to the production pool 11 years ago Armen [:armenzg] (deleted), patch	bhearsum : review+ armenzg : checked-in+	Details \| Diff \| Splinter Review
add snow-046 back to the production pool and remove it from staging 11 years ago Armen [:armenzg] (deleted), patch	bhearsum : review+ armenzg : checked-in+	Details \| Diff \| Splinter Review

Hal Wine [:hwine] (use NI)

Reporter

Description

•

12 years ago

talos-r4-snow-046 is reported as "seeing a rash of mysterious crashes in debug tests" - see bug 824498 for details

Please run diagnostics to see if there are any hardware issues, and resolve as needed.

Regardless of hardware issues, please reimage host before returning to releng.

Hal Wine [:hwine] (use NI)

Reporter

Updated

•

12 years ago

Summary: talos-r4-snow-046 → talos-r4-snow-046 showing inexplicable crashes, hardware suspected

Hal Wine [:hwine] (use NI)

Reporter

Updated

•

12 years ago

Depends on: 824755

Vinh Hua [:vinh]

Updated

•

12 years ago

colo-trip: --- → scl1

Salvador Espinoza [:sal]

Comment 1

•

12 years ago

I did a regular reboot on this host and it came up with no problems, I didn't read the diags part. Will get to this Monday

Hal Wine [:hwine] (use NI)

Reporter

Updated

•

12 years ago

Depends on: 825350

Salvador Espinoza [:sal]

Updated

•

12 years ago

Depends on: 825648

Amy Rich [:arr] [:arich]

Updated

•

12 years ago

Assignee: server-ops-dcops → nobody

Component: Server Operations: DCOps → Release Engineering: Machine Management

QA Contact: dmoore → armenzg

bhearsum@mozilla.com (:bhearsum)

Updated

•

12 years ago

Summary: talos-r4-snow-046 showing inexplicable crashes, hardware suspected → talos-r4-snow-046 problem tracking

bhearsum@mozilla.com (:bhearsum)

Updated

•

12 years ago

Depends on: 829293

Justin Wood (:Callek)

Comment 2

•

11 years ago

This is now re-enabled in prod

Status: NEW → RESOLVED

Closed: 11 years ago

Resolution: --- → FIXED

Phil Ringnalda (:philor)

Comment 3

•

11 years ago

https://tbpl.mozilla.org/php/getParsedLog.php?id=19457159&tree=Mozilla-Inbound is exactly the same sort of mysterious crash as bug 824498

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Phil Ringnalda (:philor)

Updated

•

11 years ago

Whiteboard: [buildduty][buildslaves][capacity] → [buildduty][buildslaves][capacity][badslave]

Justin Wood (:Callek)

Comment 4

•

11 years ago

disabled in slavealloc again.

coop ideas on what our next step is?

Flags: needinfo?(coop)

Chris Cooper [:coop] (he/him)

Comment 5

•

11 years ago

(In reply to Justin Wood (:Callek) from comment #4)
> disabled in slavealloc again.
> 
> coop ideas on what our next step is?

This is the point where we usually need to replace the logic board. Please open an IT bug with dcops to start that process. Bonus points if you can batch it with other slaves that need the same attention.

Flags: needinfo?(coop)

Justin Wood (:Callek)

Updated

•

11 years ago

Depends on: 838893

Chris Cooper [:coop] (he/him)

Comment 6

•

11 years ago

Slave has been repaired, reimaged, and is back in service.

Status: REOPENED → RESOLVED

Closed: 11 years ago → 11 years ago

Resolution: --- → FIXED

Phil Ringnalda (:philor)

Comment 7

•

11 years ago

https://tbpl.mozilla.org/php/getParsedLog.php?id=20162205&tree=Mozilla-Inbound

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Phil Ringnalda (:philor)

Comment 8

•

11 years ago

https://tbpl.mozilla.org/php/getParsedLog.php?id=20172738&tree=Mozilla-Inbound

Phil Ringnalda (:philor)

Comment 9

•

11 years ago

https://tbpl.mozilla.org/php/getParsedLog.php?id=20217285&tree=Mozilla-Inbound

Ryan VanderMeulen [:RyanVM]

Comment 10

•

11 years ago

See also bug 824498 - it looks like we're still getting intermittent crashes on this machine.

Ryan VanderMeulen [:RyanVM]

Comment 11

•

11 years ago

https://tbpl.mozilla.org/php/getParsedLog.php?id=20278709&tree=Mozilla-Inbound

Ryan VanderMeulen [:RyanVM]

Comment 12

•

11 years ago

Also, probably dupes of this - bug 845294, bug 847108, bug 847196

Ryan VanderMeulen [:RyanVM]

Comment 13

•

11 years ago

https://tbpl.mozilla.org/php/getParsedLog.php?id=20251647&tree=Mozilla-Aurora

Chris Cooper [:coop] (he/him)

Comment 14

•

11 years ago

I'm moving this slave to staging permanently and will mark it as such in slavealloc.

Status: REOPENED → RESOLVED

Closed: 11 years ago → 11 years ago

Resolution: --- → FIXED

Chris Cooper [:coop] (he/him)

Comment 15

•

11 years ago

Attached patch Move snow-010 to production and snow-046 to staging (deleted) — Details — Splinter Review

Swapping 1-for-1 slaves between staging and production. See previous comments in this bug for reasons why snow-046 is unsuitable for production.

Assignee: nobody → coop

Status: RESOLVED → REOPENED

Attachment #720729 - Flags: review?(armenzg)

Resolution: FIXED → ---

Armen [:armenzg]

Updated

•

11 years ago

Attachment #720729 - Flags: review?(armenzg) → review+

Armen [:armenzg]

Comment 16

•

11 years ago

Can you please check if it exists on the graphs production DB?

Chris Cooper [:coop] (he/him)

Comment 17

•

11 years ago

Comment on attachment 720729 [details] [diff] [review]
Move snow-010 to production and snow-046 to staging

Review of attachment 720729 [details] [diff] [review]:
-----------------------------------------------------------------

https://hg.mozilla.org/build/buildbot-configs/rev/8f4dfb408cd0

Attachment #720729 - Flags: checked-in+

Armen [:armenzg]

Comment 18

•

11 years ago

I've rebooted both slaves into their own respective pools.

Status: REOPENED → RESOLVED

Closed: 11 years ago → 11 years ago

Resolution: --- → FIXED

Armen [:armenzg]

Comment 19

•

11 years ago

Merged and reconfiguration completed.

Nobody; OK to take it and work on it

Assignee

Updated

•

11 years ago

Product: mozilla.org → Release Engineering

bhearsum@mozilla.com (:bhearsum)

Comment 20

•

11 years ago

not responding to pdu reboots

Status: RESOLVED → REOPENED

Depends on: 912545

Resolution: FIXED → ---

bhearsum@mozilla.com (:bhearsum)

Updated

•

11 years ago

Depends on: 912664

Chris Cooper [:coop] (he/him)

Updated

•

11 years ago

Assignee: coop → nobody

bhearsum@mozilla.com (:bhearsum)

Comment 21

•

11 years ago

Back in production after an HD replacement.

Status: REOPENED → RESOLVED

Closed: 11 years ago → 11 years ago

Resolution: --- → FIXED

Armen [:armenzg]

Comment 22

•

11 years ago

Attached patch add snow-046 back to the production pool (obsolete) (deleted) — Details — Splinter Review

Disabled in slavealloc meanwhile.

Attachment #814148 - Flags: review?(bhearsum)

bhearsum@mozilla.com (:bhearsum)

Updated

•

11 years ago

Attachment #814148 - Flags: review?(bhearsum) → review+

Armen [:armenzg]

Comment 23

•

11 years ago

Comment on attachment 814148 [details] [diff] [review]
add snow-046 back to the production pool

https://hg.mozilla.org/build/buildbot-configs/rev/1952f6d18716

Attachment #814148 - Flags: checked-in+

Aki Sasaki (not active)

Comment 24

•

11 years ago

    test_stag_not_in_prod ...                                            [FAIL]

===============================================================================
[FAIL]: test_slave_allocation.SlaveCheck.test_stag_not_in_prod

Traceback (most recent call last):
  File "test/test_slave_allocation.py", line 33, in test_stag_not_in_prod
    'declared as staging-only:\n%s' % '\n'.join(sorted(common_slaves))
twisted.trial.unittest.FailTest: Staging-only slaves should not be declared as production and vice versa. However, the following production slaves declared as staging-only:
talos-r4-snow-046
not equal:
a = set()
b = set(['talos-r4-snow-046'])

Armen [:armenzg]

Comment 25

•

11 years ago

Attached patch add snow-046 back to the production pool and remove it from staging (deleted) — Details — Splinter Review

Backed it out and here's the good patch after testing it with test-masters.sh

Attachment #814148 - Attachment is obsolete: true

Attachment #814166 - Flags: review?(bhearsum)

bhearsum@mozilla.com (:bhearsum)

Updated

•

11 years ago

Attachment #814166 - Flags: review?(bhearsum) → review+

Armen [:armenzg]

Comment 26

•

11 years ago

Comment on attachment 814166 [details] [diff] [review]
add snow-046 back to the production pool and remove it from staging

https://hg.mozilla.org/build/buildbot-configs/rev/1ec4a515adcf

Attachment #814166 - Flags: checked-in+

Armen [:armenzg]

Updated

•

11 years ago

Assignee: nobody → armenzg

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

John Hopkins (:jhopkins)

Comment 27

•

11 years ago

in production

Armen [:armenzg]

Comment 28

•

11 years ago

Rebooted the machine into production and enabled it on slavealloc.

Assignee: armenzg → nobody

Status: REOPENED → RESOLVED

Closed: 11 years ago → 11 years ago

Resolution: --- → FIXED

Phil Ringnalda (:philor)

Comment 29

•

11 years ago

I'll want to be able to find the GC crash in https://tbpl.mozilla.org/php/getParsedLog.php?id=28895921&tree=Try again.

Phil Ringnalda (:philor)

Comment 30

•

11 years ago

And the GC crash in https://tbpl.mozilla.org/php/getParsedLog.php?id=28906391&tree=Try

Armen [:armenzg]

Comment 31

•

11 years ago

What do you mean? Is the slave doing something unexpected?
I don't know much about "finding the GC crash"

Phil Ringnalda (:philor)

Comment 32

•

11 years ago

Garbage collection and cycle collection apparently do a good job of exercising RAM, and so typically a machine with bad RAM (or a CPU that's bad about talking to its RAM, or a single trace with a hairline crack in it between the CPU and the RAM, or whatever it may really be) will, along with hitting PPoD failures in reftests, hit a lot of GC crashes.

That's what this slave did, and the reason we had multiple bugs filed about crashes in tests that only happened on this slave, and that's the reason it was in staging rather than production with a slavealloc note saying not to put it in production.

Disabled in slavealloc, please do not bring it back to production without diagnosing what's actually wrong with the memory, fixing it, and running at least a hundred test runs in staging without a single unexplained unexpected not-seen-on-other-slaves crash.

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Armen [:armenzg]

Comment 34

•

11 years ago

philor: any ideas if we could word this for developers to try to find an easy test case?

(In reply to Phil Ringnalda (:philor) from comment #32)
> Garbage collection and cycle collection apparently do a good job of
> exercising RAM, and so typically a machine with bad RAM (or a CPU that's bad
> about talking to its RAM, or a single trace with a hairline crack in it
> between the CPU and the RAM, or whatever it may really be) will, along with
> hitting PPoD failures in reftests, hit a lot of GC crashes.
> 
> That's what this slave did, and the reason we had multiple bugs filed about
> crashes in tests that only happened on this slave, and that's the reason it
> was in staging rather than production with a slavealloc note saying not to
> put it in production.
> 
> Disabled in slavealloc, please do not bring it back to production without
> diagnosing what's actually wrong with the memory, fixing it, and running at
> least a hundred test runs in staging without a single unexplained unexpected
> not-seen-on-other-slaves crash.

Phil Ringnalda (:philor)

Comment 35

•

11 years ago

Sorry, that was casual phrasing on my part. I have no reason to believe that GC/CC are *better* at detecting bad RAM than memtest86 (which has been under development for almost 20 years, focusing on just that one task), they are simply the most likely part of our tests to wind up crashing. Running tests for 24 hours with me looking at the results will indeed show intermittent memory failures, along with lots of noise which is not from intermittent memory failures, but it's not a better way.

I think it would be a far better thing to focus on first running memtest86 on this slave once, since I think any diagnostics on it would have been before we started using that rather than Apple's memory diagnostics, and second on having a plan to run it long enough to detect intermittent failures that don't happen on just one quick run.

Armen [:armenzg]

Comment 36

•

11 years ago

Thanks philor :)

Armen [:armenzg]

Updated

•

11 years ago

Depends on: 933886

Armen [:armenzg]

Comment 37

•

11 years ago

Memory analysis requested in bug 933886.

Armen [:armenzg]

Comment 38

•

11 years ago

2013-01-03 - "Ran hardware diagnostic three times but did not find any issues.  All hardware passed."
2013-01-15 - "Host has been reimaged."
2013-02-05 - "mysterious crashes"
2013-02-15 - logic board replaced
2013-02-17 - same issues
2013-10-02 - back from HD replacement
2013-10-09 - reported more issues
2013-11-11 - memtest did not find any issues "After running memtest86+ multiples times"

I can only see ourselves replacing the memory and give it one more shot. After that we should decommission it.

Armen [:armenzg]

Updated

•

11 years ago

Depends on: 937656

Armen [:armenzg]

Comment 39

•

11 years ago

RAM has been replaced.
Putting back into production.
I will check tomorrow.

Assignee: nobody → armenzg

Armen [:armenzg]

Comment 40

•

11 years ago

It looks good.

Status: REOPENED → RESOLVED

Closed: 11 years ago → 11 years ago

Resolution: --- → FIXED

Armen [:armenzg]

Updated

•

10 years ago

Assignee: armenzg → nobody

QA Contact: armenzg → bugspam.Callek

Hal Wine [:hwine] (use NI)

Reporter

Updated

•

10 years ago

Alias: talos-r4-snow-046 → t-snow-r4-0044

Summary: talos-r4-snow-046 problem tracking → t-snow-r4-0044 problem tracking

BMO Automation

Updated

•

6 years ago

Product: Release Engineering → Infrastructure & Operations

BMO Automation

Updated

•

4 years ago

Product: Infrastructure & Operations → Infrastructure & Operations Graveyard

You need to log in before you can comment on or make changes to this bug.