807646 - (tegra-093) tegra-093 problem tracking

bhearsum@mozilla.com (:bhearsum)

Reporter

Description

•

12 years ago

Ping is failing. Trying a pdu reboot on it.

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 1

•

12 years ago

WTF. The dashboard claimed the ping check was failing and that it was down, but it was actually running a job when I PDU rebooted it. This tegra seems fine, modulo the purple build as a result of me interrupting the job.

Status: NEW → RESOLVED

Closed: 12 years ago

Resolution: --- → FIXED

Justin Wood (:Callek)

Comment 2

•

12 years ago

(In reply to Ben Hearsum [:bhearsum] from comment #1) > WTF. The dashboard claimed the ping check was failing and that it was down, > but it was actually running a job when I PDU rebooted it. This tegra seems > fine, modulo the purple build as a result of me interrupting the job. Just to comment somewhere for now, I'll use here: The dashboard at present is a *snapshot in time* with a relatively length delay. Where it will take a tegras status at a given moment, up/down does it have an error.flg etc. and report it. That then gets generated by the dashboard into its table, which takes a few minutes, then gets rsync'd to the server that shows us the website, so there is a slight delay to it. Meanwhile tegras in real jobs frequently go up/down while installing/uninstalling/etc. things, so its possible to see a DOWN at any given moment, for response as to DOWN on a tegra in the dashboard, see the "bar" at the right of the row, which will show if its been offline for a handful of checks in a row, or what.

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 3

•

12 years ago

SUTAgent not present; time for recovery

Status: RESOLVED → REOPENED

Depends on: 807965

Resolution: FIXED → ---

Phil Ringnalda (:philor)

Comment 4

•

12 years ago

The thing that confuses me about all that is that there's no way to reconcile any of that with https://secure.pub.build.mozilla.org/buildapi/recent/tegra-093?numbuilds=100 - it did not do a purple build on the morning of the 1st, it set retry on one at 04:58, but at 07:12 it was happily doing a run it had started at 07:09 and finished, green, at 07:42. At 06:26 on the 2nd, when it was "time for recovery" it had finished a green job 24 minutes before, and was 6 minutes away from taking another and doing it green. Then it is rumored to have been recovered sometime before 16:47 on the 2nd, which could conceivably have happened between 11:49 and 12:25, because that's the only time when it went more than 5 minutes between jobs. That's also the dividing line between when it went from being a pretty darn good tegra to being a pretty darn broken one. 25 builds since then, 52% green; the 25 builds before then were 84% green. My best wild and ridiculous guess is that there are two slaves that both think they are named tegra-093.

Phil Ringnalda (:philor)

Updated

•

12 years ago

Depends on: 808437

Justin Wood (:Callek)

Comment 5

•

12 years ago

Ran ./stop_cp.sh

Blocks: 808468

Justin Wood (:Callek)

Updated

•

12 years ago

Blocks: 813012

Justin Wood (:Callek)

Updated

•

12 years ago

No longer blocks: 813012

Justin Wood (:Callek)

Comment 6

•

12 years ago

Brought back to life

Status: REOPENED → RESOLVED

Closed: 12 years ago → 12 years ago

Resolution: --- → FIXED

Justin Wood (:Callek)

Comment 7

•

12 years ago

Actually needed cpr, sending over to recovery for a reimage.

URL: https://secure.pub.build.mozilla.org/...

Status: RESOLVED → REOPENED

Depends on: 817995

Resolution: FIXED → ---

Justin Wood (:Callek)

Updated

•

12 years ago

Status: REOPENED → RESOLVED

Closed: 12 years ago → 12 years ago

Resolution: --- → FIXED

Justin Wood (:Callek)

Comment 8

•

12 years ago

No jobs taken on this device for >= 7 weeks

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Justin Wood (:Callek)

Comment 9

•

12 years ago

(mass change: filter on tegraCallek02reboot2013) I just rebooted this device, hoping that many of the ones I'm doing tonight come back automatically. I'll check back in tomorrow to see if it did, if it does not I'll triage next step manually on a per-device basis. --- Command I used (with a manual patch to the fabric script to allow this command) (fabric)[jwood@dev-master01 fabric]$ python manage_foopies.py -j15 -f devices.json `for i in 021 032 036 039 046 048 061 064 066 067 071 074 079 081 082 083 084 088 093 104 106 108 115 116 118 129 152 154 164 168 169 174 179 182 184 187 189 200 207 217 223 228 234 248 255 264 270 277 285 290 294 295 297 298 300 302 304 305 306 307 308 309 310 311 312 314 315 316 319 320 321 322 323 324 325 326 328 329 330 331 332 333 335 336 337 338 339 340 341 342 343 345 346 347 348 349 350 354 355 356 358 359 360 361 362 363 364 365 367 368 369; do echo '-D' tegra-$i; done` reboot_tegra The command does the reboot, one-at-a-time from the foopy the device is connected from. with one ssh connection per foopy

Justin Wood (:Callek)

Updated

•

12 years ago

Depends on: 838687

Justin Wood (:Callek)

Comment 10

•

12 years ago

had to cycle clientproxy to bring this back

Status: REOPENED → RESOLVED

Closed: 12 years ago → 12 years ago

Resolution: --- → FIXED

Justin Wood (:Callek)

Updated

•

12 years ago

No longer blocks: 808468

Depends on: 808468

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 11

•

12 years ago

Hasn't run a job in 15 days, 16:05:21

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

bhearsum@mozilla.com (:bhearsum)

Reporter

Updated

•

12 years ago

Depends on: 858134

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 12

•

12 years ago

recovery didn't help, dunno what to do 2013-04-05 06:20:43 tegra-093 p online active OFFLINE :: error.flg [Automation Error: Unable to connect to device after 5 attempts]

Justin Wood (:Callek)

Updated

•

12 years ago

Depends on: 865749

Justin Wood (:Callek)

Comment 13

•

11 years ago

Sending this slave to recovery -->Automated message.

Depends on: 889567

Justin Wood (:Callek)

Updated

•

11 years ago

Depends on: 892096

Justin Wood (:Callek)

Comment 14

•

11 years ago

Sending this slave to recovery -->Automated message.

Depends on: 896572

Vinh Hua [:vinh]

Comment 15

•

11 years ago

flashed and reimaged

Justin Wood (:Callek)

Updated

•

11 years ago

URL: https://secure.pub.build.mozilla.org/... → https://secure.pub.build.mozilla.org/...

Nobody; OK to take it and work on it

Assignee

Updated

•

11 years ago

Product: mozilla.org → Release Engineering

Justin Wood (:Callek)

Comment 16

•

11 years ago

Replace sdcard please

Depends on: 918162

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 17

•

11 years ago

back in production

Status: REOPENED → RESOLVED

Closed: 12 years ago → 11 years ago

Resolution: --- → FIXED

Justin Wood (:Callek)

Updated

•

11 years ago

QA Contact: other → armenzg

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 18

•

11 years ago

2014-01-16 10:24:20 tegra-093 p online active OFFLINE :: error.flg [Automation Error: Unable to connect to device after 5 attempts] pdu reboot didn't help

Status: RESOLVED → REOPENED

Depends on: 960642

Resolution: FIXED → ---

John Pech [:jpech-lbc]

Comment 19

•

11 years ago

SD card have been wiped and tegra has been re-imaged.

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 20

•

11 years ago

(In reply to John Pech [:jpech-lbc] from comment #19) > SD card have been wiped and tegra has been re-imaged. can we try again?

Depends on: 974917

Van Le [:van]

Comment 21

•

11 years ago

SD card formatted, tegra reimaged and flashed. [vle@admin1a.private.scl3 ~]$ fping tegra-093.tegra.releng.scl3.mozilla.com tegra-093.tegra.releng.scl3.mozilla.com is alive

Justin Wood (:Callek)

Comment 22

•

11 years ago

Taking jobs again

Status: REOPENED → RESOLVED

Closed: 11 years ago → 11 years ago

Resolution: --- → FIXED

BMO Automation

Updated

•

6 years ago

Product: Release Engineering → Infrastructure & Operations

BMO Automation

Updated

•

5 years ago

Product: Infrastructure & Operations → Infrastructure & Operations Graveyard