Closed
Bug 784278
Opened 12 years ago
Closed 12 years ago
New tegras (and some old ones) failing in reftest intermittently
Categories
(Release Engineering :: General, defect, P3)
Tracking
(Not tracked)
RESOLVED
WORKSFORME
People
(Reporter: Callek, Unassigned)
References
Details
(Keywords: intermittent-failure)
Attachments
(1 file)
(deleted),
patch
|
Callek
:
review+
|
Details | Diff | Splinter Review |
So we have many instances, so far, of our new batch of tegras failing while being run on one of our previous mac foopies.
Some of these same tegras that are failing have also passed on some tests, across trees, try, m-i, m-c, etc.
The screenshot data: url's seem to be a completely blank [white] screen.
I welcome *any* and *all* ideas on what to look for, or if someone wants to hands-on a tegra or two, or even the foopy itself.
See any of:
https://secure.pub.build.mozilla.org/buildapi/recent/tegra-306
https://secure.pub.build.mozilla.org/buildapi/recent/tegra-305
https://secure.pub.build.mozilla.org/buildapi/recent/tegra-304 (no orange reftests yet)
https://secure.pub.build.mozilla.org/buildapi/recent/tegra-302
https://secure.pub.build.mozilla.org/buildapi/recent/tegra-300
https://secure.pub.build.mozilla.org/buildapi/recent/tegra-299
https://secure.pub.build.mozilla.org/buildapi/recent/tegra-298
https://secure.pub.build.mozilla.org/buildapi/recent/tegra-297
https://secure.pub.build.mozilla.org/buildapi/recent/tegra-296
https://secure.pub.build.mozilla.org/buildapi/recent/tegra-295
https://secure.pub.build.mozilla.org/buildapi/recent/tegra-294
https://secure.pub.build.mozilla.org/buildapi/recent/tegra-293
https://secure.pub.build.mozilla.org/buildapi/recent/tegra-290
I am hoping we can find out what is different/problematic here rather than needing to junk/return these tegras.
Comment 1•12 years ago
|
||
https://tbpl.mozilla.org/php/getParsedLog.php?id=14548463&tree=Mozilla-Inbound
294
https://tbpl.mozilla.org/php/getParsedLog.php?id=14548402&tree=Mozilla-Inbound
302
https://tbpl.mozilla.org/php/getParsedLog.php?id=14548286&tree=Mozilla-Inbound
300
https://tbpl.mozilla.org/php/getParsedLog.php?id=14551356&tree=Mozilla-Inbound
305
Comment 2•12 years ago
|
||
Comment 3•12 years ago
|
||
Comment 4•12 years ago
|
||
Some random observations:
- there are both UNEXPECTED-FAIL and UNEXPECTED-PASS failures in all of these logs; they seem to be in nearly equal proportion
- the set of failing tests is consistent; the 2 R1 logs are virtually identical; same for the 2 R2 logs and the 2 R3 logs
- the environment, SUT version, OS version and everything else I could think to check looks the same as on "old" tegras; the only difference I have spotted is:
HOME=/Users/cltbld
I see the same as Geoff here. I don't know why setting the home directory would matter for any of our code, so that doesn't feel like it should cause this. It really looks like the tegras are simply not rendering graphics during these test runs, as unusual as that would be.
Running tegra 305 on my desk with exactly the same set of builds and tests as it ran in the automation (different host-utils because I can't download the one the tegras use) but it runs perfectly every single time.
It's run 5 runs now, never failed.
I'm running it in a loop 50 times, rebooting between each run. I'll check back on it in a few hours and see what happens.
Comment 7•12 years ago
|
||
Comment 8•12 years ago
|
||
So, overnight, tegra-305 ran the same code that caused the intermittent orange in comment 1 fifty-five times in succession and never failed a single test.
When I went into 2.IDF the tegras were sitting on a metal shelf with styrofoam insulating them from the metal shelf at the bottom. In my cursory review of the set of tegras (while I was looking for tegra 305) I found that most or all of the tegras having this issue are the ones on the ends of the shelves, resting against the metal supports on either side of the shelf. I recommend that we insulate these better because the metal may be adding conductivity across the board where there shouldn't be conductivity causing these intermittent issues.
I will definitely agree that the theory sounds kind of wacko but given that the software on the mac foopy running these didn't change and that removing the tegra from that environment and running the same software on it in a new environment changed the behavior, then I'm thinking our one variable left is the physical environment itself. I'll file a DC-ops bug to get more insulation around these devices, and have it block this one.
Comment 10•12 years ago
|
||
(In reply to Clint Talbert ( :ctalbert ) from comment #9)
> I will definitely agree that the theory sounds kind of wacko but given that
> the software on the mac foopy running these didn't change and that removing
> the tegra from that environment and running the same software on it in a new
> environment changed the behavior, then I'm thinking our one variable left is
> the physical environment itself. I'll file a DC-ops bug to get more
> insulation around these devices, and have it block this one.
FWIW, I don't think this sounds wacko at all. We had a similar issue a few years ago with the iX machines where a certain combination of flooring materials, racks, and fan/drive harmonics in different colos caused degraded performance on *some* of the iX machines. These are often the craziest situations to debug, so kudos if this works. Fingers crossed here.
Comment 11•12 years ago
|
||
Comment 12•12 years ago
|
||
Reporter | ||
Comment 13•12 years ago
|
||
Bug 784767 should be done now, I am 90% sure that c#12 here was before that work was done. So starting now we should be on the lookout for more cases of this. (I'm hoping it does not repeat)
Comment 14•12 years ago
|
||
Comment 15•12 years ago
|
||
Updated•12 years ago
|
Comment 16•12 years ago
|
||
Comment 17•12 years ago
|
||
Comment 18•12 years ago
|
||
Comment 19•12 years ago
|
||
Comment 20•12 years ago
|
||
Comment 21•12 years ago
|
||
Things that fit by number:
https://tbpl.mozilla.org/php/getParsedLog.php?id=14642103&tree=Mozilla-Inbound
tegra-356
https://tbpl.mozilla.org/php/getParsedLog.php?id=14644642&tree=Mozilla-Inbound
tegra-314
https://tbpl.mozilla.org/php/getParsedLog.php?id=14642362&tree=Mozilla-Inbound
tegra-361
but throughout a bunch of bustage for the now-backed-out https://hg.mozilla.org/integration/mozilla-inbound/rev/c0bf8f743419, we also had
https://tbpl.mozilla.org/php/getParsedLog.php?id=14643408&tree=Mozilla-Inbound
tegra-191
https://tbpl.mozilla.org/php/getParsedLog.php?id=14644076&tree=Mozilla-Inbound
tegra-152
https://tbpl.mozilla.org/php/getParsedLog.php?id=14642108&tree=Mozilla-Inbound
tegra-119
https://tbpl.mozilla.org/php/getParsedLog.php?id=14641662&tree=Mozilla-Inbound
tegra-146
https://tbpl.mozilla.org/php/getParsedLog.php?id=14640011&tree=Mozilla-Inbound
tegra-101
https://tbpl.mozilla.org/php/getParsedLog.php?id=14640107&tree=Mozilla-Inbound
tegra-104
which may just mean that this is a symptom, one which can also be caused by delaying the creation of the form history and password dbs, or may mean that it has spread (or was never about the 300s), or practically anything you can think of for it to mean.
Comment 22•12 years ago
|
||
Comment 23•12 years ago
|
||
Comment 24•12 years ago
|
||
Comment 25•12 years ago
|
||
Comment 26•12 years ago
|
||
Comment 27•12 years ago
|
||
Comment 28•12 years ago
|
||
Comment 29•12 years ago
|
||
Comment 30•12 years ago
|
||
Comment 31•12 years ago
|
||
Comment 32•12 years ago
|
||
Comment 33•12 years ago
|
||
Comment 34•12 years ago
|
||
Comment 35•12 years ago
|
||
Comment 36•12 years ago
|
||
Comment 37•12 years ago
|
||
Comment 38•12 years ago
|
||
tegra-367
Comment 39•12 years ago
|
||
Comment 40•12 years ago
|
||
Comment 41•12 years ago
|
||
Comment 42•12 years ago
|
||
Comment 43•12 years ago
|
||
Comment 44•12 years ago
|
||
Comment 45•12 years ago
|
||
Comment 46•12 years ago
|
||
Comment 47•12 years ago
|
||
Comment 48•12 years ago
|
||
Comment 49•12 years ago
|
||
Comment 50•12 years ago
|
||
Comment 51•12 years ago
|
||
Comment 52•12 years ago
|
||
Comment 53•12 years ago
|
||
Comment 54•12 years ago
|
||
Comment 55•12 years ago
|
||
Comment 56•12 years ago
|
||
Comment 57•12 years ago
|
||
Comment 58•12 years ago
|
||
https://tbpl.mozilla.org/php/getParsedLog.php?id=14755449&tree=Firefox
tegra-298
(We'll be missing a fair number of instances of this that'll wind up in bug 663657 since it sometimes times out, like this one did, after the several hundred failures.)
Comment 59•12 years ago
|
||
Comment 60•12 years ago
|
||
Comment 61•12 years ago
|
||
Comment 62•12 years ago
|
||
Comment 63•12 years ago
|
||
Comment 64•12 years ago
|
||
Comment 65•12 years ago
|
||
https://tbpl.mozilla.org/php/getParsedLog.php?id=14784346&tree=Mozilla-Inbound
s: tegra-367
https://tbpl.mozilla.org/php/getParsedLog.php?id=14784544&tree=Mozilla-Inbound
s: tegra-311
https://tbpl.mozilla.org/php/getParsedLog.php?id=14784385&tree=Mozilla-Inbound
s: tegra-343
All on one push. Niiiice.
Comment 66•12 years ago
|
||
Comment 67•12 years ago
|
||
Comment 68•12 years ago
|
||
Comment 69•12 years ago
|
||
Comment 70•12 years ago
|
||
Comment 71•12 years ago
|
||
Comment 72•12 years ago
|
||
Comment 73•12 years ago
|
||
Comment 74•12 years ago
|
||
Comment 75•12 years ago
|
||
Comment 76•12 years ago
|
||
Comment 77•12 years ago
|
||
Comment 78•12 years ago
|
||
Comment 79•12 years ago
|
||
This doesn't seem limited to Mac foopy builds. Many of the above failures have foopy_type 'Linux'.
Summary: New tegras failing in reftest (on a *mac* foopy) intermittently → New tegras failing in reftest (on a *mac* (or linux?) foopy) intermittently
Reporter | ||
Comment 80•12 years ago
|
||
(In reply to Matt Brubeck (:mbrubeck) from comment #79)
> This doesn't seem limited to Mac foopy builds. Many of the above failures
> have foopy_type 'Linux'.
Good point, initial reasoning for calling it out was that it was not an issue with Linux foopy alone. [and was before I brought linux foopy to production for any new tegras]
Summary: New tegras failing in reftest (on a *mac* (or linux?) foopy) intermittently → New tegras failing in reftest intermittently
Comment 81•12 years ago
|
||
Comment 82•12 years ago
|
||
Comment 83•12 years ago
|
||
Comment 84•12 years ago
|
||
Comment 85•12 years ago
|
||
Comment 86•12 years ago
|
||
Comment 87•12 years ago
|
||
Comment 88•12 years ago
|
||
Comment 89•12 years ago
|
||
Comment 90•12 years ago
|
||
Comment 91•12 years ago
|
||
Comment 92•12 years ago
|
||
Comment 93•12 years ago
|
||
Comment 94•12 years ago
|
||
Comment 95•12 years ago
|
||
Comment 96•12 years ago
|
||
Comment 97•12 years ago
|
||
https://tbpl.mozilla.org/php/getParsedLog.php?id=14856658&tree=Mozilla-Aurora
tegra-307
(another triple)
Comment 98•12 years ago
|
||
Comment 99•12 years ago
|
||
Comment 100•12 years ago
|
||
Comment 101•12 years ago
|
||
https://tbpl.mozilla.org/php/getParsedLog.php?id=14863796&tree=Mozilla-Inbound
tegra-338
(another triple, but on my push, which is less funny)
Comment 102•12 years ago
|
||
Comment 103•12 years ago
|
||
Comment 104•12 years ago
|
||
Comment 105•12 years ago
|
||
Comment 106•12 years ago
|
||
Comment 107•12 years ago
|
||
Comment 108•12 years ago
|
||
Comment 109•12 years ago
|
||
Comment 110•12 years ago
|
||
Reporter | ||
Updated•12 years ago
|
Whiteboard: [orange]
Comment 111•12 years ago
|
||
Comment 112•12 years ago
|
||
Comment 113•12 years ago
|
||
Comment 114•12 years ago
|
||
Comment 115•12 years ago
|
||
Comment 116•12 years ago
|
||
Comment 117•12 years ago
|
||
Comment 118•12 years ago
|
||
Comment 119•12 years ago
|
||
Comment 120•12 years ago
|
||
Comment 121•12 years ago
|
||
Comment 122•12 years ago
|
||
Comment 123•12 years ago
|
||
Comment 124•12 years ago
|
||
Comment 125•12 years ago
|
||
Comment 126•12 years ago
|
||
Comment 127•12 years ago
|
||
Comment 128•12 years ago
|
||
Comment 129•12 years ago
|
||
Comment 130•12 years ago
|
||
Is this just the new normal, and from now on 10 or 20 reftest runs a day will fail this way to go along with the 10 or 20 reftest runs a day that will time out?
Comment 131•12 years ago
|
||
Comment 132•12 years ago
|
||
Comment 133•12 years ago
|
||
Comment 134•12 years ago
|
||
Comment 135•12 years ago
|
||
Comment 136•12 years ago
|
||
https://tbpl.mozilla.org/php/getParsedLog.php?id=14988710&tree=Firefox
s: tegra-299
https://tbpl.mozilla.org/php/getParsedLog.php?id=14993687&tree=Mozilla-Inbound
s: tegra-290
https://tbpl.mozilla.org/php/getParsedLog.php?id=15000102&tree=Mozilla-Inbound
s: tegra-367
https://tbpl.mozilla.org/php/getParsedLog.php?id=15000045&tree=Mozilla-Inbound
s: tegra-318
Comment 137•12 years ago
|
||
Comment 138•12 years ago
|
||
Comment 139•12 years ago
|
||
Comment 140•12 years ago
|
||
Comment 141•12 years ago
|
||
Comment 142•12 years ago
|
||
Comment 143•12 years ago
|
||
Comment 144•12 years ago
|
||
Comment 145•12 years ago
|
||
Comment 146•12 years ago
|
||
Comment 147•12 years ago
|
||
We can't seem to repro this anywhere. I'd like to see if the fixes from bug 737961 which will eliminate the need to run at massive resolution will fix this. When we eliminated the need for the 800 x 1000 resolution from the jsreftest and crashtests those frameworks became far more stable and green.
Depends on: 737961
Comment 148•12 years ago
|
||
Comment 149•12 years ago
|
||
https://tbpl.mozilla.org/php/getParsedLog.php?id=15029191&tree=Mozilla-Inbound
tegra-339
https://tbpl.mozilla.org/php/getParsedLog.php?id=15029310&tree=Mozilla-Inbound
tegra-343
https://tbpl.mozilla.org/php/getParsedLog.php?id=15029593&tree=Mozilla-Inbound
tegra-355
(a triple, but they're hardly worth noticing anymore, a push which has one green reftest hunk is more of a surprise)
Comment 150•12 years ago
|
||
Comment 151•12 years ago
|
||
Comment 152•12 years ago
|
||
Comment 153•12 years ago
|
||
That's two inbound pushes in a row, one hit this on all three reftest hunks, the next hit this on two of the three, and bug 660480. Callek asked me early on whether this was bad enough that we would be better off not running the new tegras at all instead of enduring it. At the time, the answer was no, we didn't need to shut them off. Now the answer is yes, they are in some way broken, and need to go away until they get better.
Comment 154•12 years ago
|
||
Comment 155•12 years ago
|
||
Comment 156•12 years ago
|
||
Comment 157•12 years ago
|
||
Comment 158•12 years ago
|
||
Comment 159•12 years ago
|
||
Comment 160•12 years ago
|
||
Comment 161•12 years ago
|
||
Comment 162•12 years ago
|
||
Comment 163•12 years ago
|
||
Comment 164•12 years ago
|
||
Comment 165•12 years ago
|
||
Comment 166•12 years ago
|
||
Comment 167•12 years ago
|
||
Comment 168•12 years ago
|
||
Comment 169•12 years ago
|
||
Comment 170•12 years ago
|
||
Comment 171•12 years ago
|
||
Comment 172•12 years ago
|
||
https://tbpl.mozilla.org/php/getParsedLog.php?id=15075660&tree=Firefox
tegra-073 (which has only existed since Thursday, so yeah, "new")
Comment 173•12 years ago
|
||
Comment 174•12 years ago
|
||
Comment 175•12 years ago
|
||
Comment 176•12 years ago
|
||
Comment 177•12 years ago
|
||
Comment 178•12 years ago
|
||
Comment 179•12 years ago
|
||
Comment 180•12 years ago
|
||
Comment 181•12 years ago
|
||
Comment 182•12 years ago
|
||
Comment 183•12 years ago
|
||
Comment 184•12 years ago
|
||
Comment 185•12 years ago
|
||
Comment 186•12 years ago
|
||
Comment 187•12 years ago
|
||
Comment 188•12 years ago
|
||
https://tbpl.mozilla.org/php/getParsedLog.php?id=15119800&tree=Mozilla-Inbound
tegra-302
That's the retriggered run, on a push which probably broke R1, which we thought probably didn't despite having tried to show how it was breaking it on try, because we've lost pretty much all faith in the ability of tegras to run reftests anymore.
Comment 189•12 years ago
|
||
Comment 190•12 years ago
|
||
Comment 191•12 years ago
|
||
Comment 192•12 years ago
|
||
Comment 193•12 years ago
|
||
Comment 194•12 years ago
|
||
https://tbpl.mozilla.org/php/getParsedLog.php?id=15118983&tree=Try
tegra-310
https://tbpl.mozilla.org/php/getParsedLog.php?id=15114547&tree=Try
tegra-361
https://tbpl.mozilla.org/php/getParsedLog.php?id=15112724&tree=Try
tegra-334
https://tbpl.mozilla.org/php/getParsedLog.php?id=15116834&tree=Try
tegra-333
https://tbpl.mozilla.org/php/getParsedLog.php?id=15101027&tree=Try
tegra-308
Comment 195•12 years ago
|
||
Comment 196•12 years ago
|
||
Comment 197•12 years ago
|
||
Comment 198•12 years ago
|
||
Comment 199•12 years ago
|
||
Comment 200•12 years ago
|
||
Comment 201•12 years ago
|
||
Comment 202•12 years ago
|
||
Comment 203•12 years ago
|
||
Comment 204•12 years ago
|
||
Comment 205•12 years ago
|
||
Comment 206•12 years ago
|
||
Comment 207•12 years ago
|
||
Comment 208•12 years ago
|
||
(In reply to Clint Talbert ( :ctalbert ) from comment #6)
> Running tegra 305 on my desk with exactly the same set of builds and tests
> as it ran in the automation (different host-utils because I can't download
> the one the tegras use) but it runs perfectly every single time.
>
We can get you the same host-utils.
Comment 209•12 years ago
|
||
Comment 210•12 years ago
|
||
Comment 211•12 years ago
|
||
Comment 212•12 years ago
|
||
Comment 213•12 years ago
|
||
Comment 214•12 years ago
|
||
Comment 215•12 years ago
|
||
Comment 216•12 years ago
|
||
Comment 217•12 years ago
|
||
Comment 218•12 years ago
|
||
Comment 219•12 years ago
|
||
https://tbpl.mozilla.org/php/getParsedLog.php?id=15170841&tree=Mozilla-Inbound
s: tegra-313
https://tbpl.mozilla.org/php/getParsedLog.php?id=15171041&tree=Mozilla-Inbound
s: tegra-304
https://tbpl.mozilla.org/php/getParsedLog.php?id=15171079&tree=Mozilla-Inbound
s: tegra-302
https://tbpl.mozilla.org/php/getParsedLog.php?id=15171798&tree=Mozilla-Inbounds: tegra-349
Comment 220•12 years ago
|
||
Comment 221•12 years ago
|
||
Comment 222•12 years ago
|
||
https://tbpl.mozilla.org/php/getParsedLog.php?id=15181920&tree=Mozilla-Inbound
tegra-361
https://tbpl.mozilla.org/php/getParsedLog.php?id=15182023&tree=Mozilla-Inbound
tegra-338
https://tbpl.mozilla.org/php/getParsedLog.php?id=15182176&tree=Mozilla-Inbound
tegra-314
https://tbpl.mozilla.org/php/getParsedLog.php?id=15181115&tree=Mozilla-Inbound
tegra-318
https://tbpl.mozilla.org/php/getParsedLog.php?id=15180111&tree=Mozilla-Inbound
tegra-107
https://tbpl.mozilla.org/php/getParsedLog.php?id=15179605&tree=Mozilla-Inbound
tegra-310
Comment 223•12 years ago
|
||
Comment 224•12 years ago
|
||
Comment 225•12 years ago
|
||
Comment 226•12 years ago
|
||
Comment 227•12 years ago
|
||
This will soon be slowing down and be solved since we won't run reftests on the new batches of tegras (bug 790698).
Comment 228•12 years ago
|
||
Or it'll be morphing into something even weirder, since there are a very few older tegras getting infected with the all-white-reftest disease.
https://tbpl.mozilla.org/php/getParsedLog.php?id=15197834&tree=Mozilla-Inbound
tegra-280
Comment 229•12 years ago
|
||
https://tbpl.mozilla.org/php/getParsedLog.php?id=15222705&tree=Firefox
tegra-267
Which isn't a full set, and there probably are two separate things getting mixed together here.
Comment 230•12 years ago
|
||
Updated•12 years ago
|
Summary: New tegras failing in reftest intermittently → New tegras (and some old ones) failing in reftest intermittently
Comment 231•12 years ago
|
||
Comment 232•12 years ago
|
||
Comment 233•12 years ago
|
||
Comment 234•12 years ago
|
||
Comment 235•12 years ago
|
||
Comment 236•12 years ago
|
||
Updated•12 years ago
|
OS: Windows 7 → Other
Priority: -- → P3
Comment 237•12 years ago
|
||
Comment 238•12 years ago
|
||
Comment 239•12 years ago
|
||
Comment 240•12 years ago
|
||
Comment 241•12 years ago
|
||
Comment 242•12 years ago
|
||
Comment 243•12 years ago
|
||
Comment 244•12 years ago
|
||
Comment 245•12 years ago
|
||
https://tbpl.mozilla.org/php/getParsedLog.php?id=15748376&tree=Mozilla-Inbound
tegra-094 (which makes me weepy, since that's the first one since the patch for bug 792212 landed)
Comment 246•12 years ago
|
||
Comment 247•12 years ago
|
||
Comment 248•12 years ago
|
||
The new tegras seem to work the same as the old now, with the patch for bug 797942. Can we close this out and remove the special configuration preventing new tegras from running reftests?
Comment 249•12 years ago
|
||
Attachment #674711 -
Flags: review?(bugspam.Callek)
Reporter | ||
Comment 250•12 years ago
|
||
Comment on attachment 674711 [details] [diff] [review]
time to revert
staging first though! (and double check we have some of these new ones up in staging)
Attachment #674711 -
Flags: review?(bugspam.Callek) → review+
Comment 251•12 years ago
|
||
Comment 252•12 years ago
|
||
Comment 253•12 years ago
|
||
Since bug 797942 is on Gecko 19, despite the 18 milestone, wouldn't the "time to revert" be when 19 hits mozilla-release in February?
Comment 254•12 years ago
|
||
Comment 255•12 years ago
|
||
Comment 256•12 years ago
|
||
Assignee | ||
Updated•12 years ago
|
Keywords: intermittent-failure
Assignee | ||
Updated•12 years ago
|
Whiteboard: [orange]
Comment 257•12 years ago
|
||
Resolving WFM keyword:intermittent-failure bugs last modified >3 months ago, whose whiteboard contains none of:
{random,disabled,marked,fuzzy,todo,fails,failing,annotated,time-bomb,leave open}
There will inevitably be some false positives; for that (and the bugspam) I apologise. Filter on orangewfm.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → WORKSFORME
Assignee | ||
Updated•11 years ago
|
Product: mozilla.org → Release Engineering
Assignee | ||
Updated•7 years ago
|
Component: General Automation → General
You need to log in
before you can comment on or make changes to this bug.
Description
•