Closed Bug 938872 (t-xp32-ix-085) Opened 11 years ago Closed 11 years ago

t-xp32-ix-085 problem tracking

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task, P3)

x86
Windows XP

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: philor, Unassigned)

References

Details

(Whiteboard: [buildduty][buildslaves][capacity])

Attachments

(1 file)

https://tbpl.mozilla.org/php/getParsedLog.php?id=30576711&tree=Mozilla-Inbound is a block of 64 Pink Pixels of Death, a rather impressive feat of memory failure which really ought to show up when this slave has memtest run on it. Disabled in slavealloc.
Let's run memtest just out of curiosity. We will swap the mem even if it does not yield signs of it.
Depends on: 939104
Depends on: 959454
Depends on: 960557
Attempting SSH reboot...Failed. Filed IT bug for reboot (bug 963858)
Back in production.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Either broken, or the reimage went badly: 13:31:42 INFO - 1919 ERROR TEST-UNEXPECTED-FAIL | /tests/gfx/tests/mochitest/test_acceleration.html | Acceleration enabled on Windows XP or newer - didn't expect 0, but got it
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Attempting SSH reboot...Failed. Filed IT bug for reboot (bug 974117)
Depends on: 975006
Gonna try it again after another reimage+diagnostics.
Status: REOPENED → RESOLVED
Closed: 11 years ago11 years ago
Resolution: --- → FIXED
First job was orange, but not for acceleration reasons...so I think it's legit orange. Still watching the next job.
Machine is still broken: 11:42:23 INFO - 7502 ERROR TEST-UNEXPECTED-FAIL | /tests/content/canvas/test/webgl/non-conf-tests/test_webgl_available.html | Expected WebGL creation to succeed.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
back in production, leaving dep bug 975006 open until first job completes.
moved to t-w732-ix-002 slot and brought back into production for testing. will post findings in bug 975006
Still broken, disabled again. I'd really very much rather that you not put known broken slaves in production on a Friday and then just wander off.
(In reply to Phil Ringnalda (:philor) from comment #12) > Still broken, disabled again. I'd really very much rather that you not put > known broken slaves in production on a Friday and then just wander off. to clarify: - these machines were enabled Fri morning PT - I reported to sheriffs at that time that I was enabling these hosts and warned they might fail. - as said in comment 11, I'd post findings in bug 975006 and I did after first job which was green - checking back on things now, it's obvious that our attempt to fix did not work. I apologize this caused failures and time on your end. Thank you for disabling. If you would like to question certain practices that I am doing as buildduty, I'm open to a constructive talk. Maybe I can then take your expertise and change the way we do things that would be better for sheriffs and our build system. ie: possibly not enabling machines on a friday. van: it seems like switching slots did not do the trick. looks like we will have to try something else :)
van has been working hard on this slave trying to find the root problem. pete can you please enable this on thurs or friday. Note in above comments I failed to catch it burn jobs. Could you please keep a very close eye on it and catch it burning a job before sheriffs do. Thanks!
Flags: needinfo?(pmoore)
But note that it's way way harder to spot than a simple "it burns, you'll see that every job is red," since what it actually does is run at far too low a resolution and without graphics acceleration, so when it has test failures, you have to look at the actual failures, and realize that in this case a context-menu test is failing because the resolution is so low that most of the menu is off-screen, not because of the usual intermittent failure bug. Since we have tens of thousands of intermittent failure bugs filed, and since tbpl suggests them based on the test filename, not on the failure message, even if you see that this slave took a job and had a test failure, and when you go to look a sheriff has starred it as being an intermittent failure, that *still* doesn't mean that it was that known failure rather than the busted resolution and lack of acceleration; only knowing to ask yourself "given that this particular slave is probably busted, running with tiny resolution, does this failure still look like a known intermittent rather than it having tiny resolution?" will say whether it's still busted or not. Well, or just starting it up not in production, and looking at whether it's running at a tiny resolution, it sure seems to me like that ought to be possible somehow.
Thanks folks. So it sounds like my best bet is to check the resolution before putting it in production. It looks like a windows xp 32 bit slave, I'll see if I can VNC onto it. Phil: what resolution *should* it have? If this is just a case of the resolution being incorrectly set - is this something wrong with our imaging process - can this be fixed by GPO config? Also, is it necessary for hardware acceleration to be enabled, and how would we do this, and can we also make that part of the imaging process? If it is not possible on this hardware, and is a requirement of the tests, should we disable the tests that require hardware acceleration from machines that do not (and cannot) have it? Thanks, Pete
Flags: needinfo?(pmoore) → needinfo?(philringnalda)
According to a screenshot from a test failure on a healthy winxp slave, resolution should be 1600x1200. And according to my vague memory of things armenzg has said in bugs and screenshots he has posted to bugs while we've had post-reimaging resolution and graphics problems before, you can compare... maybe it's the Graphics Properties, from right-clicking the desktop?... between a busted slave and a healthy one to find that the busted one is using the wrong graphics card, or only has one when it should have two, or has the right one, but it only thinks it's capable of running smaller resolutions. Not sure, I've never had access to any of our slaves and I no longer own anything running WinXP.
Flags: needinfo?(philringnalda)
Thanks Philor. Armen, see comments 16 and 17 above - what are your thoughts? Pete
Flags: needinfo?(armenzg)
I found out RelOps are working on adding a test to the start talos bat to check resolution before launching run slave. Hopefully this could help. Not sure whether graphics acceleration is needed. Also RelOps mentioned there could be a physical monitor connected to slave. RelOps will update this bug to associate it to the start talos check bug they are working on.
I think it may be worthwhile seeing if that change fixes it, rather than doing a one-time fix. However, a one-time fix is probably as simple as vnc'ing onto the slave and changing desktop size (not sure about hardware acceleration though).
It seems it is still using the wrong display. I've asked IT to look into it in the dep bug. I created this page (I will adding more info): https://wiki.mozilla.org/ReleaseEngineering/Buildduty/Slave_Management#Xp If that fails, we might want to ask for the graphic card to be replaced. ################### A while ago I wrote a script that adjusts the screen resolution on Win7 machines: http://hg.mozilla.org/build/tools/file/default/scripts/support/mouse_and_screen_resolution.py There is code to query screen resolutions. We should find a way to prevent starting machines up with not big enough screen resolution. We could use runslave.py or start-buildbot.bat to prevent that (since we don't have pre-flight tasks yet).
Flags: needinfo?(armenzg)
On all of the machines there is a script c:\monitor_config\fakemon.vbs that will detect if the second screen is missing. Add it if necessary then adjust the resolution.
The screen resolution is now correct. Rebooting into the pool.
The screen resolution *was* correct, for some apparently brief period. Screenshot in https://tbpl.mozilla.org/php/getParsedLog.php?id=39859470&tree=Mozilla-Inbound is it failing a test at 1024x768, https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?name=t-xp32-ix-085 is it failing most of the things it tried to do today. Disabled in slavealloc.
Depends on: 1013280
QA Contact: armenzg → bugspam.Callek
Bug 1013280 is fixed, rebooted into production
Status: REOPENED → RESOLVED
Closed: 11 years ago11 years ago
Resolution: --- → FIXED
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: