Closed
Bug 977615
Opened 11 years ago
Closed 11 years ago
Upgrade Windows 8 machines' Nvidia drivers to 335.23
Categories
(Infrastructure & Operations :: RelOps: General, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: armenzg, Assigned: q)
References
Details
Attachments
(5 files, 1 obsolete file)
We're going to need to upgrade the drivers to help with bug 938395.
I will run a machine through staging first.
Reporter | ||
Updated•11 years ago
|
OS: Linux → Windows 8
Summary: Upgrade one Windows 8 machine's Nvidia drivers to 334.89 → Upgrade Windows 8 machines' Nvidia drivers to 334.89
Updated•11 years ago
|
Assignee: relops → q
Nvidia has released GeForce 335.23 WHQL drivers, can this driver be installed instead?
http://www.nvidia.com/download/driverResults.aspx/73780/en-us
Reporter | ||
Comment 2•11 years ago
|
||
I'm testing now the machine on staging and I will have the results soon.
When could we have time to test the installation on a machine and check that we have the graphics update service still disabled?
http://us.download.nvidia.com/Windows/335.23/335.23-desktop-win8-win7-winvista-64bit-english-whql.exe
Reporter | ||
Comment 3•11 years ago
|
||
We should have all results by tomorrow morning.
Reporter | ||
Comment 4•11 years ago
|
||
All jobs passed.
Q: When can this fit within your schedule?
Thanks!
Flags: needinfo?(q)
Comment 5•11 years ago
|
||
I would recommend that we wait to make any major changes to production till after pwn2own.
Reporter | ||
Comment 6•11 years ago
|
||
For sure, we'll time the deployment after it.
Reporter | ||
Comment 7•11 years ago
|
||
After the release that fixes it goes out.
Reporter | ||
Comment 8•11 years ago
|
||
Q: any news on when this could be done?
Let's aim for Monday. Does that work for you?
Flags: needinfo?(q)
Reporter | ||
Comment 10•11 years ago
|
||
Monday wfm!
Thank you Q!
Is there an easy way to rollback in case anything goes wrong?
I don't think we will need it but we should try to test it once.
jrmuizel: we might also need your experience if we need to disable a patch temporarily. We got a green run on staging but the greenness could change.
Flags: needinfo?(armenzg)
Reporter | ||
Comment 11•11 years ago
|
||
I'm ready whenever you are. Thanks!
Assignee | ||
Comment 12•11 years ago
|
||
Sorry this didn't go out yesterday I was testing rollback options and we have them. I will work on getting this out today.
Q
Reporter | ||
Comment 13•11 years ago
|
||
For CCed people: No action required. Adding you as FYI. We're upgrading Nvidia drivers on Win8 machines. The change has been tested on staging.
Blocks: 988012
Assignee | ||
Comment 14•11 years ago
|
||
This was rolled back to figure out what caused the failures that closed the trees.
Reporter | ||
Comment 15•11 years ago
|
||
Q, could you please deploy the change to t-w864-ix-009?
I would like to test it on my staging master.
On another note, I'm putting t-w864-ix-042 again on staging to see why it didn't catch the issues from bug 988012.
Blocks: t-w864-ix-009
Assignee | ||
Comment 16•11 years ago
|
||
will do
Reporter | ||
Comment 17•11 years ago
|
||
Q, can we please use 335.23 instead of 334.89?
Summary: Upgrade Windows 8 machines' Nvidia drivers to 334.89 → Upgrade Windows 8 machines' Nvidia drivers to 335.23
Assignee | ||
Comment 18•11 years ago
|
||
targeting only t-w864-ix-009 with the update. Rebooting now to pick it up with driver version 335.25
Updated•11 years ago
|
Blocks: t-w864-ix-038
Reporter | ||
Comment 19•11 years ago
|
||
I think there's something funky with 009's graphical setup.
Q: can you have a look at it when you have time?
Could you also install the newer graphics driver on 042?
Thanks for your help!
On another note, my testing on staging was pretty much invalid.
GPO reverted my manual installation in between reboots.
This means that I was testing the older setup and that is why I got everything to be green. Yay me!
Assignee | ||
Comment 20•11 years ago
|
||
009 will only get the new driver ( it has it now deployed via GPO) I ran c:\monitor_config\fakemon.vbs ( gets run on startup) and things looks good. You should be good to test that machine. I will address 042 right now.
Reporter | ||
Comment 21•11 years ago
|
||
Everything is running green so far on 009.
The driver version seems the right one (it is still on the right version 335.23).
Could it be that we need a reboot after the driver install?
Maybe the tests fail on the first run after the drivers upgrade?
I should run tests for mozilla-beta since that is what the logs from bug 988012 list.
Reporter | ||
Comment 22•11 years ago
|
||
Reporter | ||
Comment 23•11 years ago
|
||
Reporter | ||
Comment 24•11 years ago
|
||
Matt, Jeff: I've attached the logs of the jobs that failed after running with the newer Nvidia driver.
Could you please have a look at them and find a way to make the green?
Once we have solutions for this we can move forward to the deployment (I will need to do another run across various branches).
Comment 25•11 years ago
|
||
Are you assuming that your problems with 009 were different than the problems in production? I don't think there's any reason to believe they were different, my memory of the failures would make "tiny resolution and no acceleration" a perfect fit.
Reporter | ||
Comment 26•11 years ago
|
||
It seems that we might have two issues:
* the installation can get us into a bad state with no acceleration and tiny screen resolution
* after upgrade, we have some perma orange failures on various branches
Matt, Jeff: when do you have time to look at the failures? Does the following plan make sense?
I can think of this path forward:
1) We can run a machine through all the release branches: m-c, m-a, m-b, m-r & esr24
1.1) File bugs for each failing test; ask original authors to help them fix them or disable them
1.2) Loan you a machine and ask
2) Once we deploy this again, we can quickly disable any machines that get a bad installation and fix them up
Flags: needinfo?(matt.woodrow)
Flags: needinfo?(jmuizelaar)
Comment 27•11 years ago
|
||
(In reply to Armen Zambrano [:armenzg] (Release Engineering) (EDT/UTC-4) from comment #26)
> It seems that we might have two issues:
> * the installation can get us into a bad state with no acceleration and tiny
> screen resolution
> * after upgrade, we have some perma orange failures on various branches
>
> Matt, Jeff: when do you have time to look at the failures? Does the
> following plan make sense?
Sounds reasonable to me.
Flags: needinfo?(jmuizelaar)
Comment 28•11 years ago
|
||
Yes, sounds good to me too.
Jeff and I looked at the logs last week, nothing stood out as obviously graphics related at all. It wasn't really obvious how the driver upgrade could have caused them.
Flags: needinfo?(matt.woodrow)
Reporter | ||
Comment 29•11 years ago
|
||
Jeff, Matt: would you mind preparing a patch to disable the perma-failure on beta?
Q: do you have any trick to see if the installations failed on some of the machines?
Any ideas on how to prevent machines from taking jobs if they are not ready to take jobs? (graphically speaking)
Anything I can check on the machine?
I'm hoping to look into the machines that failed last time (by looking at Windows logs) to see if there's anything that failed during the installation.
I triggered jobs on mozilla-aurora and they all came out clean.
I've triggered jobs on mozilla-release and mozilla-esr24 and we should know by tomorrow. I assume they might share the perma-failure of mozilla-beta.
It seems that "WINNT 6.2 mozilla-central debug test mochitest-browser-chrome" was intermittent. It failed 2/5 times *and* the test failures from each were different.
"WINNT 6.2 mozilla-beta opt test mochitest-browser-chrome" has failed 3/3 times the same way [1].
[1] These errors and similar:
17:40:04 WARNING - TEST-UNEXPECTED-FAIL | chrome://mochitests/content/browser/browser/components/customizableui/test/browser_876944_customize_mode_create_destroy.js | The number of placeholders should be correct. - Got 2, expected 1
17:40:09 WARNING - TEST-UNEXPECTED-FAIL | chrome://mochitests/content/browser/browser/components/customizableui/test/browser_880382_drag_wide_widgets_in_panel.js | Area PanelUI-contents should have 13 items. - Got 12, expected 13
17:40:16 WARNING - TEST-UNEXPECTED-FAIL | chrome://mochitests/content/browser/browser/components/customizableui/test/browser_890140_orphaned_placeholders.js | Should no longer be in default state.
Reporter | ||
Comment 30•11 years ago
|
||
Any answers wrt the questions on comment 29?
m-r and m-esr24 have come out clean.
As of now, we only have a perma-orange on mozilla-beta.
Q: what do you think if we deploy the change again to five machines and keep a close eye on it?
I would like to see recent failures.
Meanwhile, I will look at one of the machine that failed in the past: t-w864-ix-105
Flags: needinfo?(q)
Flags: needinfo?(matt.woodrow)
Flags: needinfo?(jmuizelaar)
Assignee | ||
Comment 31•11 years ago
|
||
I think deploying to five machines is great idea. Do you have candidate machines for me?
Flags: needinfo?(q)
Reporter | ||
Comment 32•11 years ago
|
||
Let's do 010 to 015.
Thanks Q!
Reporter | ||
Comment 33•11 years ago
|
||
Q: if you could do it as early as possible either today or tomorrow, it will give me a higher chance of keeping an eye on them.
I will probably disable buildbot at the end of my day and re-enable them in the following morning.
Comment 34•11 years ago
|
||
Flags: needinfo?(jmuizelaar)
Reporter | ||
Comment 35•11 years ago
|
||
Q and I will be deploying the change to five machines and keep a close eye on them.
Flags: needinfo?(matt.woodrow)
Assignee | ||
Comment 36•11 years ago
|
||
10 - 15 were set to get the new drivers after reboot. I still haven't seen an install get pulled I am going to make sure the change is getting picked up.
Reporter | ||
Comment 37•11 years ago
|
||
I've disabled on slavealloc and requested for buildbot shutdown for those slaves until I can look at them tomorrow.
I can see that a couple of them already started failing jobs (010 & 014):
https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&type=t-w864-ix&name=t-w864-ix-010
https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&type=t-w864-ix&name=t-w864-ix-011
https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&type=t-w864-ix&name=t-w864-ix-012
https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&type=t-w864-ix&name=t-w864-ix-013
https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&type=t-w864-ix&name=t-w864-ix-014
https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&type=t-w864-ix&name=t-w864-ix-015
For instance #10 had twisted disconnections.
How does the installation work? Do we wait until the drivers update and then reboot?
Is there a way to check for a failed installation?
Assignee | ||
Comment 38•11 years ago
|
||
The install runs at boot and makes sure that runslave doesn't start. The install is still running on 14 and runslave was killed. is it possible something rebooted this machine during the last install ? The machine has only been up for 6 minutes.
Q
Assignee | ||
Comment 39•11 years ago
|
||
The installer completed on 014 and it rebooted without my intervention. Checking on the resolution etc now
Reporter | ||
Comment 40•11 years ago
|
||
The last job that #14 took successfully finished at: 4/23/2014, 3:16:44 PM
After that it should have rebooted (last step in this job [1])
At that point, I assume it came back from a reboot and installed the drivers.
At Wed Apr 23 15:34:43 2014 it started the next job and after 1 mins, 48 secs it failed (Wed Apr 23 15:36:32 2014) *without* rebooting. [3]
From your comment, who killed runslave? Did you mean that it was not running? ("...and runslave was killed.")
FTR I gracefully shutdown buildbot somewhere before my comment (15:50:02 PDT)
Taking that action should look like killing runslave.py when connected to the machine.
IIUC (correctly if I'm wrong) #14 took jobs even though the installation had not finished.
#######################
[1] http://buildbot-master110.srv.releng.scl3.mozilla.com:8201/builders/WINNT%206.2%20try%20opt%20test%20reftest-no-accel/builds/521
[2] http://buildbot-master110.srv.releng.scl3.mozilla.com:8201/builders/WINNT%206.2%20fx-team%20opt%20test%20mochitest-2/builds/251
[3]
15:35:27 INFO - #####
15:35:27 INFO - ##### Running clobber step.
15:35:27 INFO - #####
15:35:27 INFO - Running pre-action listener: _resource_record_pre_action
15:35:27 INFO - Running main action method: clobber
15:35:27 INFO - rmtree: C:\slave\test\build
15:35:27 INFO - Using _rmtree_windows ...
15:35:27 INFO - retry: Calling <bound method DesktopUnittest._rmtree_windows of <__main__.DesktopUnittest object at 0x025F9B30>> with args: ('C:\\slave\\test\\build',), kwargs: {}, attempt #1
remoteFailed: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion.
]
Assignee | ||
Comment 41•11 years ago
|
||
I think I might have an idea here let me check the install logs.
Assignee | ||
Comment 42•11 years ago
|
||
This was my fault. For some reason the hostname filter that told gpo to NOT revert the changes back to the old driver was not taking. So the new driver was installing then GPO was detecting the new driver and resetting the node hence the kill behavior ( we had this in place because test would fail with the new driver ). I removed the revert statements entirely and am retrying now.
Assignee | ||
Comment 43•11 years ago
|
||
Okay 14 looks stable and came up with the correct res and did not revert.
Reporter | ||
Comment 44•11 years ago
|
||
Yay!
Jeff, Matt: I want to wait until next week so Q and I can coordinate further deploying this (I'm on PTO until Monday). Please reach for coop if this could not wait.
Reporter | ||
Comment 45•11 years ago
|
||
Thanks Q. I'm happy to see this figured out.
Assignee | ||
Comment 46•11 years ago
|
||
Do we have an metrics on how these machines are doing?
Reporter | ||
Comment 47•11 years ago
|
||
I'm putting back the 5 machines into production after the changes that Q did for comment 43.
Let's see how the jobs go for each of them:
https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&type=t-w864-ix&name=t-w864-ix-010
https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&type=t-w864-ix&name=t-w864-ix-011
https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&type=t-w864-ix&name=t-w864-ix-012
https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&type=t-w864-ix&name=t-w864-ix-013
https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&type=t-w864-ix&name=t-w864-ix-014
https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&type=t-w864-ix&name=t-w864-ix-015
Comment 48•11 years ago
|
||
https://tbpl.mozilla.org/php/getParsedLog.php?id=38670508&tree=Mozilla-Central is 015 failing the webgl mochitests which fail when run (without hardware acceleration, at such a low resolution, whathaveyou) when webgl doesn't work on a slave.
Comment 49•11 years ago
|
||
Two of the others have done green mochitest-2 runs (though nobody has yet done either of my favorites, reftest or mochitest-1), so it may be that just 015 is broken.
Comment 50•11 years ago
|
||
https://tbpl.mozilla.org/php/getParsedLog.php?id=38674156&tree=Mozilla-Inbound is 015 failing reftest webgl tests. Disabled it in slavealloc.
Reporter | ||
Comment 51•11 years ago
|
||
Thank you philor.
It seems that only 015 has given trouble so far.
I sometimes feel that not being able to set a WebGL context should turn the job red.
Anyways, let's look tomorrow as to why 015 misbehaved.
Let's hope the other ones stay put.
18:08:41 INFO - JavaScript warning: http://mochi.test:8888/tests/dom/imptests/html/webgl/common.js, line 3: WebGL: Can't get a usable WebGL context
18:43:29 INFO - JavaScript warning: file:///C:/slave/test/build/tests/reftest/tests/content/canvas/test/reftest/webgl-utils.js, line 59: WebGL: Can't get a usable WebGL context
Reporter | ||
Comment 52•11 years ago
|
||
Is this relevant? Should we re-image 015 and start again?
On another note, could we deploy the change to machine 020 to 029?
If machine 010 to 014 have been working I'm confident to see those work.
After that we should deploy across the pool and disable individual machines that get into the state of 015.
Attachment #8414866 -
Flags: feedback?(q)
Reporter | ||
Comment 53•11 years ago
|
||
We're going to deploy across the board.
Deploying now rather than on the morning will be less disruptive if we have some machines that don't work.
If this goes sideways please use the escalation wiki.
Reporter | ||
Comment 54•11 years ago
|
||
Comment on attachment 8414866 [details]
Capture1.PNG
Irrelevant event. Obsoleting screenshot.
Attachment #8414866 -
Attachment is obsolete: true
Attachment #8414866 -
Flags: feedback?(q)
Reporter | ||
Comment 55•11 years ago
|
||
Email sent to sheriffs.
This can be used to monitor the change: https://tbpl.mozilla.org/?tree=Mozilla-Inbound&jobname=WINNT%206.2
And maybe this: http://builddata.pub.build.mozilla.org/reports/pending/pending.html (last two diagrams)
Reporter | ||
Comment 56•11 years ago
|
||
It seems that this can kill the first job. Following jobs work as expected.
Reporter | ||
Comment 57•11 years ago
|
||
While the installation of the driver goes on, the script was killing any python process starting up. That caused the "blue" jobs. Once the installation finished the machine would reboot and come back with the right driver.
Unfortunately, some machines would get a messed up device/screen-resolution condition.
The way that the fakemon.vbs runs, it can get on the way of buildbot (see [1]).
Some jobs would come out orange.
We have 005 which recovered from such state. On the other hand, we had 004 that got 3 orange jobs.
With the newer starttalos.bat that Q is deploying to the win8 machines, it will prevent machines like 004 to run.
[1] TEST-UNEXPECTED-FAIL | chrome://mochitests/content/browser/toolkit/mozapps/extensions/test/browser/browser_details.js | Enable button should be visible
Reporter | ||
Comment 58•11 years ago
|
||
So it seems that the new starttalos.bat is doing what we expecte.
Currently, the are no production win8 jobs to hit.
There are a bunch of Try ones:
https://tbpl.mozilla.org/?tree=Try&jobname=WINNT%206.2
Reporter | ||
Comment 59•11 years ago
|
||
I will be checking before 9pm.
I will let Q know by then if there's anything else we need to do or not.
I hope to get more data points for later in here (I'm waiting on the try backlog to clear to get to my jobs):
https://tbpl.mozilla.org/?tree=Ash&jobname=WINNT 6.2 ash opt test&rev=685ffafc35cf
https://tbpl.mozilla.org/?tree=Cedar&jobname=WINNT%206.2.*opt&rev=d0a03ae4832f
Updated•11 years ago
|
Blocks: t-w864-ix-062
Updated•11 years ago
|
Depends on: t-w864-ix-030
Updated•11 years ago
|
Blocks: t-w864-ix-030
No longer depends on: t-w864-ix-030
Reporter | ||
Comment 60•11 years ago
|
||
It seems that we're out of the woods.
However, this machine (even though it had your new starttalos.bat) was with the screen resolution code in the forefront.
It burned a job. I closed cmd and disabled the machine for when you have time to look at it.
Attachment #8414979 -
Flags: feedback?(q)
Reporter | ||
Updated•11 years ago
|
Blocks: t-w864-ix-063
Reporter | ||
Updated•11 years ago
|
Blocks: t-w864-ix-083
Comment 61•11 years ago
|
||
t-w864-ix-117 is a bit suspicious, since the screenshot from a mochitest-5 timeout shows the Start screen, but I left it enabled to see what else it could manage to break.
Assignee | ||
Comment 62•11 years ago
|
||
083 looks like it had an error from before the starttalos change. A reboot brought the box back
Reporter | ||
Comment 63•11 years ago
|
||
t-w864-ix-117 seems fine from looking at it with VNC.
t-w864-ix-070 and t-w864-ix-077 are not OK. I will reboot them once more.
Re-enabled and rebooted:
t-w864-ix-015
t-w864-ix-030
t-w864-ix-042
t-w864-ix-062
t-w864-ix-063
t-w864-ix-083
I will watch them.
Comment 64•11 years ago
|
||
t-w864-ix-063 still has webgl bustage, https://tbpl.mozilla.org/php/getParsedLog.php?id=38815611&tree=Mozilla-Aurora, redisabled.
Updated•11 years ago
|
Blocks: t-w864-ix-018
Reporter | ||
Comment 65•11 years ago
|
||
Ryan and philor have disabled 063 and 018.
Ryan says that we're actually seeing the issue on machines at random.
Comment 66•11 years ago
|
||
We've seen the same timeouts on at least 6 or 7 different slaves in the last hour or so. Looking at slave health, they appear to recover OK after rebooting.
Comment 67•11 years ago
|
||
The screenshot thing is maybe unrelated, since instances of bug 632290 on April 12th and 17th show the same screenshot of the Start screen, but the way we had one instance of that failure last week on Linux, and we've had six all on Win8 since last night seems related.
Reporter | ||
Comment 68•11 years ago
|
||
Q: can we put the call of fakemon.vbs outside of starttalos.bat the way it was? or maybe remove the 2nd call?
Reporter | ||
Comment 69•11 years ago
|
||
FYI I spoke with Q and we decided not to touch anything anymore as I asked on comment 68.
I believe the extra failures were some machines I should have not put back into the pool (020/021) that brought back some issues that they only showed.
It seems that we're OK now. philor/RyanVM let me know if not.
I'm going to go over any stragglers and file a follow up bug.
Reporter | ||
Comment 70•11 years ago
|
||
Done with the update.
I filed bug 1004813 for stragglers.
No longer blocks: t-w864-ix-083, t-w864-ix-063, t-w864-ix-018, 974684, 1004813, t-w864-ix-062, t-w864-ix-012, t-w864-ix-030, 988012, t-w864-ix-009, t-w864-ix-038, t-w864-ix-014, t-w864-ix-011, t-w864-ix-015, t-w864-ix-010, t-w864-ix-013
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Reporter | ||
Updated•11 years ago
|
Attachment #8414979 -
Flags: feedback?(q)
You need to log in
before you can comment on or make changes to this bug.
Description
•