977615 - Upgrade Windows 8 machines' Nvidia drivers to 335.23

I'm testing now the machine on staging and I will have the results soon. When could we have time to test the installation on a machine and check that we have the graphics update service still disabled? http://us.download.nvidia.com/Windows/335.23/335.23-desktop-win8-win7-winvista-64bit-english-whql.exe

Armen [:armenzg]

Reporter

Comment 3

•

11 years ago

We should have all results by tomorrow morning.

Armen [:armenzg]

Reporter

Comment 4

•

11 years ago

All jobs passed. Q: When can this fit within your schedule? Thanks!

Flags: needinfo?(q)

Amy Rich [:arr] [:arich]

Comment 5

•

11 years ago

I would recommend that we wait to make any major changes to production till after pwn2own.

Armen [:armenzg]

Reporter

Comment 6

•

11 years ago

For sure, we'll time the deployment after it.

Armen [:armenzg]

Reporter

Comment 7

•

11 years ago

After the release that fixes it goes out.

Armen [:armenzg]

Reporter

Comment 8

•

11 years ago

Q: any news on when this could be done?

Q

Assignee

Comment 9

•

11 years ago

Let's aim for Monday. Does that work for you?

Flags: needinfo?(q)

Q

Assignee

Updated

•

11 years ago

Flags: needinfo?(armenzg)

Armen [:armenzg]

Reporter

Comment 10

•

11 years ago

Monday wfm! Thank you Q! Is there an easy way to rollback in case anything goes wrong? I don't think we will need it but we should try to test it once. jrmuizel: we might also need your experience if we need to disable a patch temporarily. We got a green run on staging but the greenness could change.

Flags: needinfo?(armenzg)

Armen [:armenzg]

Reporter

Comment 11

•

11 years ago

I'm ready whenever you are. Thanks!

Q

Assignee

Comment 12

•

11 years ago

Sorry this didn't go out yesterday I was testing rollback options and we have them. I will work on getting this out today. Q

Armen [:armenzg]

Reporter

Comment 13

•

11 years ago

For CCed people: No action required. Adding you as FYI. We're upgrading Nvidia drivers on Win8 machines. The change has been tested on staging.

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Updated

•

11 years ago

Blocks: 988012

Q

Assignee

Comment 14

•

11 years ago

This was rolled back to figure out what caused the failures that closed the trees.

Armen [:armenzg]

Reporter

Comment 15

•

11 years ago

Q, could you please deploy the change to t-w864-ix-009? I would like to test it on my staging master. On another note, I'm putting t-w864-ix-042 again on staging to see why it didn't catch the issues from bug 988012.

Blocks: t-w864-ix-009

Q

Assignee

Comment 16

•

11 years ago

will do

Armen [:armenzg]

Reporter

Comment 17

•

11 years ago

Q, can we please use 335.23 instead of 334.89?

Summary: Upgrade Windows 8 machines' Nvidia drivers to 334.89 → Upgrade Windows 8 machines' Nvidia drivers to 335.23

Q

Assignee

Comment 18

•

11 years ago

targeting only t-w864-ix-009 with the update. Rebooting now to pick it up with driver version 335.25

Phil Ringnalda (:philor)

Updated

•

11 years ago

Blocks: t-w864-ix-038

Armen [:armenzg]

Reporter

Comment 19

•

11 years ago

Attached image t-w864-ix-009 has graphic issues (deleted) — Details

I think there's something funky with 009's graphical setup. Q: can you have a look at it when you have time? Could you also install the newer graphics driver on 042? Thanks for your help! On another note, my testing on staging was pretty much invalid. GPO reverted my manual installation in between reboots. This means that I was testing the older setup and that is why I got everything to be green. Yay me!

Q

Assignee

Comment 20

•

11 years ago

009 will only get the new driver ( it has it now deployed via GPO) I ran c:\monitor_config\fakemon.vbs ( gets run on startup) and things looks good. You should be good to test that machine. I will address 042 right now.

Armen [:armenzg]

Reporter

Comment 21

•

11 years ago

Everything is running green so far on 009. The driver version seems the right one (it is still on the right version 335.23). Could it be that we need a reboot after the driver install? Maybe the tests fail on the first run after the drivers upgrade? I should run tests for mozilla-beta since that is what the logs from bug 988012 list.

Armen [:armenzg]

Reporter

Comment 22

•

11 years ago

Attached file mozilla-central debug test mochitest-browser-chrome failures (deleted) — Details

Armen [:armenzg]

Reporter

Comment 23

•

11 years ago

Attached file mozilla-beta opt mochitest-browser-chrome failure (deleted) — Details

Armen [:armenzg]

Reporter

Comment 24

•

11 years ago

Matt, Jeff: I've attached the logs of the jobs that failed after running with the newer Nvidia driver. Could you please have a look at them and find a way to make the green? Once we have solutions for this we can move forward to the deployment (I will need to do another run across various branches).

Phil Ringnalda (:philor)

Comment 25

•

11 years ago

Are you assuming that your problems with 009 were different than the problems in production? I don't think there's any reason to believe they were different, my memory of the failures would make "tiny resolution and no acceleration" a perfect fit.

Armen [:armenzg]

Reporter

Comment 26

•

11 years ago

It seems that we might have two issues: * the installation can get us into a bad state with no acceleration and tiny screen resolution * after upgrade, we have some perma orange failures on various branches Matt, Jeff: when do you have time to look at the failures? Does the following plan make sense? I can think of this path forward: 1) We can run a machine through all the release branches: m-c, m-a, m-b, m-r & esr24 1.1) File bugs for each failing test; ask original authors to help them fix them or disable them 1.2) Loan you a machine and ask 2) Once we deploy this again, we can quickly disable any machines that get a bad installation and fix them up

Flags: needinfo?(matt.woodrow)

Flags: needinfo?(jmuizelaar)

Jeff Muizelaar [:jrmuizel]

Comment 27

•

11 years ago

(In reply to Armen Zambrano [:armenzg] (Release Engineering) (EDT/UTC-4) from comment #26) > It seems that we might have two issues: > * the installation can get us into a bad state with no acceleration and tiny > screen resolution > * after upgrade, we have some perma orange failures on various branches > > Matt, Jeff: when do you have time to look at the failures? Does the > following plan make sense? Sounds reasonable to me.

Flags: needinfo?(jmuizelaar)

Matt Woodrow (:mattwoodrow)

Comment 28

•

11 years ago

Yes, sounds good to me too. Jeff and I looked at the logs last week, nothing stood out as obviously graphics related at all. It wasn't really obvious how the driver upgrade could have caused them.

Flags: needinfo?(matt.woodrow)

Armen [:armenzg]

Reporter

Comment 29

•

11 years ago

Jeff, Matt: would you mind preparing a patch to disable the perma-failure on beta? Q: do you have any trick to see if the installations failed on some of the machines? Any ideas on how to prevent machines from taking jobs if they are not ready to take jobs? (graphically speaking) Anything I can check on the machine? I'm hoping to look into the machines that failed last time (by looking at Windows logs) to see if there's anything that failed during the installation. I triggered jobs on mozilla-aurora and they all came out clean. I've triggered jobs on mozilla-release and mozilla-esr24 and we should know by tomorrow. I assume they might share the perma-failure of mozilla-beta. It seems that "WINNT 6.2 mozilla-central debug test mochitest-browser-chrome" was intermittent. It failed 2/5 times *and* the test failures from each were different. "WINNT 6.2 mozilla-beta opt test mochitest-browser-chrome" has failed 3/3 times the same way [1]. [1] These errors and similar: 17:40:04 WARNING - TEST-UNEXPECTED-FAIL | chrome://mochitests/content/browser/browser/components/customizableui/test/browser_876944_customize_mode_create_destroy.js | The number of placeholders should be correct. - Got 2, expected 1 17:40:09 WARNING - TEST-UNEXPECTED-FAIL | chrome://mochitests/content/browser/browser/components/customizableui/test/browser_880382_drag_wide_widgets_in_panel.js | Area PanelUI-contents should have 13 items. - Got 12, expected 13 17:40:16 WARNING - TEST-UNEXPECTED-FAIL | chrome://mochitests/content/browser/browser/components/customizableui/test/browser_890140_orphaned_placeholders.js | Should no longer be in default state.

Armen [:armenzg]

Reporter

Comment 30

•

11 years ago

Any answers wrt the questions on comment 29? m-r and m-esr24 have come out clean. As of now, we only have a perma-orange on mozilla-beta. Q: what do you think if we deploy the change again to five machines and keep a close eye on it? I would like to see recent failures. Meanwhile, I will look at one of the machine that failed in the past: t-w864-ix-105

Flags: needinfo?(q)

Flags: needinfo?(matt.woodrow)

Flags: needinfo?(jmuizelaar)

Q

Assignee

Comment 31

•

11 years ago

I think deploying to five machines is great idea. Do you have candidate machines for me?

Flags: needinfo?(q)

Armen [:armenzg]

Reporter

Comment 32

•

11 years ago

Let's do 010 to 015. Thanks Q!

Armen [:armenzg]

Reporter

Comment 33

•

11 years ago

Q: if you could do it as early as possible either today or tomorrow, it will give me a higher chance of keeping an eye on them. I will probably disable buildbot at the end of my day and re-enable them in the following morning.

Jeff Muizelaar [:jrmuizel]

Comment 34

•

11 years ago

Attached patch Disable failing tests on beta (deleted) — Details — Splinter Review

Flags: needinfo?(jmuizelaar)

Armen [:armenzg]

Reporter

Comment 35

•

11 years ago

Q and I will be deploying the change to five machines and keep a close eye on them.

Flags: needinfo?(matt.woodrow)

Q

Assignee

Comment 36

•

11 years ago

10 - 15 were set to get the new drivers after reboot. I still haven't seen an install get pulled I am going to make sure the change is getting picked up.

Armen [:armenzg]

Reporter

Comment 37

•

11 years ago

I've disabled on slavealloc and requested for buildbot shutdown for those slaves until I can look at them tomorrow. I can see that a couple of them already started failing jobs (010 & 014): https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&type=t-w864-ix&name=t-w864-ix-010 https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&type=t-w864-ix&name=t-w864-ix-011 https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&type=t-w864-ix&name=t-w864-ix-012 https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&type=t-w864-ix&name=t-w864-ix-013 https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&type=t-w864-ix&name=t-w864-ix-014 https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&type=t-w864-ix&name=t-w864-ix-015 For instance #10 had twisted disconnections. How does the installation work? Do we wait until the drivers update and then reboot? Is there a way to check for a failed installation?

Q

Assignee

Comment 38

•

11 years ago

The install runs at boot and makes sure that runslave doesn't start. The install is still running on 14 and runslave was killed. is it possible something rebooted this machine during the last install ? The machine has only been up for 6 minutes. Q

Q

Assignee

Comment 39

•

11 years ago

The installer completed on 014 and it rebooted without my intervention. Checking on the resolution etc now

Armen [:armenzg]

Reporter

Comment 40

•

11 years ago

The last job that #14 took successfully finished at: 4/23/2014, 3:16:44 PM After that it should have rebooted (last step in this job [1]) At that point, I assume it came back from a reboot and installed the drivers. At Wed Apr 23 15:34:43 2014 it started the next job and after 1 mins, 48 secs it failed (Wed Apr 23 15:36:32 2014) *without* rebooting. [3] From your comment, who killed runslave? Did you mean that it was not running? ("...and runslave was killed.") FTR I gracefully shutdown buildbot somewhere before my comment (15:50:02 PDT) Taking that action should look like killing runslave.py when connected to the machine. IIUC (correctly if I'm wrong) #14 took jobs even though the installation had not finished. ####################### [1] http://buildbot-master110.srv.releng.scl3.mozilla.com:8201/builders/WINNT%206.2%20try%20opt%20test%20reftest-no-accel/builds/521 [2] http://buildbot-master110.srv.releng.scl3.mozilla.com:8201/builders/WINNT%206.2%20fx-team%20opt%20test%20mochitest-2/builds/251 [3] 15:35:27 INFO - ##### 15:35:27 INFO - ##### Running clobber step. 15:35:27 INFO - ##### 15:35:27 INFO - Running pre-action listener: _resource_record_pre_action 15:35:27 INFO - Running main action method: clobber 15:35:27 INFO - rmtree: C:\slave\test\build 15:35:27 INFO - Using _rmtree_windows ... 15:35:27 INFO - retry: Calling <bound method DesktopUnittest._rmtree_windows of <__main__.DesktopUnittest object at 0x025F9B30>> with args: ('C:\\slave\\test\\build',), kwargs: {}, attempt #1 remoteFailed: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion. ]

Blocks: t-w864-ix-010, t-w864-ix-011, t-w864-ix-012, t-w864-ix-013, t-w864-ix-014, t-w864-ix-015

Q

Assignee

Comment 41

•

11 years ago

I think I might have an idea here let me check the install logs.

Q

Assignee

Comment 42

•

11 years ago

This was my fault. For some reason the hostname filter that told gpo to NOT revert the changes back to the old driver was not taking. So the new driver was installing then GPO was detecting the new driver and resetting the node hence the kill behavior ( we had this in place because test would fail with the new driver ). I removed the revert statements entirely and am retrying now.

Q

Assignee

Comment 43

•

11 years ago

Okay 14 looks stable and came up with the correct res and did not revert.

Armen [:armenzg]

Reporter

Comment 44

•

11 years ago

Yay! Jeff, Matt: I want to wait until next week so Q and I can coordinate further deploying this (I'm on PTO until Monday). Please reach for coop if this could not wait.

Armen [:armenzg]

Reporter

Comment 45

•

11 years ago

Thanks Q. I'm happy to see this figured out.

Q

Assignee

Comment 46

•

11 years ago

Do we have an metrics on how these machines are doing?

Armen [:armenzg]

Reporter

Comment 47

•

11 years ago

I'm putting back the 5 machines into production after the changes that Q did for comment 43. Let's see how the jobs go for each of them: https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&type=t-w864-ix&name=t-w864-ix-010 https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&type=t-w864-ix&name=t-w864-ix-011 https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&type=t-w864-ix&name=t-w864-ix-012 https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&type=t-w864-ix&name=t-w864-ix-013 https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&type=t-w864-ix&name=t-w864-ix-014 https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&type=t-w864-ix&name=t-w864-ix-015

Phil Ringnalda (:philor)

Comment 48

•

11 years ago

https://tbpl.mozilla.org/php/getParsedLog.php?id=38670508&tree=Mozilla-Central is 015 failing the webgl mochitests which fail when run (without hardware acceleration, at such a low resolution, whathaveyou) when webgl doesn't work on a slave.

Phil Ringnalda (:philor)

Comment 49

•

11 years ago

Two of the others have done green mochitest-2 runs (though nobody has yet done either of my favorites, reftest or mochitest-1), so it may be that just 015 is broken.

Phil Ringnalda (:philor)

Comment 50

•

11 years ago

https://tbpl.mozilla.org/php/getParsedLog.php?id=38674156&tree=Mozilla-Inbound is 015 failing reftest webgl tests. Disabled it in slavealloc.

Armen [:armenzg]

Reporter

Comment 51

•

11 years ago

Thank you philor. It seems that only 015 has given trouble so far. I sometimes feel that not being able to set a WebGL context should turn the job red. Anyways, let's look tomorrow as to why 015 misbehaved. Let's hope the other ones stay put. 18:08:41 INFO - JavaScript warning: http://mochi.test:8888/tests/dom/imptests/html/webgl/common.js, line 3: WebGL: Can't get a usable WebGL context 18:43:29 INFO - JavaScript warning: file:///C:/slave/test/build/tests/reftest/tests/content/canvas/test/reftest/webgl-utils.js, line 59: WebGL: Can't get a usable WebGL context

Armen [:armenzg]

Reporter

Comment 52

•

11 years ago

Attached image Capture1.PNG (obsolete) (deleted) — Details

Is this relevant? Should we re-image 015 and start again? On another note, could we deploy the change to machine 020 to 029? If machine 010 to 014 have been working I'm confident to see those work. After that we should deploy across the pool and disable individual machines that get into the state of 015.

Attachment #8414866 - Flags: feedback?(q)

Armen [:armenzg]

Reporter

Comment 53

•

11 years ago

We're going to deploy across the board. Deploying now rather than on the morning will be less disruptive if we have some machines that don't work. If this goes sideways please use the escalation wiki.

Armen [:armenzg]

Reporter

Comment 54

•

11 years ago

Comment on attachment 8414866 [details] Capture1.PNG Irrelevant event. Obsoleting screenshot.

Attachment #8414866 - Attachment is obsolete: true

Attachment #8414866 - Flags: feedback?(q)

Armen [:armenzg]

Reporter

Comment 55

•

11 years ago

Email sent to sheriffs. This can be used to monitor the change: https://tbpl.mozilla.org/?tree=Mozilla-Inbound&jobname=WINNT%206.2 And maybe this: http://builddata.pub.build.mozilla.org/reports/pending/pending.html (last two diagrams)

Armen [:armenzg]

Reporter

Comment 56

•

11 years ago

It seems that this can kill the first job. Following jobs work as expected.

Armen [:armenzg]

Reporter

Comment 57

•

11 years ago

While the installation of the driver goes on, the script was killing any python process starting up. That caused the "blue" jobs. Once the installation finished the machine would reboot and come back with the right driver. Unfortunately, some machines would get a messed up device/screen-resolution condition. The way that the fakemon.vbs runs, it can get on the way of buildbot (see [1]). Some jobs would come out orange. We have 005 which recovered from such state. On the other hand, we had 004 that got 3 orange jobs. With the newer starttalos.bat that Q is deploying to the win8 machines, it will prevent machines like 004 to run. [1] TEST-UNEXPECTED-FAIL | chrome://mochitests/content/browser/toolkit/mozapps/extensions/test/browser/browser_details.js | Enable button should be visible

Armen [:armenzg]

Reporter

Comment 58

•

11 years ago

So it seems that the new starttalos.bat is doing what we expecte. Currently, the are no production win8 jobs to hit. There are a bunch of Try ones: https://tbpl.mozilla.org/?tree=Try&jobname=WINNT%206.2

Armen [:armenzg]

Reporter

Comment 59

•

11 years ago

I will be checking before 9pm. I will let Q know by then if there's anything else we need to do or not. I hope to get more data points for later in here (I'm waiting on the try backlog to clear to get to my jobs): https://tbpl.mozilla.org/?tree=Ash&jobname=WINNT 6.2 ash opt test&rev=685ffafc35cf https://tbpl.mozilla.org/?tree=Cedar&jobname=WINNT%206.2.*opt&rev=d0a03ae4832f

Phil Ringnalda (:philor)

Updated

•

11 years ago

Blocks: t-w864-ix-062

Ryan VanderMeulen [:RyanVM]

Updated

•

11 years ago

Depends on: t-w864-ix-030

Ryan VanderMeulen [:RyanVM]

Updated

•

11 years ago

Blocks: t-w864-ix-030

No longer depends on: t-w864-ix-030

Armen [:armenzg]

Reporter

Comment 60

•

11 years ago

Attached image t-w864-ix-083 screenshot (deleted) — Details

It seems that we're out of the woods. However, this machine (even though it had your new starttalos.bat) was with the screen resolution code in the forefront. It burned a job. I closed cmd and disabled the machine for when you have time to look at it.

Attachment #8414979 - Flags: feedback?(q)

Armen [:armenzg]

Reporter

Updated

•

11 years ago

Blocks: t-w864-ix-063

Armen [:armenzg]

Reporter

Updated

•

11 years ago

Blocks: t-w864-ix-083

Phil Ringnalda (:philor)

Comment 61

•

11 years ago

t-w864-ix-117 is a bit suspicious, since the screenshot from a mochitest-5 timeout shows the Start screen, but I left it enabled to see what else it could manage to break.

Q

Assignee

Comment 62

•

11 years ago

083 looks like it had an error from before the starttalos change. A reboot brought the box back

Armen [:armenzg]

Reporter

Comment 63

•

11 years ago

t-w864-ix-117 seems fine from looking at it with VNC. t-w864-ix-070 and t-w864-ix-077 are not OK. I will reboot them once more. Re-enabled and rebooted: t-w864-ix-015 t-w864-ix-030 t-w864-ix-042 t-w864-ix-062 t-w864-ix-063 t-w864-ix-083 I will watch them.

Phil Ringnalda (:philor)

Comment 64

•

11 years ago

t-w864-ix-063 still has webgl bustage, https://tbpl.mozilla.org/php/getParsedLog.php?id=38815611&tree=Mozilla-Aurora, redisabled.

Ryan VanderMeulen [:RyanVM]

Updated

•

11 years ago

Blocks: t-w864-ix-018

Armen [:armenzg]

Reporter

Comment 65

•

11 years ago

Ryan and philor have disabled 063 and 018. Ryan says that we're actually seeing the issue on machines at random.

Ryan VanderMeulen [:RyanVM]

Comment 66

•

11 years ago

We've seen the same timeouts on at least 6 or 7 different slaves in the last hour or so. Looking at slave health, they appear to recover OK after rebooting.

Phil Ringnalda (:philor)

Comment 67

•

11 years ago

The screenshot thing is maybe unrelated, since instances of bug 632290 on April 12th and 17th show the same screenshot of the Start screen, but the way we had one instance of that failure last week on Linux, and we've had six all on Win8 since last night seems related.

Armen [:armenzg]

Reporter

Comment 68

•

11 years ago

Q: can we put the call of fakemon.vbs outside of starttalos.bat the way it was? or maybe remove the 2nd call?

Armen [:armenzg]

Reporter

Comment 69

•

11 years ago

FYI I spoke with Q and we decided not to touch anything anymore as I asked on comment 68. I believe the extra failures were some machines I should have not put back into the pool (020/021) that brought back some issues that they only showed. It seems that we're OK now. philor/RyanVM let me know if not. I'm going to go over any stragglers and file a follow up bug.

Armen [:armenzg]

Reporter

Updated

•

11 years ago

Blocks: 1004813

Armen [:armenzg]

Reporter

Comment 70

•

11 years ago

Done with the update. I filed bug 1004813 for stragglers.

No longer blocks: t-w864-ix-083, t-w864-ix-063, t-w864-ix-018, 974684, 1004813, t-w864-ix-062, t-w864-ix-012, t-w864-ix-030, 988012, t-w864-ix-009, t-w864-ix-038, t-w864-ix-014, t-w864-ix-011, t-w864-ix-015, t-w864-ix-010, t-w864-ix-013

Status: NEW → RESOLVED

Closed: 11 years ago

Resolution: --- → FIXED

Armen [:armenzg]

Reporter

Updated

•

11 years ago

Blocks: 974684

Phil Ringnalda (:philor)

Updated

•

11 years ago

Depends on: 1003614

Armen [:armenzg]

Reporter

Updated

•

11 years ago

Attachment #8414979 - Flags: feedback?(q)

t-w864-ix-009 has graphic issues 11 years ago Armen [:armenzg] (deleted), image/png		Details
mozilla-central debug test mochitest-browser-chrome failures 11 years ago Armen [:armenzg] (deleted), application/gzip		Details
mozilla-beta opt mochitest-browser-chrome failure 11 years ago Armen [:armenzg] (deleted), application/gzip		Details
Disable failing tests on beta 11 years ago Jeff Muizelaar [:jrmuizel] (deleted), patch		Details \| Diff \| Splinter Review
Capture1.PNG 11 years ago Armen [:armenzg] (deleted), image/png		Details
t-w864-ix-083 screenshot 11 years ago Armen [:armenzg] (deleted), image/png		Details