Closed
Bug 1026870
Opened 10 years ago
Closed 10 years ago
Something wrong with windows build slaves ("LINK : fatal error LNK1123: failure during conversion to COFF: file invalid or corrupt")
Categories
(Infrastructure & Operations Graveyard :: CIDuty, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: glandium, Assigned: q)
References
Details
(Keywords: intermittent-failure, Whiteboard: [release-impacting])
See https://tbpl.mozilla.org/?tree=Try&jobname=win&rev=c6fe0466209b for example. It's a try with sccache disabled (to rule it out), and exhibiting the same problem as other older builds on those slaves.
The error is:
LINK : fatal error LNK1123: failure during conversion to COFF: file invalid or corrupt
usually when linking ICU, but it also happens when linking something from the crash reporter on some other builds.
Apparently, only the slaves between 20 and 29 are affected, although I haven't observed failures on all of them because they don't have all recently built something, all the slaves that do fail to build are in that range.
It started happening today or yesterday.
Related information about this error code:
http://msdn.microsoft.com/en-us/library/7dz62kfh.aspx
http://support.microsoft.com/kb/2757355
Comment 1•10 years ago
|
||
b-2008-ix-0020 through b-2008-ix-0029 are disabled in slavealloc.
Comment 2•10 years ago
|
||
so these slaves do seem to be using vs2010 still as you can see with lines like /c/PROGRA~2/MICROS~2.0. The slave still shows this shortname still points to:
/c/Program\ Files\ \(x86\)/Microsoft\ Visual\ Studio\ 10.0/
so it is not using the junction set up at /c/tools/vs2013/.
I am not sure why this is failing here. On my testing slave I was able to build this builder on m-c(not try) against vs2010 while vs2013 was also installed on the machine: http://people.mozilla.org/~jlund/vs2010-mozilla-central-winxp-opt.log
Comment 3•10 years ago
|
||
the fallout of burning these jobs falls on my head.
Regardless of my test slave not catching hitting this for whatever reason, it looks like this is a common issue. The resolution seems to be to install Visual Studio 2010 SP1 as mentioned in here[1] and also here[2].
pinging Q and dmajor for relops and dev perspective - Is it possible to do update our machine's VS2010 install to SP1 or should we re-image these hosts and look at coming up with another alternative?
[1] - from comment 1 - http://support.microsoft.com/kb/2757355
[2] - http://stackoverflow.com/questions/10888391/error-link-fatal-error-lnk1123-failure-during-conversion-to-coff-file-inval
Flags: needinfo?(q)
Flags: needinfo?(dmajor)
Comment 4•10 years ago
|
||
From stack overflow: "Note that installing VS 2010 SP1 will remove the 64-bit compilers. You need to install the VS 2010 SP1 compiler pack to get them back."
Reporter | ||
Comment 5•10 years ago
|
||
You should check directly on one of those problematic slaves, and check the first microsoft link in comment #0 before considering upgrading MSVC.
I will need to follow up on this one with my team as we applied the same gpo here as we did on the test slave. If sp1 does not fix this immediately we will re image the machines and then I think we should pull one and teat against it.
Flags: needinfo?(q)
Reporter | ||
Comment 7•10 years ago
|
||
I'd rather you do it the other way around. Installing SP1 means changing the compiler. This can have any sorts of effects, like sucking more memory to do PGO linkage, or miscompiling, and that shouldn't be done lightly.
Comment 8•10 years ago
|
||
SP1 involves a change to the CRT for Firefox releases, and we should avoid that.
Comment 9•10 years ago
|
||
OK I've asked to revert the changes I requested: https://bugzilla.mozilla.org/show_bug.cgi?id=1019165#c12
re-imaging those machines sounds like best course of action for now.
Flags: needinfo?(dmajor)
Assignee | ||
Comment 10•10 years ago
|
||
Postmortem shows this:
VS 2010 sp1 and the compiler pack was installed in conjunction with VS 2012 for initial testing on the test machine. We were planning on rolling it with VS 2012 per releng instructions then we (Releng/Relops) made the decision to skip 2012 and go to 2013 and that 2010 patch step was lost since the test machine already had it. To move forward we will need 2010 sp1 and the compiler reinstalled before going to 2013.
Just in case I will get a 2010 sp1/compiler pack installer GPO ready but I will need confirmation that we we can run with 2010 SP1 on the builders
Assignee | ||
Comment 11•10 years ago
|
||
The affected builders are being re-imaged now
Reporter | ||
Comment 12•10 years ago
|
||
(In reply to Q from comment #10)
> Just in case I will get a 2010 sp1/compiler pack installer GPO ready but I
> will need confirmation that we we can run with 2010 SP1 on the builders
See comment 7 and comment 8.
Assignee | ||
Comment 13•10 years ago
|
||
Thanks I missed comment 8
Assignee | ||
Comment 14•10 years ago
|
||
Thanks I missed comment 8
Assignee | ||
Comment 15•10 years ago
|
||
All slaves re-imaged and confirmed except 0023 which was slow due to a disk check being forced.
Assignee | ||
Comment 16•10 years ago
|
||
all slaves done.
Assignee | ||
Comment 17•10 years ago
|
||
Who is in charge of getting these back in?
Assignee: nobody → q
Flags: needinfo?(mh+mozilla)
Comment 18•10 years ago
|
||
That would be the lucky buildduty, aka jlund.
Component: General Automation → Buildduty
Flags: needinfo?(mh+mozilla)
QA Contact: catlee → bugspam.Callek
Comment 19•10 years ago
|
||
I didn't get around to this today. I will add them first thing in the morning. Win builders are far from our worst wait_times so this should be fine.
Comment 20•10 years ago
|
||
Also WRT actually installing vs2013, it looks like there are two options:
1) we have two separate pools of windows build machines. 1 for vs2010 and 1 for vs2013 where we slowly fade out the former.
2) we install VS2010 SP1 and VS2013 on all our build machines.
I would like to weigh out the pros/cons of both options in a discussion with folks more knowledgeable than myself.
Yes, the CRT will change, and we will have to measure performance in all our builders to check for diffs in things like PGO. This could be bad. But I'd like to discuss how bad, and is it worse than dividing up our win pools. Dividing up is not very optimal from Mozilla's release engineering side of things and will come with its own consequences.
Armen - I believe you have done some work on investigating vs2010 SPI before, maybe with the vs2012 work Q mentioned here: https://bugzilla.mozilla.org/show_bug.cgi?id=1026870#c10 Do you have any input?
bhearsum - I heard rumors that you used to do all the 'windows' stuff before jhopkins and armen. any thoughts?
maybe a quick group meeting with bsmedberg and/or glandium to sort this out would be best?
Flags: needinfo?(bhearsum)
Flags: needinfo?(armenzg)
Comment 21•10 years ago
|
||
(In reply to Jordan Lund (:jlund) from comment #20)
> Also WRT actually installing vs2013, it looks like there are two options:
>
> 1) we have two separate pools of windows build machines. 1 for vs2010 and 1
> for vs2013 where we slowly fade out the former.
>
> 2) we install VS2010 SP1 and VS2013 on all our build machines.
>
> I would like to weigh out the pros/cons of both options in a discussion with
> folks more knowledgeable than myself.
>
> Yes, the CRT will change, and we will have to measure performance in all our
> builders to check for diffs in things like PGO. This could be bad. But I'd
> like to discuss how bad, and is it worse than dividing up our win pools.
> Dividing up is not very optimal from Mozilla's release engineering side of
> things and will come with its own consequences.
I'm extremely wary of changing anything in the toolchain for Beta, Release, and ESR. However, it *is* early in a Beta cycle, so we'd have lots of time to prove out the change with Beta users. This option would require RelMan sign off for sure because of the risk involved.
I haven't been following this bug until now - I'm assuming the CRT only changes for option #2. If so, can I ask why it must change? We already set PATH/LIB/etc. for our compiler - can we not pick up the old CRT by setting one of those? as far as I know, we've done all of our compilers upgrades without dividing the pool, so I'd like to understand what's special about this one. Eg: bug 563318 was the tracker for 2010. bug 563317 shows us installing it on existing machines.
The other downside to dividing up the pool is that we end up with worse machine utilization in both pools, which reduces our throughput for changes overall.
Flags: needinfo?(bhearsum)
Comment 22•10 years ago
|
||
I think dividing the pool is going to risk us missing all sorts of weirdness that we don't initially associate with a certain set of machines, and instead think is some sporadic intermittent issue.
Reporter | ||
Comment 23•10 years ago
|
||
How about option 3: look into http://msdn.microsoft.com/en-us/library/7dz62kfh.aspx and see if it's not possible to *not* install SP1.
Reporter | ||
Comment 24•10 years ago
|
||
(In reply to Ben Hearsum [:bhearsum] from comment #21)
> as far as I know, we've done all of our compilers upgrades
> without dividing the pool, so I'd like to understand what's special about
> this one.
AIUI, VS2012/2013 installs a CVTRES.EXE that is not compatible with VS2010, and, it looks like it puts it in some shared location, overwriting VS2010's, or in a directory that has precedence in $PATH. I hope it's the latter.
As to why that didn't happen when we switched to VS2010, in all likeliness, VS2010 wasn't installing a CRTRES.EXE that is not compatible with VS2005/VS2008.
Comment 25•10 years ago
|
||
As Ed and Ben mentions we should have divided pools unless we have to.
jlund, with regards previous knowledge, I have only one input: order matters with Windows, specially when talking about Visual Studio. Installers and uninstallers can be doing things we would not expect them too.
Another tip: do not try to install in custom places unless you have to. I had once a variable that you could set for a specific path, however, not every component of VS would install there as they had other secret environment variables.
Best of luck.
PS = Don't call Ben a Windows expert or he'll come at you :P
Flags: needinfo?(armenzg)
Comment 26•10 years ago
|
||
With only MSVC2013 installed, cvtres.exe is installed to c:\Program Files (x86)/Microsoft Visual Studio 12.0/VC/BIN/cvtres.exe
If cvtres.exe is indeed the problem, we should check and see which version is on the PATH currently, and which version ends up on the PATH in the "bad" case, and hopefully we can just munge the PATH to get the right thing on top in each case.
Comment 27•10 years ago
|
||
it sounds like bsmedberg and glandium's suggestion (option 3) is worth perusing.
I've filed to get another test machine but with vs2010 (the non sp1 version) and vs2013 installed on it: bug 1027745
I am closing this bug for now as this was a buildduty infra error bug which has been resolved. tracking getting the 10 re-imaged machines back into production will happen in the original rollout bug: Bug 1019165
overall goal of vs2013 in prod is still tracked here: Bug 1009807 - Figure out the correct path setup and mozconfigs for automation-driven MSVC2013 builds
ty you all for your aid thus far.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Comment 28•10 years ago
|
||
(In reply to Benjamin Smedberg [:bsmedberg] from comment #26)
> With only MSVC2013 installed, cvtres.exe is installed to c:\Program Files
> (x86)/Microsoft Visual Studio 12.0/VC/BIN/cvtres.exe
>
> If cvtres.exe is indeed the problem, we should check and see which version
> is on the PATH currently, and which version ends up on the PATH in the "bad"
> case, and hopefully we can just munge the PATH to get the right thing on top
> in each case.
And even if cvtres.exe is overwritten by default, we could manually copy the older version somewhere and point PATH at it where needed, I think.
Comment 29•10 years ago
|
||
Some potential workarounds: http://social.msdn.microsoft.com/Forums/vstudio/en-US/d10adba0-e082-494a-bb16-2bfc039faa80/vs2012-rc-installation-breaks-vs2010-c-projects?forum=vssetup
It seems the root cause is a .NET DLL dependency, which people have worked around by overwriting cvtres, getting a different one onto PATH, or tinkering with the machine's .NET installation.
(I don't see a way to link to specific comments, but you can search for "Proposed as answer")
Comment 30•10 years ago
|
||
Tweaking summary to make the issue easier to find, particularly given we're unfortunately seeing it again not that bug 1019165 has started rolling out again.
Summary: Something wrong with b-2008-ix-002x slaves → Something wrong with b-2008-ix-002x slaves ("LINK : fatal error LNK1123: failure during conversion to COFF: file invalid or corrupt")
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment 34•10 years ago
|
||
I filed bug 1049794 for b-2008-ix-0120 and disabled it in slavealloc.
Comment hidden (Legacy TBPL/Treeherder Robot) |
Assignee | ||
Comment 36•10 years ago
|
||
This was due to a VS 2013 push to a machine with an active job. Markco can you make sure these are disabled and jobs are DONE BEFORE deploying 2013 ?
Flags: needinfo?(mcornmesser)
Flags: needinfo?(arich)
Updated•10 years ago
|
Flags: needinfo?(mcornmesser)
Updated•10 years ago
|
Flags: needinfo?(arich)
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Updated•10 years ago
|
Depends on: b-2008-sm-0050
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment 50•10 years ago
|
||
reopening, as:
a) 13 reports by sheriffs since closed
b) lots of issues hit in bug 1057549 which point to very slow machines
(b) hit a number of release builds this cycle.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Whiteboard: [release-impacting]
Updated•10 years ago
|
Summary: Something wrong with b-2008-ix-002x slaves ("LINK : fatal error LNK1123: failure during conversion to COFF: file invalid or corrupt") → Something wrong with windows build slaves ("LINK : fatal error LNK1123: failure during conversion to COFF: file invalid or corrupt")
Comment 51•10 years ago
|
||
I have a suspicion: both this issue and bug 1057229 comment 5 could be explained by some builders picking up an incomplete VS2013 setup (i.e. without the cvtres fixup, or without Update 3). I don't know how that might happen, though.
Comment 52•10 years ago
|
||
Could the build slave become active before the GPO for VS2013 is fully applied? I assume it's a long install.
Assignee | ||
Comment 53•10 years ago
|
||
This is being addressed by https://bugzilla.mozilla.org/show_bug.cgi?id=1063372 and https://bugzilla.mozilla.org/show_bug.cgi?id=1063018
Status: REOPENED → RESOLVED
Closed: 10 years ago → 10 years ago
Resolution: --- → FIXED
Comment 54•10 years ago
|
||
We're still seeing these failures pretty regularly, especially on the release branches.
https://treeherder.mozilla.org/ui/logviewer.html#?job_id=72561&repo=mozilla-b2g32_v2_0
Comment 55•10 years ago
|
||
(In reply to Ryan VanderMeulen [:RyanVM UTC-5] from comment #54)
> We're still seeing these failures pretty regularly, especially on the
> release branches.
> https://treeherder.mozilla.org/ui/logviewer.html#?job_id=72561&repo=mozilla-
> b2g32_v2_0
ni: q: this issue just doesn't want to go away! :) the job Ryan is pointing to was against b-2008-ix-0005 @ Tue Dec 16 15:13:49 2014. is it possible this slave recently imaged or had an incomplete gpo?
Flags: needinfo?(q)
Assignee | ||
Comment 56•10 years ago
|
||
HMM well we are no longer applying 2013 from GPO only in the base image so it in theory can no longer get an incomplete install. I am doing a full audit to make that is 100% true
Flags: needinfo?(q)
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment 60•10 years ago
|
||
(In reply to Q from comment #56)
> HMM well we are no longer applying 2013 from GPO only in the base image so
> it in theory can no longer get an incomplete install. I am doing a full
> audit to make that is 100% true
hmm, 008, 001, and 005 (twice) have recently hit this. aside from this error, sheriffs are also seeing timeouts, essentially https://bugzil.la/1055876 again. Since this bug and 1055876 were related last time, I wonder if there is something similar at play, granted this isn't part of GPO anymore.
sheriffs are now being forced disabling slaves that hit this to to (a) stress the importance of this and (b) show the scope of what slaves are 'bad'.
Q, how did the audit go? would you be able to spend some brain cycles assisting me with this? I'm free most this week. thanks for bearing with me :)
Flags: needinfo?(q)
Assignee | ||
Comment 61•10 years ago
|
||
Couldn't find anything obvious the gpo isn't at play anywhere. I have set time aside tomorrow to slog through this one. Maybe we can do some vidyo time after I do some more digging in the morning?
Flags: needinfo?(q)
Assignee | ||
Comment 62•10 years ago
|
||
Do these errors happen to start when we started installing the new HG version?
Assignee | ||
Comment 63•10 years ago
|
||
HG started rolling out on December 11th 2014
Comment 64•10 years ago
|
||
(In reply to Q from comment #63)
> HG started rolling out on December 11th 2014
TBH - I am not sure. Ryan made a comment on Dec 17th after a 3 month gap saying that we are still seeing this bug: https://bugzilla.mozilla.org/show_bug.cgi?id=1026870#c54
Ryan, is it possible that this only started acting up again around the 11th?
Status: RESOLVED → REOPENED
Flags: needinfo?(ryanvm)
Resolution: FIXED → ---
Comment hidden (Legacy TBPL/Treeherder Robot) |
Updated•10 years ago
|
Blocks: b-2008-ix-0140
Comment hidden (Legacy TBPL/Treeherder Robot) |
Updated•10 years ago
|
Blocks: b-2008-ix-0008
Comment 67•10 years ago
|
||
(In reply to Jordan Lund (:jlund) PTO till Jan 14th from comment #64)
> Ryan, is it possible that this only started acting up again around the 11th?
Could be, can't say I remember at this point.
Flags: needinfo?(ryanvm)
Comment hidden (Legacy TBPL/Treeherder Robot) |
Updated•10 years ago
|
Blocks: b-2008-ix-0127
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Updated•10 years ago
|
Blocks: b-2008-ix-0116
Updated•10 years ago
|
Blocks: b-2008-ix-0078
Comment 71•10 years ago
|
||
Unfortunately, looking a whole lot like "anything that has been reimaged in the last month or so."
Updated•10 years ago
|
Blocks: b-2008-ix-0002
Updated•10 years ago
|
Blocks: b-2008-ix-0005
Updated•10 years ago
|
Blocks: b-2008-ix-0004
Updated•10 years ago
|
Blocks: b-2008-ix-0003
Assignee | ||
Comment 72•10 years ago
|
||
Per discussions it is possible that the HG install having parts of the VS redistributable included in them may be causing the problem. The fix may be the correction of the system PATH or investigating the overlap. There are cases of machines failing that have not been re-imaged
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment 75•10 years ago
|
||
In bug 1117900 comment 20, Ryan found a builder with an old VS install. Relevant?
Assignee | ||
Comment 76•10 years ago
|
||
I was hoping it was however, it doesn't seem likely as we have found machines reporting the correct VS version with the issue. The machines with the wrong version have been tracked to testing machines that did not get re-imaged so far.
Assignee | ||
Comment 77•10 years ago
|
||
Since it was brought up in IRC today. The three month gap is concerning in that nothing we can find changed in the build or re-image process and this seems to happen on machines that have and have not been re-imaged in the failure window. However, some GPO changes were made around the time the failures started for git and HG installs.
Updated•10 years ago
|
Blocks: b-2008-ix-0084, b-2008-ix-0167
Updated•10 years ago
|
Blocks: b-2008-ix-0133
Comment 78•10 years ago
|
||
I’ve started reimaging the dependent machines in batches of 5, with an hour wait between batches.
Comment 79•10 years ago
|
||
(In reply to Chris Cooper [:coop] from comment #78)
> I’ve started reimaging the dependent machines in batches of 5, with an hour
> wait between batches.
These are all re-imaged and re-enabled now.
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Assignee | ||
Comment 87•10 years ago
|
||
I think this might be fixed with GPO updates discussed in IRC ( will transcribe here when I get to a non laptop keyboard). We haven't seen any new reports since 01/09. Is there a way to confirm the fix?
Flags: needinfo?(ryanvm)
Flags: needinfo?(coop)
Comment 88•10 years ago
|
||
I think that the lack of new slave disablings or reports in this bug since is a positive sign :) Let's give it a week or so and resolve the bug if things look good?
Flags: needinfo?(ryanvm)
Comment 89•10 years ago
|
||
(In reply to Ryan VanderMeulen [:RyanVM UTC-5] from comment #88)
> I think that the lack of new slave disablings or reports in this bug since
> is a positive sign :) Let's give it a week or so and resolve the bug if
> things look good?
Sounds good to me.
Flags: needinfo?(coop)
Comment 90•10 years ago
|
||
apologies, I was on PTO.
Thanks Q for implementing a likely fix.
Updated•10 years ago
|
Status: REOPENED → RESOLVED
Closed: 10 years ago → 10 years ago
Resolution: --- → FIXED
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Updated•9 years ago
|
Blocks: b-2008-ix-0023, b-2008-ix-0022, b-2008-ix-0025, b-2008-ix-0026, b-2008-ix-0020, b-2008-ix-0021, b-2008-ix-0024, b-2008-ix-0027, b-2008-ix-0028, b-2008-ix-0029, b-2008-sm-0050
No longer depends on: b-2008-ix-0023, b-2008-ix-0022, b-2008-ix-0025, b-2008-ix-0026, b-2008-ix-0029, b-2008-ix-0020, b-2008-ix-0021, b-2008-ix-0024, b-2008-ix-0027, b-2008-ix-0028, b-2008-sm-0050
Updated•7 years ago
|
Product: Release Engineering → Infrastructure & Operations
Updated•5 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•