Closed Bug 582335 Opened 14 years ago Closed 14 years ago

Win2k3 builds failing to link xul.dll: Not enough space for gklayout.lib

Categories

(Release Engineering :: General, defect, P1)

x86
Windows Server 2003
defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: luke, Assigned: khuey)

References

Details

Attachments

(4 files)

I will bump this.
Assignee: nobody → armenzg
Priority: -- → P2
This is not a space issue. There is something else going on since if we look at the full log we can see that we got enough free space. * 114.44 GB of space available * 116.65 GB of space available I am clobbering and re-triggering. I will check in the morning to see what is going on.
Status: NEW → ASSIGNED
This could mean that we ran out of space in the temp directory, which I believe is located on c:\.
This is still happening on the TM tree.
Are you sure http://hg.mozilla.org/tracemonkey/rev/80382d88b92c cannot be causing this ? See also mention of a 'disk space problem' on try before the tracemonkey landing, bug 577648 comment #38. The slaves I inspected have at least 10G free on the disk housing TEMP/TMP, and more than 100G from on the builds drive. Admittedly that's between builds, but if we're now creating temp files during the build that are > 10GB in size then that's a regression. Meanwhile I'll monitor a build in progress ...
well, it could be causing it, I guess. But the error we're getting is fatal error C1083: Cannot open compiler intermediate file: '../../staticlib/components/gklayout.lib': Not enough space and none of that code is in gklayout.lib
Died again when creating xul.dll in the PGO instrumentation cycle. I compared the sizes of all the libs that are listed explicitly with a recent mozilla-central build, and they're all pretty close: gklayout is 14M bigger (1.5%), and the total is 17.5M bigger (1.0%). Could you be tickling a compiler bug, or (new/recursive) references cause the compiler to (dumbly) open mulitple copies of gklayout.lib ?
Summary: Win2k3 builds failing on TM: Not enough space for gklayout.lib → Win2k3 builds failing to link xull.dll on TM: Not enough space for gklayout.lib
So if this is dying linking gklayout.lib, then I doubt it's a disk space issue. When we go to link libxul, gklayout.lib is an input file. I suspect this is either a physical or virtual memory issue. (Or just a compiler bug.)
Is there a way to confirm that we're running out of memory? If that's the case, we'll need to switch the windows builders to 64-bit ASAP.
Severity: normal → critical
Priority: P2 → P1
Do the tracemonkey builders have the /3GB switch that we needed on mozilla-central earlier this year? I figure they do, just covering the bases.
(In reply to comment #10) > Do the tracemonkey builders have the /3GB switch that we needed on > mozilla-central earlier this year? > > I figure they do, just covering the bases. mw32-ix-slave13, which has failed a lot recently, does have the /3GB switch.
Can we try backing out http://hg.mozilla.org/tracemonkey/rev/80382d88b92c to see if it builds?
(In reply to comment #12) > Can we try backing out http://hg.mozilla.org/tracemonkey/rev/80382d88b92c to > see if it builds? Yes. But what are we trying to learn from the experiment?
(In reply to comment #13) > (In reply to comment #12) > > Can we try backing out http://hg.mozilla.org/tracemonkey/rev/80382d88b92c to > > see if it builds? > > Yes. But what are we trying to learn from the experiment? Testing to see if it's something specific to that revision that is causing the linker to go awry. Also, according to http://msdn.microsoft.com/en-us/library/aa366778%28VS.85%29.aspx, we'll get a 4GB address space for the 32-bit compiler, so could be worth testing as well.
(In reply to comment #14) > (In reply to comment #13) > > (In reply to comment #12) > > > Can we try backing out http://hg.mozilla.org/tracemonkey/rev/80382d88b92c to > > > see if it builds? > > > > Yes. But what are we trying to learn from the experiment? > > Testing to see if it's something specific to that revision that is causing the > linker to go awry. It probably is that revision that's triggering the error. Assuming we confirm that's the case, what will we do differently?
(In reply to comment #15) > (In reply to comment #14) > > (In reply to comment #13) > > > (In reply to comment #12) > > > > Can we try backing out http://hg.mozilla.org/tracemonkey/rev/80382d88b92c to > > > > see if it builds? > > > > > > Yes. But what are we trying to learn from the experiment? > > > > Testing to see if it's something specific to that revision that is causing the > > linker to go awry. > > It probably is that revision that's triggering the error. Assuming we confirm > that's the case, what will we do differently? Can the patch be split up into smaller pieces, to find out exactly what's causing this issue? Then try doing it differently, or we can try getting in touch with Microsoft to see if they have any suggestions.
(In reply to comment #16) > > Can the patch be split up into smaller pieces, to find out exactly what's > causing this issue? Then try doing it differently, or we can try getting in > touch with Microsoft to see if they have any suggestions. Before we do anything like that, we need to know whether the linker is running out of memory, and how close it is getting to running out of memory on mozilla-central. If it is near the edge without this patch, it is not worth spending a lot of time examining this specific change, since it will happen again with some other changeset.
Assignee: armenzg → nobody
I found chromium hitting this bug: http://codereview.chromium.org/1075013 they say "linking indirectly ... takes more space than having an explicit dependency". Anyone know how that translates to our build system?
I am now building a PGO build on a 32-bit Windows system and a 64-bit Windows system to see if the latter succeeds while the former fails. It would really be better if someone actually looked at what's happening on the tinderbox, though.
First cut at memory usage data: watching the process manager shows that the 'Available' part of 'Physical Memory (K)' never drops below 3300000. link grabs about 500MB of RAM, the page file increases 400MB (to 580MB), then link craps out. This machine has 4G of RAM, and also has the /3G switch in boot.ini (as do all the win32 'boxes' used for compiling, hardware or not). Swap is configured to be between 2G and 4G in size on C:\, which has 14G free. Doesn't look like a memory shortage in terms of resources. I'll run VMMap from SysInternals to try to get a handle on the virtual address space in use.
Can we trace the build with xperf, so that we can see what system calls are failing near the end?
Attached image VMMap for link (deleted) —
This is the high water mark as best as I can tell by polling every second or so. It takes about 16 seconds for link to fail out, where normally takes a few minutes when it succeeds (IIRC).
Webkit working around linking large lib files: https://bugs.webkit.org/show_bug.cgi?id=19743
If I'm reading the chromium report correctly, "linking indirectly" means linking against webcore.lib, and they switched to link in all the object files directly. We have a prototype patch to do the same, to improve build times on mac, and we could perhaps press it into service if necessary, but it would probably require ted-time, which is in short supply now and would delay plugin crash reporting on mac, which is also really important.
I have been able to build a PGO build on 32-bit windows with MSVC 2005 (oops) and on 64-bit windows with MSVC 2008. Are we using MSVC 2008 on the tinderbox? That would correspond to the chromium and webkit reports.
We use Visual Studio 2005 SP1 (service pack 1) for win32 builds.
you can get the msvc info in the "About Visual Studio" menu item under Help. There's a button to copy everything to the clipboard. here's what I successfully compiled it with: Microsoft Visual Studio 2005 Version 8.0.50727.762 (SP.050727-7600) Microsoft .NET Framework Version 2.0.50727 SP2 Installed Edition: Professional Microsoft Visual Basic 2005 77626-009-0000007-41814 Microsoft Visual Basic 2005 Microsoft Visual C# 2005 77626-009-0000007-41814 Microsoft Visual C# 2005 Microsoft Visual C++ 2005 77626-009-0000007-41814 Microsoft Visual C++ 2005 Microsoft Visual J# 2005 77626-009-0000007-41814 Microsoft Visual J# 2005 Microsoft Visual Web Developer 2005 77626-009-0000007-41814 Microsoft Visual Web Developer 2005 Microsoft Web Application Projects 2005 77626-009-0000007-41814 Microsoft Web Application Projects 2005 Version 8.0.50727.762 Crystal Reports AAC60-G0CSA4B-V7000AY Crystal Reports for Visual Studio 2005 Hotfix for Microsoft Visual Studio 2005 Professional Edition - ENU (KB949009) This Hotfix is for Microsoft Visual Studio 2005 Professional Edition - ENU. If you later install a more recent service pack, this Hotfix will be uninstalled automatically. For more information, visit http://support.microsoft.com/kb/949009 Microsoft Visual Studio 2005 Professional Edition - ENU Service Pack 1 (KB926601) This service pack is for Microsoft Visual Studio 2005 Professional Edition - ENU. If you later install a more recent service pack, this service pack will be uninstalled automatically. For more information, visit http://support.microsoft.com/kb/926601
The build slave has: Microsoft Visual Studio 2005 Version 8.0.50727.762 (SP.050727-7600) Microsoft .NET Framework Version 2.0.50727 SP1 Installed Edition: Professional Microsoft Visual C++ 2005 77626-009-0000007-41431 Microsoft Visual C++ 2005 Microsoft Web Application Projects 2005 77626-009-0000007-41431 Microsoft Web Application Projects 2005 Version 8.0.50727.762 Hotfix for Microsoft Visual Studio 2005 Professional Edition - ENU (KB949009) This Hotfix is for Microsoft Visual Studio 2005 Professional Edition - ENU. If you later install a more recent service pack, this Hotfix will be uninstalled automatically. For more information, visit http://support.microsoft.com/kb/949009 Microsoft Visual Studio 2005 Professional Edition - ENU Service Pack 1 (KB926601) This service pack is for Microsoft Visual Studio 2005 Professional Edition - ENU. If you later install a more recent service pack, this service pack will be uninstalled automatically. For more information, visit http://support.microsoft.com/kb/926601 ----- The differences are SP1 vs SP2; C++ and Web Application projects versions; no Visual Basic, C#, J#, Web Developer, and Crystal Reports on build slave.
(In reply to comment #28) > The differences are SP1 vs SP2... ... for the .Net Framework, which I hope is unrelated. No idea where the 41431 vs 41814 comes from for the C++ compiler.
I don't think that's a version number, FWIW. Looks like you both have the exact same thing to me.
OK. So what is your machine spec sayrer ? Are you using http://hg.mozilla.org/build/buildbot-configs/file/default/mozilla2/win32/tracemonkey/nightly/mozconfig ? Calling 'make -f client.mk profiledbuild' ?
(In reply to comment #31) > OK. So what is your machine spec sayrer ? I built on qm-purify01.mozilla.org. It's a Win2k3 box with automatic updates turned on. Windows Update showed that it's not missing any system or msvc update, other than an IE8 service pack. > Are you using > http://hg.mozilla.org/build/buildbot-configs/file/default/mozilla2/win32/tracemonkey/nightly/mozconfig > ? Calling 'make -f client.mk profiledbuild' ? I didn't use that exact mozconfig, but I did a profiled build in the same way the nightly does (I wrote the profile input code, so I know what it is supposed to do). Even watched it run the browser. I will retry with that precise mozconfig.
(In reply to comment #30) > I don't think that's a version number, FWIW. Looks like you both have the exact > same thing to me. I couldn't find any explanation of what that number is. We should find out.
Ends: Searching ../../staticlib/components/htmlpars.lib: Searching ../../staticlib/components/imglib2.lib: Searching ../../staticlib/components/gklayout.lib: Found "public: __thiscall mozilla::ipc::DocumentRendererNativeIDParent::DocumentRendererNativeIDParent(void)" (??0DocumentRendererNativeIDParent@ipc@mozilla@@QAE@XZ) Referenced in domipc_s.lib(TabParent.obj) fatal error C1083: Cannot open compiler intermediate file: '../../staticlib/components/gklayout.lib': Not enough space LINK : fatal error LNK1257: code generation failed It's already searched gklayout.lib once at this point, so multiple resolution passes ? It doesn't mention that apart from 'Starting pass 1' at the very beginning. Will compare to a m-c build.
Rob Helmer pointed out an interesting fact - we're hitting a compiler error (C1083) and a linker error (LNK1257). Looks like this from using /GL when calling the compiler, so we get 'link-time code generation enabled (except some host utils, and using /GL- on jpeg, libimg, and cairo). Do we need to do this ? Seems like it makes linking xul a bigger job than if it was done piecemeal.
http://msdn.microsoft.com/en-us/library/aa289168%28VS.71%29.aspx Most of the switches are fairly straightforward and have been in the Visual C++ product for many versions, although two are more recent and can produce dramatic speed improvements without any need to rewrite code. These are /GL, Whole Program Optimization /GL is a fairly big win in general and not something we want to drop if we can avoid it
Yeah, we definitely don't want to turn off /GL.
those flags are required to do PGO, iirc. certainly the big wins come from them, anyway.
[This is a truncated log, ending about 500 lines after the point the tracemonkey build dies. Full log at http://people.mozilla.com/~nthomas/verbose-mc.log.bz2] Comparing to the log in comment #34 there is lots of symbol ordering differences until we get to the failure point from TM: Searching ../../staticlib/components/htmlpars.lib: Searching ../../staticlib/components/imglib2.lib: Searching ../../staticlib/components/gklayout.lib: Found "public: __thiscall mozilla::ipc::DocumentRendererNativeIDParent::DocumentRendererNativeIDParent(void)" (??0DocumentRendererNativeIDParent@ipc@mozilla@@QAE@XZ) Referenced in domipc_s.lib(TabParent.obj) Loaded gklayout.lib(DocumentRendererNativeIDParent.obj) Found "public: bool __thiscall mozilla::ipc::DocumentRendererChild::RenderDocument(class nsIDOMWindow *,int const &,int const &,int const &,int const &,class nsString const &,unsigned int const &,int const &,unsigned int &,unsigned int &,class nsCString &)" (?RenderDocument@DocumentRendererChild@ipc@mozilla@@QAE_NPAVnsIDOMWindow@@ABH111ABVnsString@@ABI1AAI4AAVnsCString@@@Z) Referenced in domipc_s.lib(TabChild.obj) Loaded gklayout.lib(DocumentRendererChild.obj) Found "public: bool __thiscall mozilla::ipc::DocumentRendererShmemChild::RenderDocument(class nsIDOMWindow *,int const &,int const &,int const &,int const &,class nsString const &,unsigned int const &,int const &,struct gfxMatrix const &,class mozilla::ipc::Shmem &)" (?RenderDocument@DocumentRendererShmemChild@ipc@mozilla@@QAE_NPAVnsIDOMWindow@@ABH111ABVnsString@@ABI1ABUgfxMatrix@@AAVShmem@23@@Z) Referenced in domipc_s.lib(TabChild.obj) Loaded gklayout.lib(DocumentRendererShmemChild.obj) Found "public: bool __thiscall mozilla::ipc::DocumentRendererNativeIDChild::RenderDocument(class nsIDOMWindow *,int const &,int const &,int const &,int const &,class nsString const &,unsigned int const &,int const &,struct gfxMatrix const &,int const &)" (?RenderDocument@DocumentRendererNativeIDChild@ipc@mozilla@@QAE_NPAVnsIDOMWindow@@ABH111ABVnsString@@ABI1ABUgfxMatrix@@1@Z) Referenced in domipc_s.lib(TabChild.obj) Loaded gklayout.lib(DocumentRendererNativeIDChild.obj) Found "private: static int mozilla::PaintTracker::gPaintTracker" (?gPaintTracker@PaintTracker@mozilla@@0HA) Referenced in domplugins_s.lib(PluginInstanceChild.obj) Loaded gklayout.lib(PaintTracker.obj) Searching ../../staticlib/components/docshell.lib: Searching ../../staticlib/components/embedcomponents.lib: and on to generating code. Later in the log we have Discarded "public: __thiscall mozilla::ipc::DocumentRendererNativeIDParent::DocumentRendererNativeIDParent(void)" (??0DocumentRendererNativeIDParent@ipc@mozilla@@QAE@XZ) from gklayout.lib(DocumentRendererNativeIDParent.obj) but that's the only other reference. Full log at http://people.mozilla.com/~nthomas/verbose-mc.log.bz2 Fair enough on the need for /GL.
(In reply to comment #21) > Can we trace the build with xperf, so that we can see what system calls are > failing near the end? According to http://blogs.msdn.com/b/pigscanfly/archive/2008/02/24/xperf-support-for-xp.aspx most of xperf only works on vista/win2k8 or higher: "xperf.exe can be used on Windows XP SP2, and Windows Server 2003 for turning tracing on and of, and merge kernel trace data with user mode traces into a single ETL file. These operations are simply called "trace control". NOte that the '-stackwalk' switch is not supported on XP because its kernel doesn't support capturing the stack on events, this is anew feature in the Vista kernel. However, all operations that require trace decoding (and that's almost everything else), must be done on Vista or Windows Server 2008. This includes viewing traces in the Windows Performance Analyzer tool (xperfview.exe)." Any idea if that's useful to us? In the meantime, I'm looking at Detours (http://research.microsoft.com/en-us/projects/detours/#overview) and StraceNT. Not sure if the former is useful here.
That's fine -- the stack won't really help us, and we just need to capture the trace on the machine. We can analyze the trace on another computer, and would probably prefer to anyway.
Thanks for looking into it, BTW! Let me know if I can help with the tooling.
I managed to get a trace of system calls with stracent. This is the log of the last 100,000 calls. It seems to start dieing somewhere in here, based on the repetition. I've got the full log as well which I can post somewhere. It is massive, though (33 million lines). I'm still going to try to grab a trace with xperf, too.
I could use a hand with xperf. I've tried capturing system calls with: xperf -on SysProf -stackwalk profile -f trace.etl and xperf -on SYSCALL -stackwalk profile -f trace.etl but neither seems to get one. The former ends up with some information, but no system calls that I can find, when loaded up with xperfview. The latter seems to have no information at all, besides basic system info, when viewed.
1.) I tried using the nightly mozconfig, but it has dependencies that aren't in the tree. What do I need to check out in order to use it? 2.) Are the build slaves up to date w.r.t. service packs and whatnot?
(In reply to comment #45) > 1.) I tried using the nightly mozconfig, but it has dependencies that aren't in > the tree. What do I need to check out in order to use it? It looks like it includes http://hg.mozilla.org/build/buildbot-configs/file/10dc80ba4481/mozilla2/win32/include/choose-make-flags which ends up having the same affect as: mk_add_options MOZ_MAKE_FLAGS="-j4" > 2.) Are the build slaves up to date w.r.t. service packs and whatnot? They aren't. They were last updated in August, 2007. If you want to do some debugging on a build machine I'd be glad to give you access to one.
(In reply to comment #46) > (In reply to comment #45) > > 1.) I tried using the nightly mozconfig, but it has dependencies that aren't in > > the tree. What do I need to check out in order to use it? > > It looks like it includes > http://hg.mozilla.org/build/buildbot-configs/file/10dc80ba4481/mozilla2/win32/include/choose-make-flags > > which ends up having the same affect as: > mk_add_options MOZ_MAKE_FLAGS="-j4" Is that the right thing? I thought it was a bad idea to use parallel make on windows without using pymake. > > > 2.) Are the build slaves up to date w.r.t. service packs and whatnot? > > They aren't. They were last updated in August, 2007. If you want to do some > debugging on a build machine I'd be glad to give you access to one. I don't know what I would do. My next step is to make sure that I build with the same mozconfig and see if it succeeds.
(In reply to comment #47) > (In reply to comment #46) > > (In reply to comment #45) > > > 1.) I tried using the nightly mozconfig, but it has dependencies that aren't in > > > the tree. What do I need to check out in order to use it? > > > > It looks like it includes > > http://hg.mozilla.org/build/buildbot-configs/file/10dc80ba4481/mozilla2/win32/include/choose-make-flags > > > > which ends up having the same affect as: > > mk_add_options MOZ_MAKE_FLAGS="-j4" > > Is that the right thing? I thought it was a bad idea to use parallel make on > windows without using pymake. Whoops - I read it backwards. It's -j1 on these machines. (We use -j4 on the VMs because they're slow enough to the point that we don't hit that particular issue.) > > > > > 2.) Are the build slaves up to date w.r.t. service packs and whatnot? > > > > They aren't. They were last updated in August, 2007. If you want to do some > > debugging on a build machine I'd be glad to give you access to one. > > I don't know what I would do. My next step is to make sure that I build with > the same mozconfig and see if it succeeds. OK, just thought I'd throw it out there.
Status (from the releng side): * I'm trying out the patch in bug 522770 to see if it would fix this particular issue. I'm waiting on a try push to complete so I can figure out how to invoke the linker properly. Doesn't look like I'll be able to do that until tomorrow as the Windows builds haven't started yet. * I haven't personally been able to gleam anything from the strace output, but perhaps someone who knows Windows system calls better would. * Filed bug on Microsoft Connect (I highly doubt they will care beyond "this is fixed in VS2010", if that's even the case). https://connect.microsoft.com/VisualStudio/feedback/details/581207/visual-studio-2005-sp1-reproducible-linker-error-lkn1257-caused-by-c1083
(In reply to comment #49) > Status (from the releng side): > * I'm trying out the patch in bug 522770 to see if it would fix this particular > issue. I'm waiting on a try push to complete so I can figure out how to invoke > the linker properly. Doesn't look like I'll be able to do that until tomorrow > as the Windows builds haven't started yet. These builds failed out on all platforms, eg win32 opt http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1280440376.1280442999.27586.gz mac32 opt http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1280436240.1280441533.21741.gz linux32 opt http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1280438632.1280444385.480.gz Did anyone glean anything from comment #39 ? Did we add an ipc <--> xpconnect dependency on Tracemonkey recently ? That seems to be the only part of js/src/ included in gklayout.lib. Or perhaps modify DOM or something else with hooks into JS ?
With some guidance, I was able to link xul.lib by replacing gklayout.lib with the obj and libs which are linked into it. Even without unpacking the 2nd tier of libs the link succeeded. Based on the fact that it's already in progress I think that bug 522770 would be the quickest way to work around this issue. Going forward, VS2008 or 2010 *may* work around similar issues, but since we want bug 522770 for other reasons I think it makes sense to push on that. I spoke with khuey and Mitch on IRC and they seemed willing to push it along. One more thing of note: Catlee mentioned yesterday that even building this on a 64-bit version of Windows will not help, because the compiler are linker are 32-bit apps, and thus only get 32-bit address space.
(In reply to comment #51) > One more thing of note: Catlee mentioned yesterday that even building this on a > 64-bit version of Windows will not help, because the compiler are linker are > 32-bit apps, and thus only get 32-bit address space. Dumpbin on cl and link on my install of MSVC 2005 says that both programs are LARGEADDRESSAWARE so on a x64 system they should get 4 GB of usable address space instead of 2 or 3 as the case may be. Fixing Bug 522770 will be faster than switching all the x86 builders to run on x64, but that shouldn't be discounted as an option in the future.
Summary: Win2k3 builds failing to link xull.dll on TM: Not enough space for gklayout.lib → Win2k3 builds failing to link xul.dll on TM: Not enough space for gklayout.lib
This spread to mozilla-central with the TM merge even though TM was green after backing out the patch. I have a patch in hand.
Assignee: nobody → me
Summary: Win2k3 builds failing to link xul.dll on TM: Not enough space for gklayout.lib → Win2k3 builds failing to link xul.dll: Not enough space for gklayout.lib
The patches I've landed for Bug 522770 appear to have fixed this on mozilla-central. Tracemonkey will pick this up on the next merge.
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
I filed Bug 583628 to capture the thoughts on building 32 bit builds on 64 bit builders.
(In reply to comment #52) > (In reply to comment #51) > > One more thing of note: Catlee mentioned yesterday that even building this on a > > 64-bit version of Windows will not help, because the compiler are linker are > > 32-bit apps, and thus only get 32-bit address space. > > Dumpbin on cl and link on my install of MSVC 2005 says that both programs are > LARGEADDRESSAWARE so on a x64 system they should get 4 GB of usable address > space instead of 2 or 3 as the case may be. Fixing Bug 522770 will be faster > than switching all the x86 builders to run on x64, but that shouldn't be > discounted as an option in the future. Ah, that's good to know. Thanks!
Microsoft responded on the bug I filed. They said the out-of-memory issue is fixed in 2010. If we end up hitting this again before we're ready to upgrade we can try contacting Product Support Services to get a hotfix.
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: