Closed
Bug 937279
Opened 11 years ago
Closed 7 years ago
w864 slaves fail cloning and rebooting; get into infinite retry loops
Categories
(Infrastructure & Operations Graveyard :: CIDuty, task)
Tracking
(Not tracked)
RESOLVED
INCOMPLETE
People
(Reporter: bhearsum, Unassigned)
References
Details
(Whiteboard: SOLUTIONS: modify reboot steps on ScriptFactory AND add auto retry logic)
Examples:
https://bugzilla.mozilla.org/show_bug.cgi?id=928673
https://bugzilla.mozilla.org/show_bug.cgi?id=937270
https://bugzilla.mozilla.org/show_bug.cgi?id=937273
https://bugzilla.mozilla.org/show_bug.cgi?id=937277
And probably more that I haven't seen yet.
Reporter | ||
Comment 1•11 years ago
|
||
Reporter | ||
Updated•11 years ago
|
Blocks: t-w864-ix-116
Reporter | ||
Updated•11 years ago
|
Blocks: t-w864-ix-075
Reporter | ||
Comment 2•11 years ago
|
||
Reporter | ||
Comment 3•11 years ago
|
||
Reporter | ||
Updated•11 years ago
|
Blocks: t-w864-ix-109
Updated•11 years ago
|
OS: Linux → Windows 8
Comment 4•11 years ago
|
||
It seems that they all versions of Windows share the same HG version. [1]
Same vlans. [2]
Could it be some sort of weird interactions from hgtool.py and hg?
It would be great to have logs to inspect.
[1]
cltbld@T-W732-IX-001 ~
$ /c/mozilla-build/hg/hg.exe --version
Mercurial Distributed SCM (version 2.5.4)
cltbld@T-W864-IX-001 ~
$ /c/mozilla-build/hg/hg.exe --version
Mercurial Distributed SCM (version 2.5.4)
$ /c/mozilla-build/hg/hg.exe --version
Mercurial Distributed SCM (version 2.5.4)
[2]
armenzg@UbuntVM:~/moz/patches$ host t-w732-ix-001
t-w732-ix-001.build.mozilla.org is an alias for t-w732-ix-001.wintest.releng.scl3.mozilla.com.
t-w732-ix-001.wintest.releng.scl3.mozilla.com has address 10.26.40.151
armenzg@UbuntVM:~/moz/patches$ host t-w864-ix-001
t-w864-ix-001.build.mozilla.org is an alias for t-w864-ix-001.wintest.releng.scl3.mozilla.com.
t-w864-ix-001.wintest.releng.scl3.mozilla.com has address 10.26.40.31
armenzg@UbuntVM:~/moz/patches$ host t-xp32-ix-001
t-xp32-ix-001.build.mozilla.org is an alias for t-xp32-ix-001.wintest.releng.scl3.mozilla.com.
t-xp32-ix-001.wintest.releng.scl3.mozilla.com has address 10.26.41.21
armenzg@UbuntVM:~/moz/patches$ host t-xp32-ix-109
t-xp32-ix-109.build.mozilla.org is an alias for t-xp32-ix-109.wintest.releng.scl3.mozilla.com.
t-xp32-ix-109.wintest.releng.scl3.mozilla.com has address 10.26.41.129
armenzg@UbuntVM:~/moz/patches$ host t-xp32-ix-075
t-xp32-ix-075.build.mozilla.org is an alias for t-xp32-ix-075.wintest.releng.scl3.mozilla.com.
t-xp32-ix-075.wintest.releng.scl3.mozilla.com has address 10.26.41.95
armenzg@UbuntVM:~/moz/patches$ host t-xp32-ix-116
t-xp32-ix-116.build.mozilla.org is an alias for t-xp32-ix-116.wintest.releng.scl3.mozilla.com.
t-xp32-ix-116.wintest.releng.scl3.mozilla.com has address 10.26.41.136
Reporter | ||
Comment 5•11 years ago
|
||
(In reply to Armen Zambrano [:armenzg] (Release Engineering) (EDT/UTC-4) from comment #4)
> It seems that they all versions of Windows share the same HG version. [1]
> Same vlans. [2]
>
> Could it be some sort of weird interactions from hgtool.py and hg?
Probably not. Some of these happen without hgtool.py (from https://bugzilla.mozilla.org/show_bug.cgi?id=900825#c3):
> 'c:\\mozilla-build\\hg\\hg' 'clone' 'http://hg.mozilla.org/build/mozharness' 'scripts'
> program finished with exit code -1073741800
> It would be great to have logs to inspect.
Full example here: https://tbpl.mozilla.org/php/getParsedLog.php?id=30418653&tree=Fx-Team
Reporter | ||
Comment 6•11 years ago
|
||
Something interesting here is that mercurial dies very quickly - it doesn't even print "requesting all changes". The Buildbot output says the elapsed time is just 100ms.
Comment 7•11 years ago
|
||
I wonder why we don't call hgtool.py for mozharness:
'c:\\mozilla-build\\hg\\hg' 'clone' 'http://hg.mozilla.org/build/mozharness' 'scripts'
On another note, we don't reboot after the job fails since count_and_reboot is inside of the mozharness repo.
Maybe we loose DNS resolution and since we don't reboot we can't recover.
Maybe we can look at the Error logs that Windows has.
Host t-w864-ix-033 had its first incident on Sat Nov 9 10:27:28 2013.
Perhaps we should change ScriptFactory to call directly the reboot command without count_and_reboot.py.
Reporter | ||
Comment 8•11 years ago
|
||
(In reply to Armen Zambrano [:armenzg] (Release Engineering) (EDT/UTC-4) from comment #7)
> I wonder why we don't call hgtool.py for mozharness:
> 'c:\\mozilla-build\\hg\\hg' 'clone' 'http://hg.mozilla.org/build/mozharness'
> 'scripts'
We don't have hgtool until we have mozharness.
> On another note, we don't reboot after the job fails since count_and_reboot
> is inside of the mozharness repo.
> Maybe we loose DNS resolution and since we don't reboot we can't recover.
Interesting thought, this seems rather plausible! It fits in with the fact that hg dies so quickly. Is there another bug for this intermittent loss in DNS?
> Maybe we can look at the Error logs that Windows has.
> Host t-w864-ix-033 had its first incident on Sat Nov 9 10:27:28 2013.
Indeed. We have this at 9:24am on that day:
The system failed to register host (A or AAAA) resource records (RRs) for network adapter
with settings:
Adapter Name : {35E0B54E-FAD3-461C-8363-19AFE39D4762}
Host Name : T-W864-IX-033
Primary Domain Suffix : releng.ad.mozilla.com
DNS server list :
10.26.75.40, 10.26.75.41
Sent update to server : <?>
IP Address(es) :
10.26.40.63
The reason the system could not register these RRs during the update request was because of a system problem. You can manually retry DNS registration of the network adapter and its settings by typing 'ipconfig /registerdns' at the command prompt. If problems still persist, contact your DNS server or network systems administrator. See event details for specific error code information.
> Perhaps we should change ScriptFactory to call directly the reboot command
> without count_and_reboot.py.
Maybe...this is tricky to do cross-platform though. We should file a separate bug for it.
Reporter | ||
Comment 9•11 years ago
|
||
Given that this is confirmed to be a DNS issue and _not_ an issue with the slaves themselves, I'm going to put all of the "affected" slaves back in production.
The current workaround is a manual reboot, but we need to find some way to stop these slaves from getting into infinite RETRY loops.
Severity: normal → major
Reporter | ||
Updated•11 years ago
|
Severity: major → normal
Summary: many w864 slaves failing to clone repositories with weird exit codes → w864 slaves lose dns settings sometimes; get into infinite retry loops
Reporter | ||
Comment 11•11 years ago
|
||
This might not be DNS related after all....
12:49 < dustin> so this was an onsite DNS "blip"?
12:49 < bhearsum|buildduty> i'm not entirely sure
12:49 < bhearsum|buildduty> https://bugzilla.mozilla.org/show_bug.cgi?id=937279#c8 has an event from the Event Viewer that has details
12:50 < dustin> ah, that's about using dynamic DNS to register its forward/reverse with a DNS server
12:50 < dustin> we don't support dynamic DNS
12:50 < dustin> which is why it failed
12:51 < dustin> but those forward/reverse records are (should be!) already in DNS
12:51 < dustin> I thought we had disabled that, but my memory's fuzzy lo those many months ago
12:51 < bhearsum|buildduty> so, this could just be coincidence...
12:51 < dustin> yeah
12:51 < bhearsum|buildduty> we're not 100% certain dns failure is causing these hg issues
12:51 < dustin> I'm fairly certain all windows hosts will show that error message from time to time
Comment 12•11 years ago
|
||
The event viewer message is about failing to modify the host's forward/reverse DNS using Dynamic DNS. Our DNS servers don't support Dynamic DNS, so this isn't surprising, and I suspect that all of our Windows hosts will have this error in their event logs. I have a vague memory of suggesting that we disable these registrations, but I don't know if that ever happened.
In other words, nothing I see here implicates external DNS failures.
I think the more interesting clue will be in the failures. From the dep bugs, I see
--
Has a case of the "any use of hg results in 'program finished with exit code -1073741800' and setting RETRY" blues.
--
where that code is coming from HG.EXE. So figuring out what that means and what might cause it would be a good start. A few bigger log snippets might help, too, especially with timestamps that we could correlate with other events, e.g., DNS server logs.
Reporter | ||
Comment 13•11 years ago
|
||
(In reply to Dustin J. Mitchell [:dustin] (I read my bugmail; don't needinfo me) from comment #12)
> --
> Has a case of the "any use of hg results in 'program finished with exit code
> -1073741800' and setting RETRY" blues.
> --
>
> where that code is coming from HG.EXE. So figuring out what that means and
> what might cause it would be a good start. A few bigger log snippets might
> help, too, especially with timestamps that we could correlate with other
> events, e.g., DNS server logs.
Unfortunately, there's no output from hg when we hit this. It runs for about 100ms and then dies. https://bugzilla.mozilla.org/show_bug.cgi?id=937154#c0 has the most complete log snippet we have.
Comment 14•11 years ago
|
||
Thanks! -1073741800 expressed in 32-bit 2's complement is 0xc0000018.
Per http://www.tenox.net/links/ntstatus.html:
C0000018 STATUS_CONFLICTING_ADDRESSES
that leads to
http://stackoverflow.com/questions/2622607/windows-loader-problem-turn-on-verbose-mode
http://ntsecurity.nu/onmymind/2006/2006-06-01.html
suggesting that this occurs when the same region of memory is mapped for two purposes. This is sounding suspiciously like the Cygwin flaw, where the OS doesn't have enough flexibility to put objects like DLLs at arbitrary locations in the process's memory map, and two objects overlap.
But I should probably leave this to the experts..
Comment 15•11 years ago
|
||
Dustin has already pointed out the windows errors being a red herring. However, in response to his response I have a GPO already to disable able client DNS trying to send DDNS updates to the servers. It shouldn't affect anything other than removing some noise from the windows logs.
Flags: needinfo?(q)
Updated•11 years ago
|
Depends on: t-w864-ix-127
Updated•11 years ago
|
Depends on: t-w864-ix-028
Comment 16•11 years ago
|
||
Any reason why this is happening mainly on our Windows 8 machines? Is it because it is a 64-bit arch rather than 32-bit?
Updated•11 years ago
|
Whiteboard: SOLUTIONS: modify reboot steps on ScriptFactory AND add auto retry logic
Updated•11 years ago
|
Summary: w864 slaves lose dns settings sometimes; get into infinite retry loops → w864 slaves fail cloning and rebooting; get into infinite retry loops
Comment 17•11 years ago
|
||
(In reply to Armen Zambrano [:armenzg] (Release Engineering) (EDT/UTC-4) from comment #16)
> Any reason why this is happening mainly on our Windows 8 machines? Is it
> because it is a 64-bit arch rather than 32-bit?
--^
Regardless, if we know or not the answer. I believe we should modify the reboot step to at least manage to recover if count_and_reboot.py is missing (rather than retrying the hg step - which might ).
Flags: needinfo?(q)
Flags: needinfo?(dustin)
Comment 18•11 years ago
|
||
I'm not sure why you flagged me for needinfo here - was there a question?
Flags: needinfo?(dustin)
Comment 19•11 years ago
|
||
(In reply to Dustin J. Mitchell [:dustin] (I read my bugmail; don't needinfo me) from comment #18)
> I'm not sure why you flagged me for needinfo here - was there a question?
comment 16.
Comment 20•11 years ago
|
||
Ah, no idea here.
Comment 21•11 years ago
|
||
NVM. We better fix the reboot step or wait for bug =712205.
Flags: needinfo?(q)
Updated•7 years ago
|
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → INCOMPLETE
Assignee | ||
Updated•7 years ago
|
Component: Platform Support → Buildduty
Product: Release Engineering → Infrastructure & Operations
Updated•5 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•