Convert windows l10n repacks to use cross-l10n on linux (to avoid cloning timeouts)
Categories
(Release Engineering :: Release Automation: L10N, defect)
Tracking
(firefox-esr78 unaffected, firefox80 fixed)
Tracking | Status | |
---|---|---|
firefox-esr78 | --- | unaffected |
firefox80 | --- | fixed |
People
(Reporter: nthomas, Unassigned)
References
Details
(Keywords: intermittent-failure, leave-open)
Attachments
(3 files, 1 obsolete file)
First instance on 2020-07-01, cloning the first l10n locale but having already gotten gecko OK:
[task 2020-07-01T02:36:48.207Z] 02:36:48 INFO - Running command: ['hg', '--config', 'ui.merge=internal:merge', '--config', 'extensions.robustcheckout=Z:\\task_1593570207\\build\\src\\testing\\mozharness\\external_tools\\robustcheckout.py', 'robustcheckout', 'https://hg.mozilla.org/l10n-central/mr', 'mr', '--sharebase', 'y:\\hg-shared', '--branch', '708ac4b5f864']
[task 2020-07-01T02:36:48.207Z] 02:36:48 INFO - Copy/paste: hg --config ui.merge=internal:merge --config extensions.robustcheckout=Z:\task_1593570207\build\src\testing\mozharness\external_tools\robustcheckout.py robustcheckout https://hg.mozilla.org/l10n-central/mr mr --sharebase y:\hg-shared --branch 708ac4b5f864
[task 2020-07-01T02:36:48.297Z] 02:36:48 INFO - (using Mercurial 4.7.1)
[taskcluster 2020-07-01T02:42:59.412Z] [taskcluster-proxy] Successfully refreshed taskcluster-proxy credentials: task-client/KaYyTu-iQAm-7X-ioyPKlQ/0/on/us-east-1/i-0d1c3a3b99f5ea602/until/1593572579.278
[taskcluster 2020-07-01T02:59:59.426Z] [taskcluster-proxy] Successfully refreshed taskcluster-proxy credentials: task-client/KaYyTu-iQAm-7X-ioyPKlQ/0/on/us-east-1/i-0d1c3a3b99f5ea602/until/1593573599.291
[taskcluster 2020-07-01T03:16:59.390Z] [taskcluster-proxy] Successfully refreshed taskcluster-proxy credentials: task-client/KaYyTu-iQAm-7X-ioyPKlQ/0/on/us-east-1/i-0d1c3a3b99f5ea602/until/1593574619.238
[taskcluster 2020-07-01T03:33:59.363Z] [taskcluster-proxy] Successfully refreshed taskcluster-proxy credentials: task-client/KaYyTu-iQAm-7X-ioyPKlQ/0/on/us-east-1/i-0d1c3a3b99f5ea602/until/1593575639.199
[taskcluster 2020-07-01T03:50:59.273Z] [taskcluster-proxy] Successfully refreshed taskcluster-proxy credentials: task-client/KaYyTu-iQAm-7X-ioyPKlQ/0/on/us-east-1/i-0d1c3a3b99f5ea602/until/1593576659.108
[taskcluster 2020-07-01T04:07:59.248Z] [taskcluster-proxy] Successfully refreshed taskcluster-proxy credentials: task-client/KaYyTu-iQAm-7X-ioyPKlQ/0/on/us-east-1/i-0d1c3a3b99f5ea602/until/1593577679.076
[taskcluster 2020-07-01T04:24:59.157Z] [taskcluster-proxy] Successfully refreshed taskcluster-proxy credentials: task-client/KaYyTu-iQAm-7X-ioyPKlQ/0/on/us-east-1/i-0d1c3a3b99f5ea602/until/1593578698.979
[taskcluster:error] Aborting task...
[taskcluster 2020-07-01T04:26:04.971Z] SUCCESS: The process with PID 3640 (child process of PID 1872) has been terminated.
[taskcluster 2020-07-01T04:26:04.971Z] SUCCESS: The process with PID 1872 (child process of PID 888) has been terminated.
[taskcluster 2020-07-01T04:26:04.971Z] SUCCESS: The process with PID 2384 (child process of PID 684) has been terminated.
[taskcluster 2020-07-01T04:26:04.971Z] SUCCESS: The process with PID 888 (child process of PID 684) has been terminated.
[taskcluster 2020-07-01T04:26:04.971Z] SUCCESS: The process with PID 684 (child process of PID 5064) has been terminated.
[taskcluster 2020-07-01T04:26:04.971Z]
Other instances:
2020-07-01 https://firefoxci.taskcluster-artifacts.net/DinN4QavSx2OVxg1vY6kXw/0/public/logs/live_backing.log
2020-07-01 https://firefoxci.taskcluster-artifacts.net/K0NUlvinTQacLStEvk7xEw/0/public/logs/live_backing.log (first where other locales are cloned fine then stops)
2020-07-03 https://firefoxci.taskcluster-artifacts.net/BRFPR9rrS6OIaICdVsWflQ/0/public/logs/live_backing.log
2020-07-06 https://firefoxci.taskcluster-artifacts.net/Uw9xb9vkRJagtBxZKiV8Ew/0/public/logs/live_backing.log
2020-07-08 https://firefoxci.taskcluster-artifacts.net/VKpXEqXOSlmq8XNpBoPuAQ/0/public/logs/live_backing.log
2020-07-08 https://firefoxci.taskcluster-artifacts.net/eRp0do6VR86q0hbGuPA6OA/0/public/logs/live_backing.log
2020-07-08 https://firefoxci.taskcluster-artifacts.net/Hou64f9RR8y7veC1kfFWdw/0/public/logs/live_backing.log
2020-07-10 https://firefoxci.taskcluster-artifacts.net/KW8pQecXT262mXejHx2pJg/0/public/logs/live_backing.log (in us-west-2b)
2020-07-10 https://firefoxci.taskcluster-artifacts.net/PMmXP1FTQ_6pHF29YqGwpQ/0/public/logs/live_backing.log
2020-07-16 https://firefoxci.taskcluster-artifacts.net/cD8H73cySxCKn7ZlN7ecXA/0/public/logs/live_backing.log
2020-07-16 https://firefoxci.taskcluster-artifacts.net/VsWulsjqRcC0M0kPoaW2OA/0/public/logs/live_backing.log
2020-07-16 https://firefoxci.taskcluster-artifacts.net/dK7ZK4OsRR6M6kHTrHIfyA/0/public/logs/live_backing.log
All but one of these are instances in us-east-1, except for KW8pQecXT262mXejHx2pJg.
Reporter | ||
Comment 1•4 years ago
|
||
Connor, we are wondering if this might be abuse/DDOS protection on hg.m.o. These requests would be coming from AWS to the public hg.m.o servers.
We are also thinking of swapping to the private servers, and were curious if there is any protection there too.
Reporter | ||
Updated•4 years ago
|
Comment 3•4 years ago
|
||
We don't have any true DDoS protection on hgmo (ie something like fail2ban), except for bans on IP blocks. Given the varied public IP's in the logs, I doubt this is the problem here, and the IPs currently banned in the load balancer don't seem to match with any in the logs. Using the internal mirrors would work, but I would need to add the new repos to them first since they currently only have a subset of all the repos on the public endpoint.
The way we avoid DDoS from cloning today is by using clonebundles. Perhaps we could start producing bundles for the l10n repos. Usually cloning these repos doesn't cause any disruption since they're quite small, and looking at some of the logs where cloning these repos succeeded, the clone + working directory update took ~10s. Given the small load for each individual clone I don't think the issue here is related to a DDoS from our infra, but I don't have a better explanation either.
Comment hidden (Intermittent Failures Robot) |
Reporter | ||
Comment 5•4 years ago
|
||
OK, thanks for the information. Perhaps its something to do the actual workers themselves.
Braindump of ideas:
- we also see this issue occasionally on mozilla-central for l10n repacks, most recently here. That coincided with 79.0 RC1 having a couple of issues, so it might be related to surge load on the servers. There's are 3 platforms by 21 chunks per branch that can all start at roughly the same time, each cloning 5 locales sequentially. Sometimes releases align with nightlies to double the load (or a higher multiple if several releases are started simultaneously). Clone bundles could offload a lot of work from the server, so might be worthwhile even if only updated weekly as low change repos
- I've tried adding
socket.setdefaulttimeout(10.0)
totesting/mozharness/external_tools/robustcheckout.py
, and running a bunch of l10n repacks on try (eg), but haven't been able to reproduce the hang and verify that the timeout works. Setting a much lower timeout locally certainly throws an exception and the retry behaviour of robustcheckout kicks in, so I think this should help us mitigate the issue - we see this for Windows but not other platforms (which all run on Linux). The Mercurial versions are 4.7.1 and 4.8.1 respectively, and I see support was added in 4.8 for setting an http/https timeout. It defaults to None and I don't think we are setting that in the hgrc for our Linux images, so it's likely not the key difference
- we could set the max job time to a much lower time than 7200 seconds in order to fail faster, and sheriffs retry manually. Windows repacks are typically 23 minutes long, other platforms 5 minutes, so something like 1800 would tightly match that, or 2400s to allow a little more leeway for variation
Comment hidden (Intermittent Failures Robot) |
Reporter | ||
Updated•4 years ago
|
Reporter | ||
Comment 7•4 years ago
|
||
Windows jobs typically take 23 minutes, so limit to 30 minutes so that cloning hangs terminate earlier than 2 hours.
Mac and Linux jobs are typically 5 minutes long, so bring the limit in to 15 minutes from an hour.
Also refactors so this the max run time is only specified in the kind, rather than some in the kind and some in the transform.
Updated•4 years ago
|
Reporter | ||
Comment 9•4 years ago
|
||
We're unsure what the next step is here, unassigning for now.
Updated•4 years ago
|
Comment 10•4 years ago
|
||
Comment 11•4 years ago
|
||
bugherder |
Comment 12•4 years ago
|
||
Reporter | ||
Comment 13•4 years ago
|
||
Looks like the timeout of 30 minutes is too short for some jobs. There are 6 failures on that push for windows l10n - 1 has the l10n hang and was failed as we wanted, but 5 are jobs that should have been left alone as they had finished cloning and were repacking. Seems to be related to AWS instances in us-east-1 which don't already have a clone of gecko, but doesn't always occur. I'm asking around about why we don't use the internal hg servers for that AWS region.
There's one more that should have been left alone on the previous nightly, while the most recent nightly was OK.
Reporter | ||
Comment 14•4 years ago
|
||
Bug 1614852 also has some investigations of hanging hg clones, in the context of GCP workers.
Reporter | ||
Comment 15•4 years ago
|
||
Comment 16•4 years ago
|
||
Comment 17•4 years ago
|
||
bugherder |
Comment 19•4 years ago
|
||
Still timing out on the latest merge: https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&group_state=expanded&resultStatus=testfailed%2Cbusted%2Cexception&revision=c34351a5fd6c6bfb76ffd7b86b8757d6d95b581c&searchStr=l10n&selectedTaskRun=Rg3yOSv2THqmhKSO2i4KUw.0
Failure log, cloning took 5 mins and updating 5 mins: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=311586372&repo=mozilla-central&lineNumber=57
Here is where it hanged so looks like a different issue: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=311586372&repo=mozilla-central&lineNumber=924-926
Same lines here: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=311588544&repo=mozilla-central&lineNumber=631-633
Reporter | ||
Comment 20•4 years ago
|
||
Looks like 6 failures on that push, 5 are the main signature for this bug:
[task 2020-07-31T00:09:22.108Z] 00:09:22 INFO - Running command: ['hg', '--config', 'ui.merge=internal:merge', '--config', 'extensions.robustcheckout=Z:\\task_1596153325\\build\\src\\testing\\mozharness\\external_tools\\robustcheckout.py', 'robustcheckout', 'https://hg.mozilla.org/l10n-central/cy', 'cy', '--sharebase', 'y:\\hg-shared', '--branch', 'default']
[task 2020-07-31T00:09:22.108Z] 00:09:22 INFO - Copy/paste: hg --config ui.merge=internal:merge --config extensions.robustcheckout=Z:\task_1596153325\build\src\testing\mozharness\external_tools\robustcheckout.py robustcheckout https://hg.mozilla.org/l10n-central/cy cy --sharebase y:\hg-shared --branch default
[task 2020-07-31T00:09:22.190Z] 00:09:22 INFO - (using Mercurial 4.7.1)
[taskcluster 2020-07-31T00:15:04.046Z] [taskcluster-proxy] Successfully refreshed taskcluster-proxy credentials: task-client/Rg3yOSv2THqmhKSO2i4KUw/0/on/us-east-1/i-02a30bcf0a197d56d/until/1596155703.938
[taskcluster 2020-07-31T00:32:04.074Z] [taskcluster-proxy] Successfully refreshed taskcluster-proxy credentials: task-client/Rg3yOSv2THqmhKSO2i4KUw/0/on/us-east-1/i-02a30bcf0a197d56d/until/1596156723.974
[taskcluster:error] Aborting task...
The taskcluster-proxy lines aren't believed to be related, the bug is the hang doing an hg clone.
The other one is closely related, as has 10 minutes to clone+update gecko, then 30 minutes to clone https://hg.mozilla.org/l10n-central/br. It was repacking when it hit the 45 minute max run time.
These jobs are using the internal mirror in us-east-1 to clone gecko, as sheehan re-enabled that on my request. It seems to make the clone+update of gecko a little quicker ~8 minutes vs ~10, but obviously still having issues with the public hg server or network or windows.
The windows image hasn't changed since February. I'm not sure what our next steps are here.
Comment 21•4 years ago
|
||
The other one is closely related, as has 10 minutes to clone+update gecko, then 30 minutes to clone https://hg.mozilla.org/l10n-central/br. It was repacking when it hit the 45 minute max run time.
Do these clone tasks occur on different filesystems? It doesn't make sense that clone+update for gecko would be so much faster than clone+update for an l10n repo, which are several orders of magnitude different in size.
Even if the docker image itself hasn't changed, if we mount filesystems into the container any change to the filesystem might be affecting the vcs timings there.
Updated•4 years ago
|
Comment 22•4 years ago
|
||
We are only seeing this on windows machines, and looking at the logs, we are using the same shared directory, and an adjacent checkout directory.
Comment hidden (Intermittent Failures Robot) |
Comment 24•4 years ago
|
||
One idea I have on my mind is to restructure the l10n repack process to do the libs
and langpack
phase independently of the package repacks, on just one platform. That'd create an artifact that includes the merge dirs and the dist/xpi-stage, which the repacks would then use.
I had looked at the time we spend on cloning and libs, and thought it might not be worth it, but maybe it is?
Comment 25•4 years ago
|
||
Comment 26•4 years ago
|
||
I just attached to this bug the WIP of me setting us up to use internal hg mirrors for mozharness clones, I abandoned working on it when I saw connor's reply in c#3.
I don't have any better ideas right now sadly :(
Comment 27•4 years ago
|
||
Comment 28•4 years ago
|
||
Comment 29•4 years ago
|
||
Comment 30•4 years ago
|
||
bugherder |
Reporter | ||
Updated•4 years ago
|
Comment 31•4 years ago
|
||
bugherder |
Comment 32•4 years ago
|
||
Can we uplift ^ to Beta? Looks like it's making a huge runtime difference on m-c. On a related note - should we adjust the timeout values downwards again?
Reporter | ||
Updated•4 years ago
|
Comment hidden (Intermittent Failures Robot) |
Comment 34•4 years ago
|
||
We wanted to wait a week, to see if there was fallout on nightly, before landing on beta.
Comment 35•4 years ago
|
||
bugherder uplift |
Comment 36•4 years ago
|
||
Comment hidden (Intermittent Failures Robot) |
Updated•4 years ago
|
Comment 39•4 years ago
|
||
If we want to uplift to esr78: try push | grafted try commit on top of esr78
Comment 40•4 years ago
|
||
That broke on mac. Trying mac on debian8: try push | attempted fix
Comment 41•4 years ago
|
||
Still broken on mac =\
Updated•2 years ago
|
Updated•1 year ago
|
Comment 42•1 year ago
|
||
The bug assignee is inactive on Bugzilla, so the assignee is being reset.
Description
•