1653171 - Convert windows l10n repacks to use cross-l10n on linux (to avoid cloning timeouts)

Reporter

Description

•

4 years ago

First instance on 2020-07-01, cloning the first l10n locale but having already gotten gecko OK:

[task 2020-07-01T02:36:48.207Z] 02:36:48     INFO - Running command: ['hg', '--config', 'ui.merge=internal:merge', '--config', 'extensions.robustcheckout=Z:\\task_1593570207\\build\\src\\testing\\mozharness\\external_tools\\robustcheckout.py', 'robustcheckout', 'https://hg.mozilla.org/l10n-central/mr', 'mr', '--sharebase', 'y:\\hg-shared', '--branch', '708ac4b5f864']
[task 2020-07-01T02:36:48.207Z] 02:36:48     INFO - Copy/paste: hg --config ui.merge=internal:merge --config extensions.robustcheckout=Z:\task_1593570207\build\src\testing\mozharness\external_tools\robustcheckout.py robustcheckout https://hg.mozilla.org/l10n-central/mr mr --sharebase y:\hg-shared --branch 708ac4b5f864
[task 2020-07-01T02:36:48.297Z] 02:36:48     INFO -  (using Mercurial 4.7.1)
[taskcluster 2020-07-01T02:42:59.412Z] [taskcluster-proxy] Successfully refreshed taskcluster-proxy credentials: task-client/KaYyTu-iQAm-7X-ioyPKlQ/0/on/us-east-1/i-0d1c3a3b99f5ea602/until/1593572579.278
[taskcluster 2020-07-01T02:59:59.426Z] [taskcluster-proxy] Successfully refreshed taskcluster-proxy credentials: task-client/KaYyTu-iQAm-7X-ioyPKlQ/0/on/us-east-1/i-0d1c3a3b99f5ea602/until/1593573599.291
[taskcluster 2020-07-01T03:16:59.390Z] [taskcluster-proxy] Successfully refreshed taskcluster-proxy credentials: task-client/KaYyTu-iQAm-7X-ioyPKlQ/0/on/us-east-1/i-0d1c3a3b99f5ea602/until/1593574619.238
[taskcluster 2020-07-01T03:33:59.363Z] [taskcluster-proxy] Successfully refreshed taskcluster-proxy credentials: task-client/KaYyTu-iQAm-7X-ioyPKlQ/0/on/us-east-1/i-0d1c3a3b99f5ea602/until/1593575639.199
[taskcluster 2020-07-01T03:50:59.273Z] [taskcluster-proxy] Successfully refreshed taskcluster-proxy credentials: task-client/KaYyTu-iQAm-7X-ioyPKlQ/0/on/us-east-1/i-0d1c3a3b99f5ea602/until/1593576659.108
[taskcluster 2020-07-01T04:07:59.248Z] [taskcluster-proxy] Successfully refreshed taskcluster-proxy credentials: task-client/KaYyTu-iQAm-7X-ioyPKlQ/0/on/us-east-1/i-0d1c3a3b99f5ea602/until/1593577679.076
[taskcluster 2020-07-01T04:24:59.157Z] [taskcluster-proxy] Successfully refreshed taskcluster-proxy credentials: task-client/KaYyTu-iQAm-7X-ioyPKlQ/0/on/us-east-1/i-0d1c3a3b99f5ea602/until/1593578698.979
[taskcluster:error] Aborting task...
[taskcluster 2020-07-01T04:26:04.971Z] SUCCESS: The process with PID 3640 (child process of PID 1872) has been terminated.
[taskcluster 2020-07-01T04:26:04.971Z] SUCCESS: The process with PID 1872 (child process of PID 888) has been terminated.
[taskcluster 2020-07-01T04:26:04.971Z] SUCCESS: The process with PID 2384 (child process of PID 684) has been terminated.
[taskcluster 2020-07-01T04:26:04.971Z] SUCCESS: The process with PID 888 (child process of PID 684) has been terminated.
[taskcluster 2020-07-01T04:26:04.971Z] SUCCESS: The process with PID 684 (child process of PID 5064) has been terminated.
[taskcluster 2020-07-01T04:26:04.971Z]

All but one of these are instances in us-east-1, except for KW8pQecXT262mXejHx2pJg.

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 1

•

4 years ago

Connor, we are wondering if this might be abuse/DDOS protection on hg.m.o. These requests would be coming from AWS to the public hg.m.o servers.

We are also thinking of swapping to the private servers, and were curious if there is any protection there too.

Flags: needinfo?(sheehan)

Nick Thomas [:nthomas] (UTC+12)

Reporter

Updated

•

4 years ago

Flags: needinfo?(bugspam.Callek)

Tom Prince [:tomprince]

Comment 2

•

4 years ago

The code used in run-task to find the internal mirror is here.

Connor Sheehan [:sheehan]

Comment 3

•

4 years ago

We don't have any true DDoS protection on hgmo (ie something like fail2ban), except for bans on IP blocks. Given the varied public IP's in the logs, I doubt this is the problem here, and the IPs currently banned in the load balancer don't seem to match with any in the logs. Using the internal mirrors would work, but I would need to add the new repos to them first since they currently only have a subset of all the repos on the public endpoint.

The way we avoid DDoS from cloning today is by using clonebundles. Perhaps we could start producing bundles for the l10n repos. Usually cloning these repos doesn't cause any disruption since they're quite small, and looking at some of the logs where cloning these repos succeeded, the clone + working directory update took ~10s. Given the small load for each individual clone I don't think the issue here is related to a DDoS from our infra, but I don't have a better explanation either.

Flags: needinfo?(sheehan)

Comment hidden (Intermittent Failures Robot)

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 5

•

4 years ago

OK, thanks for the information. Perhaps its something to do the actual workers themselves.

Braindump of ideas:

we also see this issue occasionally on mozilla-central for l10n repacks, most recently here. That coincided with 79.0 RC1 having a couple of issues, so it might be related to surge load on the servers. There's are 3 platforms by 21 chunks per branch that can all start at roughly the same time, each cloning 5 locales sequentially. Sometimes releases align with nightlies to double the load (or a higher multiple if several releases are started simultaneously). Clone bundles could offload a lot of work from the server, so might be worthwhile even if only updated weekly as low change repos
I've tried adding socket.setdefaulttimeout(10.0) to testing/mozharness/external_tools/robustcheckout.py, and running a bunch of l10n repacks on try (eg), but haven't been able to reproduce the hang and verify that the timeout works. Setting a much lower timeout locally certainly throws an exception and the retry behaviour of robustcheckout kicks in, so I think this should help us mitigate the issue
we see this for Windows but not other platforms (which all run on Linux). The Mercurial versions are 4.7.1 and 4.8.1 respectively, and I see support was added in 4.8 for setting an http/https timeout. It defaults to None and I don't think we are setting that in the hgrc for our Linux images, so it's likely not the key difference
we could set the max job time to a much lower time than 7200 seconds in order to fail faster, and sheriffs retry manually. Windows repacks are typically 23 minutes long, other platforms 5 minutes, so something like 1800 would tightly match that, or 2400s to allow a little more leeway for variation

Comment hidden (Intermittent Failures Robot)

Nick Thomas [:nthomas] (UTC+12)

Reporter

Updated

•

4 years ago

Keywords: leave-open

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 7

•

4 years ago

Attached file Bug 1653171 - Shorten max run time for l10n repacks, r=Callek (deleted) — Details

Windows jobs typically take 23 minutes, so limit to 30 minutes so that cloning hangs terminate earlier than 2 hours.
Mac and Linux jobs are typically 5 minutes long, so bring the limit in to 15 minutes from an hour.

Also refactors so this the max run time is only specified in the kind, rather than some in the kind and some in the transform.

Phabricator Automation

Updated

•

4 years ago

Assignee: nobody → nthomas

Status: NEW → ASSIGNED

Rob Lemley [:rjl]

Updated

•

4 years ago

Blocks: 1655625

Pulsebot

Comment 8

•

4 years ago

Pushed by nthomas@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/d3f021e9c896 Shorten max run time for l10n repacks, r=Callek

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 9

•

4 years ago

We're unsure what the next step is here, unassigning for now.

Assignee: nthomas → nobody

Status: ASSIGNED → NEW

Phabricator Automation

Updated

•

4 years ago

Assignee: nobody → nthomas

Status: NEW → ASSIGNED

Pulsebot

Comment 10

•

4 years ago

Pushed by cbrindusan@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/c2667d1e8390 Updated shippable-l10n/kind.yml for trailing spaces. CLOSED TREE

Cristina Coroiu [:ccoroiu]

Comment 11

•

4 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/d3f021e9c896
https://hg.mozilla.org/mozilla-central/rev/c2667d1e8390

Andreea Pavel [:apavel]

Comment 12

•

4 years ago

As an fyi, these are still occurring https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&resultStatus=testfailed%2Cbusted%2Cexception&searchStr=windows%2C10%2Caarch64%2Copt&revision=7b8c4f32cdde947c1b1f942905fbc5039b6d1c65&selectedTaskRun=HYFX-k0lSYSd_5Oy3tYzZg.0

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 13

•

4 years ago

Looks like the timeout of 30 minutes is too short for some jobs. There are 6 failures on that push for windows l10n - 1 has the l10n hang and was failed as we wanted, but 5 are jobs that should have been left alone as they had finished cloning and were repacking. Seems to be related to AWS instances in us-east-1 which don't already have a clone of gecko, but doesn't always occur. I'm asking around about why we don't use the internal hg servers for that AWS region.

There's one more that should have been left alone on the previous nightly, while the most recent nightly was OK.

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 14

•

4 years ago

Bug 1614852 also has some investigations of hanging hg clones, in the context of GCP workers.

See Also: → https://bugzilla.mozilla.org/show_bug.cgi?id=1614852

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 15

•

4 years ago

Attached file Bug 1653171 - bump timeout for windows l10n up to 45 minutes, r=Callek (deleted) — Details

Pulsebot

Comment 16

•

4 years ago

Pushed by nthomas@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/9b7136d538b6 bump timeout for windows l10n up to 45 minutes, r=Callek

Alexandru Michis [:malexandru]

Comment 17

•

4 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/9b7136d538b6

Cosmin Sabou [:CosminS]

Comment 19

•

4 years ago

Still timing out on the latest merge: https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&group_state=expanded&resultStatus=testfailed%2Cbusted%2Cexception&revision=c34351a5fd6c6bfb76ffd7b86b8757d6d95b581c&searchStr=l10n&selectedTaskRun=Rg3yOSv2THqmhKSO2i4KUw.0
Failure log, cloning took 5 mins and updating 5 mins: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=311586372&repo=mozilla-central&lineNumber=57
Here is where it hanged so looks like a different issue: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=311586372&repo=mozilla-central&lineNumber=924-926
Same lines here: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=311588544&repo=mozilla-central&lineNumber=631-633

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 20

•

4 years ago

Looks like 6 failures on that push, 5 are the main signature for this bug:

[task 2020-07-31T00:09:22.108Z] 00:09:22     INFO - Running command: ['hg', '--config', 'ui.merge=internal:merge', '--config', 'extensions.robustcheckout=Z:\\task_1596153325\\build\\src\\testing\\mozharness\\external_tools\\robustcheckout.py', 'robustcheckout', 'https://hg.mozilla.org/l10n-central/cy', 'cy', '--sharebase', 'y:\\hg-shared', '--branch', 'default']
[task 2020-07-31T00:09:22.108Z] 00:09:22     INFO - Copy/paste: hg --config ui.merge=internal:merge --config extensions.robustcheckout=Z:\task_1596153325\build\src\testing\mozharness\external_tools\robustcheckout.py robustcheckout https://hg.mozilla.org/l10n-central/cy cy --sharebase y:\hg-shared --branch default
[task 2020-07-31T00:09:22.190Z] 00:09:22     INFO -  (using Mercurial 4.7.1)
[taskcluster 2020-07-31T00:15:04.046Z] [taskcluster-proxy] Successfully refreshed taskcluster-proxy credentials: task-client/Rg3yOSv2THqmhKSO2i4KUw/0/on/us-east-1/i-02a30bcf0a197d56d/until/1596155703.938
[taskcluster 2020-07-31T00:32:04.074Z] [taskcluster-proxy] Successfully refreshed taskcluster-proxy credentials: task-client/Rg3yOSv2THqmhKSO2i4KUw/0/on/us-east-1/i-02a30bcf0a197d56d/until/1596156723.974
[taskcluster:error] Aborting task...

The taskcluster-proxy lines aren't believed to be related, the bug is the hang doing an hg clone.

The other one is closely related, as has 10 minutes to clone+update gecko, then 30 minutes to clone https://hg.mozilla.org/l10n-central/br. It was repacking when it hit the 45 minute max run time.

These jobs are using the internal mirror in us-east-1 to clone gecko, as sheehan re-enabled that on my request. It seems to make the clone+update of gecko a little quicker ~8 minutes vs ~10, but obviously still having issues with the public hg server or network or windows.

The windows image hasn't changed since February. I'm not sure what our next steps are here.

Connor Sheehan [:sheehan]

Comment 21

•

4 years ago

The other one is closely related, as has 10 minutes to clone+update gecko, then 30 minutes to clone https://hg.mozilla.org/l10n-central/br. It was repacking when it hit the 45 minute max run time.

Do these clone tasks occur on different filesystems? It doesn't make sense that clone+update for gecko would be so much faster than clone+update for an l10n repo, which are several orders of magnitude different in size.

Even if the docker image itself hasn't changed, if we mount filesystems into the container any change to the filesystem might be affecting the vcs timings there.

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Updated

•

4 years ago

Keywords: intermittent-failure

Tom Prince [:tomprince]

Comment 22

•

4 years ago

We are only seeing this on windows machines, and looking at the logs, we are using the same shared directory, and an adjacent checkout directory.

Comment hidden (Intermittent Failures Robot)

Axel Hecht [:Pike]

Comment 24

•

4 years ago

One idea I have on my mind is to restructure the l10n repack process to do the libs and langpack phase independently of the package repacks, on just one platform. That'd create an artifact that includes the merge dirs and the dist/xpi-stage, which the repacks would then use.

I had looked at the time we spend on cloning and libs, and thought it might not be worth it, but maybe it is?

Justin Wood (:Callek)

Comment 25

•

4 years ago

Attached file Use internal hg mirrors from mozharness clones (obsolete) (deleted) — Details

Justin Wood (:Callek)

Comment 26

•

4 years ago

I just attached to this bug the WIP of me setting us up to use internal hg mirrors for mozharness clones, I abandoned working on it when I saw connor's reply in c#3.

I don't have any better ideas right now sadly :(

Flags: needinfo?(bugspam.Callek)

Tom Prince [:tomprince]

Comment 27

•

4 years ago

Attached file Bug 1653171: Run windows L10n repacks on linux; r?Callek (deleted) — Details

Pulsebot

Comment 28

•

4 years ago

Pushed by mozilla@hocat.ca: https://hg.mozilla.org/integration/autoland/rev/f8a157da190f Run windows L10n repacks on linux; r=Callek

Pulsebot

Comment 29

•

4 years ago

Pushed by csabou@mozilla.com: https://hg.mozilla.org/mozilla-central/rev/02be42e5105b Run windows L10n repacks on linux; r=Callek

Cosmin Sabou [:CosminS]

Comment 30

•

4 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/02be42e5105b

Nick Thomas [:nthomas] (UTC+12)

Reporter

Updated

•

4 years ago

Assignee: nthomas → mozilla

Bogdan Tara[:bogdan_tara | bogdant]

Comment 31

•

4 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/f8a157da190f

Ryan VanderMeulen [:RyanVM]

Comment 32

•

4 years ago

Can we uplift ^ to Beta? Looks like it's making a huge runtime difference on m-c. On a related note - should we adjust the timeout values downwards again?

Flags: needinfo?(mozilla)

Nick Thomas [:nthomas] (UTC+12)

Reporter

Updated

•

4 years ago

Summary: Timeouts cloning l10n repositories in windows repacks → Convert windows l10n repacks to use cross-l10n on linux (to avoid cloning timeouts)

Comment hidden (Intermittent Failures Robot)

Tom Prince [:tomprince]

Comment 34

•

4 years ago

We wanted to wait a week, to see if there was fallout on nightly, before landing on beta.

Flags: needinfo?(mozilla)

Whiteboard: [checkin-needed-beta]

Ryan VanderMeulen [:RyanVM]

Comment 35

•

4 years ago

bugherder uplift

https://hg.mozilla.org/releases/mozilla-beta/rev/6397b1c1d190

status-firefox80: --- → fixed

Whiteboard: [checkin-needed-beta]

Andreea Pavel [:apavel]

Comment 36

•

4 years ago

Recent failure: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=313067554&repo=mozilla-central&lineNumber=831

Comment hidden (Intermittent Failures Robot)

Magnus Melin [:mkmelin]

Updated

•

4 years ago

status-firefox-esr78: --- → unaffected

Aki Sasaki (not active)

Comment 39

•

4 years ago

If we want to uplift to esr78: try push | grafted try commit on top of esr78

Aki Sasaki (not active)

Comment 40

•

4 years ago

That broke on mac. Trying mac on debian8: try push | attempted fix

Aki Sasaki (not active)

Comment 41

•

4 years ago

Still broken on mac =\

Mike Hommey [:glandium]

Updated

•

3 years ago

Blocks: 1712035

Phabricator Automation

Updated

•

2 years ago

Attachment #9167744 - Attachment is obsolete: true

BMO Automation

Updated

•

1 year ago

Severity: normal → S3

BugBot [:suhaib / :marco/ :calixte]

Comment 42

•

1 year ago

The bug assignee is inactive on Bugzilla, so the assignee is being reset.

Assignee: mozilla → nobody

Status: ASSIGNED → NEW

Bug 1653171 - Shorten max run time for l10n repacks, r=Callek 4 years ago Nick Thomas [:nthomas] (UTC+12) (deleted), text/x-phabricator-request		Details
Bug 1653171 - bump timeout for windows l10n up to 45 minutes, r=Callek 4 years ago Nick Thomas [:nthomas] (UTC+12) (deleted), text/x-phabricator-request		Details
Use internal hg mirrors from mozharness clones 4 years ago Justin Wood (:Callek) (deleted), text/x-phabricator-request		Details
Bug 1653171: Run windows L10n repacks on linux; r?Callek 4 years ago Tom Prince [:tomprince] (deleted), text/x-phabricator-request		Details