<a class="header-button" href="https://bugzilla-dev.allizom.org/home" title="Go to home page"> Bugzilla

Reporter

Updated

•

6 years ago

Blocks: 1338040

Reporter

Comment 2

•

6 years ago

See especially https://bugzilla.mozilla.org/show_bug.cgi?id=1474758#c41: "we may be facing resource starvation in shared tasks. The solution would be to reduce worker capacity and spawn more instances."

While I have some concerns (why didn't we see this effect when we initially experimented with a very small pool? why did this seem to start fairly suddenly a few weeks ago, when there were no big changes to load?), it seems the best explanation. Can we change to 2 workers per instance?

Flags: needinfo?(coop)

Comment 3

•

6 years ago

(In reply to Geoff Brown [:gbrown] from comment #2)

While I have some concerns (why didn't we see this effect when we initially experimented with a very small pool? why did this seem to start fairly suddenly a few weeks ago, when there were no big changes to load?), it seems the best explanation. Can we change to 2 workers per instance?

It will take a while to recreate all the instances. I'll start tomorrow AM. I'll also want to socialize the associated increase in cost with Travis.

Assignee: nobody → coop

Status: NEW → ASSIGNED

Flags: needinfo?(coop)

Comment 4

•

6 years ago

I've bumped the total number of instances up to 40 from 25.

I started recreating machine-0, but it took 5 attempts to recreate that single instance due to the flakiness of bootstrapping from scratch every time. Rather than recreate the existing instances, I've decided to provision the new, higher-number instances first instead. Once we have that new capacity, I'll return and recreate the low-numbered ones.

So far we have 4 instances that are each running only 2 workers instead of 4. These are:

machine-0
machine-25
machine-30
machine-34

That gives you some indication of how often provisioning works on the first pass. :/

I'll continue slogging through this.

Comment 5

•

6 years ago

machine-[24-39] are all running with 2 workers each. I'll now start recreating the existing instances.

Comment 6

•

6 years ago

There are now 40 packet.net instances, each of which are running 2 workers.

If we are still seeing timeouts and slowdowns in this new configuration, we can drop to a single worker per instance, but at that point, we've pretty much invalidated the reason for pursuing packet.net instances in the first place.

Status: ASSIGNED → RESOLVED

Closed: 6 years ago

Resolution: --- → FIXED

Updated

•

6 years ago

Blocks: 1535550

Updated

•

6 years ago

Blocks: 1536583

https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=243333700&repo=autoland&lineNumber=14416
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=243310537&repo=autoland&lineNumber=12875
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=243217995&repo=mozilla-inbound&lineNumber=28534
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=243217990&repo=mozilla-inbound&lineNumber=27638
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=243217501&repo=autoland&lineNumber=29276

Reporter

Comment 7

•

6 years ago

:coop -- Despite your efforts here, this condition continues; in fact, I see no difference.

From bug 1474758, all of these recent (April 28, 29) failures' android-performance.log artifacts show /proc/cpuinfo around 800 Mhz:

eg https://taskcluster-artifacts.net/H71E7qBJRbGwhymOkqE4YQ/0/public/test_info//android-performance.log

Host /proc/cpuinfo:
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 158
model name : Intel(R) Xeon(R) CPU E3-1240 v6 @ 3.70GHz
stepping : 9
microcode : 0x8e
cpu MHz : 799.980

If we really are running max 2 workers/instance now, then I think there is no correlation between the reduced cpuinfo MHz and # workers/instance.

Flags: needinfo?(coop)

Dustin J. Mitchell [:dustin] (he/him)

Comment 8

•

6 years ago

I would be surprised to see such a correspondance -- nothing in docker apportions MHz between containers.

This sounds a lot more like CPU throttling has somehow been enabled.

Jason Orendorff [:jorendorff]

Comment 9

•

6 years ago

I just opened a new ticket in packet for the CPU throttling problem.

Updated

•

6 years ago

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Joel Maher ( :jmaher ) (UTC -8)

Comment 10

•

6 years ago

for reference a few years ago :garndt and I worked on running talos at packet.net and we saw the same thing- specific workers were running significantly slower than the majority of the other workers- we were doing 1 process/machine. We spent a few weeks working with our contact at packet.net at the time and didn't get anywhere, they were confused and couldn't explain it.

if we were to work around this and had 4 instances/machine, that means that all instances on those machines would be "auto retried/failed".

does this change over time, as in one machine works fine at 9am but at 11am it is running slower? Was the machine rebooted in between, etc.?

Comment 11

•

6 years ago

I got a response from packet.net:

So after reviewing what could be the possible cause for this, I dag deeper on c1.small.x86 capabilities instead.
I highly suggest to tune your CPU and this to verify or If you haven't already here is our best guide below:
https://support.packet.com/kb/articles/cpu-tuning

This can be verified/confirmed by

cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

I am working in another bug and will dig into this afterward.

Reporter

Comment 12

•

6 years ago

Attached file Bug 1545308 - Add cpufreq/scaling_governor info to android-performance.log; r=wcosta (deleted) — Details

Verify setting of scaling_governor by adding it to existing log.

https://hg.mozilla.org/mozilla-central/rev/d6416b899841

Reporter

Updated

•

6 years ago

Keywords: leave-open

AlHopper

Comment 13

•

6 years ago

I've seen this on multiple Intel based machines - desktops and servers.

It's related to the scaling governor - and the default appears to throttle down the processor to 800MHz (usually) if its not modified. You may have to tweak the following scripts - depending on your processor type.

First to see what mode the scaling governor is in:

#!/bin/bash

cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

----- call this mode.sh --------

Next to set all the cores for max performance - try this:

#!/bin/bash
echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor >/dev/null

------ call this scripts perf.sh --------

Next - rerun the "mode.sh" script above and you should see:

performance

printed out once for every core you have access to.

Lastly - if you want to save energy, you can set the scaling governor to save power with:

#!/bin/bash

echo powersave | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor >/dev/null

------- call this psave.sh -------

These scripts are copied from from a server with a 6-core 12 thread Xeon processor with an Intel motherboard. I've hunted through all the BIOS settings and set everything up to "go fast" - that is - no throttling. Yet I have to run the perf.sh script everytime the machine reboots!!

Pulsebot

Comment 14

•

6 years ago

Pushed by gbrown@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/d6416b899841 Add cpufreq/scaling_governor info to android-performance.log; r=wcosta

Dorel Luca [:dluca]

Comment 15

•

6 years ago

bugherder

Reporter

Comment 16

•

6 years ago

Thanks Al - sounds like we are on the right track.

And my diagnostic patch confirms that we usually see "powersave".

I can't write to scaling_governor in the same place -- no privileges. Maybe that's better done in the worker? Hoping wcosta or coop can sort that out...

Comment hidden (Intermittent Failures Robot)

Comment 18

•

6 years ago

I just redeployed instances with CPU governor set to "performance". :gbrown, could you please confirm the slowness issue is gone?

Flags: needinfo?(gbrown)

https://treeherder.mozilla.org/logviewer.html#?job_id=244855542&repo=autoland

Reporter

Comment 19

•

6 years ago

I still see "powersave" reported by recent tasks:

https://taskcluster-artifacts.net/aA8f776nSMqt3jS99rDvzQ/0/public/test_info//android-performance.log

/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor: powersave
/sys/devices/system/cpu/cpu1/cpufreq/scaling_governor: powersave
/sys/devices/system/cpu/cpu2/cpufreq/scaling_governor: powersave
/sys/devices/system/cpu/cpu3/cpufreq/scaling_governor: powersave
/sys/devices/system/cpu/cpu4/cpufreq/scaling_governor: powersave
/sys/devices/system/cpu/cpu5/cpufreq/scaling_governor: powersave
/sys/devices/system/cpu/cpu6/cpufreq/scaling_governor: powersave
/sys/devices/system/cpu/cpu7/cpufreq/scaling_governor: powersave

Flags: needinfo?(gbrown)

Comment 20

•

6 years ago

Two things:

Given that reducing the # of workers per instance didn't fix this problem, are we back to running 4 workers/instance with 25 instances total again? I never landed my change to switch to 2 workers/instance with 40 instances when it became clear that wasn't helping, and AFAICT there's been no other change to the terraform file: https://github.com/taskcluster/taskcluster-infrastructure/blob/master/docker-worker.tf Just want to make sure that the github repo is representative of the current state.
The sooner we can get to image-based deployments in packet.net, the better. This iteration cycle is going to be painful otherwise. Wander: can you pick up bug 1523569 once sccache in GCP is done, please? Maybe a git repo is overkill, but having some sort of local filestore for images hosted in packet.net will be required for bug 1508790 anyway.

Flags: needinfo?(coop) → needinfo?(wcosta)

Updated

•

6 years ago

Blocks: 1533084

Comment 21

•

6 years ago

(In reply to Geoff Brown [:gbrown] from comment #19)

I still see "powersave" reported by recent tasks:

https://treeherder.mozilla.org/logviewer.html#?job_id=244855542&repo=autoland

https://taskcluster-artifacts.net/aA8f776nSMqt3jS99rDvzQ/0/public/test_info//
android-performance.log

/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor: powersave
/sys/devices/system/cpu/cpu1/cpufreq/scaling_governor: powersave
/sys/devices/system/cpu/cpu2/cpufreq/scaling_governor: powersave
/sys/devices/system/cpu/cpu3/cpufreq/scaling_governor: powersave
/sys/devices/system/cpu/cpu4/cpufreq/scaling_governor: powersave
/sys/devices/system/cpu/cpu5/cpufreq/scaling_governor: powersave
/sys/devices/system/cpu/cpu6/cpufreq/scaling_governor: powersave
/sys/devices/system/cpu/cpu7/cpufreq/scaling_governor: powersave

There was a bustage in the code, I redeployed and it now has "performance"

Flags: needinfo?(wcosta)

https://taskcluster-artifacts.net/Ghl2qrptRyujTc4L3ecniQ/0/public/test_info//android-performance.log

Reporter

Comment 22

•

6 years ago

(In reply to Wander Lairson Costa [:wcosta] from comment #21)

There was a bustage in the code, I redeployed and it now has "performance"

Sorry, but I still see "powersave" reported by all recent tasks:

https://treeherder.mozilla.org/logviewer.html#?job_id=245395212&repo=autoland (Started: Wed, May 8, 13:04:33)

/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor: powersave
/sys/devices/system/cpu/cpu1/cpufreq/scaling_governor: powersave
/sys/devices/system/cpu/cpu2/cpufreq/scaling_governor: powersave
/sys/devices/system/cpu/cpu3/cpufreq/scaling_governor: powersave
/sys/devices/system/cpu/cpu4/cpufreq/scaling_governor: powersave
/sys/devices/system/cpu/cpu5/cpufreq/scaling_governor: powersave
/sys/devices/system/cpu/cpu6/cpufreq/scaling_governor: powersave
/sys/devices/system/cpu/cpu7/cpufreq/scaling_governor: powersave

Comment 23

•

6 years ago

(In reply to Wander Lairson Costa [:wcosta] from comment #21)

There was a bustage in the code, I redeployed and it now has "performance"

Relevant PR is here: https://github.com/taskcluster/taskcluster-infrastructure/pull/46

I suggested to Wander in IRC that we should try fixing this by hand, i.e. ssh to each machine and manually set scaling_governor to performance. We can then iterate on making the deployment automation do this automatically.

https://treeherder.mozilla.org/logviewer.html#?job_id=245429039&repo=autoland

Reporter

Comment 24

•

6 years ago

The latest tasks have "performance" now:

https://taskcluster-artifacts.net/Cug6JQYTT-2QwWLnjJR6tQ/0/public/test_info//android-performance.log

/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor: performance
/sys/devices/system/cpu/cpu1/cpufreq/scaling_governor: performance
/sys/devices/system/cpu/cpu2/cpufreq/scaling_governor: performance
/sys/devices/system/cpu/cpu3/cpufreq/scaling_governor: performance
/sys/devices/system/cpu/cpu4/cpufreq/scaling_governor: performance
/sys/devices/system/cpu/cpu5/cpufreq/scaling_governor: performance
/sys/devices/system/cpu/cpu6/cpufreq/scaling_governor: performance
/sys/devices/system/cpu/cpu7/cpufreq/scaling_governor: performance

Self ni to check on associated intermittent failures tomorrow.

Flags: needinfo?(gbrown)

https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=245433579&repo=autoland&lineNumber=13527

Reporter

Comment 25

•

6 years ago

Oh darn.

From bug 1474758,

https://taskcluster-artifacts.net/SOjCQvELT7OoeYGgaVYy-w/0/public/test_info//android-performance.log

Host cpufreq/scaling_governor:
/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor: performance
/sys/devices/system/cpu/cpu1/cpufreq/scaling_governor: performance
/sys/devices/system/cpu/cpu2/cpufreq/scaling_governor: performance
/sys/devices/system/cpu/cpu3/cpufreq/scaling_governor: performance
/sys/devices/system/cpu/cpu4/cpufreq/scaling_governor: performance
/sys/devices/system/cpu/cpu5/cpufreq/scaling_governor: performance
/sys/devices/system/cpu/cpu6/cpufreq/scaling_governor: performance
/sys/devices/system/cpu/cpu7/cpufreq/scaling_governor: performance

Host /proc/cpuinfo:
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 94
model name : Intel(R) Xeon(R) CPU E3-1240 v5 @ 3.50GHz
stepping : 3
microcode : 0x6a
cpu MHz : 799.941
cache size : 8192 KB
physical id : 0
siblings : 8
core id : 0
cpu cores : 4
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 22
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb invpcid_single intel_pt kaiser tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp
bugs :
bogomips : 7008.65
clflush size : 64
cache_alignment : 64
address sizes : 39 bits physical, 48 bits virtual
power management:

Notice "performance": this happened after yesterday's change.

Notice "cpu MHz : 799.941"!!

Flags: needinfo?(gbrown)

Comment 26

•

6 years ago

(In reply to Geoff Brown [:gbrown] from comment #25)

Oh darn.

Notice "cpu MHz : 799.941"!!

OK, that's concerning.

Wander: can you double-check the pool of instances to see how pervasive this is, and then reach out to packet for next steps here?

I've NI-ed Al who is already on this bug too.

Flags: needinfo?(wcosta)

Flags: needinfo?(al)

https://treeherder.mozilla.org/logviewer.html#?job_id=245544363&repo=autoland
https://treeherder.mozilla.org/logviewer.html#?job_id=245433579&repo=autoland
https://treeherder.mozilla.org/logviewer.html#?job_id=245434394&repo=mozilla-central

Reporter

Comment 27

•

6 years ago

I see 3 new examples so far:

2 of the 3 examples above have worker id "machine-4". Can that worker id be translated into a packet.net instance that someone could ssh into to investigate further?

AlHopper

Comment 28

•

6 years ago

The fix I provided in Comment 13 will not persist after a reboot.
To ensure that the scaling governor is placed into performance mode requires the following steps:

Add the following line:

GOVERNOR="performance"

in

/etc/init.d/cpufrequtils

On Ubuntu 18.04 you need to run:

sudo apt-get install cpufrequtils
sudo systemctl disable ondemand

Flags: needinfo?(al)

Comment 29

•

6 years ago

I spotted frequencies going to 800 MHz even with scaling governor set to performance. What I am now doing is, besides setting scaling governor to performance, I also set the minimum CPU frequency to 3.5 GHz. :gbrown, could you please keep an eye on failing tasks?

Flags: needinfo?(wcosta)

Comment 30

•

6 years ago

Update: even so, I can see machines running with 800 MHz.

Reporter

Comment 31

•

6 years ago

Associated test failures definitely continue and remain a big concern. Logs show "performance" and ~800 MHz.

Do we have more ideas? Any work in progress?

I am tempted to have the task fail and retry when it finds this condition.

Comment 32

•

6 years ago

Do we know the frequency how often this happens? Is it happening on all workers or only for some of those? Also which scaling driver is actually running? Maybe there is a bug we are just hitting here?

Reporter

Comment 33

•

6 years ago

(In reply to Henrik Skupin (:whimboo) [⌚️UTC+2] from comment #32)

Do we know the frequency how often this happens?

Not really. It is not very frequent, but there are about 10 cases found in bug 1474758 each day; that is only jsreftests, which account for maybe 10% of packet.net tasks, so a very gross estimate would be 100 cases of low MHz per day.

Is it happening on all workers or only for some of those?

There is a correlation with certain worker-ids for a period of time, but the affected worker-ids seem to change from day to day. Look at the "Machine name" column of https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?startday=2019-05-09&endday=2019-05-16&tree=trunk&bug=1474758 to see what I mean.

Also which scaling driver is actually running?

When a task cats /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor, it sees "performance" now.

Maybe there is a bug we are just hitting here?

A scaling driver / governor bug?

Comment 34

•

6 years ago

(In reply to Geoff Brown [:gbrown] from comment #33)

Not really. It is not very frequent, but there are about 10 cases found in bug 1474758 each day; that is only jsreftests, which account for maybe 10% of packet.net tasks, so a very gross estimate would be 100 cases of low MHz per day.

Is it happening on all workers or only for some of those?

There is a correlation with certain worker-ids for a period of time, but the affected worker-ids seem to change from day to day. Look at the "Machine name" column of https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?startday=2019-05-09&endday=2019-05-16&tree=trunk&bug=1474758 to see what I mean.

As it looks like it happens for the workers with the name 7, 8, 34, and 36. Others only appear once in that list. Maybe someone could check one of those manually? Can we blacklist (taking out of the pool) for now?

Also which scaling driver is actually running?

When a task cats /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor, it sees "performance" now.

So comment 23 referenced: https://github.com/taskcluster/taskcluster-infrastructure/pull/46/files

Where in this patch do we actually set the governor to performance? We only start the cpufreq service, or? Using cat in a task (?) to set it, doesn't it conflict with the service?

A scaling driver / governor bug?

Yes, so it would be good to know which driver is actually used. Is it intel_pstate?

Reporter

Updated

•

6 years ago

See Also: → https://bugzilla.mozilla.org/show_bug.cgi?id=1552334

Reporter

Comment 35

•

6 years ago

Bug 1552334 recognizes the slow instances and retries the affected task. That is effective in avoiding test failures, but sometimes delays test runs and is inefficient in terms of worker use: It would still be great to see this bug resolved properly.

https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&searchStr=android%2Copt%2Cx86_64%2Cwpt2&revision=b74e5737da64a7af28ab4f81f996950917aa71c5

Reporter

Comment 36

•

6 years ago

(In reply to Geoff Brown [:gbrown] from comment #35)

sometimes delays test runs and is inefficient in terms of worker use

As an example, in

this task retried 4 times (getting worker 8 or worker 34 each time). The task takes about 5 minutes to detect the poor performance (this could be improved) + time for rescheduling, so the start of the successful task was delayed by about 30 minutes in total.

Joel Maher ( :jmaher ) (UTC -8)

Comment 37

•

6 years ago

so maybe we estimate 7 minutes/retry- and then calculate the total retries or % retries or retries/day, then we could determine how many workers we need.

Actually if we have this retry in place, possibly we could consider running talos on linux @packet.net :)

Updated

•

6 years ago

Assignee: coop → nobody

Updated

•

6 years ago

Blocks: 1554706

Reporter

Comment 38

•

6 years ago

This seems to have stopped!

I see no retries due to reduced MHz since May 25. Wonderful!

Let's keep this bug open, unless we know how it was fixed...

Comment 39

•

6 years ago

Self note: the correct command line to set the cpu governor for intel_pstate driver is:

echo performance | tee /sys/devices/system/cpu/cpufreq/policy*/scaling_governor

https://github.com/taskcluster/docker-worker/commit/0f3cef5b3d10e83fa9e5cf23c841637520f3d7bc

Comment 40

•

6 years ago

This is fixed in the image based workers.

Status: REOPENED → RESOLVED

Closed: 6 years ago → 6 years ago

Resolution: --- → FIXED