Closed Bug 1489264 Opened 6 years ago Closed 6 years ago

Investigate running packet.net-based Android emulator unit tests on GCP

Categories

(Testing :: General, enhancement, P1)

Version 3
enhancement

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: gbrown, Assigned: gbrown)

References

Details

(Whiteboard: [geckoview:p2])

Building on wcosta's foundation, let's see how well Android 4.2/4.3 tests run on GCP.
Oops, cloned that too well.

s/Android 4.2/4.3/Android 7.0/
My initial attempt:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=a0b4e7a111a7a879a141ef45b9e7225516690fc0&filter-tier=1&filter-tier=2&filter-tier=3

https://queue.taskcluster.net/v1/task/enqpmpaER9mhXgCfjatrSA/runs/1/artifacts/public/logs/live.log

The android x86 emulator does not start because kvm is not available.

https://taskcluster-artifacts.net/enqpmpaER9mhXgCfjatrSA/1/public/test_info//emulator-qMHGOf.log

emulator: CPU Acceleration: DISABLED
emulator: CPU Acceleration status: KVM requires a CPU that supports vmx or svm
emulator: ERROR: x86_64 emulation currently requires hardware acceleration!
Please ensure KVM is properly installed and usable.
CPU acceleration status: KVM requires a CPU that supports vmx or svm
(In reply to Geoff Brown [:gbrown] from comment #2)
> The android x86 emulator does not start because kvm is not available.

Paging Wander
Flags: needinfo?(wcosta)
(In reply to Chris Cooper [:coop] pronoun: he from comment #3)
> (In reply to Geoff Brown [:gbrown] from comment #2)
> > The android x86 emulator does not start because kvm is not available.
> 
> Paging Wander

If that's urgent I can look now, if not I will postpone for when I recover.
Flags: needinfo?(wcosta) → needinfo?(coop)
(In reply to Wander Lairson Costa [:wcosta] from comment #5) 
> If that's urgent I can look now, if not I will postpone for when I recover.

Redirecting NI to gbrown to see whether this is a blocker.
Flags: needinfo?(coop) → needinfo?(gbrown)
The geckoview tests currently running on mozilla-central, tier 3:

https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&filter-searchStr=android%20x86%207.0&filter-tier=1&filter-tier=2&filter-tier=3

could run on integration branches as tier 1 today, on packet.net, if there was a flexible provisioning solution available (or I suppose, if we just committed to a big pool of packet.net instances). Currently we are paused on committing to packet.net while we investigate gcp; in that sense, this bug is a blocker for getting geckoview tests to tier 1. The geckoview team has been very patient to date, but let's check in with them...

:davidb - Can you comment on how important and urgent it is to get these geckoview tests running in continuous integration?
Flags: needinfo?(gbrown) → needinfo?(dbolter)
Priority: -- → P1
Discussed with Jim and Snorp. This is not super urgent while we have arm coverage - thanks for the ping!
Flags: needinfo?(dbolter)
Depends on: 1490040
The nested-vm feature is enabled.
(In reply to Wander Lairson Costa [:wcosta] from comment #9)
> The nested-vm feature is enabled.

Yes, that helps!

https://treeherder.mozilla.org/#/jobs?repo=try&revision=f7ff463dcfe8063d4c335a6c4f9da378dc7ae320&filter-tier=1&filter-tier=2&filter-tier=3

https://treeherder.mozilla.org/logviewer.html#?job_id=198554836&repo=try&lineNumber=849

[task 2018-09-11T01:43:36.807Z] 01:43:36     INFO - Running command: ['ls', '-l', '/dev/kvm']
[task 2018-09-11T01:43:36.807Z] 01:43:36     INFO - Copy/paste: ls -l /dev/kvm
[task 2018-09-11T01:43:36.812Z] 01:43:36     INFO -  crw-rw-rw- 1 root root 10, 232 Sep 11 01:42 /dev/kvm
[task 2018-09-11T01:43:36.812Z] 01:43:36     INFO - Return code: 0
[task 2018-09-11T01:43:36.812Z] 01:43:36     INFO - Running command: ['kvm-ok']
[task 2018-09-11T01:43:36.813Z] 01:43:36     INFO - Copy/paste: kvm-ok
[task 2018-09-11T01:43:36.820Z] 01:43:36     INFO -  INFO: /dev/kvm exists
[task 2018-09-11T01:43:36.820Z] 01:43:36     INFO -  KVM acceleration can be used
[task 2018-09-11T01:43:36.821Z] 01:43:36     INFO - Return code: 0
[task 2018-09-11T01:43:36.821Z] 01:43:36     INFO - Running command: ['emulator', '-accel-check']
[task 2018-09-11T01:43:36.821Z] 01:43:36     INFO - Copy/paste: emulator -accel-check
[task 2018-09-11T01:43:36.837Z] 01:43:36     INFO -  accel:
[task 2018-09-11T01:43:36.837Z] 01:43:36     INFO -  0
[task 2018-09-11T01:43:36.837Z] 01:43:36     INFO -  KVM (version 12) is installed and usable.
[task 2018-09-11T01:43:36.837Z] 01:43:36     INFO -  accel
[task 2018-09-11T01:43:36.838Z] 01:43:36     INFO - Return code: 0

Now the x86 emulator starts and uses kvm - great!!
*But*...when I try to run a full set of Android x86 tests, I find most of them timeout and the tasks retry or fail:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=63e04a2b1573409c9c011ae224b24d0b24a80cc6&filter-tier=1&filter-tier=2&filter-tier=3

It seems that each test task will run successfully if run alone (one task at a time), but fails when ~3 or more such test tasks are running at once. The difference may only be performance: each task runs significantly slower, sufficient to trigger timeouts.
(In reply to Geoff Brown [:gbrown] (less available Sept 10-14) from comment #11)
> It seems that each test task will run successfully if run alone (one task at
> a time), but fails when ~3 or more such test tasks are running at once. The
> difference may only be performance: each task runs significantly slower,
> sufficient to trigger timeouts.

Geoff: in the mtg yesterday, we briefly discussed setting up a matrix of instance configs like we did for packet.net to hone in on the best combination of price/performance. Given your comment, it sounds like you're ready for this, and that we should aim higher spec-wise until we can match the packet.net performance. We can then compare results from GCP vs packet.net directly.

Wander: can you get the matrix setup for Geoff? We can rope in other people (Brian, John, ...) as required.
Flags: needinfo?(wcosta)
I set up a grid with n1-standard-{2,4,8,16} machine types. Each machine type has 4 instances. The worker-types are gce/n1-std-{2,4,8,16}.
Flags: needinfo?(wcosta)
Depends on: 1490962
Update: the worker type were renamed gecko-t-linux-{2,4,8,16}
First attempt with gecko-t-linux-2 is not working:

https://treeherder.mozilla.org/#/jobs?repo=try&tier=1%2C2%2C3&revision=f4793fdf946e8c2d47395c449979d1f14579b422
Flags: needinfo?(wcosta)
I suppose that might have been affected by bug 1491948. Will re-test when trees re-open.
Flags: needinfo?(wcosta)
Bug 1492553 is a complication -- some tests are currently perma-fail on mozilla-central. I hadn't realized that before...but I don't think it affects comment 15.
gecko-t-linux-4 is not fast enough and we see frequent task retries when the emulator fails to start:

https://treeherder.mozilla.org/#/jobs?repo=try&tier=1%2C2%2C3&searchStr=android-em-7&revision=0ce5b4e132d2c19c6963919ac65e031399cf2318

gecko-t-linux-8 eliminates most task retries and tests complete, but take more than twice as long to complete as on packet.net:

https://treeherder.mozilla.org/#/jobs?repo=try&tier=1%2C2%2C3&searchStr=android-em-7&revision=dfcaf35b6d2d3b2e4d3125a81fe31daab36f2326

gecko-t-linux-16 shows no significant improvement over gecko-t-linux-8:

https://treeherder.mozilla.org/#/jobs?repo=try&tier=1%2C2%2C3&searchStr=android-em-7&revision=a6fed1047e0eee01b4cb3192c171bba377486833

gecko-t-linux-32 shows no significant improvement over gecko-t-linux-8:

https://treeherder.mozilla.org/#/jobs?repo=try&tier=1%2C2%2C3&searchStr=android-em-7&revision=443111c064cba983cd9e9c1dadbd8154a9456716


For these configurations, it looks like gecko-t-linux-8 is the best we can do; tests pass, but run 2x to 3x as long as they currently do on packet.net -- disappointing.

(I'm only comparing a couple of mochitests and geckoview-junit here, due to bug 1492553.)
Whiteboard: [geckoview:p2]
to follow up here are we not interested in using GCP for geckoview x86 based tests due to the longer runtime compared to packet.net, or are we ok with that?
The plan is to continue to use packet.net for geckoview x86 tests unless we can improve performance on gcp. :wcosta is investigating to see if performance improvements on gcp are possible.
(In reply to Geoff Brown [:gbrown] from comment #20)
> The plan is to continue to use packet.net for geckoview x86 tests unless we
> can improve performance on gcp. :wcosta is investigating to see if
> performance improvements on gcp are possible.

Wander: can you comment on what GCP permutations you've tried so far? (standard vs highmem vs highcpu vs custom vs ...)

From our discussion this morning, you haven't found a winning combination of instance types/specs yet. If we're going to come back to this in the future, we should keep a record of what we've already tried.
Flags: needinfo?(wcosta)
GCP has a different approach, all these different machine types only change the relation CPU/memory, they differ only on how much memory you have, on average, per core. Therefore, for the matter of the speed test, "standard" was all we needed. Given that, n1-standard-8 seems to be the winner, I didn't check in terms of cost.

Also, I spotted we were using slow disks, changing it to SSD improved IO, but only had limited impact on general performance.

The tests I performed was with a focus on the Android x86 emulator.
Flags: needinfo?(wcosta)
Blocks: 1498298
No longer blocks: 1498298
Blocks: 1425322
We will keep using packet.net for Android x86 emulator tests. gcp is an option, but doesn't provide the same performance.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.