Closed Bug 1482344 Opened 6 years ago Closed 6 years ago

raptor fails to run fetch benchmarks after moving to hardware

Categories

(Testing :: Raptor, enhancement)

enhancement
Not set
normal

Tracking

(firefox63 fixed)

RESOLVED FIXED
mozilla63
Tracking Status
firefox63 --- fixed

People

(Reporter: jmaher, Assigned: ahal)

References

Details

Attachments

(2 obsolete files)

raptor unittests unity3d and local one wasm-misc work fine when they are run on virtual machines, but when they are run on physical hardware they fail. In looking at logs before/after, we would fetch the task and post the data in the benchmarks directory, now only benchmarks from third_party/webkit/PerformanceTests/ seem to appear in our benchmark directory while running tests.
on a virtual machine, I see this in the log: [taskcluster 2018-08-09 13:48:23.520Z] === Task Starting === [setup 2018-08-09T13:48:23.986Z] run-task started in /builds/worker [cache 2018-08-09T13:48:23.989Z] cache /builds/worker/checkouts exists; requirements: gid=1000 uid=1000 version=1 [cache 2018-08-09T13:48:23.989Z] cache /builds/worker/workspace exists; requirements: gid=1000 uid=1000 version=1 [volume 2018-08-09T13:48:23.990Z] changing ownership of volume /builds/worker/.cache to 1000:1000 [volume 2018-08-09T13:48:23.990Z] volume /builds/worker/checkouts is a cache [volume 2018-08-09T13:48:23.990Z] changing ownership of volume /builds/worker/tooltool-cache to 1000:1000 [volume 2018-08-09T13:48:23.990Z] volume /builds/worker/workspace is a cache [setup 2018-08-09T13:48:23.991Z] running as worker:worker [fetches 2018-08-09T13:48:23.991Z] fetching artifacts Downloading https://queue.taskcluster.net/v1/task/XGuKvVIKTqi2FDJc_lWG-w/artifacts/public/wasm-misc.zip to /builds/worker/fetches/wasm-misc.zip.tmp Downloading https://queue.taskcluster.net/v1/task/XGuKvVIKTqi2FDJc_lWG-w/artifacts/public/wasm-misc.zip https://queue.taskcluster.net/v1/task/XGuKvVIKTqi2FDJc_lWG-w/artifacts/public/wasm-misc.zip resolved to 4433793 bytes with sha256 0ba273b748b872117a4b230c776bbd73550398da164025a735c28a16c0224397 in 0.619s Renaming to /builds/worker/fetches/wasm-misc.zip Extracting /builds/worker/fetches/wasm-misc.zip to /builds/worker/fetches using ['unzip', '/builds/worker/fetches/wasm-misc.zip'] Archive: /builds/worker/fetches/wasm-misc.zip creating: wasm-misc/ ... /builds/worker/fetches/wasm-misc.zip extracted in 0.136s Removing /builds/worker/fetches/wasm-misc.zip [fetches 2018-08-09T13:48:24.867Z] finished fetching artifacts [task 2018-08-09T13:48:24.867Z] executing ['/builds/worker/bin/test-linux.sh', '--installer-url=https://queue.taskcluster.net/v1/task/cMNgzfDCRJSd6A9blGVoBw/artifacts/public/build/target.tar.bz2', '--test-packages-url=https://queue.taskcluster.net/v1/task/cMNgzfDCRJSd6A9blGVoBw/artifacts/public/build/target.test_packages.json', '--test=raptor-wasm-misc', '--branch-name', 'try', '--download-symbols=ondemand'] on hardware we don't run test-linux.sh, is it possible that we have different features in docker-worker vs <whatever>-worker that we are using on hardware?
Flags: needinfo?(wcosta)
Flags: needinfo?(ahal)
I see :ahal recently added a fetch_artifacts support in run-task: https://searchfox.org/mozilla-central/source/taskcluster/scripts/run-task#742 this looks as if it is supported in both docker-worker and native-engine, but I found that native-engine (i.e. hardware) doesn't have MOZ_FETCHES defined in the environment variables.
it appears that edit+retrigger to add MOZ_FETCH* env vars doesn't solve this problem.
Blocks: 1482151
(In reply to Joel Maher ( :jmaher ) (UTC+2) from comment #1) > on a virtual machine, I see this in the log: > [taskcluster 2018-08-09 13:48:23.520Z] === Task Starting === > [setup 2018-08-09T13:48:23.986Z] run-task started in /builds/worker > [cache 2018-08-09T13:48:23.989Z] cache /builds/worker/checkouts exists; > requirements: gid=1000 uid=1000 version=1 > [cache 2018-08-09T13:48:23.989Z] cache /builds/worker/workspace exists; > requirements: gid=1000 uid=1000 version=1 > [volume 2018-08-09T13:48:23.990Z] changing ownership of volume > /builds/worker/.cache to 1000:1000 > [volume 2018-08-09T13:48:23.990Z] volume /builds/worker/checkouts is a cache > [volume 2018-08-09T13:48:23.990Z] changing ownership of volume > /builds/worker/tooltool-cache to 1000:1000 > [volume 2018-08-09T13:48:23.990Z] volume /builds/worker/workspace is a cache > [setup 2018-08-09T13:48:23.991Z] running as worker:worker > [fetches 2018-08-09T13:48:23.991Z] fetching artifacts > Downloading > https://queue.taskcluster.net/v1/task/XGuKvVIKTqi2FDJc_lWG-w/artifacts/ > public/wasm-misc.zip to /builds/worker/fetches/wasm-misc.zip.tmp > Downloading > https://queue.taskcluster.net/v1/task/XGuKvVIKTqi2FDJc_lWG-w/artifacts/ > public/wasm-misc.zip > https://queue.taskcluster.net/v1/task/XGuKvVIKTqi2FDJc_lWG-w/artifacts/ > public/wasm-misc.zip resolved to 4433793 bytes with sha256 > 0ba273b748b872117a4b230c776bbd73550398da164025a735c28a16c0224397 in 0.619s > Renaming to /builds/worker/fetches/wasm-misc.zip > Extracting /builds/worker/fetches/wasm-misc.zip to /builds/worker/fetches > using ['unzip', '/builds/worker/fetches/wasm-misc.zip'] > Archive: /builds/worker/fetches/wasm-misc.zip > creating: wasm-misc/ > ... > /builds/worker/fetches/wasm-misc.zip extracted in 0.136s > Removing /builds/worker/fetches/wasm-misc.zip > [fetches 2018-08-09T13:48:24.867Z] finished fetching artifacts > [task 2018-08-09T13:48:24.867Z] executing > ['/builds/worker/bin/test-linux.sh', > '--installer-url=https://queue.taskcluster.net/v1/task/ > cMNgzfDCRJSd6A9blGVoBw/artifacts/public/build/target.tar.bz2', > '--test-packages-url=https://queue.taskcluster.net/v1/task/ > cMNgzfDCRJSd6A9blGVoBw/artifacts/public/build/target.test_packages.json', > '--test=raptor-wasm-misc', '--branch-name', 'try', > '--download-symbols=ondemand'] > > > on hardware we don't run test-linux.sh, is it possible that we have > different features in docker-worker vs <whatever>-worker that we are using > on hardware? It doesn't seem to be related to worker setup. Do you have a link to the failing task?
Flags: needinfo?(wcosta)
I noticed that on packet, it searches for the home directory at /home/cltbld, shouldn't it be /build/worker?
Flags: needinfo?(jmaher)
I think it's the other way around, those native-engine workers run from /home/cltbld. Joel, I think you need to add the 'workdir' key to raptor.yml similar to what I needed for the jsshell-bench tasks: https://searchfox.org/mozilla-central/source/taskcluster/ci/source-test/jsshell.yml#19 Note those jsshell tasks are currently the only things using both run-task and a native-engine worker, so there are still edge cases that haven't been smoothed over.
Flags: needinfo?(ahal)
Oh, but because raptor.yml is a "test" kind (and not a "source-test" kind like jsshell), you'll need to figure out how to propagate this value from raptor.yml up to here: https://searchfox.org/mozilla-central/source/taskcluster/taskgraph/transforms/job/__init__.py#199 There may also very well be other problems. These are the first "test" tasks to use native-engine + fetches.
(In reply to Andrew Halberstadt [:ahal] from comment #7) > I think it's the other way around, those native-engine workers run from > /home/cltbld. Joel, I think you need to add the 'workdir' key to raptor.yml > similar to what I needed for the jsshell-bench tasks: > https://searchfox.org/mozilla-central/source/taskcluster/ci/source-test/ > jsshell.yml#19 > > Note those jsshell tasks are currently the only things using both run-task > and a native-engine worker, so there are still edge cases that haven't been > smoothed over. Ops, my bad, I am so biased to packet.net that I assumed the task was running there.
Flags: needinfo?(jmaher)
Blocks: 1473365
This is happening because the 'native-engine' implementation in mozharness_test.py is overwriting the worker's env instead of updating it: https://searchfox.org/mozilla-central/source/taskcluster/taskgraph/transforms/job/mozharness_test.py#340 Though the workdir also needed to be set as per comment 7.
Assignee: nobody → ahal
Status: NEW → ASSIGNED
Turns out that still wasn't enough because the native-engine workers don't use `run-task` (mozharness_test.py could use some TLC), which means MOZ_FETCHES aren't downloaded automatically. There are two options: A) Try to mount the run-task and fetch-content scripts on these workers and modify mozharness_test.py to always use run-task. B) Download the fetches in mozharness (there is precedent here from the code-coverage tasks) Option A is more aligned with the future we want to see, so I'll give that a brief shot. If I can't get it to work for any reason, I'll fallback to option B.
We need to grab fetches from several place in mozharness, this creates a dedicated mixin that can be used from anywhere. If the 'fetch-content' script is detected that will be used, otherwise we download the fetches manually.
This unbreaks some tier 3 raptor tasks. There are a few fixes rolled together here: 1) Stop overwriting the 'env' in mozharness_test.py's 'native-engine' implementation 2) Set the workdir to /home/cltbld (which makes sure the fetches are downloaded to there) 3) Download the fetches via mozharness in the 'raptor' script (since they don't use run-task anymore) Depends on D3651
Comment on attachment 9002065 [details] Bug 1482344 - [raptor] Fix fetch tasks for native-engine mozharness_test based tasks, r=jmaher Joel Maher ( :jmaher ) (UTC+2) has approved the revision.
Attachment #9002065 - Flags: review+
Comment on attachment 9002064 [details] Bug 1482344 - [mozharness] Refactor codecoverage fetch downloading into a standalone mixin, r=marco Tudor-Gabriel Vijiala [:tvijiala] has approved the revision.
Attachment #9002064 - Flags: review+
Pushed by ahalberstadt@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/95e338482796 [mozharness] Refactor codecoverage fetch downloading into a standalone mixin, r=tvijiala
Pushed by ahalberstadt@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/aa6f46eaec1b [raptor] Fix fetch tasks for native-engine mozharness_test based tasks, r=jmaher
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla63

Comment on attachment 9002064 [details]
Bug 1482344 - [mozharness] Refactor codecoverage fetch downloading into a standalone mixin, r=marco

Revision D3651 was moved to bug 1607000. Setting attachment 9002064 [details] to obsolete.

Attachment #9002064 - Attachment is obsolete: true

Comment on attachment 9002065 [details]
Bug 1482344 - [raptor] Fix fetch tasks for native-engine mozharness_test based tasks, r=jmaher

Revision D3652 was moved to bug 1607000. Setting attachment 9002065 [details] to obsolete.

Attachment #9002065 - Attachment is obsolete: true
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: