Closed Bug 1560755 Opened 5 years ago Closed 5 years ago

Perma Android INFO - error: /builds/worker/fetches/default.profraw: Invalid instrumentation profile data (file header is corrupt) before Subprocess.CalledProcessError when Gecko 69 merges to Beta on 2019-07-01

Categories

(Firefox Build System :: Bootstrap Configuration, defect)

Unspecified
Android
defect
Not set
normal

Tracking

(firefox-esr60 unaffected, firefox-esr6869+ fixed, firefox67 unaffected, firefox67.0.1 unaffected, firefox68+ wontfix, firefox69+ fixed)

RESOLVED FIXED
mozilla69
Tracking Status
firefox-esr60 --- unaffected
firefox-esr68 69+ fixed
firefox67 --- unaffected
firefox67.0.1 --- unaffected
firefox68 + wontfix
firefox69 + fixed

People

(Reporter: apavel, Assigned: mshal)

References

Details

(Keywords: regression, Whiteboard: [fennec68.1])

Attachments

(1 file)

Central as Beta simulation

Treeherder link: https://treeherder.mozilla.org/#/jobs?repo=try&selectedJob=252950716&resultStatus=testfailed%2Cbusted%2Cexception%2Cretry%2Cusercancel%2Crunnable&classifiedState=classified&revision=e6bfa9d819178c52e6b26964c39ab01c8e13e7ab

Failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=252950716&repo=try&lineNumber=1626

[task 2019-06-22T12:32:24.872Z] 12:32:24 INFO - /builds/worker/workspace/build/src/obj-firefox/_virtualenvs/init/bin/python -m mozbuild.action.process_install_manifest --track install_dist_include.track dist/include _build_manifests/install/dist_include
[task 2019-06-22T12:32:24.872Z] 12:32:24 INFO - Elapsed: 0.22s; From dist/include: Kept 1594 existing; Added/updated 4106; Removed 0 files and 0 directories.
[task 2019-06-22T12:32:24.873Z] 12:32:24 INFO - make[3]: Leaving directory '/builds/worker/workspace/build/src/obj-firefox'
[task 2019-06-22T12:32:24.873Z] 12:32:24 INFO - make[2]: Entering directory '/builds/worker/workspace/build/src/obj-firefox'
[task 2019-06-22T12:32:24.873Z] 12:32:24 INFO - make[2]: Leaving directory '/builds/worker/workspace/build/src/obj-firefox'
[task 2019-06-22T12:32:24.948Z] 12:32:24 INFO - make[2]: Entering directory '/builds/worker/workspace/build/src/obj-firefox'
[task 2019-06-22T12:32:24.948Z] 12:32:24 INFO - ./merged.profdata.stub.stub
[task 2019-06-22T12:32:24.948Z] 12:32:24 INFO - /builds/worker/workspace/build/src/obj-firefox/_virtualenvs/init/bin/python -m mozbuild.action.file_generate /builds/worker/workspace/build/src/build/merge_profdata.py main merged.profdata.stub .deps/merged.profdata.stub.pp .deps/merged.profdata.stub.stub /builds/worker/fetches
[task 2019-06-22T12:32:24.949Z] 12:32:24 INFO - error: /builds/worker/fetches/default.profraw: Invalid instrumentation profile data (file header is corrupt)
[task 2019-06-22T12:32:24.949Z] 12:32:24 ERROR - Traceback (most recent call last):
[task 2019-06-22T12:32:24.949Z] 12:32:24 INFO - File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main
[task 2019-06-22T12:32:24.949Z] 12:32:24 INFO - "main", fname, loader, pkg_name)
[task 2019-06-22T12:32:24.949Z] 12:32:24 INFO - File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
[task 2019-06-22T12:32:24.949Z] 12:32:24 INFO - exec code in run_globals
[task 2019-06-22T12:32:24.949Z] 12:32:24 INFO - File "/builds/worker/workspace/build/src/python/mozbuild/mozbuild/action/file_generate.py", line 121, in <module>
[task 2019-06-22T12:32:24.949Z] 12:32:24 INFO - sys.exit(main(sys.argv[1:]))
[task 2019-06-22T12:32:24.949Z] 12:32:24 INFO - File "/builds/worker/workspace/build/src/python/mozbuild/mozbuild/action/file_generate.py", line 71, in main
[task 2019-06-22T12:32:24.949Z] 12:32:24 INFO - ret = module.dict[method](output, *args.additional_arguments, **kwargs)
[task 2019-06-22T12:32:24.949Z] 12:32:24 INFO - File "/builds/worker/workspace/build/src/build/merge_profdata.py", line 18, in main
[task 2019-06-22T12:32:24.949Z] 12:32:24 INFO - '-o', 'merged.profdata'] + profraw_files)
[task 2019-06-22T12:32:24.949Z] 12:32:24 INFO - File "/usr/lib/python2.7/subprocess.py", line 186, in check_call
[task 2019-06-22T12:32:24.950Z] 12:32:24 INFO - raise CalledProcessError(retcode, cmd)
[task 2019-06-22T12:32:24.950Z] 12:32:24 INFO - subprocess.CalledProcessError: Command '[u'/builds/worker/workspace/build/src/clang/bin/llvm-profdata', 'merge', '-o', 'merged.profdata', '/builds/worker/fetches/default.profraw']' returned non-zero exit status 1
[task 2019-06-22T12:32:24.950Z] 12:32:24 INFO - backend.mk:43: recipe for target '.deps/merged.profdata.stub.stub' failed
[task 2019-06-22T12:32:24.950Z] 12:32:24 ERROR - make[2]: *** [.deps/merged.profdata.stub.stub] Error 1
[task 2019-06-22T12:32:24.950Z] 12:32:24 INFO - make[2]: Leaving directory '/builds/worker/workspace/build/src/obj-firefox'
[task 2019-06-22T12:32:24.950Z] 12:32:24 INFO - make[2]: *** Waiting for unfinished jobs....
[task 2019-06-22T12:32:24.952Z] 12:32:24 INFO - make[2]: Entering directory '/builds/worker/workspace/build/src/obj-firefox'
[task 2019-06-22T12:32:24.952Z] 12:32:24 INFO - ./source-repo.h.stub
[task 2019-06-22T12:32:24.953Z] 12:32:24 INFO - /builds/worker/workspace/build/src/obj-firefox/_virtualenvs/init/bin/python -m mozbuild.action.file_generate /builds/worker/workspace/build/src/build/variables.py source_repo_header source-repo.h .deps/source-repo.h.pp .deps/source-repo.h.stub
[task 2019-06-22T12:32:24.953Z] 12:32:24 INFO - make[2]: Leaving directory '/builds/worker/workspace/build/src/obj-firefox'
[task 2019-06-22T12:32:24.993Z] 12:32:24 INFO - make[2]: Entering directory '/builds/worker/workspace/build/src/obj-firefox'
[task 2019-06-22T12:32:24.993Z] 12:32:24 INFO - ./buildid.h.stub
[task 2019-06-22T12:32:24.993Z] 12:32:24 INFO - /builds/worker/workspace/build/src/obj-firefox/_virtualenvs/init/bin/python -m mozbuild.action.file_generate /builds/worker/workspace/build/src/build/variables.py buildid_header buildid.h .deps/buildid.h.pp .deps/buildid.h.stub
[task 2019-06-22T12:32:24.993Z] 12:32:24 INFO - make[2]: Leaving directory '/builds/worker/workspace/build/src/obj-firefox'
[task 2019-06-22T12:32:24.993Z] 12:32:24 INFO - make[2]: Entering directory '/builds/worker/workspace/build/src/obj-firefox'
[task 2019-06-22T12:32:24.993Z] 12:32:24 INFO - ./mozilla-config.h.stub
[task 2019-06-22T12:32:24.993Z] 12:32:24 INFO - /builds/worker/workspace/build/src/obj-firefox/_virtualenvs/init/bin/python -m mozbuild.action.file_generate /builds/worker/workspace/build/src/python/mozbuild/mozbuild/action/process_define_files.py process_define_file mozilla-config.h .deps/mozilla-config.h.pp .deps/mozilla-config.h.stub /builds/worker/workspace/build/src/mozilla-config.h.in
[task 2019-06-22T12:32:24.993Z] 12:32:24 INFO - make[2]: Leaving directory '/builds/worker/workspace/build/src/obj-firefox'
[task 2019-06-22T12:32:24.994Z] 12:32:24 INFO - /builds/worker/workspace/build/src/config/rules.mk:400: recipe for target 'default' failed
[task 2019-06-22T12:32:24.994Z] 12:32:24 ERROR - make[1]: *** [default] Error 2
[task 2019-06-22T12:32:24.994Z] 12:32:24 INFO - client.mk:125: recipe for target 'build' failed
[task 2019-06-22T12:32:24.994Z] 12:32:24 INFO - make: *** [build] Error 2

Jon can this be from Bug 1395509 ?

Summary: Perma Android INFO - error: /builds/worker/fetches/default.profraw: Invalid instrumentation profile data (file header is corrupt) before ubprocess.CalledProcessError when Gecko 69 merges to Beta on 2019-07-01 → Perma Android INFO - error: /builds/worker/fetches/default.profraw: Invalid instrumentation profile data (file header is corrupt) before Subprocess.CalledProcessError when Gecko 69 merges to Beta on 2019-07-01

(In reply to Andreea Pavel [:apavel] from comment #1)

Jon can this be from Bug 1395509 ?

That seems unlikely. ni? Jan in case this is to do with bug 1551499.

Flags: needinfo?(jcoppeard) → needinfo?(jdemooij)

This is indeed caused by Bug 1551499 as far as I can see from the results of the latest Central as Beta simulation from today: https://treeherder.mozilla.org/#/jobs?repo=try&revision=bc0bd4eb24acafce463813a36ce3d5213df0ff91&searchStr=%28n%29%2Candroid%2C

Looking into this now..

Bugbug thinks this bug is a regression, but please revert this change in case of error.

Keywords: regression

Did we see this once, on the Try push in comment 0? It seems to be an intermittent issue that affects certain revisions intermittently, see also bug 1543326.

I just did a beta simulation build on top of inbound tip and it's green: https://treeherder.mozilla.org/#/jobs?repo=try&group_state=expanded&revision=093979c5bbf2322a1e7bf41e986ebfc322a5548e

Flags: needinfo?(jdemooij)

Also, I think there are multiple reds in that Try push in comment 0 because they all depend on the same Bpgo(instr) and Bpgo(run) jobs. If these jobs upload corrupt profiling data, then all PGO/Nightly builds will fail when retriggered.

mshal, do you know if we have plans to address the issue in comment 8 somehow? (Ryan forwarded me to you..)

Flags: needinfo?(mshal)

(In reply to Jan de Mooij [:jandem] from comment #9)

mshal, do you know if we have plans to address the issue in comment 8 somehow? (Ryan forwarded me to you..)

I'll try to look into moving the profdata merge into the run task, so that the run task can't upload corrupted data. But yeah, I don't believe this is an issue with your patch, it's just that the profile writing in Android in the "run" task is not robust enough, so if it uploads corrupted data then all downstream tasks will fail.

:tomprince, is there a workaround to be able to retrigger the run task and have downstream build tasks use the profile data from that run task? Retriggering the builds doesn't work in this case since they re-use the bad data from the same run task.

Flags: needinfo?(mshal) → needinfo?(mozilla)
Assignee: nobody → mshal

Android profile runs don't always fully write out the profile data. In
this case, the corrupted profile data is successfully uploaded, but
future profile-use PGO builds try to merge the data and fail. Retrying
the profile-use builds doesn't help, since they all pull from the same
job that published the corrupt data.

We can detect this in the run task by using llvm-profdata merge, and if
the merge fails the task can automatically be retried. Note that the
data gets redundantly merged in the profile-use build, but it may not be
possible to run the merge in the run task on all platforms (eg: OSX), so
we have to keep the merge there as well.

This seems like something which could bite us on 68 as well. Tracking for both releases.

Pushed by mshal@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/5a867f13f12e Merge Android PGO profile data in run task; r=firefox-build-system-reviewers,chmanchester
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla69

(In reply to Michael Shal [:mshal] from comment #10)

:tomprince, is there a workaround to be able to retrigger the run task and have downstream build tasks use the profile data from that run task? Retriggering the builds doesn't work in this case since they re-use the bad data from the same run task.

Nothing straightforward, but if you re-run the generate-profile task, and then rerun the downstream tasks once the first one is finished, the downstream tasks should work.

Flags: needinfo?(mozilla)

Is this something we should consider uplifting to ESR68 since that's where we're shipping Fennec from now? If so, please request approval.

Comment on attachment 9074702 [details]
Bug 1560755 - Merge Android PGO profile data in run task; r?#firefox-build-system-reviewers

ESR Uplift Approval Request

  • If this is not a sec:{high,crit} bug, please state case for ESR consideration: If an android "run" task fails to generate a proper PGO profile, the downstream Android build will fail to build, along with any attempts to retrigger the build task. It is not immediately obvious that the "run" task needs to be retriggered, resulting in lost sheriff and dev time investigating the task failures. This patch is intended to mitigate the issue by automatically retrying the "run" tasks that fail to generate a valid profile.
  • User impact if declined: No impact to end users, only impact to sheriffs and our developers. Without this patch, a release for Android may be delayed due to the extra round trip involved with sheriff interaction.
  • Fix Landed on Version:
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): The change only affects the Android PGO profile generation stage of the build, and is currently running in mozilla-central.
  • String or UUID changes made by this patch:
Flags: needinfo?(mshal)
Attachment #9074702 - Flags: approval-mozilla-esr68?

Comment on attachment 9074702 [details]
Bug 1560755 - Merge Android PGO profile data in run task; r?#firefox-build-system-reviewers

Reliability improvement for Fennec PGO runs. Approved for 68.1esr.

Attachment #9074702 - Flags: approval-mozilla-esr68? → approval-mozilla-esr68+

Looks like this needs a rebased patch for ESR68 (or has another dep needing uplift).

Flags: needinfo?(mshal)

Looks like the spacing got messed up somehow in the yml file when importing the patch - "toolchains" should be at the same level as "run", not inside "run". This patch on top of the commit in esr68 works (or at least, the decision task completes) -

diff --git a/taskcluster/ci/generate-profile/kind.yml b/taskcluster/ci/generate-profile/kind.yml
index 4e9c296bb2f68..dbcd7826ed82d 100644
--- a/taskcluster/ci/generate-profile/kind.yml
+++ b/taskcluster/ci/generate-profile/kind.yml
@@ -91,5 +91,5 @@ jobs:
                 - android/android_common.py
                 - android/androidarm_4_3.py
                 - android/android_pgo.py
-            toolchains:
-                - linux64-clang
+        toolchains:
+            - linux64-clang
Flags: needinfo?(mshal) → needinfo?(ryanvm)
Flags: needinfo?(ryanvm)
Whiteboard: [fennec68.1]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: