Closed Bug 1220658 Opened 9 years ago Closed 9 years ago

Upgrade ec2 test instances mesa versions to mesa-lts-saucy-9.2.1

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jrmuizel, Unassigned)

References

(Depends on 1 open bug)

Details

Attachments

(7 files, 3 obsolete files)

Attached patch Fix llvm pipe for 9.2.1 (from bug 975034) (obsolete) (deleted) — Splinter Review
There's a bug in mesa 8.0.4 that prevents us from updating the WebGL conformance tests. This bug has been fixed in mesa-lts-saucy-9.2.1 which I believe is the current version in 12.04. I've ported the patch from bug 975034 to 9.2.1
Blocks: 1193526
Rail, what's needed to make this happen?
Flags: needinfo?(rail)
Attachment #8681965 - Attachment is patch: true
Attachment #8681965 - Attachment mime type: text/x-patch → text/plain
Flags: needinfo?(rail)
We use mesa 8.0.4 (see http://hg.mozilla.org/build/puppet/file/tip/modules/packages/manifests/mesa.pp) with a patch from you ;) see http://hg.mozilla.org/build/puppet/file/tip/modules/packages/manifests/mesa-debian/patches/moz-fix-llvmpipe To upgrade to mesa-lts-saucy-9.2.1 (which can't be found at http://packages.ubuntu.com/search?suite=precise&section=all&arch=any&keywords=mesa-lts-saucy&searchon=names, but available in some PPAs), I'd clear with the following first: 1) jmaher may want to know about the upgrade to watch how it affects other tests 2) ahal has been working on porting our tests to docker/taskcluster. Maybe we are close enough and can wait to be able to test the changes using a try push using a different docker image It will also require some releng time to prep the packages.
oh fun stuff! Adding Armen as he will be helping :ahal with the porting to taskcluster. I assume we will need a way to test this on try prior to deployment. We still have about 60 unique failures in task cluster land, but this is using a custom docker image already to install compiz and some fonts. Is there a way to upgrade this dynamically in the job for buildbot (or current automation)? There is a long round a bout way for task cluster, but it takes a lot of patience. I would be curious to know what other tests have problems with this.
(In reply to Joel Maher (:jmaher) from comment #3) > Is there a way to upgrade this dynamically in the job for buildbot (or > current automation)? Not for tests. :(
I thought that it would be faster if I build the packages with the patch attached and someone can test installing them. * The packages are deployed in a separate repo at https://releng-puppet2.srv.releng.scl3.mozilla.com/repos/apt/custom/mesa-lts-saucy * I have an untested puppet patch which should do the trick: https://gist.github.com/rail/8872d93c9bec47bb8ba7 * The packages cannot be installed side by side with the old ones (they are marked as conflicting). * libglu1-mesa has been removed from the list. Packaging changelog tells that "it was split upstream". We may need to tweak the list of packages if it fails to install. * the best way to test this would be puppetizing a loaner instance against a user repo with the instance pinned to the user environment, see https://wiki.mozilla.org/ReleaseEngineering/PuppetAgain/HowTo/Set_up_a_user_environment#Pinning. This way you reduce the chance to fight conflicting packages from the previous version. * The changes won't be applied to talos machines, we don't install these packages there.
I'm creating a loaner instance (tst-linux64-ec2-coop) to test these packages right now.
Not working out of the box. Here's the list of packages I had to purge to get to a "clean slate": http://people.mozilla.org/~coop/bug1220658/apt-get-purge-mesa.log Here's the syslog output of the attempted, subsequent puppet install with Rail's patch installed: http://people.mozilla.org/~coop/bug1220658/syslog
I looked at that instance and the logs. It sounds like mesa-lts-saucy-9.2.1 requires some other packages to be backported. It won't be trivial :/
Depends on: 1225596
Rail manually upgraded the packages on the test node in advance of bug 1225596. I'm currently running some unittests to see how many failures pop up.
I've run some tests and so far they're green modulo a failure to pull gaia for luciddream, which is expected due to last night's b2g/vcssync bustage. Still a long way to go here though: * make sure that tests on release branches (aurora/beta/release/esr) still work * find the package delta between the current tester image and this trial node so we can mirror only the subset of packages we need, assuming bug 1225596 doesn't pan out (based on https://bugzilla.mozilla.org/show_bug.cgi?id=1225596#c1) * perform above process with a tst-linux32 slave * perform above process with a tst-emulator64 slave
(In reply to Rail Aliiev [:rail], on PTO Nov 21 - Mozlandia from comment #4) > (In reply to Joel Maher (:jmaher) from comment #3) > > Is there a way to upgrade this dynamically in the job for buildbot (or > > current automation)? > > Not for tests. :( Yes, this is possible with TaskCluster. This will need to be ported to the ubuntu1204-test docker image, too, as that's what TC is using. > root@taskcluster-worker:~# dpkg -l | grep mesa > ii libgl1-mesa-dri 8.0.4-0ubuntu0.7 free implementation of the OpenGL API -- DRI modules > ii libgl1-mesa-dri:i386 8.0.4-0ubuntu0.7 free implementation of the OpenGL API -- DRI modules > ii libgl1-mesa-glx 8.0.4-0ubuntu0.7 free implementation of the OpenGL API -- GLX runtime > ii libgl1-mesa-glx:i386 8.0.4-0ubuntu0.7 free implementation of the OpenGL API -- GLX runtime > ii libglapi-mesa 8.0.4-0ubuntu0.7 free implementation of the GL API -- shared library > ii libglapi-mesa:i386 8.0.4-0ubuntu0.7 free implementation of the GL API -- shared library > ii libglu1-mesa 8.0.4-0ubuntu0.7 Mesa OpenGL utility library (GLU) > ii libglu1-mesa:i386 8.0.4-0ubuntu0.7 Mesa OpenGL utility library (GLU) > ii mesa-common-dev 8.0.4-0ubuntu0.7 Developer documentation for Mesa TC image builds should not refer to the puppet repositories (since they don't use puppet!). The approach we took for xcb was to create an Ubuntu repository, put it in a tarball, and put it on tooltool. Then we download and unpack that and add its path to sources.list temporarily. See https://dxr.mozilla.org/mozilla-central/source/testing/docker/ubuntu1204-test/system-setup.sh#181
What timeline should we expect for this bug? I want to know so I can determine what to do with mochitest gl on TaskCluster (since we have that bug fixed there). We have some tests that are now unexpectingly passing. If this gets fixed, we change the state of the tests. If it is going to take long, we will skip the tests until this gets fixed.
:coop- trying to make some decisions on taskcluster- we are using the new mesa library already and either need to wait for this if it is soon, or disable some tests until this is resolved. Can you help us figure out a timeline- I am not sure if Rail is the one who would be doing this work.
Flags: needinfo?(coop)
Here are the failures I saw in staging. Sounds like these correspond to what Armen is seeing: Ubuntu VM 12.04 x64 mozilla-central opt test web-platform-tests-4 08:50:18 INFO - TEST-UNEXPECTED-PASS | /webgl/bufferSubData.html | bufferSubData - expected FAIL 08:50:18 INFO - TEST-INFO | expected FAIL 08:50:25 INFO - TEST-UNEXPECTED-PASS | /webgl/compressedTexImage2D.html | compressedTexImage2D - expected FAIL 08:50:25 INFO - TEST-INFO | expected FAIL 08:50:32 INFO - TEST-UNEXPECTED-PASS | /webgl/compressedTexSubImage2D.html | compressedTexSubImage2D - expected FAIL 08:50:32 INFO - TEST-INFO | expected FAIL 08:50:39 INFO - TEST-UNEXPECTED-PASS | /webgl/texImage2D.html | texImage2D - expected FAIL 08:50:39 INFO - TEST-INFO | expected FAIL 08:50:46 INFO - TEST-UNEXPECTED-PASS | /webgl/texSubImage2D.html | texSubImage2D - expected FAIL 08:50:46 INFO - TEST-INFO | expected FAIL 08:50:52 INFO - TEST-UNEXPECTED-PASS | /webgl/uniformMatrixNfv.html | Should not throw for 2 - expected FAIL 08:50:52 INFO - TEST-INFO | expected FAIL 08:50:52 INFO - TEST-UNEXPECTED-PASS | /webgl/uniformMatrixNfv.html | Should not throw for 3 - expected FAIL 08:50:52 INFO - TEST-INFO | expected FAIL 08:50:52 INFO - TEST-UNEXPECTED-PASS | /webgl/uniformMatrixNfv.html | Should not throw for 4 - expected FAIL 08:50:52 INFO - TEST-INFO | expected FAIL Ubuntu VM 12.04 x64 mozilla-central opt test mochitest-e10s-browser-chrome-7 12:12:57 INFO - 514 INFO TEST-UNEXPECTED-FAIL | browser/base/content/test/general/browser_save_video.js | uncaught exception - TypeError: gContextMenu is null at chrome://browser/content/browser.xul:1 12:12:58 INFO - Stack trace: 12:12:58 INFO - chrome://mochikit/content/tests/SimpleTest/SimpleTest.js:simpletestOnerror:1519 12:12:58 INFO - chrome://mochitests/content/browser/browser/base/content/test/general/browser_save_video.js:null:62 12:12:58 INFO - Tester_execTest@chrome://mochikit/content/browser-test.js:757:9 12:12:58 INFO - Tester.prototype.nextTest</<@chrome://mochikit/content/browser-test.js:677:7 12:12:58 INFO - SimpleTest.waitForFocus/waitForFocusInner/focusedOrLoaded/<@chrome://mochikit/content/tests/SimpleTest/SimpleTest.js:735:59 12:12:58 INFO - JavaScript error: chrome://browser/content/browser.xul, line 1: TypeError: gContextMenu is null 12:13:40 INFO - 517 INFO TEST-UNEXPECTED-FAIL | browser/base/content/test/general/browser_save_video.js | Test timed out - 12:13:40 INFO - MEMORY STAT | vsize 1490MB | residentFast 298MB | heapAllocated 126MB Ubuntu VM 12.04 x64 mozilla-central opt test web-platform-tests-e10s-4 12:34:57 INFO - TEST-UNEXPECTED-PASS | /webgl/bufferSubData.html | bufferSubData - expected FAIL 12:34:57 INFO - TEST-INFO | expected FAIL 12:35:06 INFO - TEST-UNEXPECTED-PASS | /webgl/compressedTexImage2D.html | compressedTexImage2D - expected FAIL 12:35:06 INFO - TEST-INFO | expected FAIL 12:35:15 INFO - TEST-UNEXPECTED-PASS | /webgl/compressedTexSubImage2D.html | compressedTexSubImage2D - expected FAIL 12:35:15 INFO - TEST-INFO | expected FAIL 12:35:24 INFO - TEST-UNEXPECTED-PASS | /webgl/texImage2D.html | texImage2D - expected FAIL 12:35:24 INFO - TEST-INFO | expected FAIL 12:35:33 INFO - TEST-UNEXPECTED-PASS | /webgl/texSubImage2D.html | texSubImage2D - expected FAIL 12:35:33 INFO - TEST-INFO | expected FAIL 12:35:42 INFO - TEST-UNEXPECTED-PASS | /webgl/uniformMatrixNfv.html | Should not throw for 2 - expected FAIL 12:35:42 INFO - TEST-INFO | expected FAIL 12:35:42 INFO - TEST-UNEXPECTED-PASS | /webgl/uniformMatrixNfv.html | Should not throw for 3 - expected FAIL 12:35:42 INFO - TEST-INFO | expected FAIL 12:35:42 INFO - TEST-UNEXPECTED-PASS | /webgl/uniformMatrixNfv.html | Should not throw for 4 - expected FAIL 12:35:42 INFO - TEST-INFO | expected FAIL
I don't see the mochitest-gl unexpected-pass tests though. Do we know if these wpt4 and bc7 failures are intermittent failures, or perma failures?
(In reply to Joel Maher (:jmaher) from comment #14) > :coop- trying to make some decisions on taskcluster- we are using the new > mesa library already and either need to wait for this if it is soon, or > disable some tests until this is resolved. Can you help us figure out a > timeline- I am not sure if Rail is the one who would be doing this work. Since the upgrade seems to be moving the ball forward, i.e. making tests pass that were previously failing, we should proceed. I'll ask Rail to roll a similar puppet patch for linux32 and emulator64. I can handle deployment once we have the packages mirrored, etc. Rail is on PTO until next week, and next week is Mozlando. Realistically, this isn't going to get fixed until late December. We should disable what we need to in the interim.
that makes sense! Thanks for the info, I am glad that it is pretty realistic to get this done by the end of the month barring any unforeseen problems!
(In reply to Joel Maher (:jmaher) from comment #16) > I don't see the mochitest-gl unexpected-pass tests though. Do we know if > these wpt4 and bc7 failures are intermittent failures, or perma failures? I ran the tests against some other branches (e.g. mozilla-release), and the web platform "failures" seem to be legit. I only have one data point on bc7 now, so I'm re-running the test to get more data.
(In reply to Chris Cooper [:coop] from comment #19) > I only have one data point on bc7 now, so I'm re-running the test to get > more data. The bc7 failure seems to be intermittent.
Flags: needinfo?(coop)
awesome to hear that. higher confidence in less issues randomly cropping up.
The unexpected passes are gone. I see some new unexpected *failures*: https://treeherder.mozilla.org/#/jobs?repo=try&revision=b06014e0ba00&filter-searchStr=gl The new failures could be related to changing the EC2 instance.
I assume they could also be related to the chunk-size changes? Perhaps some resource initialization ordering changed or is now split over multiple chunks?
Also: 23:02:41 INFO - 412 INFO TEST-FAIL | dom/canvas/test/webgl-conformance/_wrappers/test_conformance__limits__gl-min-textures.html | The author of the test has indicated that flaky timeouts are expected. Reason: untriaged so maybe just disable that test?
This patch should make the test give us a bit more information about what's going wrong. Can you try again with it?
(In reply to Dustin J. Mitchell [:dustin] from comment #27) > Also: > > 23:02:41 INFO - 412 INFO TEST-FAIL | > dom/canvas/test/webgl-conformance/_wrappers/test_conformance__limits__gl-min- > textures.html | The author of the test has indicated that flaky timeouts are > expected. Reason: untriaged > > so maybe just disable that test? All dom/canvas/webgl-conformance/* tests have this right now, so this is not grounds for disabling.
Comment on attachment 8695995 [details] [diff] [review] temporarily skip 3 gl tests so we can pass on task cluster (1.0) Review of attachment 8695995 [details] [diff] [review]: ----------------------------------------------------------------- ::: dom/canvas/test/_webgl-conformance.ini @@ +777,5 @@ > skip-if = os == 'android' > [webgl-conformance/_wrappers/test_conformance__textures__texture-size.html] > skip-if = os == 'android' > [webgl-conformance/_wrappers/test_conformance__textures__texture-size-cube-maps.html] > +skip-if = (os == 'android') || (os == 'linux') # remove when bug 1220658 is resolved Remove these comments. This is a generated file. Rather, just rerun the generator python script.
Attachment #8695995 - Flags: review?(jgilbert) → review-
(In reply to Dustin J. Mitchell [:dustin] from comment #31) > https://treeherder.mozilla.org/#/jobs?repo=try&revision=461bd2e184a9 All of these seem orange...
Ugh, I dislike this new feature of putting try jobs on bugs -- that try push is unrelated to this bug.
Grabbing this to figure out what we should do here.
Assignee: nobody → rail
Attached patch gl_mesa921_temp_disable.patch (obsolete) (deleted) — Splinter Review
Attachment #8695995 - Attachment is obsolete: true
Attachment #8698592 - Flags: review?(jgilbert)
Comment on attachment 8698592 [details] [diff] [review] gl_mesa921_temp_disable.patch Review of attachment 8698592 [details] [diff] [review]: ----------------------------------------------------------------- :jrmuizel: Can you find someone to review reftest.list? ::: dom/canvas/test/webgl-conformance/mochitest-errata.ini @@ +104,5 @@ > # Failures after enabling color_buffer_[half_]float. > +# remove when bug 1220658 is resolved as this is unexpected-pass. > +skip-if = (os == 'linux') > +[_wrappers/test_conformance__textures__texture-mips.html] > +# remove when bug 1220658 is resolved as this is unexpected-pass. Alright. Technically we should unmark these unexpected-pass from being predicted to fail. I'm changing a bunch of this stuff right now, so it's fine if we just hack these out, if they mean you can move faster.
Attachment #8698592 - Flags: review?(jmuizelaar)
Attachment #8698592 - Flags: review?(jgilbert)
Attachment #8698592 - Flags: review+
oh- I accidentally picked up another patch in this one- let me make this just the canvas changes.
Attached patch gl_mesa921_temp_disable.patch (deleted) — Splinter Review
ok, this is just the files under dom/canvas/test changed!
Attachment #8698592 - Attachment is obsolete: true
Attachment #8698592 - Flags: review?(jmuizelaar)
Attachment #8698601 - Flags: review+
this is a patch landed to fix the tests which fail whilst running with the new mesa library AND on taskcluster. I assume this is related specifically to the mesa library since the 'gl' tests were at parity prior to the mesa upgrade.
Keywords: leave-open
Joel, I'm going to mirror the missing repo, and then we should be ready to go with this update. Do you have any preferences when we should roll this out? Some time next week maybe?
Flags: needinfo?(jmaher)
early next week if possible!
Flags: needinfo?(jmaher)
I decided to mirror this repo separately, so it'd be easier to recover if we want to back it out. debmirror --config-file=/etc/debmirror.conf --source --no-check-gpg \ -a i386,amd64 \ -s main,main/debian-installer,restricted,restricted/debian-installer,universe,universe/debian-installer \ -d precise-updates -h us.archive.ubuntu.com -r /ubuntu -e rsync --progress --nocleanup \ /data/repos/apt/precise-updates/ When the files are synced I'll try to land the patch and regenerate the AMIs.
Attached patch mesa-puppet.diff (deleted) — Splinter Review
Attachment #8700610 - Flags: review?(dustin)
Attachment #8700610 - Flags: review?(dustin) → review+
Callek volunteered to deploy this change next Monday. What we need to do: 1) Land https://bug1220658.bmoattachments.org/attachment.cgi?id=8700610 2) merge to production 3) wait until it's deployed on all masters 4) regenerate 3 AMIS: # from /builds/aws_manager/bin on aws-manager2 ./aws_manager-tst-linux64-ec2-golden.sh ./aws_manager-tst-linux32-ec2-golden.sh ./aws_manager-tst-emulator64-ec2-golden.sh If everything goes well, this is all we need. In case we need to backout the change. * repeat the same steps with the patch backed out * kill the instances based on the new AMIs, see https://wiki.mozilla.org/ReleaseEngineering/How_To/Manage_spot_AMIs
Assignee: rail → bugspam.Callek
Both emulator64, linux32 and linux64, have all failed so far with: Mon Dec 28 08:23:00 -0800 2015 Puppet (err): Could not update: Execution of '/usr/bin/apt-get -q -y -o DPkg::Options::=--force-confold --force-yes install libc6=2.15-0ubuntu10.10' returned 100: Reading package lists... Building dependency tree... Reading state information... Some packages could not be installed. This may mean that you have requested an impossible situation or if you are using the unstable distribution that some required packages have not yet been created or been moved out of Incoming. The following information may help to resolve the situation: The following packages have unmet dependencies: libc6 : Depends: libc-bin (= 2.15-0ubuntu10.10) but 2.15-0ubuntu10.12 is to be installed E: Unable to correct problems, you have held broken packages. Wrapped exception: Execution of '/usr/bin/apt-get -q -y -o DPkg::Options::=--force-confold --force-yes install libc6=2.15-0ubuntu10.10' returned 100: Reading package lists... Building dependency tree... Reading state information... Some packages could not be installed. This may mean that you have requested an impossible situation or if you are using the unstable distribution that some required packages have not yet been created or been moved out of Incoming. The following information may help to resolve the situation: The following packages have unmet dependencies: libc6 : Depends: libc-bin (= 2.15-0ubuntu10.10) but 2.15-0ubuntu10.12 is to be installed E: Unable to correct problems, you have held broken packages. Mon Dec 28 08:23:00 -0800 2015 /Stage[main]/Packages::Libc/Package[libc6]/ensure (err): change from 2.15-0ubuntu10 to 2.15-0ubuntu10.10 failed: Could not update: Execution of '/usr/bin/apt-get -q -y -o DPkg::Options::=--force-confold --force-yes install libc6=2.15-0ubuntu10.10' returned 100: Reading package lists... Building dependency tree... Reading state information... Some packages could not be installed. This may mean that you have requested an impossible situation or if you are using the unstable distribution that some required packages have not yet been created or been moved out of Incoming. The following information may help to resolve the situation: The following packages have unmet dependencies: libc6 : Depends: libc-bin (= 2.15-0ubuntu10.10) but 2.15-0ubuntu10.12 is to be installed E: Unable to correct problems, you have held broken packages.
Sorry about that. I haven't tested the initial puppetization, because it's hard to use dev env/pinned slaves without changing the production configs. Probably I should add this support to puppetize.sh...
Comment on attachment 8700610 [details] [diff] [review] mesa-puppet.diff Review of attachment 8700610 [details] [diff] [review]: ----------------------------------------------------------------- For future landing of this patch, assuming package deps are understood.... Backed out this patch (and my followup) due to the package dep issue above. https://hg.mozilla.org/build/puppet/rev/9b0abc13eebe https://hg.mozilla.org/build/puppet/rev/577bcd68bc0a Feedback to :rail for what-to-do-next. ::: modules/packages/manifests/mesa.pp @@ +11,5 @@ > package { > # This package is a recompiled version of > + # http://packages.ubuntu.com/precise-updates/mesa-common-dev-lts-saucy > + ["libgl1-mesa-dri-lts-saucy", "libgl1-mesa-glx-lts-saucy", > + "libglapi-mesa-lts-saucy", "libxatracker1-lts-saucy": needs ] at the end. @@ +22,5 @@ > + # http://packages.ubuntu.com/precise-updates/mesa-common-dev-lts-saucy > + # libgl1-mesa-dev-lts-saucy:i386 is required by B2G emulators, Bug 1013634 > + ["libgl1-mesa-dri-lts-saucy", "libgl1-mesa-glx-lts-saucy", > + "libglapi-mesa-lts-saucy", "libxatracker1-lts-saucy", > + "libgl1-mesa-dev-lts-saucy:i386"]: modules/packages/manifests/mesa.pp - ERROR: two-space soft tabs not used on line 26 - 2sp_soft_tabs (I fixed the preceding two lines already in my travis patchset, that will be landing today)
Attachment #8700610 - Flags: feedback?(rail)
Attachment #8700610 - Flags: checked-in-
Attachment #8700610 - Flags: checked-in+
(In reply to Rail Aliiev [:rail] from comment #50) > In case we need to backout the change. > * repeat the same steps with the patch backed out > * kill the instances based on the new AMIs, see > https://wiki.mozilla.org/ReleaseEngineering/How_To/Manage_spot_AMIs Doesn't look like any of these AMI's got to the created stage during my steps.... I stopped the golden instances that were running as well.
Assignee: bugspam.Callek → rail
Relanded //hg.mozilla.org/build/puppet/rev/040e91335831, waiting for travis.
I had to land the following 3 bustage fixes: update libc version: http://hg.mozilla.org/build/puppet/rev/d5f27dc939d3 Make sure we use the new repo for libc (to make proxxy happier): http://hg.mozilla.org/build/puppet/rev/0fb4c1e2ea65 update python version: http://hg.mozilla.org/build/puppet/rev/1fde6f0410ba I felt it safe to take minor version updates to fix the bustage, we are updating the whole system in any case.
pushed to try re-enabling the tests: https://treeherder.mozilla.org/#/jobs?repo=try&revision=8af8fb99e5b0 we will need to get this landed on trunk/aurora/beta/release? probably. In addition there are web-platform-tests which need to get adjusted as well, tracked in bug 1236047
Comment on attachment 8700610 [details] [diff] [review] mesa-puppet.diff I think I sorted this out.
Attachment #8700610 - Flags: feedback?(rail)
I created a new exclusion profile in Treeherder to hide these failing jobs until they can be fixed up. We'll need to remove that exclusion once that happens.
Flags: needinfo?(wkocher)
Asan tests don't seem too happy: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&fromchange=d464ff85debc&group_state=expanded&filter-searchStr=asan&selectedJob=19165556 As far as I can tell, they're related to the mesa upgrade? I don't think I'm going to be able to get away with hiding all of these Asan tests, even for a day or two...
Flags: needinfo?(rail)
Flags: needinfo?(jmaher)
And the fallout is piling up. Trunk, Aurora, Beta trees closed.
Flags: needinfo?(wkocher)
backed out all changes: remote: https://hg.mozilla.org/build/puppet/rev/7ef629f87819 remote: https://hg.mozilla.org/build/puppet/rev/87b6ce64a12e I'm going to revert the AMIs and kill running instances.
Flags: needinfo?(rail)
Attached patch mesa.diff (deleted) — Splinter Review
This is the working combined patch.
Reopening things, but I won't be around to watch for anything else breaking tonight.
Back to the pool. It doesn't look like we are going to upgrade easily. :( To retry this attempt we would need to apply the following 2 patches (reversed): http://hg.mozilla.org/build/puppet/rev/7ef629f87819 https://github.com/mozilla/build-cloud-tools/commit/a9ccd55ad8341d847bc3bbdfc7b4ff50a7dd936e
Assignee: rail → nobody
:jrmuizel, this is turning into a much larger effort- this will take a few weeks of time to sort out the failures and make this happen which isn't in our budget. If this is critical we could find a way for you or someone on your team to figure out the android/reftest/asan failures on a loaner and then we can push on upgrading our mesa libraries again. what are your thoughts?
Flags: needinfo?(jmaher) → needinfo?(jmuizelaar)
also on this note, we should remove the version of this in taskcluster until we are ready to upgrade on the buildbot work. This will allow us to not turn off tests and see green on the webgl web-platform-tests.
(In reply to Joel Maher (:jmaher) from comment #70) > :jrmuizel, this is turning into a much larger effort- this will take a few > weeks of time to sort out the failures and make this happen which isn't in > our budget. If this is critical we could find a way for you or someone on > your team to figure out the android/reftest/asan failures on a loaner and > then we can push on upgrading our mesa libraries again. what are your > thoughts? Yeah, I can figure out the failures. How do I a machine in a similar state so that I can reproduce them?
Flags: needinfo?(jmuizelaar)
(In reply to Rail Aliiev [:rail] from comment #69) This patch and subsequent attempted backout also broke all of the talos-linux64-ix hardware instances. At this point, none of that pool is functional because puppet can't successfully complete. I don't see an easy way to back out (downgrading puppet2.7-minimal basically tries to wipe the entire system and start over). I think the best course of action to get back to a known good state is to reinstall them all, so I've started to do that now.
rail, is it possible to get Jeff a ec2 instance with the new mesa driver? I could help him run the tests on there.
Flags: needinfo?(rail)
Depends on: 1236550
Sure thing. Let me fix bug 1236550 first though
All talos-linux64-ix machines that were not loaned out have been reimaged.
(In reply to Amy Rich [:arr] [:arich] from comment #77) > All talos-linux64-ix machines that were not loaned out have been reimaged. Thank you for cleaning up after me. :(
(In reply to Joel Maher (:jmaher) from comment #60) > pushed to try re-enabling the tests: > https://treeherder.mozilla.org/#/jobs?repo=try&revision=8af8fb99e5b0 > > we will need to get this landed on trunk/aurora/beta/release? probably. In > addition there are web-platform-tests which need to get adjusted as well, > tracked in bug 1236047 With the exception of the ASAN leaks, these are all basically spurious. (We should just update our expected fail/pass list)
these failures are why we backed out the tests: android failures (bug 1236116), b2g reftests (bug 1236115), asan leaksanitizer (bug 1236113) for the webgl-conformance and web-platform-tests we can easily fix that in manifests- I had successfully done that in try- but it is a chicken and egg- for these tests we either need to: * disable them, upgrade, reenable with new expectations * upgrade 32/64/asan, live with failures, then land on ALL branches manifest expectations
Depends on: 1236925
Flags: needinfo?(rail)
This should fix the reftest failures. I wasn't able to reproduce the address sanitizer issue and if it persists we can probably just suppress it.
Attachment #8681965 - Attachment is obsolete: true
Can we try again with the new patch?
Flags: needinfo?(rail)
(In reply to Jeff Muizelaar [:jrmuizel] from comment #83) > Can we try again with the new patch? It's running into similar symptom on Bug 1162375 (emu-kk-opt reftest[1] on taskcluster docker infrastructure) and perhaps the patch to fix llvmpipe could help to solve the reftest failures as well. rail, if the patch works, can you also upgrade the mesa library for docker tester image used by B2G emu kk tests[2]? [1] http://hg.mozilla.org/mozilla-central/raw-file/tip/layout/tools/reftest/reftest-analyzer.xhtml#logurl=https://queue.taskcluster.net/v1/task/c9GhjhbUQompYMp_JCKAxw/runs/0/artifacts/public/logs/live_backing.log&only_show_unexpected=1 [2] taskcluster/tester:0.4.4 - https://tools.taskcluster.net/task-inspector/#E2gXibPRSQeM2StYMHIVIg/
(In reply to Jeff Muizelaar [:jrmuizel] from comment #83) > Can we try again with the new patch? I can try to build the package today and publish it to the same repo. Would it help testing it on the existing loaner? You'll just need to upgrade the mesa packages. (In reply to Astley Chen [:astley] UTC+8 from comment #84) > rail, if the patch works, can you also upgrade the mesa library for docker > tester image used by B2G emu kk tests[2]? Before we upgrade we should test it. AFAIK, you can do this by modifying https://dxr.mozilla.org/mozilla-central/source/testing/docker/base-test/Dockerfile#54 and pushing to try. It'd be better to ask people who worked on that file though.
the dockerfile will benefit the b2g stuff, but not the android or other desktop related failures.
Note for myself. Before the next attempt, we need to figure out what went wrong with talos machines. Jeff, I uploaded the new packages. You can upgrade them on your loaner with something like: apt-get update apt-get install `dpkg -l |grep ^ii | grep 9.2.1-1ubuntu3~precise1mozilla1 | awk '{print $2}'`
Flags: needinfo?(rail)
Joel, mind to try this again? I still need to fix some stuff in the puppet patch. If I fix it before this weekend, do you think that we should try to land it again over the weekend?
Flags: needinfo?(jmaher)
I am going to be out of town most of the weekend- but open to trying this out when we can. If we can reduce the effects on the hardware boxes, that would be nice to see.
Flags: needinfo?(jmaher)
Let's make sure we get the debug package Jeff wants in bug 1220253 included here (libgl1-mesa-dri-dbg).
Attached patch mesa.diff (deleted) — Splinter Review
To make things work I had to cherry pick missing required packages (libdrm-* and libllvm3.3*) and put them into the same apt repo. I explicitly listed all packages to be installed, including one dbg package per coop's request. I ran this change multiple times against existing instances and fresh one (puppetized from scratch with this change). It should just work really soon now. :)
Attachment #8718111 - Flags: review?(bugspam.Callek)
Assignee: nobody → rail
Comment on attachment 8718111 [details] [diff] [review] mesa.diff r+ for the puppet changes, trusting :rail on the packages themselves.
Attachment #8718111 - Flags: review?(bugspam.Callek) → review+
We'll see the change tomorrow. I hope it works this time for all platforms.
(In reply to Rail Aliiev [:rail] from comment #96) > remote: https://hg.mozilla.org/build/puppet/rev/5fb3e7d8ae6b > remote: https://hg.mozilla.org/build/puppet/rev/ca664fd30e10 > > ... to make make puppet-lint happier. we get the same problems as bug 1236113 again :(
Blocks: 1247575
I'm going to revert this, we have a lot of test failures: 08:05 <Tomcat|sheriffduty> we run into bug 1236113 08:05 <Tomcat|sheriffduty> rail: https://treeherder.mozilla.org/logviewer.html#?job_id=21520008&repo=mozilla-inbound 08:06 <Tomcat|sheriffduty> same problem again 08:07 <Tomcat|sheriffduty> rail: can we revert this ? 08:12 <Tomcat|sheriffduty> rail: filed bug 1247575 08:12 <Tomcat|sheriffduty> there is also a unexpected pass 08:12 <Tomcat|sheriffduty> https://treeherder.mozilla.org/logviewer.html#?job_id=3274298&repo=mozilla-central 08:12 <Tomcat|sheriffduty> i guess this also related 08:13 <Tomcat|sheriffduty> it seems that asan builds react like : TEST-UNEXPECTED-FAIL | LeakSanitizer | leak at /usr/lib/x86_64-linux-gnu/libdricore9.2.1.so.1 08:13 <Tomcat|sheriffduty> and linux opt like https://treeherder.mozilla.org/logviewer.html#?job_id=21525557&repo=mozilla-inbound 08:13 <Tomcat|sheriffduty> 671 INFO TEST-UNEXPECTED-PASS | dom/canvas/test/webgl-mochitest/ensure-exts/test_EXT_disjoint_timer_query.html | fail-if condition in manifest - We expected at least one failure
Flags: needinfo?(rail)
Attachment #8718111 - Flags: checked-in+ → checked-in-
back to to the pool again
Assignee: rail → nobody
(In reply to Rail Aliiev [:rail] from comment #99) > I'm going to revert this, we have a lot of test failures: > > 08:05 <Tomcat|sheriffduty> we run into bug 1236113 > 08:05 <Tomcat|sheriffduty> rail: > https://treeherder.mozilla.org/logviewer.html#?job_id=21520008&repo=mozilla- > inbound > 08:06 <Tomcat|sheriffduty> same problem again > 08:07 <Tomcat|sheriffduty> rail: can we revert this ? > 08:12 <Tomcat|sheriffduty> rail: filed bug 1247575 > 08:12 <Tomcat|sheriffduty> there is also a unexpected pass > 08:12 <Tomcat|sheriffduty> > https://treeherder.mozilla.org/logviewer.html#?job_id=3274298&repo=mozilla- > central > 08:12 <Tomcat|sheriffduty> i guess this also related > 08:13 <Tomcat|sheriffduty> it seems that asan builds react like : > TEST-UNEXPECTED-FAIL | LeakSanitizer | leak at > /usr/lib/x86_64-linux-gnu/libdricore9.2.1.so.1 > 08:13 <Tomcat|sheriffduty> and linux opt like > https://treeherder.mozilla.org/logviewer.html#?job_id=21525557&repo=mozilla- > inbound > 08:13 <Tomcat|sheriffduty> 671 INFO TEST-UNEXPECTED-PASS | > dom/canvas/test/webgl-mochitest/ensure-exts/test_EXT_disjoint_timer_query. > html | fail-if condition in manifest - We expected at least one failure I can give you a patch to change these to expected-pass instead of their present expected-fail. The unexpected passes here are good news. It's just the leaks we need to worry about.
wait...we have test-unexpected-pass failures expected, we need to land the manifest updates. In addition, web-platform-tests will have a failure as well.....and...we need to do this on ALL branches. the asan leak is something new though.
(In reply to Joel Maher (:jmaher) from comment #103) > wait...we have test-unexpected-pass failures expected, we need to land the > manifest updates. In addition, web-platform-tests will have a failure as > well.....and...we need to do this on ALL branches. Why would web-platform-tests have a failure?
Mesa looks to be leaking its debug logs. We'll work around this in bug 1247762.
Depends on: 1247762
Bug 1247762 didn't work. I've added a suppression for dricore9.2.1.so in bug 1248290. In my tests that resolves the address sanitizer issues. I assume that with that fix, this should be ready to go again.
(In reply to Jeff Muizelaar [:jrmuizel] from comment #106) > Bug 1247762 didn't work. I've added a suppression for dricore9.2.1.so in bug > 1248290. In my tests that resolves the address sanitizer issues. I assume > that with that fix, this should be ready to go again. Joel, mind if we try again on Tue?
Flags: needinfo?(jmaher)
sounds great!
Flags: needinfo?(jmaher)
Comment on attachment 8718111 [details] [diff] [review] mesa.diff Alright, take N+1. It'll show up tomorrow early ET morning remote: https://hg.mozilla.org/build/puppet/rev/7d907471288a remote: https://hg.mozilla.org/build/puppet/rev/95c88502fa3d
Attachment #8718111 - Flags: checked-in- → checked-in+
Depends on: 1247752
Depends on: 1247753
Depends on: 1248290
Depends on: 1201885
jgilbert, I am not sure how to get expected-pass for the tests. Here is a try push: https://treeherder.mozilla.org/#/jobs?repo=try&revision=99a72143fb4d here is the patch that changes the webgl manifests: https://hg.mozilla.org/try/rev/0328ff396b1a I cannot figure out why these are expected-fail.
Flags: needinfo?(jgilbert)
Blocks: 1162375
Comment on attachment 8720966 [details] MozReview Request: Bug 1220658: remove old mesa-debian directory; r?rail https://reviewboard.mozilla.org/r/35525/#review32191
Attachment #8720966 - Flags: review?(rail) → review+
(In reply to Joel Maher (:jmaher) from comment #111) > jgilbert, I am not sure how to get expected-pass for the tests. Here is a > try push: > https://treeherder.mozilla.org/#/jobs?repo=try&revision=99a72143fb4d > > here is the patch that changes the webgl manifests: > https://hg.mozilla.org/try/rev/0328ff396b1a > > I cannot figure out why these are expected-fail. We fixed these by marking the failures (and passes) in the resulting perma-orange bugs.
Flags: needinfo?(jgilbert)
I think I was supposed to mark this done? If not, just reopen.
Status: NEW → RESOLVED
Closed: 9 years ago
Keywords: leave-open
Resolution: --- → FIXED
Component: Platform Support → Buildduty
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: