Closed Bug 819963 Opened 12 years ago Closed 10 years ago

Split up mochitest-bc on desktop into ~30minute chunks

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: ehsan.akhgari, Assigned: massimo)

References

(Depends on 4 open bugs)

Details

(Keywords: sheriffing-P1)

Attachments

(8 files, 4 obsolete files)

These tests on debug builds are taking an hour and half some of the time.  Time to split them in two.
Blocks: 798945
Keywords: sheriffing-P1
browser-chrome is now the long pole in terms of end-to-end times (see bug 798945 comment 4) - the debug runs can take between 55 and 100 mins depending on platform.

Prior to splitting I would be interested to see if there are any more clownshoes (of the bug 864085 and bug 865549 ilk) we can fix to lower runtime.
Depends on: 883314
Component: BrowserTest → Release Engineering: Automation (General)
OS: Mac OS X → All
Product: Testing → mozilla.org
QA Contact: catlee
Hardware: x86 → All
Version: Trunk → other
Attached file win7 debug test times (deleted) —
the slowest tests here are:

20.374 chrome://mochitests/content/browser/toolkit/mozapps/extensions/test/browser-window/browser_discovery.js
24.814 chrome://mochitests/content/browser/toolkit/mozapps/extensions/test/browser/browser_discovery.js
26.011 chrome://mochitests/content/browser/browser/base/content/test/newtab/browser_newtab_drag_drop.js
27.768 chrome://mochitests/content/browser/dom/indexedDB/test/browser_quotaPromptDeny.js
31.195 chrome://mochitests/content/browser/toolkit/mozapps/extensions/test/browser/browser_bug562797.js
35.015 chrome://mochitests/content/browser/toolkit/components/startup/tests/browser/browser_crash_detection.js
41.336 chrome://mochitests/content/browser/browser/components/sessionstore/test/browser_480148.js
46.919 chrome://mochitests/content/browser/browser/devtools/debugger/test/browser_dbg_propertyview-data-big.js
83.641 chrome://mochitests/content/browser/toolkit/mozapps/extensions/test/browser/browser_bug557956.js

(all times in seconds)
Attached file fedora debug test times (deleted) —
the slowest tests here are:

34.422 chrome://mochitests/content/browser/toolkit/mozapps/extensions/test/browser-window/browser_discovery.js
35.018 chrome://mochitests/content/browser/toolkit/components/startup/tests/browser/browser_crash_detection.js
40.42 chrome://mochitests/content/browser/toolkit/mozapps/extensions/test/browser/browser_discovery.js
44.987 chrome://mochitests/content/browser/browser/components/sessionstore/test/browser_480148.js
48.463 chrome://mochitests/content/browser/toolkit/mozapps/extensions/test/browser/browser_bug562797.js
54.915 chrome://mochitests/content/browser/toolkit/mozapps/extensions/test/browser/browser_bug557956.js
56.325 chrome://mochitests/content/browser/browser/devtools/debugger/test/browser_dbg_propertyview-data-big.js
Since 10.7 debug is averaging over 110 minutes, and the rewrite that's on the UX tree has maybe added 5 minutes worth of tests, so they are intermittently going over 120 minutes and turning red as a result, this (or a magic clownshoes discovery) probably blocks landing Australis.
Blocks: australis
Whiteboard: [Australis:M8][Australis:P1]
(In reply to Chris AtLee [:catlee] from comment #4)
> Created attachment 763109 [details]
> fedora debug test times
> 
> the slowest tests here are:

Chris, did you check this manually? If not, is the script you're using available somewhere? I'd like to run it on UX log output to check we're not doing anything, err, clownshoes, to quote philor. :-)
Flags: needinfo?(catlee)
I don't think this should block a landing -- we should increase the timeout if we're ready to land but the bc hasn't been split. (And then really really get it split, since it's it's obviously been put off in the 6 months since ehsan filed it).

It's an artificial barrier with not end-user impact, and a big part of the push to get Australis landed ASAP is to get feedback.
Sure, filing and getting someone to fix "kick the timeout back up to 9000 again" would work just as well to get you unblocked, but one or the other does block since if you turn 10.7 bc pretty much permared you'll bounce, and this one already existed to mark as a reminder of that.
(In reply to :Gijs Kruitbosch from comment #6)
> (In reply to Chris AtLee [:catlee] from comment #4)
> > Created attachment 763109 [details]
> > fedora debug test times
> > 
> > the slowest tests here are:
> 
> Chris, did you check this manually? If not, is the script you're using
> available somewhere? I'd like to run it on UX log output to check we're not
> doing anything, err, clownshoes, to quote philor. :-)

I did some spot checks, and it seemed to be correct. I'll attach the script here so you can give it a spin yourself.
Flags: needinfo?(catlee)
Attached file test_times.py (deleted) —
(In reply to Chris AtLee [:catlee] from comment #10)
> Created attachment 776386 [details]
> test_times.py

Thanks. I filed bug 894411 about our tests. We seem to be eating about 2 minutes on top of what's on m-c. I suspect we may be able to get that to be under a minute, but if that's not enough and/or I'm wrong, we should just be fixing this, I'm afraid.
Assignee: nobody → mgervasini
See bug 894930. We're hitting the 7200s timeout on OSX 10.7 debug now.
Product: mozilla.org → Release Engineering
Whiteboard: [Australis:M8][Australis:P1] → [Australis:M?][Australis:P1]
Massimo, any update here?
Flags: needinfo?(mgervasini)
Hey Jared,

I have chunking working on the staging environment. I am running tests with 2 chunks: chunk 2 takes about 15 minutes to complete, chunk 1 more than 53 minutes and the output over 50M so it's truncated (tested on mac). 

I am trying now to increase the number of chunks to 5 to find out how it works with this setup.
Flags: needinfo?(mgervasini)
Status: NEW → ASSIGNED
Attached image out.png (deleted) —
Here the results with 5 chunks (tested on mac):

chunk 1: 11m44s 
chunk 2: 31m40s
chunk 3:  7m30s
chunk 4:  4m11s
chunk 5: 12m23s

there is a test error in the second chunk:

...
03:12:26     INFO -  45096 INFO TEST-START | /tests/dom/browser-element/mochitest/test_browserElement_NoWhitelist.html
03:12:26     INFO -  creating 1!
03:12:26     INFO -  2013-08-29 03:12:26.780 plugin-container[871:903] *** __NSAutoreleaseNoPool(): Object 0x10819bb40 of class NSCFDictionary autoreleased with no pool in place - just leaking
03:12:26     INFO -  2013-08-29 03:12:26.782 plugin-container[871:903] *** __NSAutoreleaseNoPool(): Object 0x11931ae80 of class NSCFArray autoreleased with no pool in place - just leaking
03:12:26     INFO -  [TabChild] SHOW (w,h)= (0, 0)
03:12:26     INFO -  ############################### browserElementPanning.js loaded
03:12:26     INFO -  ######################## BrowserElementChildPreload.js loaded
03:12:26     INFO -  loading about:blank, 1
03:12:26     INFO -  loading http://example.com/tests/dom/browser-element/mochitest/file_empty.html, 1
03:12:26     INFO -  45097 ERROR TEST-UNEXPECTED-FAIL | /tests/dom/browser-element/mochitest/test_browserElement_NoWhitelist.html | Should not send mozbrowserloadstart event.
03:12:26     INFO -  45098 ERROR TEST-UNEXPECTED-FAIL | /tests/dom/browser-element/mochitest/test_browserElement_NoWhitelist.html | Should not send mozbrowserloadstart event.
03:17:49     INFO -  45099 ERROR TEST-UNEXPECTED-FAIL | /tests/dom/browser-element/mochitest/test_browserElement_NoWhitelist.html | Test timed out.
03:17:49     INFO -  args: ['/usr/sbin/screencapture', '-C', '-x', '-t', 'png', '/var/folders/WJ/WJB0YwOpFvuJxQKLn0fqnU+++-k/-Tmp-/mozilla-test-fail_1HPMOr']
03:17:53     INFO -  SCREENSHOT: <...>
03:17:53     INFO -  45100 INFO TEST-END | /tests/dom/browser-element/mochitest/test_browserElement_NoWhitelist.html | finished in 324291ms
...

any idea?
(In reply to Massimo Gervasini [:mgerva] from comment #16)

> 03:12:26     INFO -  45097 ERROR TEST-UNEXPECTED-FAIL |
> /tests/dom/browser-element/mochitest/test_browserElement_NoWhitelist.html |
> Should not send mozbrowserloadstart event.
> 03:12:26     INFO -  45098 ERROR TEST-UNEXPECTED-FAIL |
> /tests/dom/browser-element/mochitest/test_browserElement_NoWhitelist.html |
> Should not send mozbrowserloadstart event.
> 03:17:49     INFO -  45099 ERROR TEST-UNEXPECTED-FAIL |
> /tests/dom/browser-element/mochitest/test_browserElement_NoWhitelist.html |
> '/var/folders/WJ/WJB0YwOpFvuJxQKLn0fqnU+++-k/-Tmp-/mozilla-test-fail_1HPMOr']
> 03:17:53     INFO -  SCREENSHOT: <...>


Oh, I've forgot to mention, the log says the test failed but the screeshot says status "Pass"
Olli, you reviewed the patch for bug 710231. Do you know why this test might be failing? Can you help out Massimo here?
Flags: needinfo?(bugs)
I don't know if this helps but on my staging environment I am running the same test suite in 7 and 9 chunks too and in both case there's a failure in the 2nd and 3rd chunk. 

Is there some kind of dependency on previous states which was not exposed when we had a single test?
(In reply to Massimo Gervasini [:mgerva] from comment #19)
> Is there some kind of dependency on previous states which was not exposed
> when we had a single test?

That is what I suspect is happening. We can also disable the failing tests and file bugs to get them fixed.
Untracking from Australis since bug 894930 has been fixed.
No longer blocks: australis
Whiteboard: [Australis:M?][Australis:P1]
Attached patch buildbot-configs-819963.patch (obsolete) (deleted) — Splinter Review
Hi Rail,

this patch adds:
* browser-chrome tests are now split in two chunks as requested
* removed unittest_suites from mozilla/config.py (it's not used)
* pep8 changes

I am asking a feedback not a full review because I have executed some experiments on my staging environment (single os x 10.6 slave), using different number of chunks and the total time to execute the full suite when using 2 or 3 chunks is roughly the same (at least on my environment): 

chunks   total time
   2      1:06:52
   3      1:06:43 

more details:

time spent on each test 
2 chunks:
  1     52 mins,  2 secs
  2     14 mins, 50 secs

3 chunks:
  1     30 mins, 12 secs
  2     23 mins,  1 secs
  3     13 mins, 30 secs


If we have enough slave capacity, 3 chunks could (in theory) save up to ~20min (best case scenario)

Do you think it's worth to split this test suite in 3 chunks instead of 2? 
(I'll post the mozharness patch as soon we know the number of chunks)
Attachment #803572 - Flags: feedback?(rail)
I'd definitely go for 3, if the total time is roughly the same, particularly since the splits seem a lot more even for 3 (and the reduced end-to-end time will be really helpful for sheriffs).
Comment on attachment 803572 [details] [diff] [review]
buildbot-configs-819963.patch

Thanks for poking this!

The patch looks OK to me.

BTW, have you tried to run more than 3 chunks? Let's say 10!? :) There may be some overhead for bootstrapping when you have too many chunks, so we need to have some kind of balance of the bootstrap time and real time spent for testing.
Attachment #803572 - Flags: feedback?(rail) → feedback+
Attached patch buildbot-configs-819963.patch (obsolete) (deleted) — Splinter Review
This patch adds:

* browser-chrome tests are now split in three chunks
* removed unittest_suites from mozilla/config.py (it's not used)
* pep8 changes
Attachment #803572 - Attachment is obsolete: true
Attachment #803784 - Flags: review?(rail)
Attached patch mozharness-819963.patch (obsolete) (deleted) — Splinter Review
... and this is the patch for mozharness
Attachment #803785 - Flags: review?(rail)
Attached file chunking-results.txt (deleted) —
More results - single slave osx 10.6

chunks     total time
   2      1:06:52
   3      1:06:43
   5      1:07:28
   9      1:23:36
  27      2:02:10
  51      2:43:05

(more detailed results in attachment)

If we decide to have more than 3 chunks we should rethink the patch. As it is now, a patch for 27 chunks will be ~350 lines long
I think 3 sounds like the optimal number for now :-)
Attachment #803784 - Flags: review?(rail) → review+
Comment on attachment 803785 [details] [diff] [review]
mozharness-819963.patch

Review of attachment 803785 [details] [diff] [review]:
-----------------------------------------------------------------

::: configs/unittests/linux_unittest.py
@@ -64,4 @@
>          "plain3": ["--total-chunks=5", "--this-chunk=3", "--chunk-by-dir=4"],
>          "plain4": ["--total-chunks=5", "--this-chunk=4", "--chunk-by-dir=4"],
>          "plain5": ["--total-chunks=5", "--this-chunk=5", "--chunk-by-dir=4"],
> -        "chrome": ["--chrome"],

Remind me, why you removed "chrome" here?
BTW, maybe it's worth to split the mozharness patch into 2 parts: 

1) add new configs in the first part, don't remove the old ones

2) remove old configs in the second patch.

In this case you wouldn't need to worry about reconfigs and configs and mozharness being in sync. You could land the first part anytime, merge it to production, land the buildbot-configs patch, reconfig and only than land the second mozharness patch.
Attached patch mozharness-819963-1.patch (deleted) — Splinter Review
* adds browser-chrome-{1,2,3}
Attachment #803785 - Attachment is obsolete: true
Attachment #803785 - Flags: review?(rail)
Attachment #804473 - Flags: review?(rail)
Attached patch mozharness-819963-2.patch (deleted) — Splinter Review
removes: "browser-chrome": ["--browser-chrome"]
Attachment #804474 - Flags: review?(rail)
Attachment #804473 - Flags: review?(rail) → review+
Attachment #804474 - Flags: review?(rail) → review+
Any idea why this bug stalled?
Hi Jared,

sorry for the delay. I am ready commit this patch but there are failures in tests. 

Chunk number 2 is fine. Others chunks have some errors, full logs here:

chunk 1: https://people.mozilla.org/~mgervasini/chunk_1.html
chunk 2: https://people.mozilla.org/~mgervasini/chunk_2.html
chunk 3: https://people.mozilla.org/~mgervasini/chunk_3.html

How do we proceed here?
Flags: needinfo?(jaws)
If you can narrow it down to which tests are causing the errors, then we can disable those tests (assuming it is in the range of 1-10 tests). If it is many more tests that are failing then there may be something more serious going on here.

For chunk 1, the group of tests at /tests/dom/browser-element/mochitest/test_browserElement* appear to be causing issues. Can you try selectively disabling those tests (first disable all, run the chunk, re-enable half, run the chunk, re-enable half of the still disabled, run the chunk, etc, until you can find the culprit or determine that all of the tests are error prone)?

For chunk 3, /tests/toolkit/components/alerts/test/test_alerts.html and /tests/toolkit/components/alerts/test/test_alerts_noobserve.html are the two tests that failed. We could disable these two tests and file a bug to re-enable them, CCing Michael Ventnor (original author of the tests) and :emk (the last person to touch both tests).
Flags: needinfo?(bugs)
Flags: needinfo?(jaws)
(In reply to Jared Wein [:jaws] from comment #35)
> If you can narrow it down to which tests are causing the errors, then we can
> disable those tests (assuming it is in the range of 1-10 tests). If it is
> many more tests that are failing then there may be something more serious
> going on here.
> 
> For chunk 1, the group of tests at
> /tests/dom/browser-element/mochitest/test_browserElement* appear to be
> causing issues. Can you try selectively disabling those tests (first disable
> all, run the chunk, re-enable half, run the chunk, re-enable half of the
> still disabled, run the chunk, etc, until you can find the culprit or
> determine that all of the tests are error prone)?
> 
> For chunk 3, /tests/toolkit/components/alerts/test/test_alerts.html and
> /tests/toolkit/components/alerts/test/test_alerts_noobserve.html are the two
> tests that failed. We could disable these two tests and file a bug to
> re-enable them, CCing Michael Ventnor (original author of the tests) and
> :emk (the last person to touch both tests).

Massimo, did you run these on your own machine? Is there a reason you can't do a try push? AFAICT e.g. the test_alerts tests can depend on machine configuration (see e.g. https://bugzilla.mozilla.org/show_bug.cgi?id=838203 ). That will also get us information about the other platforms, ensuring we don't run into surprises there.
Hi Gijs,

Tests ran in staging environment: http://dev-master01.build.scl1.mozilla.com:8920 (requires VPN access) - Probably it's better to ask catlee or rail about the try push. What do you think, guys?
Flags: needinfo?(rail)
Flags: needinfo?(catlee)
We can't change the chunking for individual try pushes, which is why we need to run it on dev-masters.
Flags: needinfo?(catlee)
I think catlee answered the question.
Flags: needinfo?(rail)
Whether the slaves running the tests "on dev-masters" are the same as the slaves that run them in production is the relevant question. I assume the answer is yes?
Blocks: 932159
Hi all,

I think the best way to proceed here is to apply the first mozharness patch so the new browser-chrome-{1,2,3} are ready to be executed.
Once the mozharness patch is applied, we can enable the new chunked tests only on cedar (requires a new patch) so both the browser-chrome and its chunked version run in parallel. This would make easier to access the failed tests and make debugging easier. 

In parallel, I am starting to extend the dev-master tests to linux and windows.

Does it sound right for you?
(In reply to Massimo Gervasini [:mgerva] from comment #41)
> Hi all,
> 
> I think the best way to proceed here is to apply the first mozharness patch
> so the new browser-chrome-{1,2,3} are ready to be executed.
> Once the mozharness patch is applied, we can enable the new chunked tests
> only on cedar (requires a new patch) so both the browser-chrome and its
> chunked version run in parallel. This would make easier to access the failed
> tests and make debugging easier. 
> 
> In parallel, I am starting to extend the dev-master tests to linux and
> windows.
> 
> Does it sound right for you?

This sounds great, what is the ETA for landing the first mozharness patch?
Summary: Split up mochitest-bc on desktop into two chunks → Split up mochitest-bc on desktop into three chunks
Summary of offline emails:

1) mgerva: when you get your changes landed on cedar, can you need-info jgriffin - he's happy to lead fixing/disabling any tests as needed on cedar. Once the chunked-up-mochitest-bc suites are green on cedar, we can rollout this change to other branches.

2) Currently the mochitest-browser-chrome test suite takes ~2hrs to run... by far the longest of any test suites. Details in http://oduinn.com/blog/2013/10/27/better-display-for-compute-hours-per-checkin/. This is bad for RelEng scheduling, and also bad for sheriffs+devs whenever we have to retrigger for intermittent test failures. Ideally, we'd like all test suites to be ~30mins, as that was good tradeoff on fast turnaround+scheduling, and a good balance of setup/teardown overhead vs real-work. For this specific case, it looks like it would mean splitting the 2hour mochitest-browser-chrome suite into 4? test suites, each ~30mins duration. Tweaking summary to match.
Flags: needinfo?(mgervasini)
Summary: Split up mochitest-bc on desktop into three chunks → Split up mochitest-bc on desktop into ~30minute chunks
It doesn't take ~2hrs to run, it takes between 45 minutes and an unknowable time over 150 minutes since that's when buildbot times it out. With around 5 minutes of setup, unless we take the unprecedented option of having different numbers of chunks for opt and debug and different numbers of chunks for each OS, you have choices like "25-77 minutes" or "18-53 minutes" or "15-41 minutes" with the amount of wasted machine time (and thus the try backlog time) increasing with each decrease. Four chunks across 17 platforms would be an added 255 minutes of machine time per push, not counting the time it takes between runs to get a slave started on a new job.
(In reply to John O'Duinn [:joduinn] from comment #43)
> Summary of offline emails:
> 
> 1) mgerva: when you get your changes landed on cedar, can you need-info
> jgriffin - he's happy to lead fixing/disabling any tests as needed on cedar.
> Once the chunked-up-mochitest-bc suites are green on cedar, we can rollout
> this change to other branches.

Thanks John!
I am creating another patch for buildbot-configs to enable bc chunking only on cedar. It's taking more than expected, but I am now 100% working on this issue.


> 2) Currently the mochitest-browser-chrome test suite takes ~2hrs to run...
> by far the longest of any test suites. Details in
> http://oduinn.com/blog/2013/10/27/better-display-for-compute-hours-per-
> checkin/. This is bad for RelEng scheduling, and also bad for sheriffs+devs
> whenever we have to retrigger for intermittent test failures. Ideally, we'd
> like all test suites to be ~30mins, as that was good tradeoff on fast
> turnaround+scheduling, and a good balance of setup/teardown overhead vs
> real-work. For this specific case, it looks like it would mean splitting the
> 2hour mochitest-browser-chrome suite into 4? test suites, each ~30mins
> duration. Tweaking summary to match.

In theory that's right but from the test I've executed in staging environment, time to execute a single chunk does not scale linearly - for example switching from 2 to 3 chunks it mostly affects the first chunk (see comment #22) - I think that's because not all the test have the same execution times, so the end to end execution time of a chunk, heavily depends on how many long running tests are in that chunk.
My opinion is that 3 chunks is a good number of chunks to start with, once we solve the issue of the failing tests it should relatively easy to run the suite using different a chunking and find the optimal value (I am trying to make it easy to configure).
Flags: needinfo?(mgervasini)
(In reply to Phil Ringnalda (:philor) from comment #44)
> It doesn't take ~2hrs to run, it takes between 45 minutes and an unknowable
> time over 150 minutes since that's when buildbot times it out. With around 5
> minutes of setup, unless we take the unprecedented option of having
> different numbers of chunks for opt and debug and different numbers of
> chunks for each OS, you have choices like "25-77 minutes" or "18-53 minutes"
> or "15-41 minutes" with the amount of wasted machine time (and thus the try
> backlog time) increasing with each decrease. Four chunks across 17 platforms
> would be an added 255 minutes of machine time per push, not counting the
> time it takes between runs to get a slave started on a new job.

On a current m-c run with no oranges, it takes between 31 and 133 minutes, depending on platform...debug OSX is the slowest.

Chunking definitely will increase machine time per push, although turnaround time on a given push may come down, depending on wait times.

We could consider applying the chunking only to debug runs (the opt runs are generally acceptable), but I don't know if that would confuse people.
Hi Rail,

this patch enables mochitest-browser-chrome-{1,2,3} on cedar (keeping the old mochitest-chrome active for reference). Other branch are not affected by this change.

This patch also removes chunked tests from BUILDBOT_UNITTEST_SUITES (used in prod) and re-enables browser-chrome in get_ubuntu_unittests() but I am not totally sure it's really needed here.
Attachment #803784 - Attachment is obsolete: true
Attachment #828174 - Flags: review?(rail)
Attachment #804473 - Flags: checked-in+
Comment on attachment 828174 [details] [diff] [review]
enabling chunked mochitest-bc on cedar + pep8 fixes

I'm not sure if wanted to add "opt test mochitest-browser-chrome" to other branches, see the builder diff: https://gist.github.com/rail/7345190

r- for now, let me know if that was intentional.
Attachment #828174 - Flags: review?(rail) → review-
Hi Rail,

you're totally right! This patch should fix the problem
Attachment #828174 - Attachment is obsolete: true
Attachment #828647 - Flags: review?(rail)
Attachment #828647 - Flags: review?(rail) → review+
Attachment #828647 - Flags: checked-in+
in production
So, there are really two things going on here simultaneously:

1) mochitest-browser-chrome is being moved from Fedora slaves to EC2 instances on linu32/linux64 debug
2) mochitest-browser-chrome is being split into chunks for all platforms

1) by itself introduces a bunch of oranges on cedar; we'll have to address both 1) and 2) to get things green.
Depends on: 936512
Depends on: 936518
Depends on: 937380
Depends on: 937407
Depends on: 938426
There aren't too many problems that need to be resolved; the primary ones are bug 937380, bug 937407, and bug 938426.  Is there consensus that we should disable these in order to move forward?
Depends on: 933680
(In reply to Jonathan Griffin (:jgriffin) from comment #51)
> So, there are really two things going on here simultaneously:
> 
> 1) mochitest-browser-chrome is being moved from Fedora slaves to EC2
> instances on linu32/linux64 debug
> 2) mochitest-browser-chrome is being split into chunks for all platforms
> 
> 1) by itself introduces a bunch of oranges on cedar; we'll have to address
> both 1) and 2) to get things green.

So there is at least one tricky failure related to 1)...see bug 933680.  We could roll out the chunking faster if the chunking was done on Fedora slaves, and worry about moving to EC2 as a separate project.
(In reply to Jonathan Griffin (:jgriffin) from comment #52)
> There aren't too many problems that need to be resolved; the primary ones
> are bug 937380, bug 937407, and bug 938426.  Is there consensus that we
> should disable these in order to move forward?

We should investigate the failures before resorting to disabling.

Mark/Drew, can you guys look into these (and others in the dependency list) and chase down any easy fixes?
(In reply to Jonathan Griffin (:jgriffin) from comment #53)
> So there is at least one tricky failure related to 1)...see bug 933680.  We
> could roll out the chunking faster if the chunking was done on Fedora
> slaves, and worry about moving to EC2 as a separate project.

That would certainly simplify things. Is it feasible?
I don't think the two things are necessarily tied.  That said, chunking adds some additional setup/teardown time which will put additional load on hardware slaves, which will end up increasing their wait times.  This is probably most applicable to OSX slaves, which seem to be the hardest hit these days.
(In reply to :Gavin Sharp (email gavin@gavinsharp.com) from comment #55)
> (In reply to Jonathan Griffin (:jgriffin) from comment #53)
> > So there is at least one tricky failure related to 1)...see bug 933680.  We
> > could roll out the chunking faster if the chunking was done on Fedora
> > slaves, and worry about moving to EC2 as a separate project.
> 
> That would certainly simplify things. Is it feasible?

Not really, sadly. IT is merging the SCL1 colo with SCL 3 and the old minis that are the Fedora slaves are being deprecated as part of that move, hence the push to run these in EC2. The timeline for the EOL'ing of Fedora slaves is probably sometime in January as I understand things, NI Amy Rich to get clarification on that timeline for that.
Flags: needinfo?(arich)
The goal was to decommission all of the fedora machines by end of FY2013, but we can slip that into January of 2014 since we are closing the company for the last two weeks of the year.
Flags: needinfo?(arich)
Actually nothing to do with this bug (none of this is, since we aren't seeing chunked ec2 failures that aren't also unchunked ec2 failures), but...

Has that been translated into bugs on *all* the things that currently run on Fedora slaves, like say every single Linux test on b2g18 since b2g18 is planning on existing until March, with the appropriate "this thing which nobody wanted to do for years now has to be done in six weeks or less" severity?
Depends on: 943092
Depends on: 943095
No longer blocks: 916194
(In reply to Amy Rich [:arich] [:arr] from comment #58)
> The goal was to decommission all of the fedora machines by end of FY2013,
> but we can slip that into January of 2014 since we are closing the company
> for the last two weeks of the year.

Here's me trying to weigh IT need to decomm these troublesome machines with the developer need to have tests running *somewhere*.

To be clear, if there's a business need to support the fedora machines for longer, we will find a way to make that happen. That shouldn't be taken as a pass for developers to avoid fixing issues like bug 943095. The increased capacity of EC2 should be enough incentive there.

Massimo: how much effort to get chunked tests for cedar running on r3 Fedora slaves in the short-term, with the aim of swapping to EC2 ASAP?
Flags: needinfo?(mgervasini)
It shouldn't be too difficult, I am preparing a patch to enable chunked mochitest-bc for fedora on cedar.
Flags: needinfo?(mgervasini)
(In reply to Massimo Gervasini [:mgerva] from comment #61)
> It shouldn't be too difficult, I am preparing a patch to enable chunked
> mochitest-bc for fedora on cedar.

Thanks, Massimo.
Hi all,

no need for a patch; tests are already running on r3 Fedora slaves: 

https://tbpl.mozilla.org/php/getParsedLog.php?id=31237630&tree=Cedar
(In reply to Massimo Gervasini [:mgerva] from comment #63)
> Hi all,
> 
> no need for a patch; tests are already running on r3 Fedora slaves: 
> 
> https://tbpl.mozilla.org/php/getParsedLog.php?id=31237630&tree=Cedar

Hey Massimo!

AIUI the request was for us to CHUNK the mochitest-bc on fedora slaves, not enable on fedora slaves. The reasoning behind this request is that if we chunk here it will allow an easier target for "this chunk works on fedora AND ubuntu hosts, we can switch this chunk off on fedora and on on Ubuntu"/etc.

If your reading of this bug is different of course we should coord between releng/ateam to figure out the appropriate course of action.
Flags: needinfo?(mgervasini)
That is a chunk of bc (chunk #2) running on Fedora.

But if the Fedora thing has become "let's move all the tests that don't work on a single-core Ubuntu ec2 slave into one chunk and run that on Fedora" then you're no longer talking about chunking, because chunks are not permanent (running b-c1 on Fedora and b-c2 and b-c3 on Ubuntu would mean that someone adding just the wrong number of tests that fell in b-c1, pushing the busted ones out to b-c2, would get backed out for making totally unrelated tests fail with nobody even realizing that they failed because they got pushed off the Fedora chunk and into an ec2 chunk); you're talking about "let's file a new separate bug about creating a new browser-chrome suite, b-c-morethanonecore, and putting tests which don't work on the ec2 slaves in that."
On Cedar we're running mochitest-bc tests in 7 configurations:
- Opt unchunked on EC2 slaves (*)
- Opt chunked on Rev3 slaves
- Opt chunked on EC2 slaves

- Debug unchunked on Rev3 slaves (*)
- Debug unchunked on EC2 slaves
- Debug chunked on Rev3 slaves
- Debug chunked on EC2 slaves

https://tbpl.mozilla.org/?tree=Cedar&showall=1&jobname=browser-chrome

(*) configuration used on other branches, e.g. mozilla-inbound

We're seeing two classes of bugs on Cedar:
* Bugs from chunking: bug 943092, bug 937407, bug 933680, bug 932159

* Bugs from moving from rev3 to ec2 slaves: bug 943095

This bug is NOT about migrating debug mochitest-browser-chrome to EC2 slaves. Bug 850101 tracks that effort. It is important for RelOps and RelEng to migrate off of these rev3 machines next year, so any help greening up these bugs would be appreciated!

I believe that all the required tests are running on Cedar to green up the chunked tests (and/or the debug ec2 tests if you can!). If you need test slaves loaned out to reproduce, please let us know! There's nothing else actionable for RelEng here.
Flags: needinfo?(mgervasini)
Depends on: 949027
Depends on: 949448
Blocks: 939036
Depends on: 962598
Depends on: 963075
Depends on: 963193
Depends on: 964358
Depends on: 964365
Depends on: 964369
Depends on: 964374
Depends on: 986452
Depends on: 986458
Depends on: 986463
Depends on: 986467
Depends on: 980746
No longer depends on: 986463
Depends on: 992270
Blocks: 992485
Depends on: 992611
They're not all running in 30 minute chunks, but trunk is now running with mochitest-bc in 3 chunks on all platforms and with devtools split into it's own test suite. As a bonus, chunk-by-dir was even enabled to make which tests run in what chunk more consistent. I think this bug is as done as it's going to be. I've filed bug 996240 for trying to get this enabled on Aurora30 as well.

Plans are also underway to further enhance the current chunking scheme (such as spawning a new browser instance for each directory in bug 992911) so we can better isolate tests from each other and give us even finer-grained control over chunking.
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: