819963 - Split up mochitest-bc on desktop into ~30minute chunks

Reporter

Description

•

12 years ago

These tests on debug builds are taking an hour and half some of the time.  Time to split them in two.

Ed Morley [:emorley]

Updated

•

12 years ago

Blocks: 798945

Keywords: sheriffing-P1

Ed Morley [:emorley]

Comment 1

•

11 years ago

browser-chrome is now the long pole in terms of end-to-end times (see bug 798945 comment 4) - the debug runs can take between 55 and 100 mins depending on platform.

Prior to splitting I would be interested to see if there are any more clownshoes (of the bug 864085 and bug 865549 ilk) we can fix to lower runtime.

Ed Morley [:emorley]

Updated

•

11 years ago

Depends on: 883314

Ed Morley [:emorley]

Updated

•

11 years ago

Component: BrowserTest → Release Engineering: Automation (General)

OS: Mac OS X → All

Product: Testing → mozilla.org

QA Contact: catlee

Hardware: x86 → All

Version: Trunk → other

Chris AtLee [:catlee]

Comment 3

•

11 years ago

Attached file win7 debug test times (deleted) — Details

the slowest tests here are:

20.374 chrome://mochitests/content/browser/toolkit/mozapps/extensions/test/browser-window/browser_discovery.js
24.814 chrome://mochitests/content/browser/toolkit/mozapps/extensions/test/browser/browser_discovery.js
26.011 chrome://mochitests/content/browser/browser/base/content/test/newtab/browser_newtab_drag_drop.js
27.768 chrome://mochitests/content/browser/dom/indexedDB/test/browser_quotaPromptDeny.js
31.195 chrome://mochitests/content/browser/toolkit/mozapps/extensions/test/browser/browser_bug562797.js
35.015 chrome://mochitests/content/browser/toolkit/components/startup/tests/browser/browser_crash_detection.js
41.336 chrome://mochitests/content/browser/browser/components/sessionstore/test/browser_480148.js
46.919 chrome://mochitests/content/browser/browser/devtools/debugger/test/browser_dbg_propertyview-data-big.js
83.641 chrome://mochitests/content/browser/toolkit/mozapps/extensions/test/browser/browser_bug557956.js

(all times in seconds)

Chris AtLee [:catlee]

Comment 4

•

11 years ago

Attached file fedora debug test times (deleted) — Details

the slowest tests here are:

34.422 chrome://mochitests/content/browser/toolkit/mozapps/extensions/test/browser-window/browser_discovery.js
35.018 chrome://mochitests/content/browser/toolkit/components/startup/tests/browser/browser_crash_detection.js
40.42 chrome://mochitests/content/browser/toolkit/mozapps/extensions/test/browser/browser_discovery.js
44.987 chrome://mochitests/content/browser/browser/components/sessionstore/test/browser_480148.js
48.463 chrome://mochitests/content/browser/toolkit/mozapps/extensions/test/browser/browser_bug562797.js
54.915 chrome://mochitests/content/browser/toolkit/mozapps/extensions/test/browser/browser_bug557956.js
56.325 chrome://mochitests/content/browser/browser/devtools/debugger/test/browser_dbg_propertyview-data-big.js

Phil Ringnalda (:philor)

Comment 5

•

11 years ago

Since 10.7 debug is averaging over 110 minutes, and the rewrite that's on the UX tree has maybe added 5 minutes worth of tests, so they are intermittently going over 120 minutes and turning red as a result, this (or a magic clownshoes discovery) probably blocks landing Australis.

Blocks: australis

Matthew N. [:MattN]

Updated

•

11 years ago

Whiteboard: [Australis:M8][Australis:P1]

:Gijs (he/him)

Comment 6

•

11 years ago

(In reply to Chris AtLee [:catlee] from comment #4)
> Created attachment 763109 [details]
> fedora debug test times
> 
> the slowest tests here are:

Chris, did you check this manually? If not, is the script you're using available somewhere? I'd like to run it on UX log output to check we're not doing anything, err, clownshoes, to quote philor. :-)

Flags: needinfo?(catlee)

Justin Dolske [:Dolske]

Comment 7

•

11 years ago

I don't think this should block a landing -- we should increase the timeout if we're ready to land but the bc hasn't been split. (And then really really get it split, since it's it's obviously been put off in the 6 months since ehsan filed it).

It's an artificial barrier with not end-user impact, and a big part of the push to get Australis landed ASAP is to get feedback.

Phil Ringnalda (:philor)

Comment 8

•

11 years ago

Sure, filing and getting someone to fix "kick the timeout back up to 9000 again" would work just as well to get you unblocked, but one or the other does block since if you turn 10.7 bc pretty much permared you'll bounce, and this one already existed to mark as a reminder of that.

Chris AtLee [:catlee]

Comment 9

•

11 years ago

(In reply to :Gijs Kruitbosch from comment #6)
> (In reply to Chris AtLee [:catlee] from comment #4)
> > Created attachment 763109 [details]
> > fedora debug test times
> > 
> > the slowest tests here are:
> 
> Chris, did you check this manually? If not, is the script you're using
> available somewhere? I'd like to run it on UX log output to check we're not
> doing anything, err, clownshoes, to quote philor. :-)

I did some spot checks, and it seemed to be correct. I'll attach the script here so you can give it a spin yourself.

Flags: needinfo?(catlee)

Chris AtLee [:catlee]

Comment 10

•

11 years ago

Attached file test_times.py (deleted) — Details

:Gijs (he/him)

Comment 11

•

11 years ago

(In reply to Chris AtLee [:catlee] from comment #10)
> Created attachment 776386 [details]
> test_times.py

Thanks. I filed bug 894411 about our tests. We seem to be eating about 2 minutes on top of what's on m-c. I suspect we may be able to get that to be under a minute, but if that's not enough and/or I'm wrong, we should just be fixing this, I'm afraid.

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 12

•

11 years ago

https://tbpl.mozilla.org/php/getParsedLog.php?id=25395769&tree=Mozilla-Inbound
Output exceeded 52428800 bytes, remaining output has been truncated

Massimo Gervasini [:massimo]

Assignee

Updated

•

11 years ago

Assignee: nobody → mgervasini

Ryan VanderMeulen [:RyanVM]

Comment 13

•

11 years ago

See bug 894930. We're hitting the 7200s timeout on OSX 10.7 debug now.

Nobody; OK to take it and work on it

Updated

•

11 years ago

Product: mozilla.org → Release Engineering

Jared Wein [:jaws] (please needinfo? me)

Updated

•

11 years ago

Whiteboard: [Australis:M8][Australis:P1] → [Australis:M?][Australis:P1]

Jared Wein [:jaws] (please needinfo? me)

Comment 14

•

11 years ago

Massimo, any update here?

Flags: needinfo?(mgervasini)

Massimo Gervasini [:massimo]

Assignee

Comment 15

•

11 years ago

Hey Jared,

I have chunking working on the staging environment. I am running tests with 2 chunks: chunk 2 takes about 15 minutes to complete, chunk 1 more than 53 minutes and the output over 50M so it's truncated (tested on mac). 

I am trying now to increase the number of chunks to 5 to find out how it works with this setup.

Flags: needinfo?(mgervasini)

Jared Wein [:jaws] (please needinfo? me)

Updated

•

11 years ago

Status: NEW → ASSIGNED

Massimo Gervasini [:massimo]

Assignee

Comment 16

•

11 years ago

Attached image out.png (deleted) — Details

Here the results with 5 chunks (tested on mac):

chunk 1: 11m44s 
chunk 2: 31m40s
chunk 3:  7m30s
chunk 4:  4m11s
chunk 5: 12m23s

there is a test error in the second chunk:

...
03:12:26     INFO -  45096 INFO TEST-START | /tests/dom/browser-element/mochitest/test_browserElement_NoWhitelist.html
03:12:26     INFO -  creating 1!
03:12:26     INFO -  2013-08-29 03:12:26.780 plugin-container[871:903] *** __NSAutoreleaseNoPool(): Object 0x10819bb40 of class NSCFDictionary autoreleased with no pool in place - just leaking
03:12:26     INFO -  2013-08-29 03:12:26.782 plugin-container[871:903] *** __NSAutoreleaseNoPool(): Object 0x11931ae80 of class NSCFArray autoreleased with no pool in place - just leaking
03:12:26     INFO -  [TabChild] SHOW (w,h)= (0, 0)
03:12:26     INFO -  ############################### browserElementPanning.js loaded
03:12:26     INFO -  ######################## BrowserElementChildPreload.js loaded
03:12:26     INFO -  loading about:blank, 1
03:12:26     INFO -  loading http://example.com/tests/dom/browser-element/mochitest/file_empty.html, 1
03:12:26     INFO -  45097 ERROR TEST-UNEXPECTED-FAIL | /tests/dom/browser-element/mochitest/test_browserElement_NoWhitelist.html | Should not send mozbrowserloadstart event.
03:12:26     INFO -  45098 ERROR TEST-UNEXPECTED-FAIL | /tests/dom/browser-element/mochitest/test_browserElement_NoWhitelist.html | Should not send mozbrowserloadstart event.
03:17:49     INFO -  45099 ERROR TEST-UNEXPECTED-FAIL | /tests/dom/browser-element/mochitest/test_browserElement_NoWhitelist.html | Test timed out.
03:17:49     INFO -  args: ['/usr/sbin/screencapture', '-C', '-x', '-t', 'png', '/var/folders/WJ/WJB0YwOpFvuJxQKLn0fqnU+++-k/-Tmp-/mozilla-test-fail_1HPMOr']
03:17:53     INFO -  SCREENSHOT: <...>
03:17:53     INFO -  45100 INFO TEST-END | /tests/dom/browser-element/mochitest/test_browserElement_NoWhitelist.html | finished in 324291ms
...

any idea?

Massimo Gervasini [:massimo]

Assignee

Comment 17

•

11 years ago

(In reply to Massimo Gervasini [:mgerva] from comment #16)

> 03:12:26     INFO -  45097 ERROR TEST-UNEXPECTED-FAIL |
> /tests/dom/browser-element/mochitest/test_browserElement_NoWhitelist.html |
> Should not send mozbrowserloadstart event.
> 03:12:26     INFO -  45098 ERROR TEST-UNEXPECTED-FAIL |
> /tests/dom/browser-element/mochitest/test_browserElement_NoWhitelist.html |
> Should not send mozbrowserloadstart event.
> 03:17:49     INFO -  45099 ERROR TEST-UNEXPECTED-FAIL |
> /tests/dom/browser-element/mochitest/test_browserElement_NoWhitelist.html |
> '/var/folders/WJ/WJB0YwOpFvuJxQKLn0fqnU+++-k/-Tmp-/mozilla-test-fail_1HPMOr']
> 03:17:53     INFO -  SCREENSHOT: <...>


Oh, I've forgot to mention, the log says the test failed but the screeshot says status "Pass"

Jared Wein [:jaws] (please needinfo? me)

Comment 18

•

11 years ago

Olli, you reviewed the patch for bug 710231. Do you know why this test might be failing? Can you help out Massimo here?

Flags: needinfo?(bugs)

Massimo Gervasini [:massimo]

Assignee

Comment 19

•

11 years ago

I don't know if this helps but on my staging environment I am running the same test suite in 7 and 9 chunks too and in both case there's a failure in the 2nd and 3rd chunk. 

Is there some kind of dependency on previous states which was not exposed when we had a single test?

Jared Wein [:jaws] (please needinfo? me)

Comment 20

•

11 years ago

(In reply to Massimo Gervasini [:mgerva] from comment #19)
> Is there some kind of dependency on previous states which was not exposed
> when we had a single test?

That is what I suspect is happening. We can also disable the failing tests and file bugs to get them fixed.

Jared Wein [:jaws] (please needinfo? me)

Comment 21

•

11 years ago

Untracking from Australis since bug 894930 has been fixed.

No longer blocks: australis

Whiteboard: [Australis:M?][Australis:P1]

Massimo Gervasini [:massimo]

Assignee

Comment 22

•

11 years ago

Attached patch buildbot-configs-819963.patch (obsolete) (deleted) — Details — Splinter Review

Hi Rail,

this patch adds:
* browser-chrome tests are now split in two chunks as requested
* removed unittest_suites from mozilla/config.py (it's not used)
* pep8 changes

I am asking a feedback not a full review because I have executed some experiments on my staging environment (single os x 10.6 slave), using different number of chunks and the total time to execute the full suite when using 2 or 3 chunks is roughly the same (at least on my environment): 

chunks   total time
   2      1:06:52
   3      1:06:43 

more details:

time spent on each test 
2 chunks:
  1     52 mins,  2 secs
  2     14 mins, 50 secs

3 chunks:
  1     30 mins, 12 secs
  2     23 mins,  1 secs
  3     13 mins, 30 secs


If we have enough slave capacity, 3 chunks could (in theory) save up to ~20min (best case scenario)

Do you think it's worth to split this test suite in 3 chunks instead of 2? 
(I'll post the mozharness patch as soon we know the number of chunks)

Attachment #803572 - Flags: feedback?(rail)

Ed Morley [:emorley]

Comment 23

•

11 years ago

I'd definitely go for 3, if the total time is roughly the same, particularly since the splits seem a lot more even for 3 (and the reduced end-to-end time will be really helpful for sheriffs).

Rail Aliiev [:rail]

Comment 24

•

11 years ago

Comment on attachment 803572 [details] [diff] [review]
buildbot-configs-819963.patch

Thanks for poking this!

The patch looks OK to me.

BTW, have you tried to run more than 3 chunks? Let's say 10!? :) There may be some overhead for bootstrapping when you have too many chunks, so we need to have some kind of balance of the bootstrap time and real time spent for testing.

Attachment #803572 - Flags: feedback?(rail) → feedback+

Massimo Gervasini [:massimo]

Assignee

Comment 25

•

11 years ago

Attached patch buildbot-configs-819963.patch (obsolete) (deleted) — Details — Splinter Review

This patch adds:

* browser-chrome tests are now split in three chunks
* removed unittest_suites from mozilla/config.py (it's not used)
* pep8 changes

Attachment #803572 - Attachment is obsolete: true

Attachment #803784 - Flags: review?(rail)

Massimo Gervasini [:massimo]

Assignee

Comment 26

•

11 years ago

Attached patch mozharness-819963.patch (obsolete) (deleted) — Details — Splinter Review

... and this is the patch for mozharness

Attachment #803785 - Flags: review?(rail)

Massimo Gervasini [:massimo]

Assignee

Comment 27

•

11 years ago

Attached file chunking-results.txt (deleted) — Details

More results - single slave osx 10.6

chunks     total time
   2      1:06:52
   3      1:06:43
   5      1:07:28
   9      1:23:36
  27      2:02:10
  51      2:43:05

(more detailed results in attachment)

If we decide to have more than 3 chunks we should rethink the patch. As it is now, a patch for 27 chunks will be ~350 lines long

Ed Morley [:emorley]

Comment 28

•

11 years ago

I think 3 sounds like the optimal number for now :-)

Rail Aliiev [:rail]

Updated

•

11 years ago

Attachment #803784 - Flags: review?(rail) → review+

Rail Aliiev [:rail]

Comment 29

•

11 years ago

Comment on attachment 803785 [details] [diff] [review]
mozharness-819963.patch

Review of attachment 803785 [details] [diff] [review]:
-----------------------------------------------------------------

::: configs/unittests/linux_unittest.py
@@ -64,4 @@
>          "plain3": ["--total-chunks=5", "--this-chunk=3", "--chunk-by-dir=4"],
>          "plain4": ["--total-chunks=5", "--this-chunk=4", "--chunk-by-dir=4"],
>          "plain5": ["--total-chunks=5", "--this-chunk=5", "--chunk-by-dir=4"],
> -        "chrome": ["--chrome"],

Remind me, why you removed "chrome" here?

Rail Aliiev [:rail]

Comment 30

•

11 years ago

BTW, maybe it's worth to split the mozharness patch into 2 parts: 

1) add new configs in the first part, don't remove the old ones

2) remove old configs in the second patch.

In this case you wouldn't need to worry about reconfigs and configs and mozharness being in sync. You could land the first part anytime, merge it to production, land the buildbot-configs patch, reconfig and only than land the second mozharness patch.

Massimo Gervasini [:massimo]

Assignee

Comment 31

•

11 years ago

Attached patch mozharness-819963-1.patch (deleted) — Details — Splinter Review

* adds browser-chrome-{1,2,3}

Attachment #803785 - Attachment is obsolete: true

Attachment #803785 - Flags: review?(rail)

Attachment #804473 - Flags: review?(rail)

Massimo Gervasini [:massimo]

Assignee

Comment 32

•

11 years ago

Attached patch mozharness-819963-2.patch (deleted) — Details — Splinter Review

removes: "browser-chrome": ["--browser-chrome"]

Attachment #804474 - Flags: review?(rail)

Rail Aliiev [:rail]

Updated

•

11 years ago

Attachment #804473 - Flags: review?(rail) → review+

Rail Aliiev [:rail]

Updated

•

11 years ago

Attachment #804474 - Flags: review?(rail) → review+

Jared Wein [:jaws] (please needinfo? me)

Comment 33

•

11 years ago

Any idea why this bug stalled?

Massimo Gervasini [:massimo]

Assignee

Comment 34

•

11 years ago

Hi Jared,

sorry for the delay. I am ready commit this patch but there are failures in tests. 

Chunk number 2 is fine. Others chunks have some errors, full logs here:

chunk 1: https://people.mozilla.org/~mgervasini/chunk_1.html
chunk 2: https://people.mozilla.org/~mgervasini/chunk_2.html
chunk 3: https://people.mozilla.org/~mgervasini/chunk_3.html

How do we proceed here?

Flags: needinfo?(jaws)

Jared Wein [:jaws] (please needinfo? me)

Comment 35

•

11 years ago

If you can narrow it down to which tests are causing the errors, then we can disable those tests (assuming it is in the range of 1-10 tests). If it is many more tests that are failing then there may be something more serious going on here.

For chunk 1, the group of tests at /tests/dom/browser-element/mochitest/test_browserElement* appear to be causing issues. Can you try selectively disabling those tests (first disable all, run the chunk, re-enable half, run the chunk, re-enable half of the still disabled, run the chunk, etc, until you can find the culprit or determine that all of the tests are error prone)?

For chunk 3, /tests/toolkit/components/alerts/test/test_alerts.html and /tests/toolkit/components/alerts/test/test_alerts_noobserve.html are the two tests that failed. We could disable these two tests and file a bug to re-enable them, CCing Michael Ventnor (original author of the tests) and :emk (the last person to touch both tests).

Flags: needinfo?(bugs)

Jared Wein [:jaws] (please needinfo? me)

Updated

•

11 years ago

Flags: needinfo?(jaws)

:Gijs (he/him)

Comment 36

•

11 years ago

(In reply to Jared Wein [:jaws] from comment #35)
> If you can narrow it down to which tests are causing the errors, then we can
> disable those tests (assuming it is in the range of 1-10 tests). If it is
> many more tests that are failing then there may be something more serious
> going on here.
> 
> For chunk 1, the group of tests at
> /tests/dom/browser-element/mochitest/test_browserElement* appear to be
> causing issues. Can you try selectively disabling those tests (first disable
> all, run the chunk, re-enable half, run the chunk, re-enable half of the
> still disabled, run the chunk, etc, until you can find the culprit or
> determine that all of the tests are error prone)?
> 
> For chunk 3, /tests/toolkit/components/alerts/test/test_alerts.html and
> /tests/toolkit/components/alerts/test/test_alerts_noobserve.html are the two
> tests that failed. We could disable these two tests and file a bug to
> re-enable them, CCing Michael Ventnor (original author of the tests) and
> :emk (the last person to touch both tests).

Massimo, did you run these on your own machine? Is there a reason you can't do a try push? AFAICT e.g. the test_alerts tests can depend on machine configuration (see e.g. https://bugzilla.mozilla.org/show_bug.cgi?id=838203 ). That will also get us information about the other platforms, ensuring we don't run into surprises there.

Massimo Gervasini [:massimo]

Assignee

Comment 37

•

11 years ago

Hi Gijs,

Tests ran in staging environment: http://dev-master01.build.scl1.mozilla.com:8920 (requires VPN access) - Probably it's better to ask catlee or rail about the try push. What do you think, guys?

Flags: needinfo?(rail)

Flags: needinfo?(catlee)

Chris AtLee [:catlee]

Comment 38

•

11 years ago

We can't change the chunking for individual try pushes, which is why we need to run it on dev-masters.

Flags: needinfo?(catlee)

Rail Aliiev [:rail]

Comment 39

•

11 years ago

I think catlee answered the question.

Flags: needinfo?(rail)

:Gavin Sharp [email: gavin@gavinsharp.com]

Comment 40

•

11 years ago

Whether the slaves running the tests "on dev-masters" are the same as the slaves that run them in production is the relevant question. I assume the answer is yes?

Phil Ringnalda (:philor)

Updated

•

11 years ago

Blocks: 932159

Massimo Gervasini [:massimo]

Assignee

Comment 41

•

11 years ago

Hi all,

I think the best way to proceed here is to apply the first mozharness patch so the new browser-chrome-{1,2,3} are ready to be executed.
Once the mozharness patch is applied, we can enable the new chunked tests only on cedar (requires a new patch) so both the browser-chrome and its chunked version run in parallel. This would make easier to access the failed tests and make debugging easier. 

In parallel, I am starting to extend the dev-master tests to linux and windows.

Does it sound right for you?

cmtalbert

Comment 42

•

11 years ago

(In reply to Massimo Gervasini [:mgerva] from comment #41)
> Hi all,
> 
> I think the best way to proceed here is to apply the first mozharness patch
> so the new browser-chrome-{1,2,3} are ready to be executed.
> Once the mozharness patch is applied, we can enable the new chunked tests
> only on cedar (requires a new patch) so both the browser-chrome and its
> chunked version run in parallel. This would make easier to access the failed
> tests and make debugging easier. 
> 
> In parallel, I am starting to extend the dev-master tests to linux and
> windows.
> 
> Does it sound right for you?

This sounds great, what is the ETA for landing the first mozharness patch?

Andrew McCreight [:mccr8]

Updated

•

11 years ago

Summary: Split up mochitest-bc on desktop into two chunks → Split up mochitest-bc on desktop into three chunks

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 43

•

11 years ago

Summary of offline emails:

1) mgerva: when you get your changes landed on cedar, can you need-info jgriffin - he's happy to lead fixing/disabling any tests as needed on cedar. Once the chunked-up-mochitest-bc suites are green on cedar, we can rollout this change to other branches.

2) Currently the mochitest-browser-chrome test suite takes ~2hrs to run... by far the longest of any test suites. Details in http://oduinn.com/blog/2013/10/27/better-display-for-compute-hours-per-checkin/. This is bad for RelEng scheduling, and also bad for sheriffs+devs whenever we have to retrigger for intermittent test failures. Ideally, we'd like all test suites to be ~30mins, as that was good tradeoff on fast turnaround+scheduling, and a good balance of setup/teardown overhead vs real-work. For this specific case, it looks like it would mean splitting the 2hour mochitest-browser-chrome suite into 4? test suites, each ~30mins duration. Tweaking summary to match.

Flags: needinfo?(mgervasini)

Summary: Split up mochitest-bc on desktop into three chunks → Split up mochitest-bc on desktop into ~30minute chunks

Phil Ringnalda (:philor)

Comment 44

•

11 years ago

It doesn't take ~2hrs to run, it takes between 45 minutes and an unknowable time over 150 minutes since that's when buildbot times it out. With around 5 minutes of setup, unless we take the unprecedented option of having different numbers of chunks for opt and debug and different numbers of chunks for each OS, you have choices like "25-77 minutes" or "18-53 minutes" or "15-41 minutes" with the amount of wasted machine time (and thus the try backlog time) increasing with each decrease. Four chunks across 17 platforms would be an added 255 minutes of machine time per push, not counting the time it takes between runs to get a slave started on a new job.

Massimo Gervasini [:massimo]

Assignee

Comment 45

•

11 years ago

(In reply to John O'Duinn [:joduinn] from comment #43)
> Summary of offline emails:
> 
> 1) mgerva: when you get your changes landed on cedar, can you need-info
> jgriffin - he's happy to lead fixing/disabling any tests as needed on cedar.
> Once the chunked-up-mochitest-bc suites are green on cedar, we can rollout
> this change to other branches.

Thanks John!
I am creating another patch for buildbot-configs to enable bc chunking only on cedar. It's taking more than expected, but I am now 100% working on this issue.


> 2) Currently the mochitest-browser-chrome test suite takes ~2hrs to run...
> by far the longest of any test suites. Details in
> http://oduinn.com/blog/2013/10/27/better-display-for-compute-hours-per-
> checkin/. This is bad for RelEng scheduling, and also bad for sheriffs+devs
> whenever we have to retrigger for intermittent test failures. Ideally, we'd
> like all test suites to be ~30mins, as that was good tradeoff on fast
> turnaround+scheduling, and a good balance of setup/teardown overhead vs
> real-work. For this specific case, it looks like it would mean splitting the
> 2hour mochitest-browser-chrome suite into 4? test suites, each ~30mins
> duration. Tweaking summary to match.

In theory that's right but from the test I've executed in staging environment, time to execute a single chunk does not scale linearly - for example switching from 2 to 3 chunks it mostly affects the first chunk (see comment #22) - I think that's because not all the test have the same execution times, so the end to end execution time of a chunk, heavily depends on how many long running tests are in that chunk.
My opinion is that 3 chunks is a good number of chunks to start with, once we solve the issue of the failing tests it should relatively easy to run the suite using different a chunking and find the optimal value (I am trying to make it easy to configure).

Flags: needinfo?(mgervasini)

Jonathan Griffin (:jgriffin)

Comment 46

•

11 years ago

(In reply to Phil Ringnalda (:philor) from comment #44)
> It doesn't take ~2hrs to run, it takes between 45 minutes and an unknowable
> time over 150 minutes since that's when buildbot times it out. With around 5
> minutes of setup, unless we take the unprecedented option of having
> different numbers of chunks for opt and debug and different numbers of
> chunks for each OS, you have choices like "25-77 minutes" or "18-53 minutes"
> or "15-41 minutes" with the amount of wasted machine time (and thus the try
> backlog time) increasing with each decrease. Four chunks across 17 platforms
> would be an added 255 minutes of machine time per push, not counting the
> time it takes between runs to get a slave started on a new job.

On a current m-c run with no oranges, it takes between 31 and 133 minutes, depending on platform...debug OSX is the slowest.

Chunking definitely will increase machine time per push, although turnaround time on a given push may come down, depending on wait times.

We could consider applying the chunking only to debug runs (the opt runs are generally acceptable), but I don't know if that would confuse people.

Massimo Gervasini [:massimo]

Assignee

Comment 47

•

11 years ago

Attached patch enabling chunked mochitest-bc on cedar + pep8 fixes (obsolete) (deleted) — Details — Splinter Review

Hi Rail,

this patch enables mochitest-browser-chrome-{1,2,3} on cedar (keeping the old mochitest-chrome active for reference). Other branch are not affected by this change.

This patch also removes chunked tests from BUILDBOT_UNITTEST_SUITES (used in prod) and re-enables browser-chrome in get_ubuntu_unittests() but I am not totally sure it's really needed here.

Attachment #803784 - Attachment is obsolete: true

Attachment #828174 - Flags: review?(rail)

Massimo Gervasini [:massimo]

Assignee

Updated

•

11 years ago

Attachment #804473 - Flags: checked-in+

Rail Aliiev [:rail]

Comment 48

•

11 years ago

Comment on attachment 828174 [details] [diff] [review]
enabling chunked mochitest-bc on cedar + pep8 fixes

I'm not sure if wanted to add "opt test mochitest-browser-chrome" to other branches, see the builder diff: https://gist.github.com/rail/7345190

r- for now, let me know if that was intentional.

Attachment #828174 - Flags: review?(rail) → review-

Massimo Gervasini [:massimo]

Assignee

Comment 49

•

11 years ago

Attached patch enabling chunked mochitest-bc on cedar + pep8 fixes (deleted) — Details — Splinter Review

Hi Rail,

you're totally right! This patch should fix the problem

Attachment #828174 - Attachment is obsolete: true

Attachment #828647 - Flags: review?(rail)

Rail Aliiev [:rail]

Updated

•

11 years ago

Attachment #828647 - Flags: review?(rail) → review+

Massimo Gervasini [:massimo]

Assignee

Updated

•

11 years ago

Attachment #828647 - Flags: checked-in+

John Hopkins (:jhopkins)

Comment 50

•

11 years ago

in production

Jonathan Griffin (:jgriffin)

Comment 51

•

11 years ago

So, there are really two things going on here simultaneously:

1) mochitest-browser-chrome is being moved from Fedora slaves to EC2 instances on linu32/linux64 debug
2) mochitest-browser-chrome is being split into chunks for all platforms

1) by itself introduces a bunch of oranges on cedar; we'll have to address both 1) and 2) to get things green.

Jonathan Griffin (:jgriffin)

Comment 53

•

11 years ago

(In reply to Jonathan Griffin (:jgriffin) from comment #51)
> So, there are really two things going on here simultaneously:
> 
> 1) mochitest-browser-chrome is being moved from Fedora slaves to EC2
> instances on linu32/linux64 debug
> 2) mochitest-browser-chrome is being split into chunks for all platforms
> 
> 1) by itself introduces a bunch of oranges on cedar; we'll have to address
> both 1) and 2) to get things green.

So there is at least one tricky failure related to 1)...see bug 933680.  We could roll out the chunking faster if the chunking was done on Fedora slaves, and worry about moving to EC2 as a separate project.

:Gavin Sharp [email: gavin@gavinsharp.com]

Comment 54

•

11 years ago

(In reply to Jonathan Griffin (:jgriffin) from comment #52)
> There aren't too many problems that need to be resolved; the primary ones
> are bug 937380, bug 937407, and bug 938426.  Is there consensus that we
> should disable these in order to move forward?

We should investigate the failures before resorting to disabling.

Mark/Drew, can you guys look into these (and others in the dependency list) and chase down any easy fixes?

:Gavin Sharp [email: gavin@gavinsharp.com]

Comment 55

•

11 years ago

(In reply to Jonathan Griffin (:jgriffin) from comment #53)
> So there is at least one tricky failure related to 1)...see bug 933680.  We
> could roll out the chunking faster if the chunking was done on Fedora
> slaves, and worry about moving to EC2 as a separate project.

That would certainly simplify things. Is it feasible?

Jonathan Griffin (:jgriffin)

Comment 56

•

11 years ago

I don't think the two things are necessarily tied.  That said, chunking adds some additional setup/teardown time which will put additional load on hardware slaves, which will end up increasing their wait times.  This is probably most applicable to OSX slaves, which seem to be the hardest hit these days.

cmtalbert

Comment 57

•

11 years ago

(In reply to :Gavin Sharp (email gavin@gavinsharp.com) from comment #55)
> (In reply to Jonathan Griffin (:jgriffin) from comment #53)
> > So there is at least one tricky failure related to 1)...see bug 933680.  We
> > could roll out the chunking faster if the chunking was done on Fedora
> > slaves, and worry about moving to EC2 as a separate project.
> 
> That would certainly simplify things. Is it feasible?

Not really, sadly. IT is merging the SCL1 colo with SCL 3 and the old minis that are the Fedora slaves are being deprecated as part of that move, hence the push to run these in EC2. The timeline for the EOL'ing of Fedora slaves is probably sometime in January as I understand things, NI Amy Rich to get clarification on that timeline for that.

Flags: needinfo?(arich)

Amy Rich [:arr] [:arich]

Comment 58

•

11 years ago

The goal was to decommission all of the fedora machines by end of FY2013, but we can slip that into January of 2014 since we are closing the company for the last two weeks of the year.

Flags: needinfo?(arich)

Phil Ringnalda (:philor)

Comment 59

•

11 years ago

Actually nothing to do with this bug (none of this is, since we aren't seeing chunked ec2 failures that aren't also unchunked ec2 failures), but...

Has that been translated into bugs on *all* the things that currently run on Fedora slaves, like say every single Linux test on b2g18 since b2g18 is planning on existing until March, with the appropriate "this thing which nobody wanted to do for years now has to be done in six weeks or less" severity?

Jonathan Griffin (:jgriffin)

Updated

•

11 years ago

Depends on: 943092

Jonathan Griffin (:jgriffin)

Updated

•

11 years ago

Depends on: 943095

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Updated

•

11 years ago

Blocks: 916194

Ed Morley [:emorley]

Updated

•

11 years ago

No longer blocks: 916194

Chris Cooper [:coop] (he/him)

Comment 60

•

11 years ago

(In reply to Amy Rich [:arich] [:arr] from comment #58)
> The goal was to decommission all of the fedora machines by end of FY2013,
> but we can slip that into January of 2014 since we are closing the company
> for the last two weeks of the year.

Here's me trying to weigh IT need to decomm these troublesome machines with the developer need to have tests running *somewhere*.

To be clear, if there's a business need to support the fedora machines for longer, we will find a way to make that happen. That shouldn't be taken as a pass for developers to avoid fixing issues like bug 943095. The increased capacity of EC2 should be enough incentive there.

Massimo: how much effort to get chunked tests for cedar running on r3 Fedora slaves in the short-term, with the aim of swapping to EC2 ASAP?

Flags: needinfo?(mgervasini)

Massimo Gervasini [:massimo]

Assignee

Comment 61

•

11 years ago

It shouldn't be too difficult, I am preparing a patch to enable chunked mochitest-bc for fedora on cedar.

Flags: needinfo?(mgervasini)

Chris Cooper [:coop] (he/him)

Comment 62

•

11 years ago

(In reply to Massimo Gervasini [:mgerva] from comment #61)
> It shouldn't be too difficult, I am preparing a patch to enable chunked
> mochitest-bc for fedora on cedar.

Thanks, Massimo.

Massimo Gervasini [:massimo]

Assignee

Comment 63

•

11 years ago

Hi all,

no need for a patch; tests are already running on r3 Fedora slaves: 

https://tbpl.mozilla.org/php/getParsedLog.php?id=31237630&tree=Cedar

Justin Wood (:Callek)

Comment 64

•

11 years ago

(In reply to Massimo Gervasini [:mgerva] from comment #63)
> Hi all,
> 
> no need for a patch; tests are already running on r3 Fedora slaves: 
> 
> https://tbpl.mozilla.org/php/getParsedLog.php?id=31237630&tree=Cedar

Hey Massimo!

AIUI the request was for us to CHUNK the mochitest-bc on fedora slaves, not enable on fedora slaves. The reasoning behind this request is that if we chunk here it will allow an easier target for "this chunk works on fedora AND ubuntu hosts, we can switch this chunk off on fedora and on on Ubuntu"/etc.

If your reading of this bug is different of course we should coord between releng/ateam to figure out the appropriate course of action.

Flags: needinfo?(mgervasini)

Phil Ringnalda (:philor)

Comment 65

•

11 years ago

That is a chunk of bc (chunk #2) running on Fedora.

But if the Fedora thing has become "let's move all the tests that don't work on a single-core Ubuntu ec2 slave into one chunk and run that on Fedora" then you're no longer talking about chunking, because chunks are not permanent (running b-c1 on Fedora and b-c2 and b-c3 on Ubuntu would mean that someone adding just the wrong number of tests that fell in b-c1, pushing the busted ones out to b-c2, would get backed out for making totally unrelated tests fail with nobody even realizing that they failed because they got pushed off the Fedora chunk and into an ec2 chunk); you're talking about "let's file a new separate bug about creating a new browser-chrome suite, b-c-morethanonecore, and putting tests which don't work on the ec2 slaves in that."

Chris AtLee [:catlee]

Comment 66

•

11 years ago

On Cedar we're running mochitest-bc tests in 7 configurations:
- Opt unchunked on EC2 slaves (*)
- Opt chunked on Rev3 slaves
- Opt chunked on EC2 slaves

- Debug unchunked on Rev3 slaves (*)
- Debug unchunked on EC2 slaves
- Debug chunked on Rev3 slaves
- Debug chunked on EC2 slaves

https://tbpl.mozilla.org/?tree=Cedar&showall=1&jobname=browser-chrome

(*) configuration used on other branches, e.g. mozilla-inbound

We're seeing two classes of bugs on Cedar:
* Bugs from chunking: bug 943092, bug 937407, bug 933680, bug 932159

* Bugs from moving from rev3 to ec2 slaves: bug 943095

This bug is NOT about migrating debug mochitest-browser-chrome to EC2 slaves. Bug 850101 tracks that effort. It is important for RelOps and RelEng to migrate off of these rev3 machines next year, so any help greening up these bugs would be appreciated!

I believe that all the required tests are running on Cedar to green up the chunked tests (and/or the debug ec2 tests if you can!). If you need test slaves loaned out to reproduce, please let us know! There's nothing else actionable for RelEng here.

They're not all running in 30 minute chunks, but trunk is now running with mochitest-bc in 3 chunks on all platforms and with devtools split into it's own test suite. As a bonus, chunk-by-dir was even enabled to make which tests run in what chunk more consistent. I think this bug is as done as it's going to be. I've filed bug 996240 for trying to get this enabled on Aurora30 as well.

Plans are also underway to further enhance the current chunking scheme (such as spawning a new browser instance for each directory in bug 992911) so we can better isolate tests from each other and give us even finer-grained control over chunking.

Status: ASSIGNED → RESOLVED

Closed: 10 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

6 years ago

Component: General Automation → General

win7 debug test times 11 years ago Chris AtLee [:catlee] (deleted), application/gzip		Details
fedora debug test times 11 years ago Chris AtLee [:catlee] (deleted), application/gzip		Details
test_times.py 11 years ago Chris AtLee [:catlee] (deleted), text/plain		Details
out.png 11 years ago Massimo Gervasini [:massimo] (deleted), image/png		Details
buildbot-configs-819963.patch 11 years ago Massimo Gervasini [:massimo] (deleted), patch	rail : feedback+	Details \| Diff \| Splinter Review
buildbot-configs-819963.patch 11 years ago Massimo Gervasini [:massimo] (deleted), patch	rail : review+	Details \| Diff \| Splinter Review
mozharness-819963.patch 11 years ago Massimo Gervasini [:massimo] (deleted), patch		Details \| Diff \| Splinter Review
chunking-results.txt 11 years ago Massimo Gervasini [:massimo] (deleted), text/plain		Details
mozharness-819963-1.patch 11 years ago Massimo Gervasini [:massimo] (deleted), patch	rail : review+ massimo : checked-in+	Details \| Diff \| Splinter Review
mozharness-819963-2.patch 11 years ago Massimo Gervasini [:massimo] (deleted), patch	rail : review+	Details \| Diff \| Splinter Review
enabling chunked mochitest-bc on cedar + pep8 fixes 11 years ago Massimo Gervasini [:massimo] (deleted), patch	rail : review-	Details \| Diff \| Splinter Review
enabling chunked mochitest-bc on cedar + pep8 fixes 11 years ago Massimo Gervasini [:massimo] (deleted), patch	rail : review+ massimo : checked-in+	Details \| Diff \| Splinter Review