Closed
Bug 1408389
Opened 7 years ago
Closed 7 years ago
when trying to run tests on m3.large (instead of m1.medium) I get many blue jobs in treeherder
Categories
(Taskcluster :: General, defect)
Taskcluster
General
Tracking
(Not tracked)
RESOLVED
FIXED
mozilla58
People
(Reporter: jmaher, Assigned: jmaher)
References
Details
Attachments
(1 file)
(deleted),
patch
|
gbrown
:
review+
|
Details | Diff | Splinter Review |
https://treeherder.mozilla.org/#/jobs?repo=try&author=jmaher@mozilla.com&fromchange=f8545b82c78b04af34da0e6d895b48153584a3cc&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-resultStatus=usercancel&filter-resultStatus=running&filter-resultStatus=pending&filter-resultStatus=runnable&filter-resultStatus=retry&selectedJob=136795928
I don't know why we get blue jobs, there is no log or other meta data- in order to switch away we need to have mostly green jobs. I assume this is a system level error at Amazon the machine gets yanked. I find it odd that it occurs on specific job types, which tells me it is related to the tests being run- although no logs lead me to confusion.
Comment 1•7 years ago
|
||
Are these the same tests that we couldn't get to run on anything but m1.mediums before? If I recall, those were failing (orange) not ?? (blue). I think the rough consensus was that they were concurrency-related tests and failed on a multi-CPU instance type (which just about everything but m1.medium is).
If this is the same, let's find and link to that bug for context.
Either way, we should be able to dig up some logging for those instances.
Assignee | ||
Comment 2•7 years ago
|
||
these are the same tests we tried to run on m3.large in the past and identified as too flaky or perma failing. There were 5 exceptions of test jobs that were using legacy and 3 of them are ok to move, but examining the last 2 test suites, this is where I get a lot of the blue jobs.
Many of the other failures I am doing a quick pass on to hunt down failures that I see in the logs. If there are other explanations for the blue jobs, that would be good to know as well. I found bug 1281241 (which this blocks) as a reference for previous work done to get off the m1.mediums.
Comment 4•7 years ago
|
||
Looking at one of the machines things start crashing (including the worker) because the machine is out of memory:
Oct 13 06:50:01 docker-worker.aws-provisioner.us-east-1e.ami-98a16ee2.m3-large.i-09d72485b13b76e53 docker-worker: Uncaught Exception! Attempting to report to Sentry and crash.
Oct 13 06:50:01 docker-worker.aws-provisioner.us-east-1e.ami-98a16ee2.m3-large.i-09d72485b13b76e53 docker-worker: Error: spawn ENOMEM
Oct 13 06:50:01 docker-worker.aws-provisioner.us-east-1e.ami-98a16ee2.m3-large.i-09d72485b13b76e53 docker-worker: at exports._errnoException (util.js:1026:11)
Oct 13 06:50:01 docker-worker.aws-provisioner.us-east-1e.ami-98a16ee2.m3-large.i-09d72485b13b76e53 docker-worker: at ChildProcess.spawn (internal/child_process.js:313:11)
Oct 13 06:50:01 docker-worker.aws-provisioner.us-east-1e.ami-98a16ee2.m3-large.i-09d72485b13b76e53 docker-worker: at exports.spawn (child_process.js:380:9)
Oct 13 06:50:01 docker-worker.aws-provisioner.us-east-1e.ami-98a16ee2.m3-large.i-09d72485b13b76e53 docker-worker: at Object.exports.execFile (child_process.js:143:15)
Oct 13 06:50:01 docker-worker.aws-provisioner.us-east-1e.ami-98a16ee2.m3-large.i-09d72485b13b76e53 docker-worker: at exports.exec (child_process.js:103:18)
Oct 13 06:50:01 docker-worker.aws-provisioner.us-east-1e.ami-98a16ee2.m3-large.i-09d72485b13b76e53 docker-worker: at Object.check (/home/ubuntu/docker_worker/node_modules/diskspace/diskspace.js:56:3)
Oct 13 06:50:01 docker-worker.aws-provisioner.us-east-1e.ami-98a16ee2.m3-large.i-09d72485b13b76e53 docker-worker: at exports.default (/home/ubuntu/docker_worker/src/lib/stats/host_metrics.js:43:13)
Oct 13 06:50:01 docker-worker.aws-provisioner.us-east-1e.ami-98a16ee2.m3-large.i-09d72485b13b76e53 docker-worker: at ontimeout (timers.js:365:14)
Oct 13 06:50:01 docker-worker.aws-provisioner.us-east-1e.ami-98a16ee2.m3-large.i-09d72485b13b76e53 docker-worker: at tryOnTimeout (timers.js:237:5)
Oct 13 06:50:01 docker-worker.aws-provisioner.us-east-1e.ami-98a16ee2.m3-large.i-09d72485b13b76e53 docker-worker: at Timer.listOnTimeout (timers.js:207:5)
Assignee | ||
Comment 5•7 years ago
|
||
this is great info! I need to look at the passing ones and see what the memory usage is.
Updated•7 years ago
|
Flags: needinfo?(dustin)
Assignee | ||
Comment 6•7 years ago
|
||
we fixed a damp test, now we need to run damp not on legacy. Doing the default instance type (m3.large), we run out of memory! for 7.5GB of memory, that isn't good- but thanks to the data in this bug, I moved to xlarge and it works great:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=d4f6786669723bccabf73c864cf3e9342792d9c6
Comment 7•7 years ago
|
||
Comment on attachment 8920556 [details] [diff] [review]
run damp/asan tests on xlarge instead of legacy
Review of attachment 8920556 [details] [diff] [review]:
-----------------------------------------------------------------
I suggest clarifying the comment, maybe, "runs out of memory on default/m3.medium"
Attachment #8920556 -
Flags: review?(gbrown) → review+
Comment 8•7 years ago
|
||
s/m3.medium/m3.large/
Comment 9•7 years ago
|
||
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #6)
> Created attachment 8920556 [details] [diff] [review]
> run damp/asan tests on xlarge instead of legacy
>
> we fixed a damp test, now we need to run damp not on legacy. Doing the
> default instance type (m3.large), we run out of memory! for 7.5GB of
> memory, that isn't good- but thanks to the data in this bug, I moved to
> xlarge and it works great:
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=d4f6786669723bccabf73c864cf3e9342792d9c6
Interesting that we run out of memory on m3.large but not m1.medium. m1.medium has half the memory of a m3.large.
Comment 10•7 years ago
|
||
Pushed by jmaher@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/11d443e7b098
run devtools on asan and xlarge. r=gbrown
Assignee | ||
Comment 11•7 years ago
|
||
m1.medium is single core, m3.large is multi-core, I suspect we are chewing up much more memory per process/thread than we would on m1.medium.
Comment 12•7 years ago
|
||
bugherder |
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla58
You need to log in
before you can comment on or make changes to this bug.
Description
•