Closed Bug 1164888 Opened 9 years ago Closed 9 years ago

Memory leak in job ingestion

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: camd, Assigned: camd)

References

Details

Attachments

(3 files)

logs from heroku on growth 9 years ago Cameron Dawson [:camd] (deleted), text/plain		Details
profile_buildapi.py 9 years ago Cameron Dawson [:camd] (deleted), text/plain		Details
PR to change concurrency 9 years ago Cameron Dawson [:camd] (deleted), text/x-github-pull-request	mdoglio : review+	Details

Cameron Dawson [:camd]

Assignee

Description

•

9 years ago

This was found on the Heroku deployment. We are leaking roughly 250MB at intervals in our buildapi job ingestion. It gets to a certain point and the tasks are reloaded and reset. Our suspicion is that it's in the ETL layer having to do with the loading of the pending.js, running.js or builds-4hr.js files. Probably the latter. Though this leak could be anywhere. Profiling is needed. But we have a few leads.

Cameron Dawson [:camd]

Assignee

Comment 1

•

9 years ago

Attached file logs from heroku on growth (deleted) — Details

Cameron Dawson [:camd]

Assignee

Updated

•

9 years ago

Assignee: nobody → cdawson

Ed Morley [:emorley]

Updated

•

9 years ago

Blocks: treeherder-heroku-prototype

Status: NEW → ASSIGNED

Component: Treeherder → Treeherder: Data Ingestion

Priority: -- → P2

Ed Morley [:emorley]

Comment 2

•

9 years ago

I'm pretty sure one of the two PR pushes to Heroku caused memory usage on the default worker (queues: default, process_objects, cycle_data, calculate_eta, populate_performance_series, fetch_bugs) to rise (and sawtooth): http://i.snag.gy/XSdXK.jpg My gut would have said it was the chunking changes [1] (eg us needing to switch to a generator so we don't double up when chunking), but the timeline implies this bug [2] - so not sure what to think. [1] https://github.com/mozilla/treeherder/commit/2628850a6f359814b5e417653006242a8e2e5402 [2] https://github.com/mozilla/treeherder/commit/aa9caa5517d17c3d9bf58a6f2072747d45db65ff + https://github.com/mozilla/treeherder/commit/49c8fff98361f09b4805acec19d3c0f89faa8a9a

Ed Morley [:emorley]

Updated

•

9 years ago

Blocks: 1165283

Ed Morley [:emorley]

Comment 3

•

9 years ago

I was playing around with pympler. Don't think I've found anything that conclusive so far - though I hadn't run a full pushlog ingestion task (for the initial runs on buildapi ingestion in the log here, there were no matching pushes, then I ingested mozilla-central and mozilla-inbound) - so perhaps there were just not enough jobs imported to cause a leak? Either way, I guess it seems to rule out (if my testing methodology is representative) us leaking when we fetch/load the builds-*.js files. https://emorley.pastebin.mozilla.org/8835142

Ed Morley [:emorley]

Comment 4

•

9 years ago

Perhaps in the meantime we should (a) switch more things to requests, (b) double check we've closed all file handles throughout, (c) update some of our dependencies (eg kombu) in case leaks have been fixed upstream.

Ed Morley [:emorley]

Comment 5

•

9 years ago

This time after running a worker in another shell to ingest the pushlogs. Still nothing conclusive imo: https://emorley.pastebin.mozilla.org/8835145

Cameron Dawson [:camd]

Assignee

Comment 6

•

9 years ago

Attached file profile_buildapi.py (deleted) — Details

Not sure how correct this is. This was my first attempt to profile the leak. It grew a tiny bit the first two passes, then not at all for the next 3 passes. So no leak showing up there.

Ed Morley [:emorley]

Comment 7

•

9 years ago

Is the pushlog ingested enough that the jobs aren't being skipped?

Cameron Dawson [:camd]

Assignee

Comment 8

•

9 years ago

I've done a lot of experimenting and checking and I'm not able to find a smoking gun. But I think I have a way forward with Heroku that should be maintainable. (Thanks mdoglio or working with me on this). When I run the buildapi_4hr celery worker locally, it grows. Up to a point. That point, in vagrant for me, is ~250MB. After that, it doesn't grow any more. This is for each concurrent task. So, concurrency of 5 will have 5 celery tasks that grow to that point and hover there. On Heroku, it seems the dyno grows to memory of around 377MB and hovers there. I have not been able to find any "smoking gun" via the use of Pympler or code elimination to see why it grows that much. However, if we setup our build4hr Heroku dyno to 1X and concurrency of 1, then it will hover at below the 512MB limit and we should be good. If we have a backlog, we can just provision extra dynos. I think this is better than spending more time on this. Especially since this code may be obsoleted when we move away from Buildbot and toward Task Cluster.

Cameron Dawson [:camd]

Assignee

Comment 9

•

9 years ago

Attached file PR to change concurrency (deleted) — Details

Attachment #8623801 - Flags: review?(mdoglio)

Mauro Doglio [:mdoglio]

Updated

•

9 years ago

Attachment #8623801 - Flags: review?(mdoglio) → review+

Treeherder GitHub Bugbot

Comment 10

•

9 years ago

Commit pushed to master at https://github.com/mozilla/treeherder https://github.com/mozilla/treeherder/commit/d4c0de62767ab8734955d505374a2696ef5aad4b Bug 1164888 - Tune Heroku build4hr handling By taking the concurrency down to 1, we can control how much memory is used and increase throughput by assigning more dynos. This prevents us overflowing the memory allocations on each dyno due to the growth of the build4hr process.

Cameron Dawson [:camd]

Assignee

Updated

•

9 years ago

Status: ASSIGNED → RESOLVED

Closed: 9 years ago

Resolution: --- → FIXED

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Quick Search

Memory leak in job ingestion

Categories

(Tree Management :: Treeherder: Data Ingestion, defect, P2)

Tracking

(Not tracked)

People

(Reporter: camd, Assigned: camd)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(3 files)

Description

Comment 1

Updated

Updated

Comment 2

Updated

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Updated

Comment 10

Updated

Attachment

General

Description

File Name

Content Type