Closed Bug 1164888 Opened 9 years ago Closed 9 years ago

Memory leak in job ingestion

Categories

(Tree Management :: Treeherder: Data Ingestion, defect, P2)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: camd, Assigned: camd)

References

Details

Attachments

(3 files)

This was found on the Heroku deployment. We are leaking roughly 250MB at intervals in our buildapi job ingestion. It gets to a certain point and the tasks are reloaded and reset. Our suspicion is that it's in the ETL layer having to do with the loading of the pending.js, running.js or builds-4hr.js files. Probably the latter. Though this leak could be anywhere. Profiling is needed. But we have a few leads.
Attached file logs from heroku on growth (deleted) —
Assignee: nobody → cdawson
Status: NEW → ASSIGNED
Component: Treeherder → Treeherder: Data Ingestion
Priority: -- → P2
I'm pretty sure one of the two PR pushes to Heroku caused memory usage on the default worker (queues: default, process_objects, cycle_data, calculate_eta, populate_performance_series, fetch_bugs) to rise (and sawtooth): http://i.snag.gy/XSdXK.jpg My gut would have said it was the chunking changes [1] (eg us needing to switch to a generator so we don't double up when chunking), but the timeline implies this bug [2] - so not sure what to think. [1] https://github.com/mozilla/treeherder/commit/2628850a6f359814b5e417653006242a8e2e5402 [2] https://github.com/mozilla/treeherder/commit/aa9caa5517d17c3d9bf58a6f2072747d45db65ff + https://github.com/mozilla/treeherder/commit/49c8fff98361f09b4805acec19d3c0f89faa8a9a
Blocks: 1165283
I was playing around with pympler. Don't think I've found anything that conclusive so far - though I hadn't run a full pushlog ingestion task (for the initial runs on buildapi ingestion in the log here, there were no matching pushes, then I ingested mozilla-central and mozilla-inbound) - so perhaps there were just not enough jobs imported to cause a leak? Either way, I guess it seems to rule out (if my testing methodology is representative) us leaking when we fetch/load the builds-*.js files. https://emorley.pastebin.mozilla.org/8835142
Perhaps in the meantime we should (a) switch more things to requests, (b) double check we've closed all file handles throughout, (c) update some of our dependencies (eg kombu) in case leaks have been fixed upstream.
This time after running a worker in another shell to ingest the pushlogs. Still nothing conclusive imo: https://emorley.pastebin.mozilla.org/8835145
Attached file profile_buildapi.py (deleted) —
Not sure how correct this is. This was my first attempt to profile the leak. It grew a tiny bit the first two passes, then not at all for the next 3 passes. So no leak showing up there.
Is the pushlog ingested enough that the jobs aren't being skipped?
I've done a lot of experimenting and checking and I'm not able to find a smoking gun. But I think I have a way forward with Heroku that should be maintainable. (Thanks mdoglio or working with me on this). When I run the buildapi_4hr celery worker locally, it grows. Up to a point. That point, in vagrant for me, is ~250MB. After that, it doesn't grow any more. This is for each concurrent task. So, concurrency of 5 will have 5 celery tasks that grow to that point and hover there. On Heroku, it seems the dyno grows to memory of around 377MB and hovers there. I have not been able to find any "smoking gun" via the use of Pympler or code elimination to see why it grows that much. However, if we setup our build4hr Heroku dyno to 1X and concurrency of 1, then it will hover at below the 512MB limit and we should be good. If we have a backlog, we can just provision extra dynos. I think this is better than spending more time on this. Especially since this code may be obsoleted when we move away from Buildbot and toward Task Cluster.
Attached file PR to change concurrency (deleted) —
Attachment #8623801 - Flags: review?(mdoglio)
Attachment #8623801 - Flags: review?(mdoglio) → review+
Commit pushed to master at https://github.com/mozilla/treeherder https://github.com/mozilla/treeherder/commit/d4c0de62767ab8734955d505374a2696ef5aad4b Bug 1164888 - Tune Heroku build4hr handling By taking the concurrency down to 1, we can control how much memory is used and increase throughput by assigning more dynos. This prevents us overflowing the memory allocations on each dyno due to the growth of the build4hr process.
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: