Closed Bug 1130303 Opened 10 years ago Closed 6 years ago

Investigate more efficient ways of cycling data & reducing the impact on other transactions

Categories

(Tree Management :: Treeherder: Infrastructure, defect, P3)

defect

Tracking

(Not tracked)

RESOLVED DUPLICATE of bug 1176492

People

(Reporter: cbook, Unassigned)

References

Details

Bob pushed https://hg.mozilla.org/integration/mozilla-inbound/rev/9c95e28087ca this morning to inbound but the push was only visible to treeherder after +20 minutes after the push. Currently checking if this is a general problem but maybe this needs to be checked from the treeherder site.
Component: Treeherder → Treeherder: Data Ingestion
Priority: -- → P1
Hardware: x86 → All
Thank you for filing this - I'm keen that we file bugs for any issues, even if they are just transient, so we can investigate and see any patterns. Treeherder link for the push in question: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=9c95e28087ca The push is visible now, so we've at least backfilled it since. Some questions (for future missing push bugs, if you could add these details in comment 0 if will mean we don't loose the chance to find investigate, if things fix themselves later): a) Was it just the jobs that were missing, or the whole push? I'm pretty sure the summary and comment 0 implies the latter, but want to be sure. b) Did you refresh the Treeherder inbound page to see if the UI had just not polled correctly? c) Was TBPL displaying the push at the time?
Flags: needinfo?(cbook)
Nothing obvious error wise on New Relic for the range in question: https://rpm.newrelic.com/accounts/677903/applications/4180461/traced_errors?tw[end]=1423220769&tw[start]=1423199169 Looking at both the non-web CPU usage graph and the non-web transaction response time, there's a massive dip in the range in question: https://rpm.newrelic.com/accounts/677903/applications/4180461?tw[end]=1423213761&tw[start]=1423208090 Now it turns out data cycling is running at that time - if you drag-select the dip on the non-web transaction response time graph (to zoom in to it), you see cycle data in the table of transactions below. That said, there's no change in the graphs for throughput or response time, when looking at just the pushlog transactions: https://rpm.newrelic.com/accounts/677903/applications/4180461/transactions?tw[end]=1423213761&tw[start]=1423208090#id=570791481 However we see a massive spike in the number of fetch-missing-pushlog transactions: https://rpm.newrelic.com/accounts/677903/applications/4180461/transactions?tw[end]=1423213761&tw[start]=1423208090#id=724545597 There's also a massive drop in the process-objects throughput: https://rpm.newrelic.com/accounts/677903/applications/4180461/transactions?tw[end]=1423213761&tw[start]=1423208090#id=570791501 Initial thoughts: * For some reason we did not ingest the push as part of the normal pushlog ingestion * data cycling blocked the process-objects tasks (amongst others) for 25 minutes * Since we weren't running process-objects, the "oops we've found a job whose revision we don't have yet, let's backfill that revision" feature ("fetch-missing-pushlogs") was delayed until we started processing the incoming jobs (process-objects), explaining why it took 20 mins to see the push. Things we should look into: 1) The data cycling should not block other transactions like this. 2) We shouldn't have missed the push in the first place, though the refactor in bug 1076826 might solve some of these.
Also bug 1129731 comment 2 (swap warning on rabbitmq1 + very high queue lengths) occurred at roughly the same time as this.
There's also bug 1130347 (builds-{4hr,running,pending} being stale at 8am UTC each day), which may have hurt us here too, since at best it could cause a backlog when they finally catch up, and worse, delays the time until we hit the missing-pushlog-fallback (though we should still not need this fallback in general, but that's another matter).
Depends on: 1130355
(In reply to Ed Morley [:edmorley] from comment #1) > Thank you for filing this - I'm keen that we file bugs for any issues, even > if they are just transient, so we can investigate and see any patterns. > > Treeherder link for the push in question: > https://treeherder.mozilla.org/#/jobs?repo=mozilla- > inbound&revision=9c95e28087ca > > The push is visible now, so we've at least backfilled it since. > > Some questions (for future missing push bugs, if you could add these details > in comment 0 if will mean we don't loose the chance to find investigate, if > things fix themselves later): > a) Was it just the jobs that were missing, or the whole push? I'm pretty > sure the summary and comment 0 implies the latter, but want to be sure. > b) Did you refresh the Treeherder inbound page to see if the UI had just not > polled correctly? > c) Was TBPL displaying the push at the time? Hey Ed, seems we we running into this again, this time with one of my c-n pushes. a) the whole push b) oh yeah refreshed to death :) seriously reloaded the page a couple of time c) yeah tbpl is/was showing this but TH not
Flags: needinfo?(cbook)
Morphing this to be about making cycle-data have less of an impact on everything else, since it seems as though we're not having any data loss here - it's just a case of getting behind with ingestion whilst cycle-data is running. At the moment for every project in the datasource table, we: * Look for all pushes with a push_timestamp < X (where X is 4 months ago on prod, 45 days ago on stage) * Chunk the results of that search into chunk_size (default 100 pushes) * For each chunk: ** fetch all revisions associated with the pushes in that chunk. ** for each revision found, look up all job_ids associated with that revision ** calculate the query to delete entries in that project's objectstore table corresponding to those job_ids ** calculate the list of queries for deleting rows corresponding to that job_id from all tables in JOBS_CYCLE_TARGETS ** actually run the objectstore + jobs tables delete queries *** whilst running the delete, disable foreign key checks (I believe just for that session, not globally, thankfully!) ** move onto the next chunk Some thoughts: * Each push can have 300-500 jobs, so a chunk size of 100 pushes might mean 50000 jobs, each of which 3-6 job_artefact rows & if talos, many many performance_artefact rows. So the deletes on some tables could be for 500,000 rows, even though we're chunking. We definitely need to reduce chunk size IMO. This would presumably also avoid the "OperationalError: (2006, 'MySQL server has gone away')" exceptions I've seen both when running cycle-data manually, and as part of the daily run. * We could also try increasing the time between each delete cycle (sleep_time; is currently 2s). * We could run cycle-data more than once a day to space the deletes out (though making sure all runs were still outside of peak hours). * We should check we have indices on the relevant columns. Some thoughts: 1) Reduce the page size of the selects/deletes and increase the time between each cycle, to give other transactions time to complete in-between. This would presumably also avoid the "OperationalError: (2006, 'MySQL server has gone away')" exceptions I've seen both when running cycle-data manually, and as part of the daily run. The performance_artefact delete queries are taking 300s in some cases: https://rpm.newrelic.com/accounts/677903/applications/4180461/transactions?type=other&show_browser=false#id=570791494&app_trace_id=5911680555&tab-detail_5911680555=trace_details 2) Run cycle-data more than once a day to space the deletes out. 3) Check we have indexes on the relevant columns. 4) Longer term I think we should look into other ways of deleting, that may be more efficient. eg re-working the queries, using something other than the push age as the basis for deleting everything else, using date based partitioning and dropping old tables (though I'm struggling to see how the latter that would work with our current schema). I'm also wondering if the deletes would be quicker if we used "WHERE job_id < N" rather than "WHERE job_id IN <a several thousand long list if IDs" - surely so long as we don't recycle job_ids (we're using bigint, so we're not going to run out) then we can be cheeky and do the former? Either way, we should do some profiling here.
Component: Treeherder: Data Ingestion → Treeherder
Summary: +20 Minute Delay for pushlog updates → Investigate more efficient ways of cycling data & reducing the impact on other transactions
I've failed at copy-paste-editing again, sigh. What I meant to say: Some thoughts: * Each push can have 300-500 jobs, so a chunk size of 100 pushes might mean 50000 jobs, each of which 3-6 job_artefact rows & if talos, many many performance_artefact rows. So the deletes on some tables could be for 500,000 rows, even though we're chunking. We definitely need to reduce chunk size IMO. This would presumably also avoid the "OperationalError: (2006, 'MySQL server has gone away')" exceptions I've seen both when running cycle-data manually, and as part of the daily run. The performance_artefact delete queries are taking 300s in some cases: https://rpm.newrelic.com/accounts/677903/applications/4180461/transactions?type=other&show_browser=false#id=570791494&app_trace_id=5911680555&tab-detail_5911680555=trace_details * We could also try increasing the time between each delete cycle (sleep_time; is currently 2s). * We could run cycle-data more than once a day to space the deletes out (though making sure all runs were still outside of peak hours). * We should check we have indices on the relevant columns. Longer term I think we should look into other ways of deleting, that may be more efficient. eg re-working the queries, using something other than the push age as the basis for deleting everything else, using date based partitioning and dropping old tables (though I'm struggling to see how the latter that would work with our current schema). I'm also wondering if the deletes would be quicker if we used "WHERE job_id < N" rather than "WHERE job_id IN <a several thousand long list if IDs" - surely so long as we don't recycle job_ids (we're using bigint, so we're not going to run out) then we can be cheeky and do the former? Either way, we should do some profiling here.
Priority: P1 → P3
Bug 1126943 would have helped reduce the impact of the objectstore parts of the cycle-data task, but we may still need to improve the rest, to prevent things like bug 1165984.
Depends on: 1126943
Blocks: 1165984
Component: Treeherder → Treeherder: Infrastructure

cycle data was moved off the default worker in bug 1176492.

Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → DUPLICATE
You need to log in before you can comment on or make changes to this bug.