Delete pushes older than a year
Categories
(Tree Management :: Treeherder: Infrastructure, task, P3)
Tracking
(Not tracked)
People
(Reporter: sclements, Unassigned)
References
(Blocks 1 open bug)
Details
In bug 1599460, Armen discovered that we don't expire any Push data so we have data going back to 2015. Two Perf tables - PerformanceAlertSummary table (doesn't expire data) and PerformanceDatum (expires data older than a year) - have foreign keys on the Push table so before we delete old data we need to determine if having a years worth of data (to align with what PerformanceDatum stores) is acceptable.
However, PerformanceDatum is quite large so it'd also be worth re-evaluating whether we need to store a years worth of data in Treeherder. It'd also be a good idea to establish a retention policy that we can apply to all Perfherder tables that ingest or generate data.
Dave and Ionut, can you weigh in please?
Reporter | ||
Updated•5 years ago
|
Comment 1•5 years ago
|
||
(In reply to Sarah Clements [:sclements] - away till Dec 2nd from comment #0)
It'd also be a good idea to establish a retention policy that we can apply to all Perfherder tables that ingest or generate data.
I agree. Can we start with aligning on 12 months, and then review if we can reduce this further? What would be the impact of expiring data from PerformanceAlertSummary? I imagine this would at least break links to alerts, but are there other ways this could impact users of Perfherder? If we're going to break links to alerts then we should make sure this doesn't result in an unhelpful 404 page.
Can we check access logs to see how often alerts older than a year are being requested?
Reporter | ||
Comment 2•5 years ago
|
||
I agree. Can we start with aligning on 12 months, and then review if we can reduce this further?
Yes, we can start with that.
What would be the impact of expiring data from PerformanceAlertSummary? I imagine this would at least break links to alerts, but are there other ways this could impact users of Perfherder? If we're going to break links to alerts then we should make sure this doesn't result in an unhelpful 404 page.
It won't break the page if an alert summary id doesn't exist; a "no alerts to show" message is displayed instead of the data.
Can we check access logs to see how often alerts older than a year are being requested?
hrm, I don't think that will be easy to check. We keep 3 days worth of logs in paper trails and the rest are archived by date/hour. A three day window probably wouldn't be that accurate so I'll need to create a script to parse at least a week or two of archives and grab relevant ids that are more or less correlated with a date range greater than a year (since the queries only include the id, not any sort of time range).
These links that are in bugzilla bugs, are they added as comments by a bot or manually as part of the performance regression detection? I'm wondering under what circumstances anything longer than a year would be needed (maybe Ionut would know?). As a side note, Treeherder stores only four months of data and there are definitely old links in bugzilla bugs that won't show any data.
Comment 3•5 years ago
|
||
(In reply to Sarah Clements [:sclements] from comment #2)
[...]
A three day window probably wouldn't be that accurate so I'll need to create a script to parse at least a week or two of archives and grab relevant ids that are more or less correlated with a date range greater than a year (since the queries only include the id, not any sort of time range).
Those ids are auto incremented, thus roughly retain the connection between their id and their creation time. Basically, you could attempt to first figure out the min & max ids when in need to search for a specific time interval.
These links that are in bugzilla bugs, are they added as comments by a bot or manually as part of the performance regression detection?
They're added manually, as part of the performance regression detection.
I'm wondering under what circumstances anything longer than a year would be needed (maybe Ionut would know?).
In my experience, we didn't have cases when we required data older than a year.
Reporter | ||
Updated•5 years ago
|
Reporter | ||
Updated•5 years ago
|
Reporter | ||
Updated•5 years ago
|
Comment 5•5 years ago
|
||
I am interested to take this. Kindly do let me know how to proceed as this is quite new for me!
Comment 6•5 years ago
|
||
Hi Shubhank,
At the root of the effort is to say to Django's ORM to delete all pushes older than a year.
The model is defined here.
Manually this code can be tested like this:
$ ./manage.py shell
>>> from treeherder.model.models import Push
>>> Push.objects.filter(<some syntax to indicate delete pushes where the field time is greater than a year)
It is documented in here how to ingest lots of pushes. Ingest few hundred and then try to manually delete it with the manual steps mentioned above. You can try remove pushes older than a day first.
Once you get a grasp of what's happening you can start looking at modifying the management command and add a test to test the logic:
You can run the management command and tests like this:
$ ./manage.py cycle_data (Read the code to know which arguments)
$ pytest tests/model/test_cycle_data (You can use -k pattern to select a subsets of tests to run)
Reporter | ||
Updated•4 years ago
|
Reporter | ||
Comment 7•4 years ago
|
||
Ionut, can you confirm that we only need to keep one years worth of Push data to align with the current performance data retention strategy?
Comment 8•4 years ago
|
||
(In reply to Sarah Clements [:sclements] from comment #7)
Ionut, can you confirm that we only need to keep one years worth of Push data to align with the current performance data retention strategy?
Yes, I confirm that.
Reporter | ||
Updated•3 years ago
|
Description
•