Investigate if expired artifacts are correctly removed
Categories
(Taskcluster :: Operations and Service Requests, task)
Tracking
(Not tracked)
People
(Reporter: marco, Assigned: yarik)
References
Details
Attachments
(1 file)
(deleted),
text/csv
|
Details |
target.crashreporter-symbols-full.tar.zst was made to expire more quickly as part of bug 1790453, and looking at a recent task we see that the expiration is indeed shorter (on https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/KHq8rOLsSBmGjRh84vOVQw/artifacts for this file it is 2023-04-18T19:13:48.598Z instead of the normal 2024-04-03T19:13:48.598Z for the other artifacts).
Unfortunately if we compare the data from June 2022 (https://console.cloud.google.com/bigquery?sq=586736774393:03952067b9d84e8e86ffb78792114d52) to March 2023 (https://console.cloud.google.com/bigquery?sq=513375772973:cc3c9c69c4774b739aa182e77e49ef64), the total size of these artifacts seems to have increased.
Is it possible we are not correctly removing expired artifacts?
Assignee | ||
Comment 1•2 years ago
|
||
Looking at this dashboard: https://earthangel-b40313e5.influxcloud.net/d/sowO93E7k/taskcluster-artifact-storage?orgId=1&from=now-90d&to=now we can see that expireArtifacts
job is running daily. But also the total number of artifacts keep growing.
There is a chance that this background task does not remove all of the artifacts during its run and those keep piling up.
Looking at service logs:
resource.type="k8s_container"
resource.labels.project_id="moz-fx-taskcluster-prod-4b87"
resource.labels.cluster_name="taskcluster-firefoxcitc-v1"
resource.labels.namespace_name="firefoxcitc-taskcluster"
jsonPayload.Type="monitor.periodic"
jsonPayload.Fields.name="expire-artifacts"
in the last 30 days there is not a single record.. which means that expire-artifacts
background job never finishes
Assignee | ||
Comment 2•2 years ago
|
||
I don't have access to bigquery and am not sure if it is possible to see what expiry date those artifacts have, and if we can see the number of those artifacts who should have been expired and removed.
Assignee | ||
Comment 3•2 years ago
|
||
So I think we might have a bigger issue.
I managed to export artifacts by expiry date from db directly (just took about 30min to aggregate)
select date_trunc('day', expires) as expires_day, count(*)
from queue_artifacts group by date_trunc('day', expires);
Also exported it to google docs
Those show that we should have deleted/expired 102.000.000 artifacts until today already!
And for the rest of 2023 we'd expire 180.000.000 additionally.
And based on the grafana S3 Delete requests we are doing somewhere 600.000-900.000 delete requests per day.
So we are quite behind the schedule and we should consider some other expire approaches
Assignee | ||
Comment 4•2 years ago
|
||
Looking at SQL queries and timings it looks like we only manage to do about 9000 queries a day, each returning 100 records. Which matches the average 900.000 deleteObject
requests a day. So we are likely limited by our db rather than AWS
Before we try something radical, I would first try to increase page size and use s3.deleteObjects
instead, which allows removal of multiple objects (up to 1000) in one call.
Plus I will add more logging to the expire-artifacts
to see better the progress.
Assignee | ||
Comment 5•2 years ago
|
||
I found a way how to query database more efficiently for our use case.
existing expire-artifacts
was using get_queue_artifacts_paginated
function which was initially suited for the fetching of artifacts of a single task, but later was probably extended to also find the expired ones. It queries 4 columns, three of which are in primary index, but expires is not.
if we do simply
select * from queue_artifacts
where expires < '2023-01-01T00:00:00Z'
limit 1000;
then postgres uses single sequential scan and returns first found rows very quickly, 1000 rows within 200-300ms
It works fast on the queries like this, where expires is in the past, and the rows are more likely to be located in the beginning of the scanned data. It is less efficient if we want to use it to find records in far future, but we don’t need this at the moment
Planned changes:
- add this query for
expire-artifacts
job - increase page size from
100
to1000
- add logging to see how many records are being processed
After we deploy that changes we would be able to tell how much time will be needed to catch up with removals, or if we still need to adjust page size or query itself
Assignee | ||
Comment 6•2 years ago
|
||
Fix was merged with https://github.com/taskcluster/taskcluster/pull/6172
And released in v49.1.0
fix will be tested on community first
Assignee | ||
Comment 7•2 years ago
|
||
Community-TC deployment seems to be working fast.
Before the deployment job was running for somewhere 450-500 seconds, now it finishes under 50s while removing 20k-25k of objects.
Reporter | ||
Comment 8•2 years ago
|
||
\o/
Assignee | ||
Comment 9•2 years ago
|
||
updated expire-artifacts job removed all of the backlog artifacts in just two days which is great :)
and now finishes in 4-10 hours daily, removing somewhere around 1m-2m artifacts
Reporter | ||
Comment 10•1 years ago
|
||
Yarik, can we close this as FIXED?
Assignee | ||
Comment 11•1 years ago
|
||
Yes, let's close it. Thanks for reminding.
Description
•