make archivescraper faster
Categories
(Socorro :: General, task, P2)
Tracking
(Not tracked)
People
(Reporter: willkg, Assigned: willkg)
Details
Attachments
(1 file)
(deleted),
text/x-github-pull-request
|
Details |
I wrote archivescraper to scrape version information for betaversion lookups. It's based on ftpscraper and when I wrote it, I wanted to stay as close to ftpscraper as I could so as to make the jump from one to the other as small as possible.
archivescraper takes a while to run. In a fresh local dev environment, it can take 20+ minutes. It's kind of irritating and a time sink and I run it at least once a week.
Relatedly, I wrote a verifyprocessed job. That uses multiprocessing to reduce the time it takes to run significantly.
archivescraper has similar properties--the bulk of the time it takes to run is traversing links on a website which is predominantly slow HTTP conversations. That's pretty ideal for multiprocessing with lots of workers.
This bug covers taking what I did with verifyprocessing and applying it to archivescraper.
Assignee | ||
Comment 1•6 years ago
|
||
The last fresh run I did took 40 minutes.
Assignee | ||
Comment 2•6 years ago
|
||
Grabbing this to tinker with today. I think it's straight-forward except for error handling and reporting. That's a bit trickier.
Assignee | ||
Comment 3•6 years ago
|
||
Assignee | ||
Comment 4•6 years ago
|
||
Assignee | ||
Comment 5•6 years ago
|
||
This has been running on stage for a while and it's significantly faster. Yay!
We just pushed this to prod. Marking as FIXED.
Description
•