Closed Bug 1455721 Opened 7 years ago Closed 7 years ago

Log parsing backlog due to hundreds of 1GB+ logs ("No logs available, log parsing is in progress")

Categories

(Tree Management :: Treeherder: Infrastructure, enhancement, P1)

enhancement

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: ccoroiu, Assigned: emorley)

References

Details

Eugh. I'd spent 30 minutes digging into New Relic, finding example URLs, calculating sizes and all sorts and adding them to this comment, however Nightly hung from the excessively large logs (and the fact that the Taskcluster task inspector defaults to displaying the log view of a task by default), and then Bugzilla's lack of preserving form fields meant I lost it all. The short "I'm not going to type this all out again" version is that there were hundreds of failures with logs that were 15-30MB compressed, but had a whopping 98%+ compression ratio - so 600-1600MB uncompressed. The backlog of the "log parser fail" queue lasted between roughly 1910-2050 BST (the parsing of successful jobs has lower priority and so the higher than normal queues lasted from 1800-2250). See: https://rpm.newrelic.com/accounts/677903/dashboard/16676634/page/2?tw%5Bend%5D=1524261600&tw%5Bstart%5D=1524240000 https://rpm.newrelic.com/accounts/677903/applications/14179757/transactions?show_browser=false&tw%5Bend%5D=1524256233&tw%5Bstart%5D=1524244823&type=Celery#id=5b224f746865725472616e73616374696f6e2f43656c6572792f6c6f672d706172736572222c22225d The main source of the 1-1.5GB logs appears to be this try push: https://treeherder.mozilla.org/#/jobs?repo=try&revision=1bf1e770770f2107a2e6d45d891506c7c7c48060 Though there were also a much smaller number of 500MB logs from this separate try push: https://treeherder.mozilla.org/#/jobs?repo=try&revision=3d88b96ce375aa4e56eeb10b2586b5310209a221 Bug 1295997 will make Treeherder more resilient to these types of issues (ie it will just skip parsing the logs entirely over a certain size). In the meantime, the number of log parsing dynos can be increased temporarily to deal with these backlogs. That said, I really do feel that these excessively large logs should be being handled at the harness/taskcluster/artifacts layer instead.
Assignee: nobody → emorley
Status: NEW → RESOLVED
Closed: 7 years ago
Component: Treeherder: Data Ingestion → Treeherder: Infrastructure
Depends on: 1295997
Priority: -- → P1
Resolution: --- → FIXED
Summary: No logs available, log parsing is in progress → Log parsing backlog due to hundreds of 1GB+ logs ("No logs available, log parsing is in progress")
(In reply to Ed Morley [:emorley] from comment #1) > The main source of the 1-1.5GB logs appears to be this try push: > https://treeherder.mozilla.org/#/ > jobs?repo=try&revision=1bf1e770770f2107a2e6d45d891506c7c7c48060 Some additional stats on this try push: * jobs: 2245 * associated text log URLs: 2245, though 190 are 404 (there are also an additional 1576 errorsummary_json logs) * combined text log download size (gzipped): 6.7 GB * combined text log uncompressed size: 204 GB (!) Perhaps we also need to make it harder for people to do a `try: -b do -p all -u all -t none` push? Until at least some tests are passing, it seems pretty wasteful for try pushes to run 2000+ jobs.
You need to log in before you can comment on or make changes to this bug.