Further optimize output.sh phase by using parallel's `--pipe` and/or `--pipepart` mechanisms to provide output-file with filenames via stdin
Categories
(Webtools :: Searchfox, enhancement)
Tracking
(Not tracked)
People
(Reporter: asuth, Assigned: asuth)
References
Details
Attachments
(1 file)
(deleted),
text/x-github-pull-request
|
Details |
The instrumentation added in bug 1567724 tells us that each output-file.rs invocation's bootstrapping for mozilla-central takes about 21.5 seconds to happen. GNU parallel's joblog indicates we currently trigger 341 of these invocations. Our m5d.2xlarge instances have 8 vCPUs each (which includes hyper-threading already), so (341 - 8) of these loads are redundant. Multiplying this out and then dividing by the 8 vCPUs, we find that each vCPU spends about 14.9 minutes doing redundant initialization that wouldn't be necessary if we passed the list of files to output over stdin to output-file. Since currently the output-file phase takes about 23 minutes, cutting out 15 minutes of waste would be a quite helpful improvement!
This does assume that output-file does not experience unbounded memory growth. In https://bugzilla.mozilla.org/show_bug.cgi?id=1567724#c19 I had noticed that the TreeDiffCache did grow our memory usage, but with the removal of blame-skipping logic, we no longer have a TreeDiffCache.
Assignee | ||
Comment 1•2 years ago
|
||
When I do this I'm also going to do something so the metrics scripts can tell how long build-codesearch.py
takes.
Assignee | ||
Comment 2•2 years ago
|
||
I'm going to trigger the config2 indexer job now to see how that benefits, and in particular, I want to see if we have any problems with uneven distribution of work among the output-file
invocations that might necessitate any quick additional changes like randomly shuffling all-files
.
Assignee | ||
Comment 3•2 years ago
|
||
config2 indexing job is now 2h49m down from 3h55m (Thursday) down from 7h52m (before Thursday).
indexer-logs-analyze.sh output key excerpt with additional new logging and analyze logic for mozilla-beta which now gets to build-codesearch.py at 28:50
down from 44:30
down from 1:44:07
. But note that we do now capture through check-index.sh
which is a better gauge as it also includes livegrep/codesearch index building and the time required to compress all the HTML into gzip, and for mozilla-beta that timestamp is 34:58
.
├── mozilla-beta
│ └──
│ script time since start apparent duration
│ ────────────────────────────────────────────────────────────
│ find-repo-files.py 0:00:00 0:01:23
│ build.sh 0:01:23 0:07:07
│ js-analyze.sh 0:08:30 0:03:12
│ idl-analyze.sh 0:11:42 0:00:12
│ ipdl-analyze.sh 0:11:54 0:00:00
│ crossref.sh 0:11:54 0:08:42
│ output.sh 0:20:36 0:08:14
│ build-codesearch.py 0:28:50 0:03:19
│ compress-outputs.sh 0:32:09 0:02:49
│ check-index.sh 0:34:58
│
├── mozilla-release
│ └──
│ script time since start apparent duration
│ ────────────────────────────────────────────────────────────
│ find-repo-files.py 0:00:00 0:01:25
│ build.sh 0:01:25 0:07:06
│ js-analyze.sh 0:08:31 0:03:12
│ idl-analyze.sh 0:11:43 0:00:12
│ ipdl-analyze.sh 0:11:55 0:00:00
│ crossref.sh 0:11:55 0:08:40
│ output.sh 0:20:35 0:08:12
│ build-codesearch.py 0:28:47 0:03:21
│ compress-outputs.sh 0:32:08 0:02:49
│ check-index.sh 0:34:57
Description
•