Closed
Bug 1029517
Opened 10 years ago
Closed 10 years ago
Improve Ouija updatedb.py performance
Categories
(Testing :: General, defect)
Testing
General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: dminor, Unassigned, Mentored)
References
()
Details
(Whiteboard: [good first bug][lang=python])
Attachments
(2 files, 1 obsolete file)
(deleted),
patch
|
dminor
:
review+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
dminor
:
review+
|
Details | Diff | Splinter Review |
+++ This bug was initially created as a clone of Bug #1027063 +++
Ouija (https://wiki.mozilla.org/Auto-tools/Projects/Ouija) is a tool for test failure rate analysis.
We update the database periodically using updatedb.py, a script which extracts build data and inserts it into the database. Unfortunately, this has become quite slow.
Currently we are inserting data a single row at a time. According to the MySQL docs, using an insert statement with multiple values() lists should be much faster. Any other performance improvements are of course welcome!
Updated•10 years ago
|
Comment 1•10 years ago
|
||
Another potential time saver could be making the HTTP requests for the CSet Results in parallel.
Reporter | ||
Comment 2•10 years ago
|
||
(In reply to Gautam Akiwate from comment #1)
> Another potential time saver could be making the HTTP requests for the CSet
> Results in parallel.
Sounds great!
Hi, I am new to mozilla development. I would like to work on this bug.
However I don't know much about Ouija, so some guidance would be needed.
Comment 4•10 years ago
|
||
Hi Miheer,
Glad to see your interest in Mozilla and this bug. We have some information to get started on our wiki for this project:
https://wiki.mozilla.org/Auto-tools/Projects/Ouija
We can't answer all questions, so if you have further questions ask in irc on #ateam or in this bug. We would love to help you out as needed to help you make a difference in tools at Mozilla.
I tried doing a line_profile for updatedb.py with delta=2.
Below is the output from the liner_profiler for uploadResults().
The select statement (at 201) takes much more time per hit than the insert statement (at 231).
Timer unit: 1e-06 s
File: src/updatedb.py
Function: uploadResults at line 189
Total time: 17.4581 s
Line # Hits Time Per Hit % Time Line Contents
==============================================================
189 @profile
190 def uploadResults(data, branch, revision, date):
191 22 70 3.2 0.0 db = MySQLdb.connect(unix_socket="/tmp/soc",
192 22 48 2.2 0.0 user="root",
193 22 38345 1743.0 0.2 db="ouija")
194
195 22 849 38.6 0.0 cur = db.cursor()
196
197 3404 13718 4.0 0.1 for item in data:
198
199 3382 13467 4.0 0.1 if 'result' in item:
200 3382 29358 8.7 0.2 id = '%s' % int(item['_id'])
201 3382 15473386 4575.2 88.6 cur.execute('select revision from\
testjobs where id=%s' % id)
202 3382 33236 9.8 0.2 if cur.fetchone():
... Lines ommited ...
216
217 3121 7285 2.3 0.0 sql = 'insert into testjobs (id, log,\
slave, result, duration,\
platform, buildtype,\
testtype, bugid,\
branch, revision, date) values ('
218 3121 10685 3.4 0.1 sql += '%s' % id
219 3121 31578 10.1 0.2 sql += ", '%s'" % log
220 3121 16624 5.3 0.1 sql += ", '%s'" % slave
221 3121 13753 4.4 0.1 sql += ", '%s'" % result
222 3121 12113 3.9 0.1 sql += ', %s' % duration
223 3121 10691 3.4 0.1 sql += ", '%s'" % platform
224 3121 12686 4.1 0.1 sql += ", '%s'" % buildtype
225 3121 14121 4.5 0.1 sql += ", '%s'" % testtype
226 3121 10465 3.4 0.1 sql += ", '%s'" % bugid
227 3121 10440 3.3 0.1 sql += ", '%s'" % branch
228 3121 10088 3.2 0.1 sql += ", '%s'" % revision
229 3121 41556 13.3 0.2 sql += ", '%s'" % date
230 3121 10723 3.4 0.1 sql += ')'
231 3121 951397 304.8 5.4 cur.execute(sql)
232 22 452 20.5 0.0 cur.close()
For the same run, the output for parseResults is given below.
File: src/updatedb.py
Function: parseResults at line 233
Total time: 254.606 s
Line # Hits Time Per Hit % Time Line Contents
==============================================================
233 @profile
234 def parseResults(args):
235 1 30 30.0 0.0 startdate = datetime.datetime.utcnow() -\
datetime.timedelta(hours=args.delta)
236
237 1 2 2.0 0.0 if args.branch == 'all':
238 1 1 1.0 0.0 result_branches = branches
239 else:
240 result_branches = [args.branch]
241
242 5 9 1.8 0.0 for branch in result_branches:
243 4 67334604 16833651.0 26.4 revisions = getPushLog(branch, startdate)
244
245 4 127369 31842.2 0.1 clearResults(branch, startdate)
246 26 87 3.3 0.0 for revision,date in revisions:
247 22 950 43.2 0.0 print "%s - %s" % (revision, date)
248 22 169457115 7702596.1 66.6 data = getCSetResults(branch, revision)
249 22 17685839 803901.8 6.9 uploadResults(data, branch, revision,
date)
Reporter | ||
Comment 7•10 years ago
|
||
Interesting! I guess we should get rid of that select statement if at all possible.
I should also do a profile on the 'production' database to see if we end up with consistent results.
Comment 8•10 years ago
|
||
very interesting.
Comment 9•10 years ago
|
||
I'll be working on parallelizing the getCSetResults.
Comment 10•10 years ago
|
||
Another improvement can be to use HTTP compression. IMO that will greatly reduce the time spent on the requests.
Comment 11•10 years ago
|
||
I have created a patch for accepting gzip compressed content.
I ran the patched updatedb.py with delta=2. However, it seems that the server only compresses the data sometimes. Don't know why. However even with this, I noticed a 20% reduction in runtime on the average. Can someone test this in the production scenario and see by how much it improves the runtime?
Reporter | ||
Comment 12•10 years ago
|
||
(In reply to Miheer from comment #11)
> Created attachment 8446088 [details] [diff] [review]
> Patch to accept gzip encoding in https requests
>
> I have created a patch for accepting gzip compressed content.
> I ran the patched updatedb.py with delta=2. However, it seems that the
> server only compresses the data sometimes. Don't know why. However even with
> this, I noticed a 20% reduction in runtime on the average. Can someone test
> this in the production scenario and see by how much it improves the runtime?
Miheer, I'm about to head out for the day, but I'll test it tomorrow morning and let you know.
Thanks!
Dan
Comment 13•10 years ago
|
||
I have attached a patch that parallelizes the getResults call. This shows an approx. 50% improvement with 4 threads.
However, we need to make the uploadResults better. Possibly allow more DB connections?
Reporter | ||
Comment 14•10 years ago
|
||
Comment on attachment 8446088 [details] [diff] [review]
Patch to accept gzip encoding in https requests
Review of attachment 8446088 [details] [diff] [review]:
-----------------------------------------------------------------
Thanks, this looks good. There is a bit of extra whitespace, but I'll clean that up prior to landing, don't worry about updating your patch.
Running on the "production" instance, I saw the following -
baseline: real 10m50.107s user 0m1.048s sys 0m0.320s
patched: real 7m28.768s user 0m0.924s sys 0m0.148s
Thanks for your contribution! If you have any other ideas here or if you're looking for another bug to take on, please let me know.
::: src/updatedb.py
@@ +112,5 @@
> +
> + #see http://www.diveintopython.net/http_web_services/gzip_compression.html
> + if response.getheader('content-encoding', '') == 'gzip':
> + data = gzip.GzipFile(fileobj=StringIO(data)).read()
> +
Be careful of extra whitespace.
Attachment #8446088 -
Flags: review+
Comment 15•10 years ago
|
||
Dan, thank you for your encouragement. Just to clarify, is it only the whitespace before the comment that you are talking about, or do you also include the one after the if statement? Also, can you profile your "production" instance, so that we can see possible bottlenecks?
Comment 16•10 years ago
|
||
I have a question.
Why has parallelizing getCSetResults() decreased the runtime? And what does the following line in getCSetResults() docstring mean - "Some results will be in asap, others will take up to 12 hours (usually < 4 hours)"
Is is all because some requests in getCSetResults() require time intensive computations on the server side.
Reporter | ||
Comment 17•10 years ago
|
||
Comment on attachment 8446606 [details] [diff] [review]
Parallelize GetResults
Review of attachment 8446606 [details] [diff] [review]:
-----------------------------------------------------------------
Thanks, this is a good start!
This definitely helps things (--delta 2, --threads 4):
baseline:
real 10m50.107s user 0m1.048s sys 0m0.320s
this patch:
real 7m21.775s user 0m1.124s sys 0m0.212s
this patch and Mineer's patch:
real 6m26.051s user 0m0.864s sys 0m0.148s
I'd like to land this since it definitely helps.
In order to do this, please rebase on top of Mineer's patch (there is a conflict, but only in the imports) and fix the few small things I mentioned below. Also, next time, please create your diff using git format-patch -1. This preserves your name and email in the patch, so the commit will show up as yours rather than mine.
Possibly you tried this already, but rather than splitting the revisions into four groups up front, it might make sense to have a queue of revisions to fetch. Then as each thread finishes its work, it can request another revision to fetch.
I'm also interested in a queue for results to be uploaded to the database, with a single thread draining the queue and uploading results. I know you tried this before and it didn't help much, but this approach would also allow us to combine inserts into a single transaction, and I'd like to see if that makes any difference.
::: src/updatedb.py
@@ +5,5 @@
> import datetime
> import sys
> import time
> +import threading
> +import Queue
Queue unused below.
@@ +244,5 @@
> + uploadResults(data, branch, revision, date)
> + lock.release()
> +
> +
> +def getRange(tid, lenR, nthreads):
Please use len_revisions rather than lenR in this function.
Comment 18•10 years ago
|
||
All getCSetResults are IO bound which is why threads should ideally help. Essentially, we are trying to overlap the time needed for each of the IO requests.
Reporter | ||
Comment 19•10 years ago
|
||
(In reply to Miheer from comment #15)
> Dan, thank you for your encouragement. Just to clarify, is it only the
> whitespace before the comment that you are talking about, or do you also
> include the one after the if statement? Also, can you profile your
> "production" instance, so that we can see possible bottlenecks?
Miheer, if you click on [1] you'll see the three spots with extra whitespace indicated.
I'm planning to profile the "production" instance today - I have to do some software updates first.
Thanks!
[1] https://bugzilla.mozilla.org/page.cgi?id=splinter.html&bug=1029517&attachment=8446088
Reporter | ||
Comment 20•10 years ago
|
||
You can get an idea of the data we're fetching by having a look at TBPL [1]. When someone lands a change, it kicks off a bunch of build and test runs that take varying amounts of time on different platforms. Grey jobs are still in progress, other colours indicate completed runs.
Some platforms are slower than others, and for platforms where we run the tests on physical machines, sometimes there is a large queue of pending jobs which means a job might not even start for an hour or two after the change is landed.
[1] https://tbpl.mozilla.org/?tree=Mozilla-Inbound
Comment 21•10 years ago
|
||
Thanks Dan, I think I understood some parts of what you said. Surprisingly, I cannot extract any information from any of your previous links.. both links do not show any data. Is something wrong with my browser... :/
Comment 22•10 years ago
|
||
Rebased Patch with modifications as requested.
Attachment #8446606 -
Attachment is obsolete: true
Comment 23•10 years ago
|
||
Hi Miheer,
What can you not access? The TBPL link or the link to your patch?
Gautam
Reporter | ||
Comment 24•10 years ago
|
||
Comment on attachment 8446694 [details] [diff] [review]
Rebased Parallelized getCSetResults
Review of attachment 8446694 [details] [diff] [review]:
-----------------------------------------------------------------
Thanks, looks great!
Attachment #8446694 -
Flags: review+
Comment 25•10 years ago
|
||
I could not access both... But I just tried opening them in incognito mode and now it's fine.
Thanks,
Miheer
Comment 26•10 years ago
|
||
Hi Miheer,
Would you like to take up the parallelization feature additions?
I can help you get started on it.
Comment 27•10 years ago
|
||
Hi Gautam,
Sure, I would like to do it. However I am a little busy now and will only be able to start working on it tomorrow. So the thing that needs to be done is implement a download queue with a pool of downloaders, right? If we are at it, why just restrict to one branch. We can parallelize getCSetResults for all branches together. What do you think?
Thanks,
Miheer
Comment 28•10 years ago
|
||
Yes. That is correct. If we implement the Queue method then we should be able to add the requests and the branch could be a parameter as well.
If you have any further questions you can post them here or alternatively mail me at gakiwate[at]mozilla.com
Comment 29•10 years ago
|
||
Okay, thanks.
Comment 30•10 years ago
|
||
Hi Gautam,
Can you explain why parallelizing cset downloads has improved performance? As far as I can see, it does database updation and downloading concurrently.. But is it necessary to parallelize downloads? Won't you just end up dividing the bandwith across parallel downloads?
Regards,
Miheer
Comment 31•10 years ago
|
||
(In reply to Miheer from comment #30)
Hi Miheer,
Sure. As far as I understand, the main issue here is the latency and not bandwidth. The IO here is the network and is primarily dominated by the latency.
In order to fully utilize your network link you need to account for two things, first bandwidth and second latency. The capacity of the link is Bandwidth x Latency. By making parallel getCSetResults requests we are overlapping the latency of the requests and making better use of the capacity.
However, what you point out is correct. Too many parallel requests will infact show no significant improvements. This is what happens if you increase the number of threads beyond 4. Presumably, this is the number that is needed to fill out the capacity of the link and anything beyond that is going to do us no good.
A good way to think of this could be, instead of getting a connection to do, lets say, 4 trips back and forth we get 4 connections to do the trips so that we need to wait only for 1 trip time instead of 4 trip times and that is where our dominant improvements come from.
Hopefully, this puts into context some of things done. If not, feel free to ask follow up questions.
Comment 32•10 years ago
|
||
Wow, thanks for an interesting explanation... that explains the parallel downloading part.
But, should we be concerned about the load it puts on the server? I have read that tbpl server is pretty loaded during certain times of the day.
Comment 33•10 years ago
|
||
(In reply to Miheer from comment #32)
That is an excellent question.
However, considering that it is upto us when we run the update scripts we do have sufficient control over it - so my first response is that it shouldn't be a problem. Further, I am not sure how much load the scripts put on the server. Maybe Dan or Joel could elaborate on this?
In any case, if we see a degradation, we could take it up for analysis.
Good questions. Keep them coming! :)
Reporter | ||
Comment 34•10 years ago
|
||
At the moment, we're running the update script twice a day, at 1 AM and 1 PM UTC.
Comment 35•10 years ago
|
||
Hi
I have issued a pull request at https://github.com/dminor/ouija/pull/16.
It does the downloading and database operations concurrently. I have added a lot of small changes on my own, so please let me know if I should remove some of them.
Thanks,
Miheer
Comment 36•10 years ago
|
||
Also, there was initially some issue on gakiwate's system. However after adding a workaround it was resolved.
Comment 37•10 years ago
|
||
For future reference this was the error encountered.
INFO:root:Downloading completed
Exception in thread DBHandler (most likely raised during interpreter shutdown):
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
File "updatedb.py", line 120, in run
File "/usr/lib/python2.7/Queue.py", line 168, in get
File "/usr/lib/python2.7/threading.py", line 332, in wait
: 'NoneType' object is not callable
Comment 38•10 years ago
|
||
Hi Miheer,
While the pull request is under consideration let us look into if there are any possible improvements that can be had in uploadResults?
As is evident from our runs it seems to be the if we can reduce the run time for uploadResults we should be able to finish faster in general.
Reporter | ||
Comment 39•10 years ago
|
||
Sorry for the delay in looking at this. The updatedb.py performance has become much worse this week (43 minutes to run updatedb.py with --delta 2, rather than around 7 minutes last week.)
I'd like to understand what is causing this before we make any more changes.
Comment 40•10 years ago
|
||
Presumably, the baseline still takes 10 minutes? So we can rule out more work load this time as opposed to last time.
Also, how many threads are we running it with?
This is odd considering in our runs we consistently get the patch to perform better.
Reporter | ||
Comment 41•10 years ago
|
||
Miheer noticed the (embarassing) fact that we don't have a primary key in the table, which was causing the select statement mentioned in comment 5 to take a very large amount of time in production. With that merged, we only take 5 seconds to run the script.
Reporter | ||
Comment 42•10 years ago
|
||
Miheer and Gautam, thanks for all your work on this!
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Comment 43•10 years ago
|
||
(In reply to Dan Minor [:dminor] from comment #41)
> Miheer noticed the (embarassing) fact that we don't have a primary key in
> the table, which was causing the select statement mentioned in comment 5 to
> take a very large amount of time in production. With that merged, we only
> take 5 seconds to run the script.
Hey Dan, does schema.sql (https://github.com/dminor/ouija/blob/4939101e0dec12bc76b8101fc9c210d3c48fd78e/sql/schema.sql) need to be updated to reflect choosing a primary key? =)
Flags: needinfo?(dminor)
Reporter | ||
Comment 44•10 years ago
|
||
Thanks, I did an 'alter table' on the production database, but forgot to update scheme.sql. Done now.
https://github.com/dminor/ouija/commit/b731093e602e7254e7ff9a985a0c512d621de954
Flags: needinfo?(dminor)
Comment 45•10 years ago
|
||
Sound!
You need to log in
before you can comment on or make changes to this bug.
Description
•