Closed Bug 1029517 Opened 10 years ago Closed 10 years ago

Improve Ouija updatedb.py performance

Categories

(Testing :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dminor, Unassigned, Mentored)

References

()

Details

(Whiteboard: [good first bug][lang=python])

Attachments

(2 files, 1 obsolete file)

+++ This bug was initially created as a clone of Bug #1027063 +++ Ouija (https://wiki.mozilla.org/Auto-tools/Projects/Ouija) is a tool for test failure rate analysis. We update the database periodically using updatedb.py, a script which extracts build data and inserts it into the database. Unfortunately, this has become quite slow. Currently we are inserting data a single row at a time. According to the MySQL docs, using an insert statement with multiple values() lists should be much faster. Any other performance improvements are of course welcome!
Another potential time saver could be making the HTTP requests for the CSet Results in parallel.
(In reply to Gautam Akiwate from comment #1) > Another potential time saver could be making the HTTP requests for the CSet > Results in parallel. Sounds great!
Hi, I am new to mozilla development. I would like to work on this bug. However I don't know much about Ouija, so some guidance would be needed.
Hi Miheer, Glad to see your interest in Mozilla and this bug. We have some information to get started on our wiki for this project: https://wiki.mozilla.org/Auto-tools/Projects/Ouija We can't answer all questions, so if you have further questions ask in irc on #ateam or in this bug. We would love to help you out as needed to help you make a difference in tools at Mozilla.
I tried doing a line_profile for updatedb.py with delta=2. Below is the output from the liner_profiler for uploadResults(). The select statement (at 201) takes much more time per hit than the insert statement (at 231). Timer unit: 1e-06 s File: src/updatedb.py Function: uploadResults at line 189 Total time: 17.4581 s Line # Hits Time Per Hit % Time Line Contents ============================================================== 189 @profile 190 def uploadResults(data, branch, revision, date): 191 22 70 3.2 0.0 db = MySQLdb.connect(unix_socket="/tmp/soc", 192 22 48 2.2 0.0 user="root", 193 22 38345 1743.0 0.2 db="ouija") 194 195 22 849 38.6 0.0 cur = db.cursor() 196 197 3404 13718 4.0 0.1 for item in data: 198 199 3382 13467 4.0 0.1 if 'result' in item: 200 3382 29358 8.7 0.2 id = '%s' % int(item['_id']) 201 3382 15473386 4575.2 88.6 cur.execute('select revision from\ testjobs where id=%s' % id) 202 3382 33236 9.8 0.2 if cur.fetchone(): ... Lines ommited ... 216 217 3121 7285 2.3 0.0 sql = 'insert into testjobs (id, log,\ slave, result, duration,\ platform, buildtype,\ testtype, bugid,\ branch, revision, date) values (' 218 3121 10685 3.4 0.1 sql += '%s' % id 219 3121 31578 10.1 0.2 sql += ", '%s'" % log 220 3121 16624 5.3 0.1 sql += ", '%s'" % slave 221 3121 13753 4.4 0.1 sql += ", '%s'" % result 222 3121 12113 3.9 0.1 sql += ', %s' % duration 223 3121 10691 3.4 0.1 sql += ", '%s'" % platform 224 3121 12686 4.1 0.1 sql += ", '%s'" % buildtype 225 3121 14121 4.5 0.1 sql += ", '%s'" % testtype 226 3121 10465 3.4 0.1 sql += ", '%s'" % bugid 227 3121 10440 3.3 0.1 sql += ", '%s'" % branch 228 3121 10088 3.2 0.1 sql += ", '%s'" % revision 229 3121 41556 13.3 0.2 sql += ", '%s'" % date 230 3121 10723 3.4 0.1 sql += ')' 231 3121 951397 304.8 5.4 cur.execute(sql) 232 22 452 20.5 0.0 cur.close()
For the same run, the output for parseResults is given below. File: src/updatedb.py Function: parseResults at line 233 Total time: 254.606 s Line # Hits Time Per Hit % Time Line Contents ============================================================== 233 @profile 234 def parseResults(args): 235 1 30 30.0 0.0 startdate = datetime.datetime.utcnow() -\ datetime.timedelta(hours=args.delta) 236 237 1 2 2.0 0.0 if args.branch == 'all': 238 1 1 1.0 0.0 result_branches = branches 239 else: 240 result_branches = [args.branch] 241 242 5 9 1.8 0.0 for branch in result_branches: 243 4 67334604 16833651.0 26.4 revisions = getPushLog(branch, startdate) 244 245 4 127369 31842.2 0.1 clearResults(branch, startdate) 246 26 87 3.3 0.0 for revision,date in revisions: 247 22 950 43.2 0.0 print "%s - %s" % (revision, date) 248 22 169457115 7702596.1 66.6 data = getCSetResults(branch, revision) 249 22 17685839 803901.8 6.9 uploadResults(data, branch, revision, date)
Interesting! I guess we should get rid of that select statement if at all possible. I should also do a profile on the 'production' database to see if we end up with consistent results.
very interesting.
I'll be working on parallelizing the getCSetResults.
Another improvement can be to use HTTP compression. IMO that will greatly reduce the time spent on the requests.
I have created a patch for accepting gzip compressed content. I ran the patched updatedb.py with delta=2. However, it seems that the server only compresses the data sometimes. Don't know why. However even with this, I noticed a 20% reduction in runtime on the average. Can someone test this in the production scenario and see by how much it improves the runtime?
(In reply to Miheer from comment #11) > Created attachment 8446088 [details] [diff] [review] > Patch to accept gzip encoding in https requests > > I have created a patch for accepting gzip compressed content. > I ran the patched updatedb.py with delta=2. However, it seems that the > server only compresses the data sometimes. Don't know why. However even with > this, I noticed a 20% reduction in runtime on the average. Can someone test > this in the production scenario and see by how much it improves the runtime? Miheer, I'm about to head out for the day, but I'll test it tomorrow morning and let you know. Thanks! Dan
Attached patch Parallelize GetResults (obsolete) (deleted) — Splinter Review
I have attached a patch that parallelizes the getResults call. This shows an approx. 50% improvement with 4 threads. However, we need to make the uploadResults better. Possibly allow more DB connections?
Comment on attachment 8446088 [details] [diff] [review] Patch to accept gzip encoding in https requests Review of attachment 8446088 [details] [diff] [review]: ----------------------------------------------------------------- Thanks, this looks good. There is a bit of extra whitespace, but I'll clean that up prior to landing, don't worry about updating your patch. Running on the "production" instance, I saw the following - baseline: real 10m50.107s user 0m1.048s sys 0m0.320s patched: real 7m28.768s user 0m0.924s sys 0m0.148s Thanks for your contribution! If you have any other ideas here or if you're looking for another bug to take on, please let me know. ::: src/updatedb.py @@ +112,5 @@ > + > + #see http://www.diveintopython.net/http_web_services/gzip_compression.html > + if response.getheader('content-encoding', '') == 'gzip': > + data = gzip.GzipFile(fileobj=StringIO(data)).read() > + Be careful of extra whitespace.
Attachment #8446088 - Flags: review+
Dan, thank you for your encouragement. Just to clarify, is it only the whitespace before the comment that you are talking about, or do you also include the one after the if statement? Also, can you profile your "production" instance, so that we can see possible bottlenecks?
I have a question. Why has parallelizing getCSetResults() decreased the runtime? And what does the following line in getCSetResults() docstring mean - "Some results will be in asap, others will take up to 12 hours (usually < 4 hours)" Is is all because some requests in getCSetResults() require time intensive computations on the server side.
Comment on attachment 8446606 [details] [diff] [review] Parallelize GetResults Review of attachment 8446606 [details] [diff] [review]: ----------------------------------------------------------------- Thanks, this is a good start! This definitely helps things (--delta 2, --threads 4): baseline: real 10m50.107s user 0m1.048s sys 0m0.320s this patch: real 7m21.775s user 0m1.124s sys 0m0.212s this patch and Mineer's patch: real 6m26.051s user 0m0.864s sys 0m0.148s I'd like to land this since it definitely helps. In order to do this, please rebase on top of Mineer's patch (there is a conflict, but only in the imports) and fix the few small things I mentioned below. Also, next time, please create your diff using git format-patch -1. This preserves your name and email in the patch, so the commit will show up as yours rather than mine. Possibly you tried this already, but rather than splitting the revisions into four groups up front, it might make sense to have a queue of revisions to fetch. Then as each thread finishes its work, it can request another revision to fetch. I'm also interested in a queue for results to be uploaded to the database, with a single thread draining the queue and uploading results. I know you tried this before and it didn't help much, but this approach would also allow us to combine inserts into a single transaction, and I'd like to see if that makes any difference. ::: src/updatedb.py @@ +5,5 @@ > import datetime > import sys > import time > +import threading > +import Queue Queue unused below. @@ +244,5 @@ > + uploadResults(data, branch, revision, date) > + lock.release() > + > + > +def getRange(tid, lenR, nthreads): Please use len_revisions rather than lenR in this function.
All getCSetResults are IO bound which is why threads should ideally help. Essentially, we are trying to overlap the time needed for each of the IO requests.
(In reply to Miheer from comment #15) > Dan, thank you for your encouragement. Just to clarify, is it only the > whitespace before the comment that you are talking about, or do you also > include the one after the if statement? Also, can you profile your > "production" instance, so that we can see possible bottlenecks? Miheer, if you click on [1] you'll see the three spots with extra whitespace indicated. I'm planning to profile the "production" instance today - I have to do some software updates first. Thanks! [1] https://bugzilla.mozilla.org/page.cgi?id=splinter.html&bug=1029517&attachment=8446088
You can get an idea of the data we're fetching by having a look at TBPL [1]. When someone lands a change, it kicks off a bunch of build and test runs that take varying amounts of time on different platforms. Grey jobs are still in progress, other colours indicate completed runs. Some platforms are slower than others, and for platforms where we run the tests on physical machines, sometimes there is a large queue of pending jobs which means a job might not even start for an hour or two after the change is landed. [1] https://tbpl.mozilla.org/?tree=Mozilla-Inbound
Thanks Dan, I think I understood some parts of what you said. Surprisingly, I cannot extract any information from any of your previous links.. both links do not show any data. Is something wrong with my browser... :/
Rebased Patch with modifications as requested.
Attachment #8446606 - Attachment is obsolete: true
Hi Miheer, What can you not access? The TBPL link or the link to your patch? Gautam
Comment on attachment 8446694 [details] [diff] [review] Rebased Parallelized getCSetResults Review of attachment 8446694 [details] [diff] [review]: ----------------------------------------------------------------- Thanks, looks great!
Attachment #8446694 - Flags: review+
I could not access both... But I just tried opening them in incognito mode and now it's fine. Thanks, Miheer
Hi Miheer, Would you like to take up the parallelization feature additions? I can help you get started on it.
Hi Gautam, Sure, I would like to do it. However I am a little busy now and will only be able to start working on it tomorrow. So the thing that needs to be done is implement a download queue with a pool of downloaders, right? If we are at it, why just restrict to one branch. We can parallelize getCSetResults for all branches together. What do you think? Thanks, Miheer
Yes. That is correct. If we implement the Queue method then we should be able to add the requests and the branch could be a parameter as well. If you have any further questions you can post them here or alternatively mail me at gakiwate[at]mozilla.com
Okay, thanks.
Hi Gautam, Can you explain why parallelizing cset downloads has improved performance? As far as I can see, it does database updation and downloading concurrently.. But is it necessary to parallelize downloads? Won't you just end up dividing the bandwith across parallel downloads? Regards, Miheer
(In reply to Miheer from comment #30) Hi Miheer, Sure. As far as I understand, the main issue here is the latency and not bandwidth. The IO here is the network and is primarily dominated by the latency. In order to fully utilize your network link you need to account for two things, first bandwidth and second latency. The capacity of the link is Bandwidth x Latency. By making parallel getCSetResults requests we are overlapping the latency of the requests and making better use of the capacity. However, what you point out is correct. Too many parallel requests will infact show no significant improvements. This is what happens if you increase the number of threads beyond 4. Presumably, this is the number that is needed to fill out the capacity of the link and anything beyond that is going to do us no good. A good way to think of this could be, instead of getting a connection to do, lets say, 4 trips back and forth we get 4 connections to do the trips so that we need to wait only for 1 trip time instead of 4 trip times and that is where our dominant improvements come from. Hopefully, this puts into context some of things done. If not, feel free to ask follow up questions.
Wow, thanks for an interesting explanation... that explains the parallel downloading part. But, should we be concerned about the load it puts on the server? I have read that tbpl server is pretty loaded during certain times of the day.
(In reply to Miheer from comment #32) That is an excellent question. However, considering that it is upto us when we run the update scripts we do have sufficient control over it - so my first response is that it shouldn't be a problem. Further, I am not sure how much load the scripts put on the server. Maybe Dan or Joel could elaborate on this? In any case, if we see a degradation, we could take it up for analysis. Good questions. Keep them coming! :)
At the moment, we're running the update script twice a day, at 1 AM and 1 PM UTC.
Hi I have issued a pull request at https://github.com/dminor/ouija/pull/16. It does the downloading and database operations concurrently. I have added a lot of small changes on my own, so please let me know if I should remove some of them. Thanks, Miheer
Also, there was initially some issue on gakiwate's system. However after adding a workaround it was resolved.
For future reference this was the error encountered. INFO:root:Downloading completed Exception in thread DBHandler (most likely raised during interpreter shutdown): Traceback (most recent call last): File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner File "updatedb.py", line 120, in run File "/usr/lib/python2.7/Queue.py", line 168, in get File "/usr/lib/python2.7/threading.py", line 332, in wait : 'NoneType' object is not callable
Hi Miheer, While the pull request is under consideration let us look into if there are any possible improvements that can be had in uploadResults? As is evident from our runs it seems to be the if we can reduce the run time for uploadResults we should be able to finish faster in general.
Sorry for the delay in looking at this. The updatedb.py performance has become much worse this week (43 minutes to run updatedb.py with --delta 2, rather than around 7 minutes last week.) I'd like to understand what is causing this before we make any more changes.
Presumably, the baseline still takes 10 minutes? So we can rule out more work load this time as opposed to last time. Also, how many threads are we running it with? This is odd considering in our runs we consistently get the patch to perform better.
Miheer noticed the (embarassing) fact that we don't have a primary key in the table, which was causing the select statement mentioned in comment 5 to take a very large amount of time in production. With that merged, we only take 5 seconds to run the script.
Miheer and Gautam, thanks for all your work on this!
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
(In reply to Dan Minor [:dminor] from comment #41) > Miheer noticed the (embarassing) fact that we don't have a primary key in > the table, which was causing the select statement mentioned in comment 5 to > take a very large amount of time in production. With that merged, we only > take 5 seconds to run the script. Hey Dan, does schema.sql (https://github.com/dminor/ouija/blob/4939101e0dec12bc76b8101fc9c210d3c48fd78e/sql/schema.sql) need to be updated to reflect choosing a primary key? =)
Flags: needinfo?(dminor)
Thanks, I did an 'alter table' on the production database, but forgot to update scheme.sql. Done now. https://github.com/dminor/ouija/commit/b731093e602e7254e7ff9a985a0c512d621de954
Flags: needinfo?(dminor)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: