Closed
Bug 777634
Opened 12 years ago
Closed 12 years ago
Finish running OPTIMIZE on tbpl_mozilla_org
Categories
(Data & BI Services Team :: DB: MySQL, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: Unfocused, Assigned: mpressman)
References
Details
Attachments
(1 file)
(deleted),
text/plain
|
Details |
We can't seem to be ale to access any build summaries or logs (via TBPL), which is kinda a big issue :\
Comment 1•12 years ago
|
||
Blair has closed mozilla-central and mozilla-inbound for this, maybe more.
eg https://tbpl.mozilla.org/php/getParsedLog.php?id=13864233&tree=Mozilla-Inbound
the B for 'Win opt' on https://tbpl.mozilla.org/?tree=Mozilla-Inbound&rev=eb4f2872a6ca
The connection to tbpl is open stays for multiple minutes waiting for a response, until Firefox says 'The connection was reset'.
AFAICT the RelEng parts of this are working fine. In http://builddata.pub.build.mozilla.org/buildjson/builds-4hr.js.gz we have
{
"builder_id": 37827,
"buildnumber": 2472,
"endtime": 1343283972,
"id": 13864233,
"master_id": 31,
"properties": {
"basedir": "/e/builds/moz2_slave/m-in-w32",
"branch": "mozilla-inbound",
"builddir": "m-in-w32",
"buildername": "WINNT 5.2 mozilla-inbound build",
"buildid": "20120725222657",
"buildnumber": 2472,
"builduid": "03cd21a2173c468093d3aeac5379642d",
"comments": "Bug 746883; fix the bustage possibly caused by 746896. r=mattwoodrow",
"forced_clobber": false,
"got_revision": "eb4f2872a6ca",
"hashType": "sha512",
"log_url": "http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-inbound-win32/1343280417/mozilla-inbound-win32-bm12-build1-build2472.txt.gz",
"master": "http://buildbot-master12.build.scl1.mozilla.com:8001/",
"periodic_clobber": false,
"platform": "win32",
"product": "firefox",
"project": "",
"purge_actual": "57.11GB",
"purge_target": "12GB",
"purged_clobber": false,
"repository": "",
"request_ids": [
13912112
],
"revision": "eb4f2872a6ca9294d642e798b202699eef9443b2",
"scheduler": "mozilla-inbound",
"slavebuilddir": "m-in-w32",
"slavename": "w64-ix-slave12",
"stage_platform": "win32",
"toolsdir": "e:/builds/moz2_slave/m-in-w32/tools"
},
"reason": "scheduler",
"request_ids": [
13912112
],
"requesttime": 1343280417,
"result": 2,
"slave_id": 1308,
"starttime": 1343280432
},
The log_url gets a 200 response and downloads fine. I think you'll need to poke at the tbpl logs to find out what's happening.
Assignee: server-ops-releng → server-ops-devservices
Component: Server Operations: RelEng → Server Operations: Developer Services
QA Contact: arich → shyam
Reporter | ||
Comment 2•12 years ago
|
||
(In reply to Nick Thomas [:nthomas] from comment #1)
> Blair has closed mozilla-central and mozilla-inbound for this, maybe more.
Also Fx-team.
Comment 4•12 years ago
|
||
Seems like an ongoing attack of sorts :
10.8.81.216 - - [26/Jul/2012:01:48:20 -0700] "GET /php/getRevisionBuilds.php?branch=try&rev=8c57277452b9&_=1343292500026 HTTP/1.1" 200 1196 "https://tbpl.mozilla.org/?tree=Try" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:15.0) Gecko/20100101 Firefox/15.0"
10.8.81.216 - - [26/Jul/2012:01:48:20 -0700] "GET /php/getRevisionBuilds.php?branch=try&rev=9702de67ae1e&_=1343292500030 HTTP/1.1" 200 3022 "https://tbpl.mozilla.org/?tree=Try" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:15.0) Gecko/20100101 Firefox/15.0"
10.8.81.216 - - [26/Jul/2012:01:48:20 -0700] "GET /php/getRevisionBuilds.php?branch=try&rev=b6b254f6c496&_=1343292500028 HTTP/1.1" 200 34265 "https://tbpl.mozilla.org/?tree=Try" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:15.0) Gecko/20100101 Firefox/15.0"
and top hits :
[root@pp-zlb09.phx ~]# grep tbpl /var/log/zeus/generic.access_2012-07-26-01 | awk {'print $1'} | sort -n | uniq -c | sort -nr
7637 88.114.144.236
7525 58.85.248.221
4586 71.58.64.76
4059 107.10.247.225
3429 63.245.220.240
2558 174.0.224.76
1742 66.207.208.98
1595 63.235.13.3
1083 99.115.82.101
Comment 5•12 years ago
|
||
OTOH, there's also this :
[shyam@natasha ~]$ curl -H "Host: tbpl.mozilla.org" -v "http://generic1.webapp.phx1.mozilla.com:81/php/getParsedLog.php?id=13864233&tree=Mozilla-Inbound"
* About to connect() to generic1.webapp.phx1.mozilla.com port 81
* Trying 10.8.81.91... connected
* Connected to generic1.webapp.phx1.mozilla.com (10.8.81.91) port 81
> GET /php/getParsedLog.php?id=13864233&tree=Mozilla-Inbound HTTP/1.1
> User-Agent: curl/7.15.5 (x86_64-redhat-linux-gnu) libcurl/7.15.5 OpenSSL/0.9.8b zlib/1.2.3 libidn/0.6.5
> Accept: */*
> Host: tbpl.mozilla.org
>
< HTTP/1.1 200 OK
< Date: Thu, 26 Jul 2012 08:53:52 GMT
< Server: Apache
< X-Backend-Server: generic1.webapp.phx1.mozilla.com
< Strict-Transport-Security: max-age=15768000 ; includeSubDomains
< Access-Control-Allow-Origin: *
< Cache-Control: max-age=900
< Expires: Thu, 26 Jul 2012 09:08:52 GMT
< Content-Length: 63
< Content-Type: text/html, charset=utf-8
Connection #0 to host generic1.webapp.phx1.mozilla.com left intact
* Closing connection #0
SQLSTATE[HY000]: General error: 2006 MySQL server has gone away
The DB seems to have a ton of queries that are locking a table, investigating further.
Comment 6•12 years ago
|
||
generic1.db.phx1 is disabled in Zeus, generic2 is locked up as shown above. This would probably be the primary cause of tbpl misbehaving at this time.
Comment 7•12 years ago
|
||
Generic1 was disabled in bug 703967 comment 39 aiui, if that helps at all?
Comment 8•12 years ago
|
||
(In reply to Ed Morley [:edmorley] from comment #7)
> Generic1 was disabled in bug 703967 comment 39 aiui, if that helps at all?
It does, I wonder if the master/slave relationship was changed. If not, generic2 is probably replicating all the changes made to generic1, but then generic2 wouldn't be r/w...
I've paged the DBAs. Sheeri is probably on PTO, but mpressman should be around. This is clearly a DB related issue that's causing tbpl some pain.
Assignee: shyam → server-ops-database
Component: Server Operations: Developer Services → Server Operations: Database
QA Contact: shyam → cshields
Comment 9•12 years ago
|
||
As noted on IRC, the culprit is:
| 61367 | system user | | tbpl_mozilla_org | Connect | 53247 | copy to tmp table | OPTIMIZE TABLE runs_logs |
Which has been running for 53247s, i.e., the better part of a day. And which has locked the whole table to writes for that period. Killing that statement should unlock everything. (Shyam said he just killed it.)
Comment 10•12 years ago
|
||
From IRC :
[02:16:51] <AryehGregor> | fox2mike, | 61367 | system user | | tbpl_mozilla_org | Connect | 53247 | copy to tmp table | OPTIMIZE TABLE runs_logs
|
[02:17:09] * | Optimizer fox2mike: Carry on soldier. Optimizer out!
[02:17:10] < AryehGregor> | For InnoDB that's equivalent to ALTER TABLE. Which takes out a write lock on the whole table.
[02:17:21] < AryehGregor> | For as long as it takes to rebuild the entire table and its indexes from scratch.
[02:17:26] < AryehGregor> | So, um, find what's doing that and stop it. :)
[02:17:34] < fox2mike> | system user
[02:17:41] < AryehGregor> | Possibly a cron job or something.
[02:17:42] < fox2mike> | isnt' that internal mysql?
[02:17:53] < AryehGregor> | Something has to be running the query, MySQL should not be running that itself.
[02:18:02] < AryehGregor> | Although hmm, you're right that it's odd.
[02:18:07] < fox2mike> | yeah
[02:18:11] < fox2mike> | if something else is
[02:18:13] < Unfocused> | Optimizer: that code doesn't look wrong, fwiw
[02:18:15] < fox2mike> | we'd have that user.
[02:18:20] < fox2mike> | not system user
[02:20:14] <AryehGregor> | fox2mike, as a workaround, can you try killing the OPTIMIZE TABLE thread
[02:20:15] < AryehGregor> | ?
[02:27:30] <AryehGregor> | fox2mike, you won't corrupt it. OPTIMIZE on InnoDB is the same as ALTER, which works by locking the table, making a copy, and then atomically overwriting the original table with the copy. If you kill it, it
will just delete the copy and unlock the original.
Therefore I ran
mysql> kill 61367;
And things are coming back to normal.
I'm leaving this open to see if we need to re-run this optimize or to see what caused this. Lowering prio.
Severity: critical → major
No longer depends on: 703967
Comment 11•12 years ago
|
||
If you do need to optimize things, I suggest you run OPTIMIZE LOCAL TABLE instead of regular OPTIMIZE TABLE, so it's not replicated. I was told the setup is master-master, so you can optimize the passive master while the active master handles all load, then switch the masters and optimize the other one. The fact that this was running as "system user" indicates (thanks, #mysql!) that it was replicated -- which was possibly unintentional, but not very good for performance either way.
Assignee | ||
Comment 12•12 years ago
|
||
Shyam is correct that the setup has changed from a master-master to a master-slave, although replication is still enabled on generic1, but in a state where it is stopped on generic1 due to an error. It also appears that read_only is on for generic1 making sure that writes cannot be done on generic1 basically presenting itself as a master-slave.
Assignee | ||
Comment 13•12 years ago
|
||
Also, comment 10 is correct in that a full table lock is acquired during an optimize/alter table setup. The table tbpl_mozilla_org.runs_logs is ~50GB on disk which is why it was taking so long.
Comment 14•12 years ago
|
||
Is this still being actively worked? I'm _sometimes_ able to get summary info on TBPL, but opening a full log always fails with "Timeout generating log." This is pretty terrible. :(
Comment 15•12 years ago
|
||
(In reply to Justin Dolske [:Dolske] from comment #14)
> Is this still being actively worked? I'm _sometimes_ able to get summary
> info on TBPL, but opening a full log always fails with "Timeout generating
> log." This is pretty terrible. :(
Is this always on the same log?
As I added to the TryServer TBPL status message, "Timeout generating log." is unfortunately cached in the DB, so logs imported at the time of the issue won't recover. You can either access the logs directly (link in TBPL status message) or retrigger.
Comment 16•12 years ago
|
||
Just tested opening a variety of full logs via TBPL and wfm, so suspect you are indeed trying a subset of the logs affected at the time of the issue.
Comment 17•12 years ago
|
||
Ah, I read that as meaning it was locally-cached and a shift-reload should clear it up!
Assignee | ||
Comment 18•12 years ago
|
||
just got an alert that this - if the table tbpl_mozilla_org.runs_logs needs to be *optimized* - I can tune the host to prioritize on that and it should run a little faster. Additionally, we are using a little bit of swap due to assigning all the system memory to the innodb_buffer_pool. This will require a daemon restart. We are going to want to change the tuning options anyway so when would be a good time to perform the restart and the maintenance?
Assignee: server-ops-database → mpressman
Comment 19•12 years ago
|
||
John/Ravi,
Going to tag this along in this downtime, we'd like to finish running the killed optimize and Matt wants to adjust some more tuning options too.
Flags: needs-treeclosure?
Summary: Unable to access build summaries and logs → Finish running OPTIMIZE on tbpl_mozilla_org
Comment 20•12 years ago
|
||
Please update https://infra.etherpad.mozilla.org/81
Comment 21•12 years ago
|
||
(In reply to Ravi Pina [:ravi] from comment #20)
> Please update https://infra.etherpad.mozilla.org/81
I already have (after I made comment #19).
Assignee | ||
Comment 22•12 years ago
|
||
tuning changes made to host and puppet
Assignee | ||
Comment 23•12 years ago
|
||
alter is finished tbpl_mozilla_org.runs_logs is down from ~50GB to ~44GB. Additionally, innodb_buffer_pool was reduced from 12GB to 8GB.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Updated•10 years ago
|
Product: mozilla.org → Data & BI Services Team
You need to log in
before you can comment on or make changes to this bug.
Description
•