777634 - Finish running OPTIMIZE on tbpl_mozilla_org

Blair McBride [:Unfocused] (UNAVAILABLE)

Reporter

Description

•

12 years ago

We can't seem to be ale to access any build summaries or logs (via TBPL), which is kinda a big issue :\

Nick Thomas [:nthomas] (UTC+12)

Comment 1

•

12 years ago

Blair has closed mozilla-central and mozilla-inbound for this, maybe more. eg https://tbpl.mozilla.org/php/getParsedLog.php?id=13864233&tree=Mozilla-Inbound the B for 'Win opt' on https://tbpl.mozilla.org/?tree=Mozilla-Inbound&rev=eb4f2872a6ca The connection to tbpl is open stays for multiple minutes waiting for a response, until Firefox says 'The connection was reset'. AFAICT the RelEng parts of this are working fine. In http://builddata.pub.build.mozilla.org/buildjson/builds-4hr.js.gz we have { "builder_id": 37827, "buildnumber": 2472, "endtime": 1343283972, "id": 13864233, "master_id": 31, "properties": { "basedir": "/e/builds/moz2_slave/m-in-w32", "branch": "mozilla-inbound", "builddir": "m-in-w32", "buildername": "WINNT 5.2 mozilla-inbound build", "buildid": "20120725222657", "buildnumber": 2472, "builduid": "03cd21a2173c468093d3aeac5379642d", "comments": "Bug 746883; fix the bustage possibly caused by 746896. r=mattwoodrow", "forced_clobber": false, "got_revision": "eb4f2872a6ca", "hashType": "sha512", "log_url": "http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-inbound-win32/1343280417/mozilla-inbound-win32-bm12-build1-build2472.txt.gz", "master": "http://buildbot-master12.build.scl1.mozilla.com:8001/", "periodic_clobber": false, "platform": "win32", "product": "firefox", "project": "", "purge_actual": "57.11GB", "purge_target": "12GB", "purged_clobber": false, "repository": "", "request_ids": [ 13912112 ], "revision": "eb4f2872a6ca9294d642e798b202699eef9443b2", "scheduler": "mozilla-inbound", "slavebuilddir": "m-in-w32", "slavename": "w64-ix-slave12", "stage_platform": "win32", "toolsdir": "e:/builds/moz2_slave/m-in-w32/tools" }, "reason": "scheduler", "request_ids": [ 13912112 ], "requesttime": 1343280417, "result": 2, "slave_id": 1308, "starttime": 1343280432 }, The log_url gets a 200 response and downloads fine. I think you'll need to poke at the tbpl logs to find out what's happening.

Assignee: server-ops-releng → server-ops-devservices

Component: Server Operations: RelEng → Server Operations: Developer Services

QA Contact: arich → shyam

Blair McBride [:Unfocused] (UNAVAILABLE)

Reporter

Comment 2

•

12 years ago

(In reply to Nick Thomas [:nthomas] from comment #1) > Blair has closed mozilla-central and mozilla-inbound for this, maybe more. Also Fx-team.

Shyam Mani [:fox2mike]

Comment 3

•

12 years ago

Looking.

Assignee: server-ops-devservices → shyam

Shyam Mani [:fox2mike]

Comment 4

•

12 years ago

Seems like an ongoing attack of sorts : 10.8.81.216 - - [26/Jul/2012:01:48:20 -0700] "GET /php/getRevisionBuilds.php?branch=try&rev=8c57277452b9&_=1343292500026 HTTP/1.1" 200 1196 "https://tbpl.mozilla.org/?tree=Try" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:15.0) Gecko/20100101 Firefox/15.0" 10.8.81.216 - - [26/Jul/2012:01:48:20 -0700] "GET /php/getRevisionBuilds.php?branch=try&rev=9702de67ae1e&_=1343292500030 HTTP/1.1" 200 3022 "https://tbpl.mozilla.org/?tree=Try" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:15.0) Gecko/20100101 Firefox/15.0" 10.8.81.216 - - [26/Jul/2012:01:48:20 -0700] "GET /php/getRevisionBuilds.php?branch=try&rev=b6b254f6c496&_=1343292500028 HTTP/1.1" 200 34265 "https://tbpl.mozilla.org/?tree=Try" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:15.0) Gecko/20100101 Firefox/15.0" and top hits : [root@pp-zlb09.phx ~]# grep tbpl /var/log/zeus/generic.access_2012-07-26-01 | awk {'print $1'} | sort -n | uniq -c | sort -nr 7637 88.114.144.236 7525 58.85.248.221 4586 71.58.64.76 4059 107.10.247.225 3429 63.245.220.240 2558 174.0.224.76 1742 66.207.208.98 1595 63.235.13.3 1083 99.115.82.101

Shyam Mani [:fox2mike]

Comment 5

•

12 years ago

OTOH, there's also this : [shyam@natasha ~]$ curl -H "Host: tbpl.mozilla.org" -v "http://generic1.webapp.phx1.mozilla.com:81/php/getParsedLog.php?id=13864233&tree=Mozilla-Inbound" * About to connect() to generic1.webapp.phx1.mozilla.com port 81 * Trying 10.8.81.91... connected * Connected to generic1.webapp.phx1.mozilla.com (10.8.81.91) port 81 > GET /php/getParsedLog.php?id=13864233&tree=Mozilla-Inbound HTTP/1.1 > User-Agent: curl/7.15.5 (x86_64-redhat-linux-gnu) libcurl/7.15.5 OpenSSL/0.9.8b zlib/1.2.3 libidn/0.6.5 > Accept: */* > Host: tbpl.mozilla.org > < HTTP/1.1 200 OK < Date: Thu, 26 Jul 2012 08:53:52 GMT < Server: Apache < X-Backend-Server: generic1.webapp.phx1.mozilla.com < Strict-Transport-Security: max-age=15768000 ; includeSubDomains < Access-Control-Allow-Origin: * < Cache-Control: max-age=900 < Expires: Thu, 26 Jul 2012 09:08:52 GMT < Content-Length: 63 < Content-Type: text/html, charset=utf-8 Connection #0 to host generic1.webapp.phx1.mozilla.com left intact * Closing connection #0 SQLSTATE[HY000]: General error: 2006 MySQL server has gone away The DB seems to have a ton of queries that are locking a table, investigating further.

Shyam Mani [:fox2mike]

Comment 6

•

12 years ago

Attached file show full processlist from generic2.db.phx1 (deleted) — Details

generic1.db.phx1 is disabled in Zeus, generic2 is locked up as shown above. This would probably be the primary cause of tbpl misbehaving at this time.

Ed Morley [:emorley]

Comment 7

•

12 years ago

Generic1 was disabled in bug 703967 comment 39 aiui, if that helps at all?

Shyam Mani [:fox2mike]

Comment 8

•

12 years ago

(In reply to Ed Morley [:edmorley] from comment #7) > Generic1 was disabled in bug 703967 comment 39 aiui, if that helps at all? It does, I wonder if the master/slave relationship was changed. If not, generic2 is probably replicating all the changes made to generic1, but then generic2 wouldn't be r/w... I've paged the DBAs. Sheeri is probably on PTO, but mpressman should be around. This is clearly a DB related issue that's causing tbpl some pain.

Assignee: shyam → server-ops-database

Component: Server Operations: Developer Services → Server Operations: Database

QA Contact: shyam → cshields

Ed Morley [:emorley]

Updated

•

12 years ago

Depends on: 703967

Aryeh Gregor (:ayg) (no longer with Mozilla)

Comment 9

•

12 years ago

Shyam Mani [:fox2mike]

Comment 10

•

12 years ago

From IRC : [02:16:51] <AryehGregor> | fox2mike, | 61367 | system user | | tbpl_mozilla_org | Connect | 53247 | copy to tmp table | OPTIMIZE TABLE runs_logs | [02:17:09] * | Optimizer fox2mike: Carry on soldier. Optimizer out! [02:17:10] < AryehGregor> | For InnoDB that's equivalent to ALTER TABLE. Which takes out a write lock on the whole table. [02:17:21] < AryehGregor> | For as long as it takes to rebuild the entire table and its indexes from scratch. [02:17:26] < AryehGregor> | So, um, find what's doing that and stop it. :) [02:17:34] < fox2mike> | system user [02:17:41] < AryehGregor> | Possibly a cron job or something. [02:17:42] < fox2mike> | isnt' that internal mysql? [02:17:53] < AryehGregor> | Something has to be running the query, MySQL should not be running that itself. [02:18:02] < AryehGregor> | Although hmm, you're right that it's odd. [02:18:07] < fox2mike> | yeah [02:18:11] < fox2mike> | if something else is [02:18:13] < Unfocused> | Optimizer: that code doesn't look wrong, fwiw [02:18:15] < fox2mike> | we'd have that user. [02:18:20] < fox2mike> | not system user [02:20:14] <AryehGregor> | fox2mike, as a workaround, can you try killing the OPTIMIZE TABLE thread [02:20:15] < AryehGregor> | ? [02:27:30] <AryehGregor> | fox2mike, you won't corrupt it. OPTIMIZE on InnoDB is the same as ALTER, which works by locking the table, making a copy, and then atomically overwriting the original table with the copy. If you kill it, it will just delete the copy and unlock the original. Therefore I ran mysql> kill 61367; And things are coming back to normal. I'm leaving this open to see if we need to re-run this optimize or to see what caused this. Lowering prio.

Severity: critical → major

No longer depends on: 703967

Aryeh Gregor (:ayg) (no longer with Mozilla)

Comment 11

•

12 years ago

If you do need to optimize things, I suggest you run OPTIMIZE LOCAL TABLE instead of regular OPTIMIZE TABLE, so it's not replicated. I was told the setup is master-master, so you can optimize the passive master while the active master handles all load, then switch the masters and optimize the other one. The fact that this was running as "system user" indicates (thanks, #mysql!) that it was replicated -- which was possibly unintentional, but not very good for performance either way.

Ed Morley [:emorley]

Updated

•

12 years ago

Blocks: 703967

Matt Pressman [:mpressman]

Assignee

Comment 12

•

12 years ago

Shyam is correct that the setup has changed from a master-master to a master-slave, although replication is still enabled on generic1, but in a state where it is stopped on generic1 due to an error. It also appears that read_only is on for generic1 making sure that writes cannot be done on generic1 basically presenting itself as a master-slave.

Matt Pressman [:mpressman]

Assignee

Comment 13

•

12 years ago

Also, comment 10 is correct in that a full table lock is acquired during an optimize/alter table setup. The table tbpl_mozilla_org.runs_logs is ~50GB on disk which is why it was taking so long.

Justin Dolske [:Dolske]

Comment 14

•

12 years ago

Is this still being actively worked? I'm _sometimes_ able to get summary info on TBPL, but opening a full log always fails with "Timeout generating log." This is pretty terrible. :(

Ed Morley [:emorley]

Comment 15

•

12 years ago

(In reply to Justin Dolske [:Dolske] from comment #14) > Is this still being actively worked? I'm _sometimes_ able to get summary > info on TBPL, but opening a full log always fails with "Timeout generating > log." This is pretty terrible. :( Is this always on the same log? As I added to the TryServer TBPL status message, "Timeout generating log." is unfortunately cached in the DB, so logs imported at the time of the issue won't recover. You can either access the logs directly (link in TBPL status message) or retrigger.

Ed Morley [:emorley]

Comment 16

•

12 years ago

Just tested opening a variety of full logs via TBPL and wfm, so suspect you are indeed trying a subset of the logs affected at the time of the issue.

Justin Dolske [:Dolske]

Comment 17

•

12 years ago

Ah, I read that as meaning it was locally-cached and a shift-reload should clear it up!

Matt Pressman [:mpressman]

Assignee

Comment 18

•

12 years ago

just got an alert that this - if the table tbpl_mozilla_org.runs_logs needs to be *optimized* - I can tune the host to prioritize on that and it should run a little faster. Additionally, we are using a little bit of swap due to assigning all the system memory to the innodb_buffer_pool. This will require a daemon restart. We are going to want to change the tuning options anyway so when would be a good time to perform the restart and the maintenance?

Assignee: server-ops-database → mpressman

Shyam Mani [:fox2mike]

Comment 19

•

12 years ago

John/Ravi, Going to tag this along in this downtime, we'd like to finish running the killed optimize and Matt wants to adjust some more tuning options too.

Flags: needs-treeclosure?

Summary: Unable to access build summaries and logs → Finish running OPTIMIZE on tbpl_mozilla_org

Ravi Pina [:ravi]

Comment 20

•

12 years ago

Please update https://infra.etherpad.mozilla.org/81

Shyam Mani [:fox2mike]

Comment 21

•

12 years ago

(In reply to Ravi Pina [:ravi] from comment #20) > Please update https://infra.etherpad.mozilla.org/81 I already have (after I made comment #19).

Matt Pressman [:mpressman]

Assignee

Comment 22

•

12 years ago

tuning changes made to host and puppet

Matt Pressman [:mpressman]

Assignee

Comment 23

•

12 years ago

alter is finished tbpl_mozilla_org.runs_logs is down from ~50GB to ~44GB. Additionally, innodb_buffer_pool was reduced from 12GB to 8GB.

Status: NEW → RESOLVED

Closed: 12 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

10 years ago

Product: mozilla.org → Data & BI Services Team