Closed
Bug 897903
Opened 11 years ago
Closed 11 years ago
https://tbpl.mozilla.org/php/getRevisionBuilds.php calls taking longer than normal
Categories
(Infrastructure & Operations Graveyard :: WebOps: Other, task)
Infrastructure & Operations Graveyard
WebOps: Other
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: emorley, Unassigned)
Details
Attachments
(4 files)
Over the last week or so, we've been noticing that both the initial TBPL (tbpl.mozilla.org) load and the later 2 minute data refreshes have been taking much longer to complete (not all the time, seems to be in spells). The long pole are the calls to getRevisionBuilds.php, eg:
https://tbpl.mozilla.org/php/getRevisionBuilds.php?branch=mozilla-inbound&rev=7c78c3a98313&_=1374744902454
This file hasn't changed recently:
https://hg.mozilla.org/webtools/tbpl/log/681fc36ecc5e/php/getRevisionBuilds.php
Few questions:
a) Has load on the webheads been higher in the last week or two at certain times throughout the day?
b) Has load on the DB been higher?
c) Is the periodic DB purge/cleanup still working as expected? (bug 779290). Wondering if the table sizes have grown.
d) Can anyone think of any optimisations either to the DB queries or to our php/apache setup?
Reporter | ||
Comment 1•11 years ago
|
||
(In reply to Ed Morley [:edmorley UTC+1] from comment #0)
> d) Can anyone think of any optimisations either to the DB queries or to our
> php/apache setup?
Hmm it would seem we don't have gzip compression enabled for either the php responses or the static content :-/
Reporter | ||
Comment 2•11 years ago
|
||
(In reply to Ed Morley [:edmorley UTC+1] from comment #0)
> Few questions:
> a) Has load on the webheads been higher in the last week or two at certain
> times throughout the day?
> b) Has load on the DB been higher?
> c) Is the periodic DB purge/cleanup still working as expected? (bug 779290).
> Wondering if the table sizes have grown.
> d) Can anyone think of any optimisations either to the DB queries or to our
> php/apache setup?
Does anyone have any ideas? :-)
Comment 3•11 years ago
|
||
seems its getting more worse and kind of affect sheriff work with a very slow tool
Severity: normal → major
Comment 4•11 years ago
|
||
So, I've done some clicking around (based on what tomcat told me on IRC) and I can't reproduce this.
Is there a URL that's consistently slow?
Severity: major → normal
Reporter | ||
Comment 5•11 years ago
|
||
Thank you for taking a look! :-)
(In reply to Shyam Mani [:fox2mike] from comment #4)
> Is there a URL that's consistently slow?
First one in comment 0 was taking 20s+ the other day, is down to 8s right now, but per comment 0 this seems to fluctuate during the day.
Comment 6•11 years ago
|
||
Also, I'm on Aurora 24.0a2 (2013-07-28) so if you can reproduce this with Aurora or Stable, that would be nice too :) With a list of steps, so I can see if it's something specific and poke around in the logs.
Comment 7•11 years ago
|
||
(In reply to Ed Morley [:edmorley UTC+1] from comment #5)
> First one in comment 0 was taking 20s+ the other day, is down to 8s right
> now, but per comment 0 this seems to fluctuate during the day.
Again, less that on Aurora, even lesser with curl :
* Connection #0 to host tbpl.mozilla.org left intact
* Closing connection #0
real 0m1.765s
user 0m0.040s
sys 0m0.065s
Comment 8•11 years ago
|
||
So I'm convinced this is a nightly issue :
Lifting the server siege... done.
Transactions: 406 hits
Availability: 100.00 %
Elapsed time: 59.42 secs
Data transferred: 72.32 MB
Response time: 2.14 secs
Transaction rate: 6.83 trans/sec
Throughput: 1.22 MB/sec
Concurrency: 14.64
Successful transactions: 406
Failed transactions: 0
Longest transaction: 7.06
Shortest transaction: 0.97
This was on https://tbpl.mozilla.org/php/getRevisionBuilds.php?branch=mozilla-inbound&rev=7c78c3a98313&_=1374744902454 and so, I can't reproduce the issue. If you can, let us know, else we can close the bug.
Reporter | ||
Comment 9•11 years ago
|
||
I see no difference compared to Chrome fwiw
Reporter | ||
Comment 10•11 years ago
|
||
(In reply to Ed Morley [:edmorley UTC+1] from comment #9)
> I see no difference compared to Chrome fwiw
(Using web console timings to compare)
Reporter | ||
Comment 11•11 years ago
|
||
Pasting this here so I don't lose it...
14:47:27 - glob: edmorley, i was curious about the tbpl slowness
14:48:07 - edmorley: glob: yeah?
14:48:14 - glob: edmorley, i see 18s load times from here, but from people at the same time, 0.4s
14:48:26 - edmorley: ah
14:48:32 - glob: edmorley, from my co-lo box (in .au), 8.1s
14:48:50 - glob: but, rather slow .. 22k/s
14:51:21 - glob: edmorley, as a comparison, i get ~ 150k/s to the same colo server from bmo
14:52:01 - glob: edmorley, and to people, 475k/s
14:52:11 - glob: (that's tbpl --> people)
Comment 12•11 years ago
|
||
CC'ing some netops folks.
Comment 13•11 years ago
|
||
from my experience today:
not matter which browser (like nightly/aurora).
Not much traffic in case of checkins or other sherrifs/devs around like my morning European time -> tbpl is fast
Much Traffic when the US Pacific Time wakes up -> slow
Comment 14•11 years ago
|
||
Agreed 100% that performance strongly correlates with time of day.
Comment 15•11 years ago
|
||
(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #14)
> Agreed 100% that performance strongly correlates with time of day.
What would you say are the slowest times?
Comment 16•11 years ago
|
||
(In reply to Shyam Mani [:fox2mike] from comment #15)
> (In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #14)
> > Agreed 100% that performance strongly correlates with time of day.
>
> What would you say are the slowest times?
For Pacific Time i would saw 0:00 till about 6am ok and from there it goes down/slow
Reporter | ||
Comment 17•11 years ago
|
||
Is exceptionally slow for me right now...
{
[14:52:50.506] GET https://tbpl.mozilla.org/?tree=B2g-Inbound [HTTP/1.1 200 OK 1076ms]
--
[14:52:58.999] GET https://tbpl.mozilla.org/css/style.css [HTTP/1.1 200 OK 1922ms]
[14:52:59.000] GET https://tbpl.mozilla.org/js/jquery.js [HTTP/1.1 200 OK 10852ms]
[14:52:59.001] GET https://tbpl.mozilla.org/js/jquery-ui-drag.js [HTTP/1.1 200 OK 11156ms]
[14:52:59.001] GET https://tbpl.mozilla.org/js/NetUtils.js [HTTP/1.1 200 OK 11304ms]
[14:52:59.002] GET https://tbpl.mozilla.org/js/PushlogJSONParser.js [HTTP/1.1 200 OK 11454ms]
[14:52:59.002] GET https://tbpl.mozilla.org/js/MachineResult.js [HTTP/1.1 200 OK 11603ms]
[14:52:59.003] GET https://tbpl.mozilla.org/js/BuildbotDBUser.js [HTTP/1.1 200 OK 11752ms]
[14:52:59.003] GET https://tbpl.mozilla.org/js/BuildAPI.js [HTTP/1.1 200 OK 11900ms]
[14:52:59.004] GET https://tbpl.mozilla.org/js/AddCommentUI.js [HTTP/1.1 200 OK 12601ms]
[14:52:59.004] GET https://tbpl.mozilla.org/js/HiddenBuildsAdminUI.js [HTTP/1.1 200 OK 12890ms]
[14:52:59.005] GET https://tbpl.mozilla.org/js/Config.js [HTTP/1.1 200 OK 13038ms]
[14:52:59.006] GET https://tbpl.mozilla.org/js/MVTime.js [HTTP/1.1 200 OK 13489ms]
[14:52:59.007] GET https://tbpl.mozilla.org/js/Data.js [HTTP/1.1 200 OK 13831ms]
[14:52:59.007] GET https://tbpl.mozilla.org/js/SummaryLoader.js [HTTP/1.1 200 OK 15289ms]
[14:52:59.008] GET https://tbpl.mozilla.org/js/UserInterface.js [HTTP/1.1 200 OK 17394ms]
[14:52:59.008] GET https://tbpl.mozilla.org/js/LoadTracker.js [HTTP/1.1 200 OK 14434ms]
[14:52:59.009] GET https://tbpl.mozilla.org/js/Controller.js [HTTP/1.1 200 OK 14578ms]
[14:52:59.009] GET https://tbpl.mozilla.org/js/utils.js [HTTP/1.1 200 OK 14726ms]
--
[14:53:16.542] GET https://hg.mozilla.org/integration/b2g-inbound/json-pushes?full=1&maxhours=24 [HTTP/1.1 200 Script output follows 2348ms]
[14:53:16.543] GET https://tbpl.mozilla.org/php/getHiddenBuilderNames.php?branch=b2g-inbound [HTTP/1.1 200 OK 7001ms]
[14:53:16.543] GET https://secure.pub.build.mozilla.org/builddata/buildjson/builds-pending.js [HTTP/1.1 200 OK 531ms]
[14:53:16.544] GET https://secure.pub.build.mozilla.org/builddata/buildjson/builds-running.js [HTTP/1.1 200 OK 1025ms]
[14:53:16.544] GET https://treestatus.mozilla.org/b2g-inbound?format=json [HTTP/1.1 200 OK 542ms]
[14:53:16.545] GET https://tbpl.mozilla.org/images/loading_16.png [HTTP/1.1 200 OK 442ms]
[14:53:18.851] GET https://tbpl.mozilla.org/php/getRevisionBuilds.php?branch=b2g-inbound&rev=8a4c07bae3e8&_=1375365198762 [HTTP/1.1 200 OK 5341ms]
[14:53:18.852] GET https://tbpl.mozilla.org/images/tango-updown.png [HTTP/1.1 200 OK 1125ms]
[14:53:18.852] GET https://api-dev.bugzilla.mozilla.org/latest/bug?include_fields=status,summary&id=899711 [HTTP/1.1 200 OK 981ms]
[14:53:18.853] GET https://api-dev.bugzilla.mozilla.org/latest/bug?include_fields=status,summary&id=870941 [HTTP/1.1 200 OK 944ms]
}
Even the static assets are taking an age.
Reporter | ||
Comment 18•11 years ago
|
||
>ping tbpl.mozilla.org -t
Pinging generic.zlb.phx.mozilla.net [63.245.217.86] with 32 bytes of data:
Reply from 63.245.217.86: bytes=32 time=151ms TTL=52
Reply from 63.245.217.86: bytes=32 time=148ms TTL=52
Reply from 63.245.217.86: bytes=32 time=146ms TTL=52
Reply from 63.245.217.86: bytes=32 time=153ms TTL=52
Request timed out.
Reply from 63.245.217.86: bytes=32 time=146ms TTL=52
Reply from 63.245.217.86: bytes=32 time=147ms TTL=52
Request timed out.
Reply from 63.245.217.86: bytes=32 time=147ms TTL=52
Reply from 63.245.217.86: bytes=32 time=147ms TTL=52
Reply from 63.245.217.86: bytes=32 time=147ms TTL=52
Reply from 63.245.217.86: bytes=32 time=147ms TTL=52
Reply from 63.245.217.86: bytes=32 time=148ms TTL=52
Request timed out.
Reply from 63.245.217.86: bytes=32 time=148ms TTL=52
Reply from 63.245.217.86: bytes=32 time=147ms TTL=52
Reply from 63.245.217.86: bytes=32 time=147ms TTL=52
Reply from 63.245.217.86: bytes=32 time=146ms TTL=52
Request timed out.
Request timed out.
Reply from 63.245.217.86: bytes=32 time=148ms TTL=52
Reply from 63.245.217.86: bytes=32 time=147ms TTL=52
Reply from 63.245.217.86: bytes=32 time=148ms TTL=52
Reply from 63.245.217.86: bytes=32 time=148ms TTL=52
Reply from 63.245.217.86: bytes=32 time=147ms TTL=52
Reply from 63.245.217.86: bytes=32 time=148ms TTL=52
Reply from 63.245.217.86: bytes=32 time=147ms TTL=52
Reply from 63.245.217.86: bytes=32 time=149ms TTL=52
Reply from 63.245.217.86: bytes=32 time=146ms TTL=52
Request timed out.
Reply from 63.245.217.86: bytes=32 time=147ms TTL=52
Request timed out.
Reply from 63.245.217.86: bytes=32 time=148ms TTL=52
Reply from 63.245.217.86: bytes=32 time=148ms TTL=52
Reply from 63.245.217.86: bytes=32 time=146ms TTL=52
Reply from 63.245.217.86: bytes=32 time=147ms TTL=52
Request timed out.
Request timed out.
Reply from 63.245.217.86: bytes=32 time=146ms TTL=52
Reply from 63.245.217.86: bytes=32 time=146ms TTL=52
Reply from 63.245.217.86: bytes=32 time=146ms TTL=52
Request timed out.
Ping statistics for 63.245.217.86:
Packets: Sent = 42, Received = 32, Lost = 10 (23% loss),
Approximate round trip times in milli-seconds:
Minimum = 146ms, Maximum = 153ms, Average = 147ms
====
>tracert tbpl.mozilla.org
Tracing route to generic.zlb.phx.mozilla.net [63.245.217.86]
over a maximum of 30 hops:
1 1 ms <1 ms <1 ms homeportal [192.168.50.1]
2 7 ms 7 ms 8 ms 10.228.244.1
3 9 ms 8 ms 7 ms brnt-core-2a-ae3-1509.network.virginmedia.net [80.2.147.93
4 16 ms 7 ms 8 ms brnt-bb-1a-ae12-0.network.virginmedia.net [62.253.174.53]
5 8 ms 8 ms 9 ms brnt-bb-1c-ae0-0.network.virginmedia.net [62.253.174.30]
6 * 13 ms 11 ms brhm-bb-1c-ae8-0.network.virginmedia.net [62.253.174.126]
7 16 ms 15 ms 20 ms 0.ae1.mpr1.lhr1.uk.above.net [213.161.65.149]
8 18 ms 18 ms 20 ms ge-3-0-0.mpr1.lhr2.uk.above.net [64.125.28.126]
9 92 ms 93 ms 91 ms so-1-1-0.mpr1.dca2.us.above.net [64.125.31.186]
10 93 ms 123 ms 90 ms xe-0-3-0.cr1.dca2.us.above.net [64.125.29.17]
11 129 ms 121 ms 119 ms xe-2-2-0.cr1.iah1.us.above.net [64.125.29.37]
12 140 ms 120 ms 118 ms xe-0-0-0.cr2.iah1.us.above.net [64.125.30.66]
13 142 ms 142 ms 142 ms xe-0-1-0.mpr4.phx2.us.above.net [64.125.28.73]
14 165 ms 144 ms 144 ms 64.124.201.182.mozilla.com [64.124.201.182]
15 * * * Request timed out.
16 148 ms 148 ms 148 ms generic.zlb.phx.mozilla.net [63.245.217.86]
Trace complete.
Comment 19•11 years ago
|
||
I can't duplicate that loss from home or a linode box.
--- 63.245.217.86 ping statistics ---
44 packets transmitted, 44 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 75.362/78.103/84.628/2.207 ms
--- 63.245.217.86 ping statistics ---
67 packets transmitted, 67 received, 0% packet loss, time 66085ms
rtt min/avg/max/mdev = 36.928/37.122/37.467/0.243 ms
I'm running a graph for tbpl traffic since the 20th and we'll be able to see if there's any recent change in traffic pattern.
Comment 20•11 years ago
|
||
also no loss from the london office.
44 packets transmitted, 44 received, 0% packet loss, time 43674ms
rtt min/avg/max/mdev = 136.123/136.396/136.691/0.478 ms
Comment 21•11 years ago
|
||
btw i can confirm the problem with ping timeouts from my homeoffice in germany
PING generic.zlb.phx.mozilla.net (63.245.217.86): 56 data bytes
64 bytes from 63.245.217.86: icmp_seq=0 ttl=47 time=181.804 ms
Request timeout for icmp_seq 1
64 bytes from 63.245.217.86: icmp_seq=1 ttl=47 time=1015.640 ms
64 bytes from 63.245.217.86: icmp_seq=2 ttl=47 time=182.797 ms
64 bytes from 63.245.217.86: icmp_seq=3 ttl=47 time=182.567 ms
64 bytes from 63.245.217.86: icmp_seq=4 ttl=47 time=181.740 ms
64 bytes from 63.245.217.86: icmp_seq=5 ttl=47 time=181.666 ms
64 bytes from 63.245.217.86: icmp_seq=6 ttl=47 time=175.996 ms
64 bytes from 63.245.217.86: icmp_seq=7 ttl=47 time=178.549 ms
64 bytes from 63.245.217.86: icmp_seq=8 ttl=47 time=177.464 ms
64 bytes from 63.245.217.86: icmp_seq=9 ttl=47 time=180.394 ms
64 bytes from 63.245.217.86: icmp_seq=10 ttl=47 time=178.051 ms
64 bytes from 63.245.217.86: icmp_seq=11 ttl=47 time=174.951 ms
64 bytes from 63.245.217.86: icmp_seq=12 ttl=47 time=181.333 ms
64 bytes from 63.245.217.86: icmp_seq=14 ttl=47 time=179.950 ms
64 bytes from 63.245.217.86: icmp_seq=15 ttl=47 time=176.077 ms
^C
--- generic.zlb.phx.mozilla.net ping statistics ---
16 packets transmitted, 15 packets received, 6.2% packet loss
round-trip min/avg/max/stddev = 174.951/235.265/1015.640/208.579 ms
Reporter | ||
Comment 22•11 years ago
|
||
(In reply to Ed Morley [:edmorley UTC+1] from comment #1)
> (In reply to Ed Morley [:edmorley UTC+1] from comment #0)
> > d) Can anyone think of any optimisations either to the DB queries or to our
> > php/apache setup?
>
> Hmm it would seem we don't have gzip compression enabled for either the php
> responses or the static content :-/
Patch in bug 900919 for this; though it's been like that since the beginning of TBPL, so unrelated to the recent perf decline here.
Reporter | ||
Comment 23•11 years ago
|
||
Sheeri, don't suppose you can confirm the DB sizes (on disk will be fine) for tbpl-production? (just want to check bug 779290 is working as expected :-))
Flags: needinfo?(scabral)
Comment 24•11 years ago
|
||
I'm currently seeing pings of 200+ms on tbpl, making it nearly unusable to me right now. I've been asking in #it for over an hour now without any responses. What more information do we need to provide to make progress here?
Flags: needinfo?(shyam)
Comment 25•11 years ago
|
||
Fwiw, I cannot repro the packet loss. Attempted from locations in Portland (OR, USA), Paris (France), and Sydney (Australia). Round-trip times are within acceptable norms for each location with 0 % packet loss.
While I'm glad that it appears to work for [some|most] people, the inconsistency of the results is itself somewhat troubling. :/
Comment 26•11 years ago
|
||
Here's what I see at the moment.
the path is clean from my client to the tbpl vip. i'm getting about ~2% loss to the vip itself, which, wouldn't cause as much degradation as you see here.
alternatively, over ipv6. tbpl page fetches are fast. I'll look at the load balancer to see if those vips are on the same machine.
please include a traceroute here, even if you aren't seeing the issue.
Comment 27•11 years ago
|
||
the v6 VIP is currently on zlb10 and v4 is on zlb09. i ran some rapid ping tests from the default gateway to each and found loss on 09 that didn't exist on 10.
--- 63.245.217.215 ping statistics ---
1000 packets transmitted, 1000 packets received, 0% packet loss
round-trip min/avg/max/stddev = 1.312/2.686/10.767/1.268 ms
--- 63.245.217.216 ping statistics ---
1000 packets transmitted, 991 packets received, 0% packet loss
round-trip min/avg/max/stddev = 1.143/2.465/37.104/1.606 ms
:jd is monitoring load balancers and i've asked to move the generic vip to another lb to determine if zlb09 is the actual cause.
Comment 28•11 years ago
|
||
$ tracert tbpl.mozilla.org
Tracing route to generic.zlb.phx.mozilla.net [63.245.217.86]
over a maximum of 30 hops:
1 <1 ms <1 ms <1 ms Wireless_Broadband_Router.home [192.168.1.1]
2 7 ms 6 ms 10 ms L100.PHLAPA-VFTTP-62.verizon-gni.net [96.227.184.1]
3 13 ms 9 ms 9 ms G0-10-3-7.PHLAPA-LCR-21.verizon-gni.net [130.81.188.202]
4 8 ms 8 ms 9 ms ae0-0.PHIL-BB-RTR1.verizon-gni.net [130.81.209.178]
5 16 ms 13 ms 14 ms 0.xe-3-1-0.XL3.IAD8.ALTER.NET [152.63.3.65]
6 17 ms 17 ms 16 ms TenGigE0-6-1-0.GW1.IAD8.ALTER.NET [152.63.35.137]
7 55 ms 99 ms 64 ms teliasonera-gw.customer.alter.net [63.125.125.42]
8 32 ms 29 ms 29 ms ash-bb3-link.telia.net [213.155.130.87]
9 49 ms 49 ms 51 ms dls-bb1-link.telia.net [213.248.80.142]
10 82 ms 84 ms 84 ms phx-b1-link.telia.net [80.91.245.78]
11 90 ms 90 ms 89 ms mozilla-ic-140268-phx-b1.c.telia.net [213.248.104.202]
12 96 ms 95 ms 96 ms xe-0-0-1.core1.phx1.mozilla.net [63.245.216.18]
13 92 ms 92 ms * generic.zlb.phx.mozilla.net [63.245.217.86]
14 91 ms 94 ms 91 ms generic.zlb.phx.mozilla.net [63.245.217.86]
This is actually improved over what I was seeing earlier.
Comment 29•11 years ago
|
||
Hi casy here is my traceroute : cbook-1119:~ cbook$ traceroute tbpl.mozilla.org
traceroute to generic.zlb.phx.mozilla.net (63.245.217.86), 64 hops max, 52 byte packets
1 172.16.2.1 (172.16.2.1) 5.700 ms 1.381 ms 0.687 ms
2 83-169-168-34-isp.superkabel.de (83.169.168.34) 7.134 ms 7.277 ms 6.267 ms
3 83-169-135-254.static.superkabel.de (83.169.135.254) 7.461 ms 12.231 ms 8.323 ms
4 83-169-128-29.static.superkabel.de (83.169.128.29) 11.974 ms 9.389 ms 10.979 ms
5 88-134-196-62-dynip.superkabel.de (88.134.196.62) 9.572 ms 10.034 ms 8.100 ms
6 88-134-202-154-dynip.superkabel.de (88.134.202.154) 15.511 ms
88-134-201-226-dynip.superkabel.de (88.134.201.226) 9.675 ms
88-134-201-222-dynip.superkabel.de (88.134.201.222) 15.545 ms
7 88-134-234-77-dynip.superkabel.de (88.134.234.77) 10.395 ms 12.125 ms 9.614 ms
8 mcn-b2-link.telia.net (213.248.103.237) 8.995 ms 13.974 ms 9.977 ms
9 ffm-bb1-link.telia.net (213.155.134.12) 16.219 ms 17.537 ms
ffm-bb1-link.telia.net (80.91.248.30) 16.703 ms
10 nyk-bb1-link.telia.net (213.155.131.145) 106.637 ms
ash-bb3-link.telia.net (80.91.246.58) 114.962 ms
nyk-bb1-link.telia.net (213.155.131.147) 189.091 ms
11 dls-bb1-link.telia.net (213.155.133.177) 159.430 ms
dls-bb1-link.telia.net (213.155.137.51) 162.362 ms
dls-bb1-link.telia.net (213.155.130.65) 148.810 ms
12 phx-b1-link.telia.net (80.91.245.78) 172.772 ms 175.821 ms 171.774 ms
13 mozilla-ic-140268-phx-b1.c.telia.net (213.248.104.202) 186.932 ms 239.438 ms 191.543 ms
14 xe-0-0-1.core1.phx1.mozilla.net (63.245.216.18) 180.842 ms 179.092 ms 186.711 ms
Comment 30•11 years ago
|
||
still pending someone from webops to move the vip. also, i transposed my zlb numbers above, .215 is 09 and .216 is 10 and 10 is the one having issues.
Comment 31•11 years ago
|
||
What Netops is asking is if someone form Webops can look deeper into this on the host level because we are quickly exhausting the things we can troubleshoot and test. I feel it would be a better use of time if we troubleshoot in parallel vs. what we are doing now which is serial.
Comment 32•11 years ago
|
||
To summarize here :
Yes, there seems to be some slowness, I think casey's tracked it down to a possible cause. I think jd (from webops) is spending some time trying to confirm it but I don't think we're going to rush to a fix. I'll let webops reply, but I'm also sending out an email and keeping people informed about this.
Comment 33•11 years ago
|
||
I have been looking into this but really have little to report. I was hoping to vidyo with folks to attempt to consolidate our various bits of information in one place. Barring that possibility I will summarize my findings here.
I have no findings. To be thorough, there are some spikes on some zeus graphs which I initially suspected (and which may be causing their own issues), but at this point I do not think they are related as they only occur at intervals and is seems (from the data I have received from others) that the issue is reproduceable at times that are outside of these narrow spikes. This leads me to suspect it has something to do with something that is beyond my abilities to troubleshoot.
It is possible to ignore any further troubleshooting and simply migrate this TIG away from this ZLB and this may solve the problem (or demonstrate that this TIG is the source of the problem). The issue with this approach is that this will require a CAB due to the fact that this TIG hosts a number of sites and would cause connectivity issues during the switch. It is also interesting to note that the ZLB host is not the most heavily loaded node in the cluster (coming in 3rd or 4th depending on the moment) and this will need to be taken into account when making the determination as to where to move this TIG to.
Incidentally I have not seen any of the spikes which initially peeked my interest in quite some time (say 30min or so).
At the end of it all I see no obvious reason that this particular ZLB would be acting poorly. Perhaps something in the TCP stack is loaded due to some use case, but again that is beyond my abilities to troubleshoot.
Finally I would say that it would be groovy if we could determine the actual cause instead of calling the punt but this is just my $0.02.
Regards
Comment 34•11 years ago
|
||
Running very slowly again at the moment. Ping me here or on IRC if you need any information.
Loading Config.js (a 13KB file) took 9000+ ms, according to Firefox's Network console.
getRevisionBuilds.php calls are taking anywhere from 1 to 50 seconds to finish loading.
Some of the timings from the Network tab when loading tbpl.
Comment 37•11 years ago
|
||
Just giving the data I was asked about - the purge is running smoothly, tbpl is 53G, and it keeps a month's worth of data:
MariaDB [tbpl_mozilla_org]> select min(endtime) from runs;
+---------------------+
| min(endtime) |
+---------------------+
| 2013-07-07 02:52:40 |
+---------------------+
1 row in set (0.02 sec)
Flags: needinfo?(scabral)
Reporter | ||
Comment 38•11 years ago
|
||
Thank you Sheeri! :-)
Reporter | ||
Comment 39•11 years ago
|
||
Response times seem to be getting worse (plus vary a lot by location):
http://status.mozilla.com/11627/141897/tbpl.mozilla.org
Comment 40•11 years ago
|
||
According to NewRelic at:
https://rpm.newrelic.com/accounts/263620/applications/2207380
I set the graphs to search for on Fri 16 Aug, and here's what I found in the "Web Transactions" - App server time section:
Average
Web transactions
App server time
/php/getParsedLog.php 7,660 ms
App server traces:
57.8 s 12.5 s 19.4 s
/php/submitBugzillaComment.php 3,680 ms
App server traces:
2.4 s 4.1 s 2.6 s
All other web transactions were under 1 sec / 1000 ms.
I've attached screenshots of NewRelic for the 24 hours on Friday, and I will also attach for 7 days ending on Fri of last week for comparison. The most amount of time is spent in the db, but it's always 200ms or less, and usually around 150 ms.
Comment 41•11 years ago
|
||
Here's the 7-day graphs. You can clearly see a spike on Sun 11 Aug thru Mon 12 Aug. Times in New Relic are Pacific, for reference.
Comment 42•11 years ago
|
||
Here are the graphs in the last 12 hours, you can see how spiky the graph gets (going to 24 hours loses the spikiness).
The same ParsedLog and submitBugzilla PHP files are the longest ones, and at one point the ParsedLog was traced at 99.5 seconds.
Have either of these files changed in the past few weeks, or are they called more frequently?
Reporter | ||
Comment 43•11 years ago
|
||
Sheeri, thank you very much for taking a look! :-)
Those two files haven't been touched for 19 months and 10 months respectively (https://hg.mozilla.org/webtools/tbpl/log/tip/php/getParsedLog.php , https://hg.mozilla.org/webtools/tbpl/log/tip/php/submitBugzillaComment.php) - and are also the ones expected to take longer, since getParsedLog.php parses up to 50MB text logs and submitBugzillaComment.php has to submit data to the BzAPI.
I think what your graphs do show however, is that the problem we're seeing here isn't on the server itself (ignoring those two files, the next highest per-file response time is 200ms, for getLogExcerpt.php, which again sometimes is expected to take that long), but the load balancers (or whatever) in front of it - which makes sense, given that we've seen slowdowns (in the region of 5000ms-30000ms) for the static assets on the site too.
Side note: Is there any way for me to get access to the New Relic page for TBPL so I can monitor for other problems in the future? :-)
Comment 44•11 years ago
|
||
needinfo :jakem for New Relic access request.
Flags: needinfo?(shyam) → needinfo?(nmaul)
Reporter | ||
Comment 45•11 years ago
|
||
It's worth noting that for the last 24-48, performance for me at least hasn't seemed quite so bad (not quite back to normal, but more useable at least). Don't know if the same has been the case for you Ryan/Wes?
Comment 46•11 years ago
|
||
Hard to say. I've found TBPL getting "stuck" quite a bit the last few days (where it gets stuck on say 97% and only a manual reload fixes it). However, with all the other infra issues we've been having, I've been hesitant to scream too much about it.
Comment 47•11 years ago
|
||
2 days ago I moved some VIPs around on the PHX1 load balancer cluster, which hosts (among many other things) tbpl. I didn't move TBPL specifically, but I did shift other things. It's possible this may be partially responsible for the improvement Ed has seen.
@phrawzty: you already have a NR account... it's your LDAP username. You should have an email from a while back, or if not a password reset capability.
Flags: needinfo?(nmaul)
Comment 48•11 years ago
|
||
Here's some data out of New Relic:
3-month apdex for tbpl.mozilla.org:
http://dl.dropbox.com/u/135415857/Screenshots/HEHc.png
3-month response time breakdown:
http://dl.dropbox.com/u/135415857/Screenshots/3Cix.png
Database is consistently the biggest chunk by far, probably longer than it should be (this app needs more memcache usage I think), but nothing like the response times people are quoting above.
Comparing individual web servers indicates none of them are performing out-of-line wit the others.
This is an interesting "map" of services that tbpl depends on:
http://dl.dropbox.com/u/135415857/Screenshots/Zdn8.png
From that you can see the biggest dependency *by far* is the database, with an average of 129 requests per page view.
Almost all of the total transaction time is spent in getRevisionBuilds.php:
http://dl.dropbox.com/u/135415857/Screenshots/XW09.png
*However*, the worst individual transactions are almost always for getParsedLog.php:
http://dl.dropbox.com/u/135415857/Screenshots/YsgE.png
Note that these seem relatively infrequent, but they do represent the kind of delays folks are talking about here. I'm trying to drill down into one of these, but at the moment New Relic appears to be having some trouble. I'll report back when I have more.
That's the best I've got so far.
Comment 49•11 years ago
|
||
getRevisionBuilds.php is called on the homepage, about 10 times per hit (with different query arguments), and specifies a no-cache header so Zeus can't help. That page also auto-refreshes every few seconds, so these get hit a *lot*.
Comment 50•11 years ago
|
||
New Relic is still having problems, but I have a somewhat interesting result from the database slow query log. One query in particular results in the vast majority of "slow queries", both in total response time and time-per-call. It also has an obscenely high variance-to-mean, which means the runtime can vary wildly from one call to the next.
The query in question is:
REPLACE INTO `percona`.`checksums` (db, tbl, chunk, chunk_index, lower_boundary, upper_boundary, this_cnt, this_crc) SELECT 'tbpl_mozilla_org', 'runs_logs', '1', 'PRIMARY', '5414848', '5525915', COUNT(*) AS cnt, COALESCE(LOWER(CONV(BIT_XOR(CAST(CRC32(CONCAT_WS('#', `id`, `buildbot_id`, `type`, `content`, CONCAT(ISNULL(`content`)))) AS UNSIGNED)), 10, 16)), 0) AS crc FROM `tbpl_mozilla_org`.`runs_logs` FORCE INDEX(`PRIMARY`) WHERE ((`id` >= '5414848')) AND ((`id` <= '5525915')) /*checksum chunk*/\G
Unfortunately I don't know what file this is called in.
Now, one potential caveat is the slow query log (at least in its current config on this system) only records queries that take over 2 seconds. If there's another query taking, say, 500ms per call and getting called far more often, that's not currently being recorded.
It's probably best if I refrain on any additional database work and defer that to :sheeri and :cyborgshadow. It might be useful to temporarily tweak the slow query log to record more data, just in case.
Comment 51•11 years ago
|
||
That's the table checksum that we run twice daily to ensure that the data on the master matches the data on the slave, and the backup. We can turn those off to see if there's an improvement, but they have been running forever (since it was on generic, before it moved to tbpl).
I wouldn't say these queries are the cause, but they are indeed contributing to the load that's on the camel's back.
I can try to see if the tables are horribly fragmented this weekend during the tree closure.
Comment 52•11 years ago
|
||
Got some details! NR was never broken, my Ghostery was blocking parts of it. :)
Here's a trace of a request that took 4.4 seconds... seems to be a lot of these, I suspect now these are the problem:
http://dl.dropbox.com/u/135415857/Screenshots/rWnC.png
The full statement is:
SELECT who, note, unix_timestamp(timestamp) AS timestamp FROM runs_notes WHERE run_id = :runid ORDER BY timestamp ASC;
It's a frequently run query, but in this case it appears to be one particular instance of the run takes a long time.
Comment 53•11 years ago
|
||
It's searching by run_id, which is a primary key, and I confirmed that an EXPLAIN shows its using the index.
We are currently logging queries that take longer than 2 seconds, and nothing on the runs_notes table has been slow in the logs, which we have going back to June 6th (on both the master and slave). (I wanted to see if maybe that query was locked, or if the slow query log had any insight, but that query is not even in the slow query log).
Can we check internal/external Zeus for times when it might be saturated? because the query did not take more than 2 seconds inside the db itself. I looked on the master and the slave; the slave has *no* slow queries (not even checksum, though the checksum queries are logged as slow on the master, for the runs_logs table mostly), and I verified with a sleep query that slow query logging is functioning.
There are just under 30k rows in the runs_notes table, it is 12M on disk.
Comment 54•11 years ago
|
||
Hrm, and yet looking at a trace of a call to getRevisionBuilds.php that took 4.2 seconds, it looks like the slowness was in the PHP, not the SQL:
https://rpm.newrelic.com/accounts/263620/applications/2207380/transactions#id=219719751&tab-detail_1802105587=trace_details&app_trace_id=1802105587
I'm going to log all queries on the master for an hour to see if there's anything obvious we optimize with regards to query or schema.
It's extremely clear to me, though, that the slave is *vastly* underutilized. I hope I'll see many read queries looking at the master, that could be moved to the slave.
Comment 55•11 years ago
|
||
I think we need to rope in the TBPL developers on this...
The sources of slowness that we've been able to isolate are:
1) A ~6ms average delay on certain database queries that are executed about 300-400 times per hit to getRevisionBuilds.php... resulting in a few seconds' delay. I suspect the variance is due simply to concurrency issues. There's probably not much we can do in the infrastructure about this... better would be to see what we can do to reduce the number of queries needed.
2) An inexplicable ~4s processing delay loading getRevisionBuilds.php sometimes. New Relic indicates this is within PHP execution, not database time. This is present with or without PHP-APC caching. Right now, that's all I know about this. I don't have a call graph or profile data or anything, so I don't know what part of the file is sometimes slow to execute.
3) getRevisionBuilds.php is itself called about 10 times per hit to the homepage, and is not cacheable (sends no-cache headers). So the above issues, though rare, are greatly magnified. Fewer calls, or a few seconds' caching, might help a lot.
If we can rework any of these, we should be better off.
Who should we NEEDINFO to investigate how feasible these things are?
Comment 56•11 years ago
|
||
I defrag'd all the tables during today's outage window. Let's see if that helps any. (I doubt it given Jake's comment 55 but I figured we should try stuff while we can, see if it helps).
Reporter | ||
Comment 57•11 years ago
|
||
(In reply to Jake Maul [:jakem] from comment #55)
> If we can rework any of these, we should be better off.
>
>
> Who should we NEEDINFO to investigate how feasible these things are?
I agree many of these things are suboptimal, but as I mentioned in comment 43, these files haven't been touched for a year or two - and also the delays we have been seeing are often much longer than the 4s added at that layer.
We've also seen extremely poor performance on static assets (per comment 17 and comment 43), so I really think this is a load balancer issue - which seems to be corroborated given comment 47 and us seeing improved perf since roughly then :-)
Reporter | ||
Comment 58•11 years ago
|
||
Meant to add - thank you for the changes made in comment 47! :-)
Comment 59•11 years ago
|
||
(In reply to Ed Morley [:edmorley UTC+1] from comment #57)
> (In reply to Jake Maul [:jakem] from comment #55)
> > If we can rework any of these, we should be better off.
> >
> >
> > Who should we NEEDINFO to investigate how feasible these things are?
>
> I agree many of these things are suboptimal, but as I mentioned in comment
> 43, these files haven't been touched for a year or two - and also the delays
> we have been seeing are often much longer than the 4s added at that layer.
I believe the data set is bigger than it was a year ago... commit volume is up quite a bit, and overall contribution rate seems higher. That in itself might account for a longer delay.
In any case, that 4 seconds is not a constant value. It averages 3-4 seconds, but it does sometimes take much longer. It may not be the only problem, but it definitely has a significant performance impact on the service as a whole.
> We've also seen extremely poor performance on static assets (per comment 17
> and comment 43), so I really think this is a load balancer issue - which
> seems to be corroborated given comment 47 and us seeing improved perf since
> roughly then :-)
Glad this helped, but it doesn't make any sense to me. I wasn't seeing a significant performance bottleneck at that layer. My actions were primarily to re-balance traffic more evenly between the load balancers in case of failure, not to alleviate any preexisting condition. I don't know why this helped, or even if this was indeed helpful (perhaps it's a red herring and something else happened around the same time).
We have a plan to migrate this app over to a dedicated Rel-Eng cluster in SCL3... sort of consolidating all the tree-closing stuff together. If you're happy here, we can probably close this bug out and deal with any performance issues during that move. Downside is, it's not highly prioritized right now, so I don't know when we'll get to it. Sometime this year, probably.
Updated•11 years ago
|
Component: Server Operations: Web Operations → WebOps: Other
Product: mozilla.org → Infrastructure & Operations
Reporter | ||
Comment 60•11 years ago
|
||
iirc this boiled down to load balancer issues; either way long since fixed.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Updated•6 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•