Closed Bug 897903 Opened 11 years ago Closed 11 years ago

https://tbpl.mozilla.org/php/getRevisionBuilds.php calls taking longer than normal

Categories

(Infrastructure & Operations Graveyard :: WebOps: Other, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: emorley, Unassigned)

Details

Attachments

(4 files)

Over the last week or so, we've been noticing that both the initial TBPL (tbpl.mozilla.org) load and the later 2 minute data refreshes have been taking much longer to complete (not all the time, seems to be in spells). The long pole are the calls to getRevisionBuilds.php, eg: https://tbpl.mozilla.org/php/getRevisionBuilds.php?branch=mozilla-inbound&rev=7c78c3a98313&_=1374744902454 This file hasn't changed recently: https://hg.mozilla.org/webtools/tbpl/log/681fc36ecc5e/php/getRevisionBuilds.php Few questions: a) Has load on the webheads been higher in the last week or two at certain times throughout the day? b) Has load on the DB been higher? c) Is the periodic DB purge/cleanup still working as expected? (bug 779290). Wondering if the table sizes have grown. d) Can anyone think of any optimisations either to the DB queries or to our php/apache setup?
(In reply to Ed Morley [:edmorley UTC+1] from comment #0) > d) Can anyone think of any optimisations either to the DB queries or to our > php/apache setup? Hmm it would seem we don't have gzip compression enabled for either the php responses or the static content :-/
(In reply to Ed Morley [:edmorley UTC+1] from comment #0) > Few questions: > a) Has load on the webheads been higher in the last week or two at certain > times throughout the day? > b) Has load on the DB been higher? > c) Is the periodic DB purge/cleanup still working as expected? (bug 779290). > Wondering if the table sizes have grown. > d) Can anyone think of any optimisations either to the DB queries or to our > php/apache setup? Does anyone have any ideas? :-)
seems its getting more worse and kind of affect sheriff work with a very slow tool
Severity: normal → major
So, I've done some clicking around (based on what tomcat told me on IRC) and I can't reproduce this. Is there a URL that's consistently slow?
Severity: major → normal
Thank you for taking a look! :-) (In reply to Shyam Mani [:fox2mike] from comment #4) > Is there a URL that's consistently slow? First one in comment 0 was taking 20s+ the other day, is down to 8s right now, but per comment 0 this seems to fluctuate during the day.
Also, I'm on Aurora 24.0a2 (2013-07-28) so if you can reproduce this with Aurora or Stable, that would be nice too :) With a list of steps, so I can see if it's something specific and poke around in the logs.
(In reply to Ed Morley [:edmorley UTC+1] from comment #5) > First one in comment 0 was taking 20s+ the other day, is down to 8s right > now, but per comment 0 this seems to fluctuate during the day. Again, less that on Aurora, even lesser with curl : * Connection #0 to host tbpl.mozilla.org left intact * Closing connection #0 real 0m1.765s user 0m0.040s sys 0m0.065s
So I'm convinced this is a nightly issue : Lifting the server siege... done. Transactions: 406 hits Availability: 100.00 % Elapsed time: 59.42 secs Data transferred: 72.32 MB Response time: 2.14 secs Transaction rate: 6.83 trans/sec Throughput: 1.22 MB/sec Concurrency: 14.64 Successful transactions: 406 Failed transactions: 0 Longest transaction: 7.06 Shortest transaction: 0.97 This was on https://tbpl.mozilla.org/php/getRevisionBuilds.php?branch=mozilla-inbound&rev=7c78c3a98313&_=1374744902454 and so, I can't reproduce the issue. If you can, let us know, else we can close the bug.
I see no difference compared to Chrome fwiw
(In reply to Ed Morley [:edmorley UTC+1] from comment #9) > I see no difference compared to Chrome fwiw (Using web console timings to compare)
Pasting this here so I don't lose it... 14:47:27 - glob: edmorley, i was curious about the tbpl slowness 14:48:07 - edmorley: glob: yeah? 14:48:14 - glob: edmorley, i see 18s load times from here, but from people at the same time, 0.4s 14:48:26 - edmorley: ah 14:48:32 - glob: edmorley, from my co-lo box (in .au), 8.1s 14:48:50 - glob: but, rather slow .. 22k/s 14:51:21 - glob: edmorley, as a comparison, i get ~ 150k/s to the same colo server from bmo 14:52:01 - glob: edmorley, and to people, 475k/s 14:52:11 - glob: (that's tbpl --> people)
CC'ing some netops folks.
from my experience today: not matter which browser (like nightly/aurora). Not much traffic in case of checkins or other sherrifs/devs around like my morning European time -> tbpl is fast Much Traffic when the US Pacific Time wakes up -> slow
Agreed 100% that performance strongly correlates with time of day.
(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #14) > Agreed 100% that performance strongly correlates with time of day. What would you say are the slowest times?
(In reply to Shyam Mani [:fox2mike] from comment #15) > (In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #14) > > Agreed 100% that performance strongly correlates with time of day. > > What would you say are the slowest times? For Pacific Time i would saw 0:00 till about 6am ok and from there it goes down/slow
Is exceptionally slow for me right now... { [14:52:50.506] GET https://tbpl.mozilla.org/?tree=B2g-Inbound [HTTP/1.1 200 OK 1076ms] -- [14:52:58.999] GET https://tbpl.mozilla.org/css/style.css [HTTP/1.1 200 OK 1922ms] [14:52:59.000] GET https://tbpl.mozilla.org/js/jquery.js [HTTP/1.1 200 OK 10852ms] [14:52:59.001] GET https://tbpl.mozilla.org/js/jquery-ui-drag.js [HTTP/1.1 200 OK 11156ms] [14:52:59.001] GET https://tbpl.mozilla.org/js/NetUtils.js [HTTP/1.1 200 OK 11304ms] [14:52:59.002] GET https://tbpl.mozilla.org/js/PushlogJSONParser.js [HTTP/1.1 200 OK 11454ms] [14:52:59.002] GET https://tbpl.mozilla.org/js/MachineResult.js [HTTP/1.1 200 OK 11603ms] [14:52:59.003] GET https://tbpl.mozilla.org/js/BuildbotDBUser.js [HTTP/1.1 200 OK 11752ms] [14:52:59.003] GET https://tbpl.mozilla.org/js/BuildAPI.js [HTTP/1.1 200 OK 11900ms] [14:52:59.004] GET https://tbpl.mozilla.org/js/AddCommentUI.js [HTTP/1.1 200 OK 12601ms] [14:52:59.004] GET https://tbpl.mozilla.org/js/HiddenBuildsAdminUI.js [HTTP/1.1 200 OK 12890ms] [14:52:59.005] GET https://tbpl.mozilla.org/js/Config.js [HTTP/1.1 200 OK 13038ms] [14:52:59.006] GET https://tbpl.mozilla.org/js/MVTime.js [HTTP/1.1 200 OK 13489ms] [14:52:59.007] GET https://tbpl.mozilla.org/js/Data.js [HTTP/1.1 200 OK 13831ms] [14:52:59.007] GET https://tbpl.mozilla.org/js/SummaryLoader.js [HTTP/1.1 200 OK 15289ms] [14:52:59.008] GET https://tbpl.mozilla.org/js/UserInterface.js [HTTP/1.1 200 OK 17394ms] [14:52:59.008] GET https://tbpl.mozilla.org/js/LoadTracker.js [HTTP/1.1 200 OK 14434ms] [14:52:59.009] GET https://tbpl.mozilla.org/js/Controller.js [HTTP/1.1 200 OK 14578ms] [14:52:59.009] GET https://tbpl.mozilla.org/js/utils.js [HTTP/1.1 200 OK 14726ms] -- [14:53:16.542] GET https://hg.mozilla.org/integration/b2g-inbound/json-pushes?full=1&maxhours=24 [HTTP/1.1 200 Script output follows 2348ms] [14:53:16.543] GET https://tbpl.mozilla.org/php/getHiddenBuilderNames.php?branch=b2g-inbound [HTTP/1.1 200 OK 7001ms] [14:53:16.543] GET https://secure.pub.build.mozilla.org/builddata/buildjson/builds-pending.js [HTTP/1.1 200 OK 531ms] [14:53:16.544] GET https://secure.pub.build.mozilla.org/builddata/buildjson/builds-running.js [HTTP/1.1 200 OK 1025ms] [14:53:16.544] GET https://treestatus.mozilla.org/b2g-inbound?format=json [HTTP/1.1 200 OK 542ms] [14:53:16.545] GET https://tbpl.mozilla.org/images/loading_16.png [HTTP/1.1 200 OK 442ms] [14:53:18.851] GET https://tbpl.mozilla.org/php/getRevisionBuilds.php?branch=b2g-inbound&rev=8a4c07bae3e8&_=1375365198762 [HTTP/1.1 200 OK 5341ms] [14:53:18.852] GET https://tbpl.mozilla.org/images/tango-updown.png [HTTP/1.1 200 OK 1125ms] [14:53:18.852] GET https://api-dev.bugzilla.mozilla.org/latest/bug?include_fields=status,summary&id=899711 [HTTP/1.1 200 OK 981ms] [14:53:18.853] GET https://api-dev.bugzilla.mozilla.org/latest/bug?include_fields=status,summary&id=870941 [HTTP/1.1 200 OK 944ms] } Even the static assets are taking an age.
>ping tbpl.mozilla.org -t Pinging generic.zlb.phx.mozilla.net [63.245.217.86] with 32 bytes of data: Reply from 63.245.217.86: bytes=32 time=151ms TTL=52 Reply from 63.245.217.86: bytes=32 time=148ms TTL=52 Reply from 63.245.217.86: bytes=32 time=146ms TTL=52 Reply from 63.245.217.86: bytes=32 time=153ms TTL=52 Request timed out. Reply from 63.245.217.86: bytes=32 time=146ms TTL=52 Reply from 63.245.217.86: bytes=32 time=147ms TTL=52 Request timed out. Reply from 63.245.217.86: bytes=32 time=147ms TTL=52 Reply from 63.245.217.86: bytes=32 time=147ms TTL=52 Reply from 63.245.217.86: bytes=32 time=147ms TTL=52 Reply from 63.245.217.86: bytes=32 time=147ms TTL=52 Reply from 63.245.217.86: bytes=32 time=148ms TTL=52 Request timed out. Reply from 63.245.217.86: bytes=32 time=148ms TTL=52 Reply from 63.245.217.86: bytes=32 time=147ms TTL=52 Reply from 63.245.217.86: bytes=32 time=147ms TTL=52 Reply from 63.245.217.86: bytes=32 time=146ms TTL=52 Request timed out. Request timed out. Reply from 63.245.217.86: bytes=32 time=148ms TTL=52 Reply from 63.245.217.86: bytes=32 time=147ms TTL=52 Reply from 63.245.217.86: bytes=32 time=148ms TTL=52 Reply from 63.245.217.86: bytes=32 time=148ms TTL=52 Reply from 63.245.217.86: bytes=32 time=147ms TTL=52 Reply from 63.245.217.86: bytes=32 time=148ms TTL=52 Reply from 63.245.217.86: bytes=32 time=147ms TTL=52 Reply from 63.245.217.86: bytes=32 time=149ms TTL=52 Reply from 63.245.217.86: bytes=32 time=146ms TTL=52 Request timed out. Reply from 63.245.217.86: bytes=32 time=147ms TTL=52 Request timed out. Reply from 63.245.217.86: bytes=32 time=148ms TTL=52 Reply from 63.245.217.86: bytes=32 time=148ms TTL=52 Reply from 63.245.217.86: bytes=32 time=146ms TTL=52 Reply from 63.245.217.86: bytes=32 time=147ms TTL=52 Request timed out. Request timed out. Reply from 63.245.217.86: bytes=32 time=146ms TTL=52 Reply from 63.245.217.86: bytes=32 time=146ms TTL=52 Reply from 63.245.217.86: bytes=32 time=146ms TTL=52 Request timed out. Ping statistics for 63.245.217.86: Packets: Sent = 42, Received = 32, Lost = 10 (23% loss), Approximate round trip times in milli-seconds: Minimum = 146ms, Maximum = 153ms, Average = 147ms ==== >tracert tbpl.mozilla.org Tracing route to generic.zlb.phx.mozilla.net [63.245.217.86] over a maximum of 30 hops: 1 1 ms <1 ms <1 ms homeportal [192.168.50.1] 2 7 ms 7 ms 8 ms 10.228.244.1 3 9 ms 8 ms 7 ms brnt-core-2a-ae3-1509.network.virginmedia.net [80.2.147.93 4 16 ms 7 ms 8 ms brnt-bb-1a-ae12-0.network.virginmedia.net [62.253.174.53] 5 8 ms 8 ms 9 ms brnt-bb-1c-ae0-0.network.virginmedia.net [62.253.174.30] 6 * 13 ms 11 ms brhm-bb-1c-ae8-0.network.virginmedia.net [62.253.174.126] 7 16 ms 15 ms 20 ms 0.ae1.mpr1.lhr1.uk.above.net [213.161.65.149] 8 18 ms 18 ms 20 ms ge-3-0-0.mpr1.lhr2.uk.above.net [64.125.28.126] 9 92 ms 93 ms 91 ms so-1-1-0.mpr1.dca2.us.above.net [64.125.31.186] 10 93 ms 123 ms 90 ms xe-0-3-0.cr1.dca2.us.above.net [64.125.29.17] 11 129 ms 121 ms 119 ms xe-2-2-0.cr1.iah1.us.above.net [64.125.29.37] 12 140 ms 120 ms 118 ms xe-0-0-0.cr2.iah1.us.above.net [64.125.30.66] 13 142 ms 142 ms 142 ms xe-0-1-0.mpr4.phx2.us.above.net [64.125.28.73] 14 165 ms 144 ms 144 ms 64.124.201.182.mozilla.com [64.124.201.182] 15 * * * Request timed out. 16 148 ms 148 ms 148 ms generic.zlb.phx.mozilla.net [63.245.217.86] Trace complete.
I can't duplicate that loss from home or a linode box. --- 63.245.217.86 ping statistics --- 44 packets transmitted, 44 packets received, 0.0% packet loss round-trip min/avg/max/stddev = 75.362/78.103/84.628/2.207 ms --- 63.245.217.86 ping statistics --- 67 packets transmitted, 67 received, 0% packet loss, time 66085ms rtt min/avg/max/mdev = 36.928/37.122/37.467/0.243 ms I'm running a graph for tbpl traffic since the 20th and we'll be able to see if there's any recent change in traffic pattern.
also no loss from the london office. 44 packets transmitted, 44 received, 0% packet loss, time 43674ms rtt min/avg/max/mdev = 136.123/136.396/136.691/0.478 ms
btw i can confirm the problem with ping timeouts from my homeoffice in germany PING generic.zlb.phx.mozilla.net (63.245.217.86): 56 data bytes 64 bytes from 63.245.217.86: icmp_seq=0 ttl=47 time=181.804 ms Request timeout for icmp_seq 1 64 bytes from 63.245.217.86: icmp_seq=1 ttl=47 time=1015.640 ms 64 bytes from 63.245.217.86: icmp_seq=2 ttl=47 time=182.797 ms 64 bytes from 63.245.217.86: icmp_seq=3 ttl=47 time=182.567 ms 64 bytes from 63.245.217.86: icmp_seq=4 ttl=47 time=181.740 ms 64 bytes from 63.245.217.86: icmp_seq=5 ttl=47 time=181.666 ms 64 bytes from 63.245.217.86: icmp_seq=6 ttl=47 time=175.996 ms 64 bytes from 63.245.217.86: icmp_seq=7 ttl=47 time=178.549 ms 64 bytes from 63.245.217.86: icmp_seq=8 ttl=47 time=177.464 ms 64 bytes from 63.245.217.86: icmp_seq=9 ttl=47 time=180.394 ms 64 bytes from 63.245.217.86: icmp_seq=10 ttl=47 time=178.051 ms 64 bytes from 63.245.217.86: icmp_seq=11 ttl=47 time=174.951 ms 64 bytes from 63.245.217.86: icmp_seq=12 ttl=47 time=181.333 ms 64 bytes from 63.245.217.86: icmp_seq=14 ttl=47 time=179.950 ms 64 bytes from 63.245.217.86: icmp_seq=15 ttl=47 time=176.077 ms ^C --- generic.zlb.phx.mozilla.net ping statistics --- 16 packets transmitted, 15 packets received, 6.2% packet loss round-trip min/avg/max/stddev = 174.951/235.265/1015.640/208.579 ms
(In reply to Ed Morley [:edmorley UTC+1] from comment #1) > (In reply to Ed Morley [:edmorley UTC+1] from comment #0) > > d) Can anyone think of any optimisations either to the DB queries or to our > > php/apache setup? > > Hmm it would seem we don't have gzip compression enabled for either the php > responses or the static content :-/ Patch in bug 900919 for this; though it's been like that since the beginning of TBPL, so unrelated to the recent perf decline here.
Sheeri, don't suppose you can confirm the DB sizes (on disk will be fine) for tbpl-production? (just want to check bug 779290 is working as expected :-))
Flags: needinfo?(scabral)
I'm currently seeing pings of 200+ms on tbpl, making it nearly unusable to me right now. I've been asking in #it for over an hour now without any responses. What more information do we need to provide to make progress here?
Flags: needinfo?(shyam)
Fwiw, I cannot repro the packet loss. Attempted from locations in Portland (OR, USA), Paris (France), and Sydney (Australia). Round-trip times are within acceptable norms for each location with 0 % packet loss. While I'm glad that it appears to work for [some|most] people, the inconsistency of the results is itself somewhat troubling. :/
Here's what I see at the moment. the path is clean from my client to the tbpl vip. i'm getting about ~2% loss to the vip itself, which, wouldn't cause as much degradation as you see here. alternatively, over ipv6. tbpl page fetches are fast. I'll look at the load balancer to see if those vips are on the same machine. please include a traceroute here, even if you aren't seeing the issue.
the v6 VIP is currently on zlb10 and v4 is on zlb09. i ran some rapid ping tests from the default gateway to each and found loss on 09 that didn't exist on 10. --- 63.245.217.215 ping statistics --- 1000 packets transmitted, 1000 packets received, 0% packet loss round-trip min/avg/max/stddev = 1.312/2.686/10.767/1.268 ms --- 63.245.217.216 ping statistics --- 1000 packets transmitted, 991 packets received, 0% packet loss round-trip min/avg/max/stddev = 1.143/2.465/37.104/1.606 ms :jd is monitoring load balancers and i've asked to move the generic vip to another lb to determine if zlb09 is the actual cause.
$ tracert tbpl.mozilla.org Tracing route to generic.zlb.phx.mozilla.net [63.245.217.86] over a maximum of 30 hops: 1 <1 ms <1 ms <1 ms Wireless_Broadband_Router.home [192.168.1.1] 2 7 ms 6 ms 10 ms L100.PHLAPA-VFTTP-62.verizon-gni.net [96.227.184.1] 3 13 ms 9 ms 9 ms G0-10-3-7.PHLAPA-LCR-21.verizon-gni.net [130.81.188.202] 4 8 ms 8 ms 9 ms ae0-0.PHIL-BB-RTR1.verizon-gni.net [130.81.209.178] 5 16 ms 13 ms 14 ms 0.xe-3-1-0.XL3.IAD8.ALTER.NET [152.63.3.65] 6 17 ms 17 ms 16 ms TenGigE0-6-1-0.GW1.IAD8.ALTER.NET [152.63.35.137] 7 55 ms 99 ms 64 ms teliasonera-gw.customer.alter.net [63.125.125.42] 8 32 ms 29 ms 29 ms ash-bb3-link.telia.net [213.155.130.87] 9 49 ms 49 ms 51 ms dls-bb1-link.telia.net [213.248.80.142] 10 82 ms 84 ms 84 ms phx-b1-link.telia.net [80.91.245.78] 11 90 ms 90 ms 89 ms mozilla-ic-140268-phx-b1.c.telia.net [213.248.104.202] 12 96 ms 95 ms 96 ms xe-0-0-1.core1.phx1.mozilla.net [63.245.216.18] 13 92 ms 92 ms * generic.zlb.phx.mozilla.net [63.245.217.86] 14 91 ms 94 ms 91 ms generic.zlb.phx.mozilla.net [63.245.217.86] This is actually improved over what I was seeing earlier.
Hi casy here is my traceroute : cbook-1119:~ cbook$ traceroute tbpl.mozilla.org traceroute to generic.zlb.phx.mozilla.net (63.245.217.86), 64 hops max, 52 byte packets 1 172.16.2.1 (172.16.2.1) 5.700 ms 1.381 ms 0.687 ms 2 83-169-168-34-isp.superkabel.de (83.169.168.34) 7.134 ms 7.277 ms 6.267 ms 3 83-169-135-254.static.superkabel.de (83.169.135.254) 7.461 ms 12.231 ms 8.323 ms 4 83-169-128-29.static.superkabel.de (83.169.128.29) 11.974 ms 9.389 ms 10.979 ms 5 88-134-196-62-dynip.superkabel.de (88.134.196.62) 9.572 ms 10.034 ms 8.100 ms 6 88-134-202-154-dynip.superkabel.de (88.134.202.154) 15.511 ms 88-134-201-226-dynip.superkabel.de (88.134.201.226) 9.675 ms 88-134-201-222-dynip.superkabel.de (88.134.201.222) 15.545 ms 7 88-134-234-77-dynip.superkabel.de (88.134.234.77) 10.395 ms 12.125 ms 9.614 ms 8 mcn-b2-link.telia.net (213.248.103.237) 8.995 ms 13.974 ms 9.977 ms 9 ffm-bb1-link.telia.net (213.155.134.12) 16.219 ms 17.537 ms ffm-bb1-link.telia.net (80.91.248.30) 16.703 ms 10 nyk-bb1-link.telia.net (213.155.131.145) 106.637 ms ash-bb3-link.telia.net (80.91.246.58) 114.962 ms nyk-bb1-link.telia.net (213.155.131.147) 189.091 ms 11 dls-bb1-link.telia.net (213.155.133.177) 159.430 ms dls-bb1-link.telia.net (213.155.137.51) 162.362 ms dls-bb1-link.telia.net (213.155.130.65) 148.810 ms 12 phx-b1-link.telia.net (80.91.245.78) 172.772 ms 175.821 ms 171.774 ms 13 mozilla-ic-140268-phx-b1.c.telia.net (213.248.104.202) 186.932 ms 239.438 ms 191.543 ms 14 xe-0-0-1.core1.phx1.mozilla.net (63.245.216.18) 180.842 ms 179.092 ms 186.711 ms
still pending someone from webops to move the vip. also, i transposed my zlb numbers above, .215 is 09 and .216 is 10 and 10 is the one having issues.
What Netops is asking is if someone form Webops can look deeper into this on the host level because we are quickly exhausting the things we can troubleshoot and test. I feel it would be a better use of time if we troubleshoot in parallel vs. what we are doing now which is serial.
To summarize here : Yes, there seems to be some slowness, I think casey's tracked it down to a possible cause. I think jd (from webops) is spending some time trying to confirm it but I don't think we're going to rush to a fix. I'll let webops reply, but I'm also sending out an email and keeping people informed about this.
I have been looking into this but really have little to report. I was hoping to vidyo with folks to attempt to consolidate our various bits of information in one place. Barring that possibility I will summarize my findings here. I have no findings. To be thorough, there are some spikes on some zeus graphs which I initially suspected (and which may be causing their own issues), but at this point I do not think they are related as they only occur at intervals and is seems (from the data I have received from others) that the issue is reproduceable at times that are outside of these narrow spikes. This leads me to suspect it has something to do with something that is beyond my abilities to troubleshoot. It is possible to ignore any further troubleshooting and simply migrate this TIG away from this ZLB and this may solve the problem (or demonstrate that this TIG is the source of the problem). The issue with this approach is that this will require a CAB due to the fact that this TIG hosts a number of sites and would cause connectivity issues during the switch. It is also interesting to note that the ZLB host is not the most heavily loaded node in the cluster (coming in 3rd or 4th depending on the moment) and this will need to be taken into account when making the determination as to where to move this TIG to. Incidentally I have not seen any of the spikes which initially peeked my interest in quite some time (say 30min or so). At the end of it all I see no obvious reason that this particular ZLB would be acting poorly. Perhaps something in the TCP stack is loaded due to some use case, but again that is beyond my abilities to troubleshoot. Finally I would say that it would be groovy if we could determine the actual cause instead of calling the punt but this is just my $0.02. Regards
Running very slowly again at the moment. Ping me here or on IRC if you need any information.
Loading Config.js (a 13KB file) took 9000+ ms, according to Firefox's Network console. getRevisionBuilds.php calls are taking anywhere from 1 to 50 seconds to finish loading.
Attached image tbpltimings.png (deleted) —
Some of the timings from the Network tab when loading tbpl.
Just giving the data I was asked about - the purge is running smoothly, tbpl is 53G, and it keeps a month's worth of data: MariaDB [tbpl_mozilla_org]> select min(endtime) from runs; +---------------------+ | min(endtime) | +---------------------+ | 2013-07-07 02:52:40 | +---------------------+ 1 row in set (0.02 sec)
Flags: needinfo?(scabral)
Thank you Sheeri! :-)
Response times seem to be getting worse (plus vary a lot by location): http://status.mozilla.com/11627/141897/tbpl.mozilla.org
Attached image New Relic graphs for tbpl for Friday (deleted) —
According to NewRelic at: https://rpm.newrelic.com/accounts/263620/applications/2207380 I set the graphs to search for on Fri 16 Aug, and here's what I found in the "Web Transactions" - App server time section: Average Web transactions App server time /php/getParsedLog.php 7,660 ms App server traces: 57.8 s 12.5 s 19.4 s /php/submitBugzillaComment.php 3,680 ms App server traces: 2.4 s 4.1 s 2.6 s All other web transactions were under 1 sec / 1000 ms. I've attached screenshots of NewRelic for the 24 hours on Friday, and I will also attach for 7 days ending on Fri of last week for comparison. The most amount of time is spent in the db, but it's always 200ms or less, and usually around 150 ms.
Here's the 7-day graphs. You can clearly see a spike on Sun 11 Aug thru Mon 12 Aug. Times in New Relic are Pacific, for reference.
Attached image New Relic graphs for last 12 hours (deleted) —
Here are the graphs in the last 12 hours, you can see how spiky the graph gets (going to 24 hours loses the spikiness). The same ParsedLog and submitBugzilla PHP files are the longest ones, and at one point the ParsedLog was traced at 99.5 seconds. Have either of these files changed in the past few weeks, or are they called more frequently?
Sheeri, thank you very much for taking a look! :-) Those two files haven't been touched for 19 months and 10 months respectively (https://hg.mozilla.org/webtools/tbpl/log/tip/php/getParsedLog.php , https://hg.mozilla.org/webtools/tbpl/log/tip/php/submitBugzillaComment.php) - and are also the ones expected to take longer, since getParsedLog.php parses up to 50MB text logs and submitBugzillaComment.php has to submit data to the BzAPI. I think what your graphs do show however, is that the problem we're seeing here isn't on the server itself (ignoring those two files, the next highest per-file response time is 200ms, for getLogExcerpt.php, which again sometimes is expected to take that long), but the load balancers (or whatever) in front of it - which makes sense, given that we've seen slowdowns (in the region of 5000ms-30000ms) for the static assets on the site too. Side note: Is there any way for me to get access to the New Relic page for TBPL so I can monitor for other problems in the future? :-)
needinfo :jakem for New Relic access request.
Flags: needinfo?(shyam) → needinfo?(nmaul)
It's worth noting that for the last 24-48, performance for me at least hasn't seemed quite so bad (not quite back to normal, but more useable at least). Don't know if the same has been the case for you Ryan/Wes?
Hard to say. I've found TBPL getting "stuck" quite a bit the last few days (where it gets stuck on say 97% and only a manual reload fixes it). However, with all the other infra issues we've been having, I've been hesitant to scream too much about it.
2 days ago I moved some VIPs around on the PHX1 load balancer cluster, which hosts (among many other things) tbpl. I didn't move TBPL specifically, but I did shift other things. It's possible this may be partially responsible for the improvement Ed has seen. @phrawzty: you already have a NR account... it's your LDAP username. You should have an email from a while back, or if not a password reset capability.
Flags: needinfo?(nmaul)
Here's some data out of New Relic: 3-month apdex for tbpl.mozilla.org: http://dl.dropbox.com/u/135415857/Screenshots/HEHc.png 3-month response time breakdown: http://dl.dropbox.com/u/135415857/Screenshots/3Cix.png Database is consistently the biggest chunk by far, probably longer than it should be (this app needs more memcache usage I think), but nothing like the response times people are quoting above. Comparing individual web servers indicates none of them are performing out-of-line wit the others. This is an interesting "map" of services that tbpl depends on: http://dl.dropbox.com/u/135415857/Screenshots/Zdn8.png From that you can see the biggest dependency *by far* is the database, with an average of 129 requests per page view. Almost all of the total transaction time is spent in getRevisionBuilds.php: http://dl.dropbox.com/u/135415857/Screenshots/XW09.png *However*, the worst individual transactions are almost always for getParsedLog.php: http://dl.dropbox.com/u/135415857/Screenshots/YsgE.png Note that these seem relatively infrequent, but they do represent the kind of delays folks are talking about here. I'm trying to drill down into one of these, but at the moment New Relic appears to be having some trouble. I'll report back when I have more. That's the best I've got so far.
getRevisionBuilds.php is called on the homepage, about 10 times per hit (with different query arguments), and specifies a no-cache header so Zeus can't help. That page also auto-refreshes every few seconds, so these get hit a *lot*.
New Relic is still having problems, but I have a somewhat interesting result from the database slow query log. One query in particular results in the vast majority of "slow queries", both in total response time and time-per-call. It also has an obscenely high variance-to-mean, which means the runtime can vary wildly from one call to the next. The query in question is: REPLACE INTO `percona`.`checksums` (db, tbl, chunk, chunk_index, lower_boundary, upper_boundary, this_cnt, this_crc) SELECT 'tbpl_mozilla_org', 'runs_logs', '1', 'PRIMARY', '5414848', '5525915', COUNT(*) AS cnt, COALESCE(LOWER(CONV(BIT_XOR(CAST(CRC32(CONCAT_WS('#', `id`, `buildbot_id`, `type`, `content`, CONCAT(ISNULL(`content`)))) AS UNSIGNED)), 10, 16)), 0) AS crc FROM `tbpl_mozilla_org`.`runs_logs` FORCE INDEX(`PRIMARY`) WHERE ((`id` >= '5414848')) AND ((`id` <= '5525915')) /*checksum chunk*/\G Unfortunately I don't know what file this is called in. Now, one potential caveat is the slow query log (at least in its current config on this system) only records queries that take over 2 seconds. If there's another query taking, say, 500ms per call and getting called far more often, that's not currently being recorded. It's probably best if I refrain on any additional database work and defer that to :sheeri and :cyborgshadow. It might be useful to temporarily tweak the slow query log to record more data, just in case.
That's the table checksum that we run twice daily to ensure that the data on the master matches the data on the slave, and the backup. We can turn those off to see if there's an improvement, but they have been running forever (since it was on generic, before it moved to tbpl). I wouldn't say these queries are the cause, but they are indeed contributing to the load that's on the camel's back. I can try to see if the tables are horribly fragmented this weekend during the tree closure.
Got some details! NR was never broken, my Ghostery was blocking parts of it. :) Here's a trace of a request that took 4.4 seconds... seems to be a lot of these, I suspect now these are the problem: http://dl.dropbox.com/u/135415857/Screenshots/rWnC.png The full statement is: SELECT who, note, unix_timestamp(timestamp) AS timestamp FROM runs_notes WHERE run_id = :runid ORDER BY timestamp ASC; It's a frequently run query, but in this case it appears to be one particular instance of the run takes a long time.
It's searching by run_id, which is a primary key, and I confirmed that an EXPLAIN shows its using the index. We are currently logging queries that take longer than 2 seconds, and nothing on the runs_notes table has been slow in the logs, which we have going back to June 6th (on both the master and slave). (I wanted to see if maybe that query was locked, or if the slow query log had any insight, but that query is not even in the slow query log). Can we check internal/external Zeus for times when it might be saturated? because the query did not take more than 2 seconds inside the db itself. I looked on the master and the slave; the slave has *no* slow queries (not even checksum, though the checksum queries are logged as slow on the master, for the runs_logs table mostly), and I verified with a sleep query that slow query logging is functioning. There are just under 30k rows in the runs_notes table, it is 12M on disk.
Hrm, and yet looking at a trace of a call to getRevisionBuilds.php that took 4.2 seconds, it looks like the slowness was in the PHP, not the SQL: https://rpm.newrelic.com/accounts/263620/applications/2207380/transactions#id=219719751&tab-detail_1802105587=trace_details&app_trace_id=1802105587 I'm going to log all queries on the master for an hour to see if there's anything obvious we optimize with regards to query or schema. It's extremely clear to me, though, that the slave is *vastly* underutilized. I hope I'll see many read queries looking at the master, that could be moved to the slave.
I think we need to rope in the TBPL developers on this... The sources of slowness that we've been able to isolate are: 1) A ~6ms average delay on certain database queries that are executed about 300-400 times per hit to getRevisionBuilds.php... resulting in a few seconds' delay. I suspect the variance is due simply to concurrency issues. There's probably not much we can do in the infrastructure about this... better would be to see what we can do to reduce the number of queries needed. 2) An inexplicable ~4s processing delay loading getRevisionBuilds.php sometimes. New Relic indicates this is within PHP execution, not database time. This is present with or without PHP-APC caching. Right now, that's all I know about this. I don't have a call graph or profile data or anything, so I don't know what part of the file is sometimes slow to execute. 3) getRevisionBuilds.php is itself called about 10 times per hit to the homepage, and is not cacheable (sends no-cache headers). So the above issues, though rare, are greatly magnified. Fewer calls, or a few seconds' caching, might help a lot. If we can rework any of these, we should be better off. Who should we NEEDINFO to investigate how feasible these things are?
I defrag'd all the tables during today's outage window. Let's see if that helps any. (I doubt it given Jake's comment 55 but I figured we should try stuff while we can, see if it helps).
(In reply to Jake Maul [:jakem] from comment #55) > If we can rework any of these, we should be better off. > > > Who should we NEEDINFO to investigate how feasible these things are? I agree many of these things are suboptimal, but as I mentioned in comment 43, these files haven't been touched for a year or two - and also the delays we have been seeing are often much longer than the 4s added at that layer. We've also seen extremely poor performance on static assets (per comment 17 and comment 43), so I really think this is a load balancer issue - which seems to be corroborated given comment 47 and us seeing improved perf since roughly then :-)
Meant to add - thank you for the changes made in comment 47! :-)
(In reply to Ed Morley [:edmorley UTC+1] from comment #57) > (In reply to Jake Maul [:jakem] from comment #55) > > If we can rework any of these, we should be better off. > > > > > > Who should we NEEDINFO to investigate how feasible these things are? > > I agree many of these things are suboptimal, but as I mentioned in comment > 43, these files haven't been touched for a year or two - and also the delays > we have been seeing are often much longer than the 4s added at that layer. I believe the data set is bigger than it was a year ago... commit volume is up quite a bit, and overall contribution rate seems higher. That in itself might account for a longer delay. In any case, that 4 seconds is not a constant value. It averages 3-4 seconds, but it does sometimes take much longer. It may not be the only problem, but it definitely has a significant performance impact on the service as a whole. > We've also seen extremely poor performance on static assets (per comment 17 > and comment 43), so I really think this is a load balancer issue - which > seems to be corroborated given comment 47 and us seeing improved perf since > roughly then :-) Glad this helped, but it doesn't make any sense to me. I wasn't seeing a significant performance bottleneck at that layer. My actions were primarily to re-balance traffic more evenly between the load balancers in case of failure, not to alleviate any preexisting condition. I don't know why this helped, or even if this was indeed helpful (perhaps it's a red herring and something else happened around the same time). We have a plan to migrate this app over to a dedicated Rel-Eng cluster in SCL3... sort of consolidating all the tree-closing stuff together. If you're happy here, we can probably close this bug out and deal with any performance issues during that move. Downside is, it's not highly prioritized right now, so I don't know when we'll get to it. Sometime this year, probably.
Component: Server Operations: Web Operations → WebOps: Other
Product: mozilla.org → Infrastructure & Operations
iirc this boiled down to load balancer issues; either way long since fixed.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: