Closed Bug 839375 Opened 12 years ago Closed 12 years ago

foopy24 problem tracking

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

x86_64
Windows 7
task
Not set
normal

Tracking

(Not tracked)

RESOLVED INCOMPLETE

People

(Reporter: Callek, Unassigned)

References

Details

So, foopy24 currently has on it: bash-3.2$ ls -d tegra-* tegra-191 tegra-201 tegra-277 tegra-279 tegra-281 tegra-283 tegra-285 tegra-192 tegra-276 tegra-278 tegra-280 tegra-282 tegra-284 and was having issues earlier, (timeouts in steps doing foopy disk utilization) (rm -rf ...; unzip -oq ...) At fear of a larger issue I stopped clientproxy on all devices on this foopy after the poke from philor. I'm doing some cleanup (`rm -f /builds/tegra-*/{client*,twistd}.log.*`) right now, because there was a LOT of these old copies of files around here, but disk capacity is awfully low usage: bash-3.2$ df -h Filesystem Size Used Avail Capacity Mounted on /dev/disk0s2 465Gi 42Gi 423Gi 10% / devfs 183Ki 183Ki 0Bi 100% /dev map -hosts 0Bi 0Bi 0Bi 100% /net map auto_home 0Bi 0Bi 0Bi 100% /home Also ganglia doesn't show any obvious pattern of issue: http://ganglia3.build.mtv1.mozilla.com/ganglia/?r=4hr&cs=&ce=&c=RelEngMTV1&h=foopy24.build.mtv1.mozilla.com&tab=m&vn=&mc=2&z=small&metric_group=ALLGROUPS (I'll note however, that ganglia is not showing stats for wio at all on this foopy, which might just be a factor of it being a mac foopy, or something in the config) Rebooting this foopy in moments, before I bring all the devices back up, having seen no obvious problems, this bug is to help track incase of future issues. ================ [Times ET] [21:38:13] philor huh, what would produce a sudden rash of timeouts in tegra 'rm -rf build' steps? [21:43:54] Callek|Buildduty philor: tegra rm -rf build !? odd [21:47:14] philor Callek|Buildduty: tegra-191, tegra-279, tegra-201, tegra-278, tegra-282, https://tbpl.mozilla.org/php/getParsedLog.php?id=19553407&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=19553415&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=19553399&tree=Mozilla-Inbound [21:47:31] philor is that maybe all one foopy? [21:48:09] Callek|Buildduty about to find out [21:48:10] Callek|Buildduty [21:48:58] Callek|Buildduty philor: yes, *ALL* one foopy [21:49:13] Callek|Buildduty philor: sounds like a good case of "stop them all then investigate" [21:49:14] Callek|Buildduty on it [21:49:16] =-= philor is now known as philor|away [21:49:59] Callek|Buildduty philor|away: I'm killing all on that one foopy [23:05:52] Callek|Buildduty philor: sooo, I'm not seeing the rm -rf failure -- I do see hwoever in all those logs timeouts when unzipping tests [23:07:54] philor Callek|Buildduty: ah, five failures that all appeared at once, the last two were unzipping, the first three were rm -rf, and I didn't notice the last two weren't the same [23:08:28] Callek|Buildduty ahh ok [23:08:29] philor https://tbpl.mozilla.org/php/getParsedLog.php?id=19553399&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=19553410&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=19553411&tree=Mozilla-Inbound
(In reply to Justin Wood (:Callek) from comment #0) > I'm doing some cleanup (`rm -f /builds/tegra-*/{client*,twistd}.log.*`) > right now, because there was a LOT of these old copies of files around here, > but disk capacity is awfully low usage: Ok, having thought about this, it is likely the problem, heres why: Foopies have a high demand on disk i/o as it is, we need a lot of read/writes for the actual jobs. twistd, will do an enumeration+rename of *every* twistd.log* file when it needs to rotate, and expand on that, indefinitely. twistd.log.2 --> twistd.log.3 , twistd.log.1 --> twistd.log.2 [c.f. http://mxr.mozilla.org/build/source/twisted/twisted/python/logfile.py#177 ] clientproxy also uses log rotation, of a similar logic with RotatingFileHandler: http://docs.python.org/release/2.4.4/lib/node348.html This combined can yeild a horrid I/O throughput if the conditions line up right. I'll be going through foopies tomorrow to clean up old logs, due to this issue.
Summary: foopy24 problem tracking → Rotating logfiles indefinitely on foopies leads to failures in I/O heavy buildbot steps.
Hrm maybe not... philor just reported more issues here... re-stopping all devices on this foopy, and will enroll it in diagnostics (stuck rm is not a good sign, imo) -- also whole thing is acting VERY slow/problematic when I am sshing in right now. Processes: 97 total, 3 running, 10 stuck, 84 sleeping, 281 threads 22:38:35 Load Avg: 0.15, 0.15, 0.16 CPU usage: 0.72% user, 2.42% sys, 96.85% idle SharedLibs: 12M resident, 4376K data, 0B linkedit. MemRegions: 10162 total, 648M resident, 19M private, 192M shared. PhysMem: 1303M wired, 606M active, 425M inactive, 2333M used, 5856M free. VM: 231G vsize, 1148M framework vsize, 17489(0) pageins, 0(0) pageouts. Networks: packets: 2098605/2001M in, 880390/337M out. Disks: 293904/3555M read, 736228/14G written. PID COMMAND %CPU TIME #TH #WQ #POR #MRE RPRVT RSHRD RSIZE VPRVT VSIZE PGRP PPID STATE UID 2909 rm 0.5 00:00.01 1 0 17 22+ 104K- 224K+ 380K+ 17M- 2378M 646 647 stuck 501 2908 rm 0.7 00:00.01 1 0 17 21 108K- 224K+ 380K+ 17M- 2378M 667 668 stuck 501 2907 rm 0.6 00:00.01 1 0 17 21 112K+ 220K- 380K+ 17M+ 2378M 634 635 stuck 501 2894 top 2.9 00:00.34 1/1 0 32 38 1040K 216K 1752K 19M 2379M 2894 2892 running 0 2892 bash 0.0 00:00.01 1 0 21 23 272K 840K 960K 17M 2378M 2892 2891 sleeping 501 2891 sshd 0.0 00:00.00 1 0 13 86 252K 1628K 816K 1504K 2413M 2889 2889 sleeping 501 2889 sshd 0.0 00:00.12 2 1 37 81 480K 1628K 2956K 712K 2412M 2889 1 sleeping 0 2887 unzip 1.0 00:00.58 1 0 18 23 228K 328K 588K 17M 2378M 494 495 stuck 501 2886 rm 0.2 00:00.43 1 0 17 21 128K 220K 396K 17M 2378M 488 489 stuck 501 2884 wget 0.2 00:00.48 2 1 40 67 652K 480K 1708K 40M 2404M 360 361 stuck 501 2781 Python 0.5 00:10.90 2 1 33 209 21M 1920K 42M 104M 2478M 548 549 stuck 501 2686 unzip 0.4 00:04.96 1 0 17 23 228K 324K 576K 17M 2378M 362 363 stuck 501 2557 distnoted 0.0 00:00.01 2 1 41 54 548K 240K 1184K 42M 2411M 2557 2540 sleeping 501 2540 launchd 0.0 00:00.09 2 0 52 48 412K 412K 844K 48M 2409M 2540 1 sleeping 501 1023 taskgated 0.0 00:00.36 1 0 30 43 720K 364K 2080K 46M 2406M 1023 1 sleeping 0 668 Python 0.0 00:02.48 2 0 11 181 18M 3568K- 13M 52M 2425M 667 1 sleeping 501 647 Python 0.0 00:02.58 2 0 11 180 18M 3568K- 13M 51M 2424M 646 1 sleeping 501 635 Python 0.0 00:02.61 2 0 11 181 18M 3568K 13M 52M 2425M 634 1 sleeping 501 604 Python 0.0 00:00.59 1 0 9 135 1288K 10M 2812K 2676K 2390M 601 602 sleeping 501 602 Python 0.0 00:00.19 2 0 10 143 1724K 10M 3520K 30M 2417M 601 1 sleeping 501 566 Python 0.0 00:02.47 2 0 11 181 16M 3568K 13M 52M 2425M 565 1 sleeping 501 551 Python 0.0 00:02.54 2 0 11 181 16M 3568K 13M 52M 2425M 550 1 sleeping 501 549 Python 0.0 00:02.68 2 0 11 181 17M 3568K 13M 52M 2425M 548 1 sleeping 501
Depends on: 839407
Blocks: tegra-191
Blocks: tegra-192
Blocks: tegra-201
Blocks: tegra-276
Blocks: tegra-277
Blocks: tegra-278
Blocks: tegra-279
Blocks: tegra-280
Blocks: tegra-281
Blocks: tegra-282
Blocks: tegra-283
Blocks: tegra-284
Blocks: tegra-285
changing summary to reflect that its about foopy24 in the end, since it was still having issues after this, and no other foopies exhibited issues
Summary: Rotating logfiles indefinitely on foopies leads to failures in I/O heavy buildbot steps. → foopy24 problem tracking
Blocks: 843918
Blocks: 851562
Depends on: 851579
No longer depends on: 851579
So, I just ignorantly started all of these tegras up on foopy127 because I was trying to work my way through the buildduty queue. They're in production now, is this going to break things?
They seem to be green....
resolving as "incomplete" since we never actually fixed the issues on foopy24, but we now no longer have foopy24 for our devices. So I'm not expecting it to resurface on the linux foopies.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → INCOMPLETE
Product: mozilla.org → Release Engineering
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.