Closed
Bug 598757
Opened 14 years ago
Closed 10 years ago
Running out of space for symbols
Categories
(Release Engineering :: General, defect, P3)
Tracking
(Not tracked)
RESOLVED
INCOMPLETE
People
(Reporter: lsblakk, Assigned: ted)
References
(Blocks 1 open bug)
Details
(Whiteboard: [symbols])
Attachments
(1 file)
(deleted),
text/plain
|
Details |
From Nagios:
dm-symbolpush01 disk - /mnt/netapp/breakpad DISK WARNING - free space: /mnt/netapp/breakpad 91633 MB (8% inode=94%)
This already happened back at the beginning of the year, in bug 540713
Comment 1•14 years ago
|
||
I already e-mailed ted about this (Sat, 18 Sep 2010 14:40:48) and he is working on it, afaik.
Reporter | ||
Comment 2•14 years ago
|
||
Good news - if there's already a bug tracking this feel free to dupe otherwise let's use this one to track it's resolution.
Assignee | ||
Comment 3•14 years ago
|
||
I have not filed another bug. I need to do some more investigation, but I think we may need to increase the amount of storage, since we're storing a lot more than we used to.
Assignee: nobody → ted.mielczarek
Assignee | ||
Comment 4•14 years ago
|
||
Ok, so, the symbols_os dir isn't contributing appreciably:
4.0G ../symbols_os
I filed bug 598928 on getting the OpenSuSE symbols dir cleaned up correctly.
I've been looking into it, and I'll have a proposal today. I think the core issue here is just that we have way more branches than we used to, so we're using a lot more space.
Assignee | ||
Comment 5•14 years ago
|
||
Okay, I reworked my old spreadsheet:
https://spreadsheets.google.com/ccc?key=0An_R0AMMILQEcEZDMUtIbU1UZUVJcndRWHhKZDVWSXc&hl=en
I filed all the low-hanging fruit I could find as bugs blocking this one. There are two sheets in that document: the first one tries to estimate current usage, and the second one shows where we could be at if we fixed the low-hanging fruit.
My "current state" estimate is off by about 350GB, which is like 30%, which makes it not a fabulous estimate, I suppose. I can probably refine that, but it's difficult because every single product has slight differences. If my estimates are anywhere near correct, we can probably save a few hundred GB by fixing the low-hanging fruit, but we still might be cutting it close on disk usage. As you can see from the spreadsheet, we are supporting *a lot* of products+branches.
Comment 6•14 years ago
|
||
I gave this volume another 250g, hopefully by the time we get close to using that up, these dependent bugs will be fixed and we will be down to 850g.
Assignee | ||
Comment 7•14 years ago
|
||
Aravind: is there anything else being stored on this volume other than the symbols_* dirs? I ran "du -sh" on each of the symbols_* dirs and it only totaled 659GB. Where's that other ~300GB?
Comment 8•14 years ago
|
||
(In reply to comment #7)
> Aravind: is there anything else being stored on this volume other than the
> symbols_* dirs? I ran "du -sh" on each of the symbols_* dirs and it only
> totaled 659GB. Where's that other ~300GB?
Yes, this share also contains some old breakpad dumps, its not being actively used, but it contains some historical dumps probably totaling about 300GB. I am running a du on those trees and will report back once I know for sure.
Note: these trees are not being used and are static.
Comment 9•14 years ago
|
||
(In reply to comment #8)
> Yes, this share also contains some old breakpad dumps, its not being actively
> used, but it contains some historical dumps probably totaling about 300GB. I
> am running a du on those trees and will report back once I know for sure.
>
The du is done, those static directories come up to 255 GB.
Assignee | ||
Comment 10•14 years ago
|
||
Ok, that makes a lot more sense!
Assignee | ||
Comment 13•14 years ago
|
||
The deps on this still ought to be fixed. We mitigated this a bit by fixing some of them, and bumping up the storage space, but we really should fix the other deps here.
Comment 15•13 years ago
|
||
This started to page again.
Comment 16•13 years ago
|
||
I was going to say "the sjc1 mount will go away in a week", but things are getting a little tight in phx1, too:
10.8.74.240:/vol/pio_symbols
1.9T 1.7T 264G 87% /mnt/netapp/breakpad
Assignee | ||
Comment 17•13 years ago
|
||
There are still a few deps here that could buy us some time, but odds are we're just going to have to increase the storage space. We keep adding new project branches and platforms, and the switch to rapid release added a lot more builds to keep track of, so we increased our storage requirements by quite a bit.
Comment 18•13 years ago
|
||
So let's resize the phx1 share and ignore the warnings in sjc1. Think of it as enticement to leave there :)
This is 2TB in sjc1 now (and 1.9 in scl3..), with 1.8TB used. Let's pin the current usage at 75%, which means the share should be 2.4TB.
That's on 10.8.74.240:/vol/pio_symbols
Storage folks, is that doable?
Assignee: ted.mielczarek → server-ops
Component: Release Engineering → Server Operations: Storage
QA Contact: release → dparsons
Comment 19•13 years ago
|
||
There is no more free space to allocate on that controller. The aggregate is 96% full.
Assignee: server-ops → nobody
Component: Server Operations: Storage → Release Engineering
QA Contact: dparsons → release
Comment 20•13 years ago
|
||
Dan, just to verify, the *phx1* aggregate is full? I only ask because the bug was initially about sjc1, so there's room for confusion.
Should we try to move symbols to scl3? Or buy a storage blade in phx1?
Assignee | ||
Comment 21•13 years ago
|
||
Going forward we only need symbol storage in PHX. Once we've finished migrating the symbol server (bug 688250) and the consumers of dm-symbolpush01 (bug 688186), we can get rid of the symbol store in SJC1 and not replace it.
Comment 22•13 years ago
|
||
Dan's confirmed in email that phx1 is indeed space-constrained. More is being quoted, so let's muddle along here and check in in a few months to see what the space situation looks like.
Comment 24•12 years ago
|
||
Things look particularly tight right now. Is there any additional deletion that can occur?
The new phx1 storage is not yet in place.
Comment 25•12 years ago
|
||
The new phx1 storage hasn't even been ordered yet. I hope it will be soon but even if it was ordered today, we're looking at least a month before it's online. We need to do something much sooner.
Assignee | ||
Comment 26•12 years ago
|
||
I don't know that there are any easy wins here at the moment. Have we verified that the symbol cleanup scripts are running on all the various symbols_* directories in this mount? (With the exception of symbols_os.)
I know we had some fiddly issues with index.txt files syncing from SJC->PHX. The cleanup scripts rely on those files. Did we accidentally break something in the move?
Comment 28•12 years ago
|
||
Step 1, don't get confused by the copy of the old sjc1 symbols we now have in scl3, which is mounted on stage.m.o. Look on symbols1.dmz.phx1.mozilla.com instead.
Step 2, look at how many manifests we have for nightly builds - there should be a maximum of 30 for each platform on a branch. In the attachment I'm trying to get linux32 mozilla-central builds, excluding all the other branches and 64-bit, and come up with 47 manifests. It's a similar story for mac (51).
It'll be the naming change between May 16 and May 17 causing the problem, which is a regression from http://hg.mozilla.org/mozilla-central/rev/a0cca6997af4 (bug 753132).
We can probably rename our way to victory, but need to do it very carefully because the naming scheme is a bit funky.
Comment 29•12 years ago
|
||
Changes today
* got down to 20G free, so I started looking into renaming some manifests on a less important branch like ux
* wrote a script to do the renames, ran it for ux, then ran the cleanup script at /mnt/netapp/breakpad/cleanup-breakpad-symbols.py against symbols_ffx
* did the same for mozilla-inbound, which was expected to free up ~20G
* talked to lerxst on IRC about not seeing the free space increase, and he's deleted 62G of snapshots (6 hourly + 2 daily), and disabled further snapshots
* we are now at
[177] symbols1.dmz.phx1:disk - /mnt/netapp/breakpad is WARNING: DISK WARNING - free space: /mnt/netapp/breakpad 82501 MB (4% inode=91%)
I'll keep working on other branches to get us some more headroom.
Comment 30•12 years ago
|
||
Standard8, jhopkins - can we nuke symbols_tbrd-test ? Were there any accidents that put actual release symbols into this test directory during the infra transition ? There are some 10.0.2 symbols from April 27, on rev 7d395fbcb557, which looks like a staging release because it doesn't have any tags on it. Also some nightlies which I doubt we need to keep.
We'd free up 50G if this can go.
Comment 31•12 years ago
|
||
(In reply to Nick Thomas [:nthomas] from comment #30)
> Standard8, jhopkins - can we nuke symbols_tbrd-test ? Were there any
> accidents that put actual release symbols into this test directory during
> the infra transition ? There are some 10.0.2 symbols from April 27, on rev
> 7d395fbcb557, which looks like a staging release because it doesn't have any
> tags on it. Also some nightlies which I doubt we need to keep.
>
> We'd free up 50G if this can go.
No objections to the nightly -test builds - we didn't publicise those (they could also be removed from ftp as well...). You might want to check with jhopkins about anything release like, as I don't know what went on there.
Comment 32•12 years ago
|
||
(In reply to Mark Banner (:standard8) from comment #31)
Ok, I'll wait to hear then.
In other news, I've fixed up the manifest naming in symbols_ffx and symbols_tbrd, and after the cleanup script we're have 180G free and should maintain that better. That leaves out branches like {mozilla,comm}-aurora which haven't merged the naming change yet, so I'll need to revisit them after merge.
Also haven't looked at Seamonkey or any of the other apps, since I don't have permisions to make changes. Perhaps we can just let them clean up naturally after 90 days.
Comment 33•12 years ago
|
||
<jhopkins> nthomas: IIRC we had all the test code removed before doing our beta builds, so i don't see why we'd need the test symbols
Deleted, 200G free now. dustin, could you please remove
symbols1.dmz.phx1:/mnt/netapp/breakpad/symbols_tbrd-test
Comment 34•12 years ago
|
||
symbols_xr done too, only 10G more space from that.
HOWTO:
For {mozilla,comm}-{central,aurora} nightlies you'd do the likes of this:
python /home/ffxbld/bug598757/rename-trunk.py /mnt/netapp/breakpad/symbols_ffx
python /mnt/netapp/breakpad/cleanup-breakpad-symbols.py \
/mnt/netapp/breakpad/symbols_ffx
For Firefox branches which are peers of mozilla-central, like profiling, fx-team etc, do this:
python /home/ffxbld/bug598757/rename-peer.py \
/mnt/netapp/breakpad/symbols_ffx profiling
python /mnt/netapp/breakpad/cleanup-breakpad-symbols.py \
/mnt/netapp/breakpad/symbols_ffx
Assignee | ||
Comment 35•12 years ago
|
||
Thanks for the detective work, Nick! I should have known better...
I've fiddled with this so many times that I just get depressed every time I think about it.
Comment 36•12 years ago
|
||
(In reply to Nick Thomas [:nthomas] from comment #33)
> Deleted, 200G free now. dustin, could you please remove
> symbols1.dmz.phx1:/mnt/netapp/breakpad/symbols_tbrd-test
done
Comment 37•12 years ago
|
||
Turns out this is going to be a big problem.
Rapid betas is going to cause a huge spike in the amount of storage needed, and we're already really borderline. We need to get more space - ideally we need to double what we have from 2TB to 4TB. Should I file a separate bug for that? Which component should I file it in?
Rapid betas start on July 17, so time is super short.
Severity: normal → critical
Updated•12 years ago
|
Blocks: daily_beta_tracking
Comment 38•12 years ago
|
||
There simply is not enough disk space in phx1 to do this. The most I could give you is another 200GB. We ordered additional capacity, but it will probably be 30 to 90 days before it is online.
Comment 39•12 years ago
|
||
I understand from akeybl that the rapid beta date has literally just been pushed back to 8/28 (woohoo!) which gives us some extra breathing room for the new storage to come online. We'll prune aggressively and try and stay inside the 200GB extra. ted is going to do some more analysis and see what he can cut.
Comment 40•12 years ago
|
||
OK great. I just talked to :dmoore and we might be able to get extra capacity online in the beginning of July. Note that this will be a totally separate system, so we'll have to migrate the existing volume in order to give it more space.
Assignee | ||
Comment 41•12 years ago
|
||
So I think the biggest remaining problem here is that we don't currently clean up beta builds, and with the switch to rapid release we have a *lot* of beta builds.
Rapid betas would just make this already bad problem really really bad.
Severity: critical → normal
Assignee | ||
Updated•12 years ago
|
Severity: normal → critical
Comment 42•12 years ago
|
||
I just gave it +200GB. Just so everyone's clear, after this space is used up, there is literally nothing else I can do to give you more space on that volume.
Assignee | ||
Comment 43•12 years ago
|
||
Just so I don't drop this on the floor, a few numbers I've been poking at:
[tmielczarek@symbols1.dmz.phx1 symbols_ffx]$ du -sh .
796G .
18G symbols_fedora
110G symbols_ubuntu/
I'm looking into lowering the 90 day cleanup time down to 45 days, which will buy us ~90GB of space just from the Firefox+Thunderbird directories alone.
Assignee | ||
Comment 44•12 years ago
|
||
4.3G symbols_camino
Assignee | ||
Comment 45•12 years ago
|
||
54G /mnt/netapp/breakpad/symbols_opensuse
Comment 46•12 years ago
|
||
There are some orphans too. Looking in the mac directory symbols_ffx/XUL, there are 188 directories that aren't mentioned in the manifests, using 42G of space. There is a big range of ages, some as recent as 2012-04 others back to 2007-09. Also 3642 empty dirs making ls operations slow (but only wasting 14MB).
Assignee | ||
Comment 47•12 years ago
|
||
461G symbols_tbrd
Assignee | ||
Comment 48•12 years ago
|
||
28G /mnt/netapp/breakpad/symbols_os/
Assignee | ||
Comment 49•12 years ago
|
||
Longer term fix: bug 684251
Assignee | ||
Updated•12 years ago
|
Assignee: nobody → ted.mielczarek
Assignee | ||
Comment 50•12 years ago
|
||
I think there are only two easy fixes we could do in the short term:
1) Clean up the swath of old beta/RC builds that we've accumulated in the switch to rapid release. (+ the few misnamed ESR nightlies)
2) Per comment 46, clean up orphaned symbol files.
Disk usage seems to be okay for the moment:
Filesystem Size Used Avail Use% Mounted on
10.8.74.240:/vol/pio_symbols
2.2T 1.8T 351G 84% /mnt/netapp/breakpad
I'd actually expect it to drop somewhat over the next 30+ days as the fix for bug 587073 rolls out.
Comment 51•12 years ago
|
||
This is alerting 95% full.
10.8.74.240:/vol/pio_symbols
2.0T 1.8T 117G 95% /mnt/netapp/breakpad
Comment 52•12 years ago
|
||
I'm showing 266GB free now... is this still a problem?
Comment 53•12 years ago
|
||
This alerted again, so I think the answer is that it is an intermittent problem and it'd be good to have more headroom on this filesystem. I ran the cleanup script /mnt/netapp/breakpad/cleanup-breakpad-symbols.sh and it got down to 94% full which Nagios was happy with but more disk space or less data would be good here.
Lowering priority to normal to reflect the situation better.
Severity: critical → normal
Comment 54•12 years ago
|
||
I gave the volume another 100GB, it's at 89% full now.
What needs to happen in order to mark this bug as R/F?
Comment 56•12 years ago
|
||
By moving some unrelated volumes off this filer, we've quietly nursed our way to absorbing another 800g of symbols in the last 4 months, but we're at the point of diminishing returns and can't keep up that rate.
So, this is a friendly early warning that we're going to be showing another episode of "Dude, Where's my Diskspace Cleanup?" on the symbols volume in the near future.
Assignee | ||
Comment 57•12 years ago
|
||
Exciting. I will make time to do some analysis next week and see if there's anything stupid going on. I would hazard a guess that it's just growth, with new platforms and B2G and everything, but sometimes there are silly things that slip in.
Assignee | ||
Comment 58•12 years ago
|
||
(We are also working on a design for storing a large portion of this data in Postgres, which should greatly reduce our storage requirements.)
Assignee | ||
Comment 59•12 years ago
|
||
I suspect I need to buckle down and fix bug 599347. I think the rapid release cycle is mostly what's screwing us here, especially since we pile up beta builds that aren't currently cleaned up.
Comment 61•12 years ago
|
||
This is lighting off alarms for our SREs (bug 849494).
If we're not going to have an automated solution RSN, we need at least some manual cleanup for a limpalong.
Comment 62•12 years ago
|
||
Now at 95% and still alerting.
<nagios-phx1:#sysadmins> Mon 05:39:18 PDT [191]
symbols1.dmz.phx1.mozilla.com:NFS Mounts - /mnt/netapp/breakpad is WARNING:
DISK WARNING - free space: /mnt/netapp/breakpad 173706 MB (5% inode=91%):
(http://m.allizom.org/NFS+Mounts+-+/mnt/netapp/breakpad)
10.8.74.240:/vol/pio_symbols
3145728000 2969250656 176477344 95% /mnt/netapp/breakpad
Assignee | ||
Comment 63•12 years ago
|
||
Okay, I started working on a change to the cleanup script to let us clean out some old data (as a one-off first, and then I'll figure out how to make it a regular part of cleanup). I think the easiest thing to do first is clean out old XULRunner release symbols. I don't think we're actually using them for anything, and there's almost 500GB of XR symbols total. I'll follow up on IRC or in the bug today when I have it ready.
Assignee | ||
Comment 64•12 years ago
|
||
Following up with pir on IRC. Looks like we should be able to reclaim ~322GB by removing old XULRunner release symbols.
Comment 65•12 years ago
|
||
[root@symbols1.dmz.phx1 breakpad]# df -h /mnt/netapp/breakpad
Filesystem Size Used Avail Use% Mounted on
10.8.74.240:/vol/pio_symbols
3.0T 2.8T 166G 95% /mnt/netapp/breakpad
[root@symbols1.dmz.phx1 pradcliffe]# cd /mnt/netapp/breakpad/symbols_xr
[root@symbols1.dmz.phx1 symbols_xr]# ~tmielczarek/cleanup-breakpad-symbols.py -r /mnt/netapp/breakpad/symbols_xr *-1.9* *-4.0* *-5.0* *-6.0* *-7.0* *-8.0* *-9.0* *-10.0* *-11.0* *-12.0* *-13.0* *-14.0* *-15.0* *-16.0* *-17.0* *-18.0*
[1/4] Reading symbol index files...
[2/4] Looking for symbols to delete...
[3/4] Deleting symbols...
[4/4] Pruning empty directories...
[root@symbols1.dmz.phx1 symbols_xr]# df -h /mnt/netapp/breakpad
Filesystem Size Used Avail Use% Mounted on
10.8.74.240:/vol/pio_symbols
3.0T 2.6T 388G 88% /mnt/netapp/breakpad
Comment 66•12 years ago
|
||
[root@symbols1.dmz.phx1 symbols_sea]# cd /mnt/netapp/breakpad/symbols_sea; ~tmielczarek/cleanup-breakpad-symbols.py -r /mnt/netapp/breakpad/symbols_sea *-2.0* *-2.1-* *-2.1.* *-2.2* *-2.3* *-2.4* *-2.5* *-2.6* *-2.7* *-2.8* *-2.9* *-2.10* *-2.11* *-2.12* *-2.13* *-2.14*
[1/4] Reading symbol index files...
[2/4] Looking for symbols to delete...
[3/4] Deleting symbols...
df -h /mnt/netapp/breakpad
[4/4] Pruning empty directories...
[root@symbols1.dmz.phx1 symbols_sea]# df -h /mnt/netapp/breakpad
Filesystem Size Used Avail Use% Mounted on
10.8.74.240:/vol/pio_symbols
3.0T 2.5T 511G 83% /mnt/netapp/breakpad
Assignee | ||
Comment 67•12 years ago
|
||
That should be enough breathing room for now. I'm going to leave this open until I fix bug 599347, which should be a longer-term fix. I also filed bug 849808 to figure out how to archive old release symbols.
Assignee | ||
Comment 68•12 years ago
|
||
I pushed the changes I made to the cleanup script that I wrote for pir to use above:
http://hg.mozilla.org/build/tools/diff/4362a609d8d8/buildfarm/breakpad/cleanup-breakpad-symbols.py
Comment 69•11 years ago
|
||
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #67)
> That should be enough breathing room for now. I'm going to leave this open
> until I fix bug 599347, which should be a longer-term fix. I also filed bug
> 849808 to figure out how to archive old release symbols.
Found in triage.
Anything left to do here? If I read the last few comments correctly, this is now a tracking bug, and all remaining work is in depbugs?
Flags: needinfo?(ted)
Assignee | ||
Comment 70•11 years ago
|
||
Yeah. If you think closing this bug or annotating it separately is useful we can do that. We should probably get bugs like this out of RelEng components, but we don't have a useful component for them right now.
Also, bug 889691 moved the storage for this and made this less of a pressing issue.
Flags: needinfo?(ted)
Updated•11 years ago
|
Product: mozilla.org → Release Engineering
Comment 72•11 years ago
|
||
Might need to run the cleanup script again:
<#sysadmins>Wed 02:42:38 PST [1964] symbols1.dmz.phx1.mozilla.com:NFS Mounts - /mnt/netapp/breakpad is WARNING: DISK WARNING - free space: /mnt/netapp/breakpad 261897 MB (5% inode=89%):
Assignee | ||
Comment 73•10 years ago
|
||
This is going to be irrelevant due to bug 1071724.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → INCOMPLETE
You need to log in
before you can comment on or make changes to this bug.
Description
•