Closed Bug 751557 Opened 13 years ago Closed 8 years ago

Adding new jobs in a Jenkins CI causes a lot of leaks and ends-up in an OOM [@ mozalloc_abort(char const*) | Abort ]

Categories

(Tech Evangelism Graveyard :: Other, defect)

All
macOS
defect
Not set
major

Tracking

(Not tracked)

RESOLVED INCOMPLETE

People

(Reporter: whimboo, Unassigned)

References

()

Details

(Keywords: hang, memory-leak, Whiteboard: [MemShrink:P2])

Crash Data

Attachments

(1 file, 5 obsolete files)

Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:14.0) Gecko/20120502 Firefox/14.0a2 I have already seen this issue some weeks ago and talked to Olli. Since then it was kinda sporadic that it happened again. Last time it was end of last week when I haven't had time for a deeper analysis. Today it happened again and I tried to find some steps to reproduce but so far I wasn't lucky. The following information is everything I have at the moment: * It happens in my daily profile * Resident memory usage grows up to 2GB within an hour or so * GC durations go up to around 4s!!!! and make the browser unusable * It happens when I have the following Jenkins pages for Mozmill CI loaded (it's in MV VPN) http://release3.qa.mtv1.mozilla.com:8080 (daily tests) http://release8-osx-107.qa.mtv1.mozilla.com:8080/ (l10n tests) First I thought it is related to auto-refresh being enabled but it doesn't look like that this is the reason. One other assumption is that it happens when new job requests are getting added via the Jenkins API. That's what I will test next. Otherwise I will try to reduce my profile as much as possible to have a minimized profile in hand. Given the cc log I have sent to Olli there were about 32544 nsDOMStorageItem objects and over 100.000 nsXPCWrappedJS (nsISupports) objects listed. For the first issue Olli already filed bug 751529. For the latter no action has happened yet. I will leave it up to Olli because he currently analyses my cc log. I hope that I can come up with some better STR soon.
Keywords: qawanted
Henrik, while you investigate STR, can you also check to see if: A) This is any better or worse in Firefox 15 Nightly B) This has gotten worse over time (for example, Firefox 12, 11, 10, etc)
CC times or GC times are 4s? Though I guess both could be bad given how much memory it is using up.
Attached file MemChaser log (obsolete) (deleted) —
It'd GC only. But I wanted to add a memchaser log which can hopefully give some more information what is getting executed.
There are a number of long CCs (one is 12 seconds!), but the GCs do look a little longer. I'd guess that GC times are long because the page is using an obscene amount of memory. If you could get the contents of about:memory?verbose when you are having this problem, that might help.
To some extent, the long GCs are a symptom of all the xpconnect stuff going on. We're spending ~100ms in root marking, which is very unusual. Another ~90ms is in mark_other. This means that we have to enumerate a bunch of xpconnect gray roots, and we're also spending a lot of time in gray marking. I'm not sure why the sweeping times are so long. That's been a persistent issue for us, though, so it's known.
Attached file MemChaser log (obsolete) (deleted) —
Updated MemChaser log now with a memory consumption of about 2GB of RAM. The log shows the constant raise of the usage over time. I think that this leak happens when new jobs are getting added to the queue. I will have to test this now on my local instance. If I can prove it, I will give details instructions how to reproduce.
Attachment #620738 - Attachment is obsolete: true
The same effect can also be seen in Safari. I had the Jenkins website open on our master for a couple of days now and Safari claimed about 3.25GB of memory.
For people with VPN access, simply open the following URL and keep the browser open for a couple of days. You will have to catch the time when new job requests get added, means when builds have been finished. That's early in the day PTO so you should keep the browser open. When you have the Jenkins web page open, please make sure to not change any configuration aspects. Thanks.
Olli, would you mind giving an update based on the latest cc log I have sent to you? You have mentioned about 170.000 XPCVariant objects, is it something we should file as a new bug?
Blocks: 702495
Assignee: nobody → english-us
Component: General → English US
Product: Core → Tech Evangelism
QA Contact: general → english-us
Whiteboard: [MemShrink]
Version: 14 Branch → unspecified
Over to tech evang; given comment 7, this doesn't sound like our bug. It would be great if pages which created giant heaps didn't cause GC pauses, but we already have bugs on that.
Attached file Detailed MemChaser log (obsolete) (deleted) —
This is a detailed log of Memchaser which I wanted to attach since ages. Sorry for the delay. It contains hopefully all the information we need to sort out this issue.
Attachment #622357 - Attachment is obsolete: true
I have had Jenkins open for an hour or so and the memory of Firefox raised from 500MB to 1.1GB. Closing the Jenkins tab caused a freeze of Firefox for about 1 minutes. As the log shows there are CC cyles of about 27s in duration but also iGC of 6s. I wonder if we can do something here as leaving it in the evangelism section only.
Keywords: hang
memchaser log doesn't give much information. GC and CC logs are needed. https://wiki.mozilla.org/Performance:Leak_Tools#Cycle_collector_heap_dump IIRC you gave me once some logs, but I don't have them on this laptop.
But Justin is right. If the web app leaks, then it leaks which will at some point slow down performance on any kind of browser. We can ofc try to make CC and GC faster, but the web app needs to be fixed.
Olli, if you have VPN access to MV please open http://10.250.73.243:8080/ and enable 'auto refresh' at the top right corner. Leave the tab open for a while and you will notice the effect really soon.
The non-closing CC times could probably be improved, as the CC is looking at 200k GC objects and not freeing any. DOM merging could potentially help the closing CC, as there are 1.2 million ref counted nodes in the graph, but it depends on what exactly they are. I'm not sure if anything can be done for the GC. The page is just allocating a huge amount of memory and never letting it go. A large percentage of the GCs are nonincremental because the page is allocating so much (and I guess the GC is falling behind).
Olli, I can get a log for you, if you don't regularly VPN in, as I'll be in the MV office.
As downloadable from Dropbox you have a gc/cc log of the issue with an iGC of about 200ms and a low CC duration. That should make it easier for Olli to get it processed. https://www.dropbox.com/s/820qcp5vtactep4/cc_gc_log.zip
Olli, here an updated version of the logs with the XUL cache pref removed. https://www.dropbox.com/s/820qcp5vtactep4/cc_gc_log.zip
Henrik, which version of FF are you using?
Henrik, we think an addon might be involved. Could you try this again with a fresh profile? Also, can you try to get minimal steps to reproduce this problem? I tried it myself, but I'm not getting the same kind of objects that are showing up in your CC log.
We're seeing some weird objects that we're not sure where they came from. Our theory is that they are weak map related, because we don't log weak maps very well right now.
Right now this log has been made with Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:17.0) Gecko/17.0 Firefox/17.0 ID:20120930042009 I can try a fresh profile with memchaser installed only.
Actually, could you turn off memchaser as well? I'm afraid it might affect things, since it uses Jetpacl.. Could you just use the CC times in the error console instead?
Here the logs with a fresh profile: https://www.dropbox.com/s/k5nu8ygv4w61yij/gg_cc_log_memchaser.zip https://www.dropbox.com/s/uikivvn8ive8yxn/gg_cc_log_plain.zip Without memchaser I'm not able to reproduce this issue. Here the logs.
Haven't manage to reproduce using a trunk build. CC graph stays almost all the time really small (<500) - only the automatic page reload increases it temporarily and CC times are between 6ms and 46ms.
> Without memchaser I'm not able to reproduce this issue. Updating the bug metadata to reflect that this is caused by MemChaser. This would be a significant issue if the problem turned out to be Jetpack and not simply MemChaser.
Assignee: english-us → nobody
Component: English US → Add-ons
Keywords: qawanted
Summary: Ultra long GC durations with active Jenkins web pages opened → MemChaser (perhaps simply Jetpack?) causes ultra long GC durations with active Jenkins web pages opened
Whiteboard: [MemShrink]
Kris says that MemChaser is using version 1.9 of the add-on SDK, which is pretty recent. So yeah, the real question here is whether this is an add-on SDK problem or MemChaser-specific?
Whiteboard: [MemShrink] → [MemShrink:P2]
Assignee: nobody → kmaglione+bmo
If what we're worried about is memory usage after http://10.250.73.243:8080/ is closed, it looks like it's mostly heap-unclassified: https://gist.github.com/3830049 I'm getting about 60MB growth in heap-unclassified with MemChaser installed, and a bit of growth elsewhere, as opposed to nothing without it installed. I'll try stripping down MemChaser to narrow down the cause when I get a chance.
This might be similar to Bug 740162 where I also had massive amounts of memory being used by jenkins. I didn't have any addons, though.
I have uploaded a 0.5pre XPI of memchaser which I will test over the next days if it changes the behavior here: https://github.com/downloads/mozilla/memchaser/memchaser-0.5pre-201211061117.xpi
Depends on: 833783
Attached file about:memory?verbose (obsolete) (deleted) —
Looks like we have a lot of orphaned nodes for both the Jenkins instances: * 97,530,664 B (20.59%) ── orphan-nodes * 110,193,800 B (23.27%) ── orphan-nodes
I have to say this is with a profile with any extensions disabled.
Attached file cc-edges.log (obsolete) (deleted) —
cc log from the same state as the about:memory dump.
So this is indeed the Jenkins CI page which is leaking or keeping orphaned elements around. Closing the tab and minimizing the memory a couple of times restores the previous memory usage. The latest GG/CC logs you can find here: http://people.mozilla.com/~hskupin/downloads/ For everyone interested, here the dump of the conversation between Olli and myself on IRC: 14:29 smaug: whimboo: any chance you could create new GC/CC logs? hopefully immediately when you start to see the leak, so that the logs don't get huge 14:29 smaug will be offline soon, but could perhaps analyze the logs tomorrow 14:30 whimboo: smaug: would your addon help here 14:30 whimboo: ? 14:31 smaug: whimboo: if you know what to look at, then yes 14:31 smaug: you could check if documents are leaded 14:31 smaug: whimboo: it is possible that gc log is need too, but my tool is for cc log only 14:31 whimboo: k 14:31 whimboo: i will check right now 14:31 smaug: my local cc analyzing scripts do just the same things as about:cc 14:32 whimboo: smaug: do you want to put this xpi up somewhere? 14:32 whimboo: i always have trouble to find it 14:34 smaug: whimboo: well, I was going to put it to AMO but that was too difficult 14:34 smaug: perhaps I should get it to tree 14:34 smaug: make it a bit prettier first 14:34 whimboo: that would be great 14:34 smaug: it is very tiny addon 14:35 whimboo: smaug: k. so i will let you have a look at 14:35 whimboo: i will produce the cc now 14:38 whimboo: smaug: cc is about 1MB. that's ok? 14:38 whimboo: gc 314MB 14:38 smaug: 1MB is ok 14:38 smaug: 314GC, huh 14:39 smaug: that is less than tiny 14:39 smaug: er, "less than tiny" doesn't make sense 14:40 whimboo: heh. so i will attach the CC log to the bug 14:41 whimboo: smaug: log is up 14:42 whimboo: smaug: so good thing is with Nightly: 82.10 MB (13.19%) ── heap-unclassified 14:42 whimboo: that's way lesser as with aurora 14:45 smaug: whimboo: that is not too big cc graph 14:45 smaug: 8000 objects 14:45 smaug: sure, too big, but still... 14:45 smaug: but ok, looking 14:45 whimboo: sweet 14:45 whimboo: if it's too large I think i can give you a smaller one later 14:46 whimboo: given that I know how to produce it 14:46 smaug: so we're leaking all the XBL document 14:46 smaug: whimboo: do you have XUL cache disabled? 14:46 smaug: whimboo: it is small enough 14:47 whimboo: ups. that was a copy of an old profile 14:47 whimboo: yes it was disabled 14:47 whimboo: but should we leak in such a case? 14:47 smaug: whimboo: disabling XUL cache is not supported setup 14:47 smaug: yes, we don't care about that case 14:48 smaug: since everything behaves differently anyway 14:48 smaug: it is development only setup 14:48 whimboo: ok, so let me restart then 14:48 smaug: whimboo: we're talking about nglayout.debug.disable_xul_cache here, right :) 14:49 whimboo: i know :) 14:50 whimboo: smaug: looks like it's starting again 14:50 whimboo: 2.03 MB (01.32%) ── orphan-nodes 14:51 whimboo: smaug: shall i wait until it's a big higher? 14:52 smaug: whimboo: you could wait a bit... 14:53 whimboo: smaug: k, how long would you prefer? until it reaches about 10MB or so? 14:54 smaug: whimboo: do you have about:cc installed? 14:54 whimboo: smaug: yes 14:54 whimboo: but not sure what to do with it 14:54 smaug: could you run cc and see the size of the graph 14:55 smaug: graph size ~10000 should be enough 14:55 smaug: whimboo: run CC few times to get stable results 14:55 whimboo: smaug: where do i see that? 14:55 smaug: whimboo: in the addon 14:55 whimboo: smaug: sure.. number of objects? 14:56 whimboo: or roots, edges 14:56 smaug: whimboo: number of objects 14:56 whimboo: k, at 353 right now 14:56 smaug: whimboo: I'd expect something like 10000 14:57 smaug: though, the previous log had 8000 14:57 smaug: whimboo: if you get it to 5000 14:57 whimboo: smaug: orphaned nodes are already at 15MB 14:57 smaug: sure 14:57 smaug: JS keeping nodes alive 14:57 whimboo: so you still want to have the 10.000 limit? 14:57 smaug: but the js is black (certainly alive) so those nodes can't be released 14:58 smaug: whimboo: 5000 could be enough 14:58 whimboo: k lets wait when I reach that 14:59 whimboo: smaug: so when I run cc it always drops down to about 200 again 14:59 whimboo: seems I can't get over 500 14:59 smaug: whimboo: ahaa 15:00 smaug: whimboo: so do you actually get leak 15:00 smaug: I mean if you close the tab 15:00 smaug: now that xul cache is enabled 15:01 whimboo: smaug: shall I close the tab and run cc? 15:01 smaug: whimboo: yes, and check what about:memory says 15:02 whimboo: smaug: hm, i think the addon initializes wrongly 15:02 whimboo: when I tried to run 'Run cycle collector' it did nothing 15:02 whimboo: now after closing the tab I get 220000 objects 15:03 smaug: that is very possibe 15:03 smaug: whimboo: run cc few times 15:03 whimboo: oh, right 15:03 whimboo: yes. it cleans up everything 15:03 whimboo: back at 200 objects 15:03 smaug: so it was garbage that we collected 15:03 whimboo: and about:memory doesn't list the page anymore 15:04 whimboo: that means it's really the website which is leaking memory 15:04 smaug: and memory usage is back to normal levels? 15:04 mdas has joined (mdas@50AE257C.D30B51A1.412CF160.IP) 15:04 whimboo: just a sec. will repeat iwth the other tab 15:04 smaug: well, "leaking", but yes, the website keeps objects alive 15:05 whimboo: smaug: err, now I'm at 1GB 15:06 whimboo: is that because of your addon? 15:06 whimboo: seems like 15:06 whimboo: 391.25 MB (61.12%) -- top(about:cc, id=28)/active/window(about:cc) 15:07 whimboo: smaug: wohoo. back to normal.. 240MB 15:07 smaug: whimboo: well, about:cc may use a lot of memory if the graph is huge 15:08 whimboo: smaug: so if I can help to analyze which js is holding that please tell me the steps 15:08 smaug: thinking... 15:09 smaug: whimboo: perhaps the best is to get cc and gc logs 15:09 smaug: before they are too huge 15:09 smaug: especially gc log is needed 15:09 smaug: since it is probably page keeping some js alive which ends up keeping nodes alive ... 15:10 whimboo: ok, will give those in a bit 15:11 whimboo: smaug: is 50MB for GC fine? 15:12 smaug: whimboo: uh, that is quite big one 15:12 whimboo: gna. let me restart then 15:13 smaug: whimboo: so, I need to go soon, but could you record this conversation to the bug, so that it is clear that it is the page which is leaking 15:13 whimboo: smaug: i cna't get it under 40MB 15:14 smaug: whimboo: put it to some server and I'll download it from there? 15:14 whimboo: sure 15:14 whimboo: thanks 15:14 whimboo: will comemnt in the bug 15:14 smaug: and put cc log too 15:14 smaug: thanks 15:14 whimboo: sure
Component: Add-ons → Other
Summary: MemChaser (perhaps simply Jetpack?) causes ultra long GC durations with active Jenkins web pages opened → Jenkins CI web page causes a lot of leaks when jobs are running
Comment on attachment 706361 [details] cc-edges.log This was collected with XUL cache disabled.
Attachment #706361 - Attachment is obsolete: true
Updated URLs for our own pages: http://10.250.73.243:8080/ http://mm-ci-master.qa.scl3.mozilla.com:8080/ Open one of them and wait until jobs are running. The number of orphaned nodes will increase constantly and reach 10MB after a couple of minutes.
No longer depends on: 833783
Depends on: 600307
No longer depends on: 751529
Not sure why I haven't commented on this bug with the location of the CC and GC log I have uploaded to my people account. Olli, whenever you have time please have a look at those. http://people.mozilla.com/~hskupin/downloads/cc-edges.88911.log http://people.mozilla.com/~hskupin/downloads/gc-edges.88911.log
Requesting info-needed from Olli for my latest cc/gc logs for his analysis. I hope you will have a bit of time soon. Thanks.
Flags: needinfo?(bugs)
Olli, can find a bit of time to have a look at?
An old version of Firefox which was 15.0 I got to crash in js::gc::MarkObjectUnbarried after adding hundreds of job requests to the Jenkins instance. I have now updated the version of Firefox to the latest release and will check that again after Eastern.
Seems like Kris doesn't work on it. Back into the pool.
Assignee: kmaglione+bmo → other
I got the crash again today with the latest release of Firefox: bp-6e5c9d5a-08dd-4a6b-95ed-6c6182130404 As the crash info shows Firefox crashes due to an OOM. Is there something we can prevent to not let this happen?
Crash Signature: [@ mozalloc_abort(char const*) | Abort ]
Keywords: mlk
Summary: Jenkins CI web page causes a lot of leaks when jobs are running → Adding new jobs in a Jenkins CI causes a lot of leaks and ends-up in an OOM [@ mozalloc_abort(char const*) | Abort ]
Sorry about not dealing with this. Henrik, could you re-test and report whether there are any changes.
Flags: needinfo?(bugs)
Yes, this is still an issue with Aurora. I know that information about Nightly would be better, but I can't deliver those today. If you need those I can check tomorrow. Otherwise I would really appreciate if you could test it yourself.
Flags: needinfo?(bugs)
I left our Jenkins page open for half a day with latest Aurora (26.0a2) and now I have a memory usage of 1.6GB whereby the referred web page already uses 650MB! 480MB in orphaned nodes which can even not get released via GC. I would love if we could get some tracking on this bug. It's waiting for months meanwhile. :( 1,004.78 MB (63.90%) -- window-objects │ ├────642.13 MB (40.83%) -- top(http://mm-ci-master.qa.scl3.mozilla.com:8080/, id=83) │ │ ├──637.28 MB (40.53%) -- active │ │ │ ├──636.97 MB (40.51%) -- window(http://mm-ci-master.qa.scl3.mozilla.com:8080/) │ │ │ │ ├──483.43 MB (30.74%) -- dom │ │ │ │ │ ├──482.45 MB (30.68%) ── orphan-nodes │ │ │ │ │ └────0.98 MB (00.06%) ++ (4 tiny) │ │ │ │ ├──145.40 MB (09.25%) -- js-compartment(http://mm-ci-master.qa.scl3.mozilla.com:8080/) │ │ │ │ │ ├──130.26 MB (08.28%) -- gc-heap │ │ │ │ │ │ ├──129.73 MB (08.25%) -- objects │ │ │ │ │ │ │ ├───79.33 MB (05.04%) ── ordinary │ │ │ │ │ │ │ ├───34.48 MB (02.19%) ── function │ │ │ │ │ │ │ └───15.92 MB (01.01%) ── dense-array │ │ │ │ │ │ └────0.54 MB (00.03%) ++ (3 tiny) │ │ │ │ │ └───15.13 MB (00.96%) ++ (5 tiny) │ │ │ │ └────8.15 MB (00.52%) ++ (3 tiny) │ │ │ └────0.31 MB (00.02%) ++ window(about:blank) │ │ └────4.85 MB (00.31%) ++ js-zone(0x7fbb3c6f4800)
Yeah, I know. I or someone should find time to find the memory leak in Jenkins. (Of course it would be nice if the Jenkins devs would do it.)
Flags: needinfo?(bugs)
Flags: needinfo?(bugs)
Henrik, have you tried contacting the Jenkins developers? This pretty clearly is a leak in their page.
Sure, I even talked with the lead developer. But sadly they never followed-up on it and on IRC I didn't receive helpful answers. But lets wait here for a moment. At the moment we are going to upgrade our local Jenkins instance to the latest LTS (see https://github.com/mozilla/mozmill-ci/pull/260). That means nearly 1 year passed since we did the last upgrade. Lets see if the new version is better. I will report back in the next couple of days.
So we have updated to the latest Jenkins LTS version today in our production environment, but the issue is still persistent. With that in mind I will try to get this reproduced with a local Jenkins installation, and plugins disabled. One of those might also cause this problem.
Not sure where all this unclassified memory comes from but I suspect it is from this bug. Is there anything which has been regressed here in terms of collecting orphaned nodes? I'm not sure what I should do to get more details here. 2,202.88 MB (100.0%) -- explicit ├──1,035.84 MB (47.02%) ── heap-unclassified
Orphaned nodes held alive by JS should be tracked in about:memory now.
Andrew, what do you mean by now? Which version of Firefox? I'm currently running 28.0a2 and the memory situation is kinda bad as shown above.
I'm suggesting that the heap-unclassified probably isn't orphan nodes, but it could be. The only way to figure out what heap-unclassified is, is to run a DMD build. https://wiki.mozilla.org/Performance/MemShrink/DMD#Build
And orphan nodes get reported -- look for "orphan-nodes" in about:memory. It's possible that some are being missed, but in general it's not immediately obvious that high heap-unclassified implies orphan nodes. As mccr8 says, DMD is the way to analyze this.
Blocks: 791695
We need some DMD analysis here, I think. But I could take a look at fresh cc logs - both "concise" and "verbose" (when created using about:memory). If possible, create logs with a clean profile (no addons).
Flags: needinfo?(hskupin)
Someone who can reproduce this needs to do some DMD analysis here.
Flags: needinfo?(bugs)
Early last week we updated our Jenkins installation to the most recent LTS version. I can now have a look if something has been changed. So I setup a new profile with only memchaser being installed. I will watch the memory usage through the day and will report back with logs as requested. I will also try to get a DMD build working.
Flags: needinfo?(hskupin)
Attached file memchaser.log (deleted) —
Ok, so this is still a problem with the latest Jenkins. Memory usage of a fresh profile with only memchaser installed went up from about 150MB to 1900MB. Firefox was barely reacting at this time. So for now I have the memchaser logs. The other requested information I will provide once I'm in a state again when Firefox uses about 1GB of memory.
Attachment #666566 - Attachment is obsolete: true
Attachment #706358 - Attachment is obsolete: true
(In reply to Olli Pettay [:smaug] from comment #56) > But I could take a look at fresh cc logs - both "concise" and "verbose" > (when created using about:memory). Ok, so I saved a concise log: https://www.dropbox.com/s/2o3xhyttkit5wcw/concise.zip Sadly all the memory has been released right after, so for a verbose log I will have to wait again until the memory usage is at a specific level. As it looks like it's really a memory leak for that site.
Here is the verbose cc log: https://www.dropbox.com/s/7y9w1cpiq03kwf0/verbose.zip Olli, can you please have a look at?
Flags: needinfo?(bugs)
Flags: needinfo?(bugs)
I still see this problem with current Nightly builds. The important fact here is now that even when I close the tab, the memory doesn't go down, but raises up from about 2GB to 5GB! Then Firefox has 100% CPU load, and memory goes up and down but the application is no longer usable. It completely hangs.
I have seen this as well.
Could you perhaps upload new GC/CC logs?
Product: Tech Evangelism → Tech Evangelism Graveyard
Crash Signature: [@ mozalloc_abort(char const*) | Abort ] → [@ mozalloc_abort(char const*) | Abort ] [@ mozalloc_abort | Abort ]
We haven't made any progress here so I doubt anything will happen in the future. And as such we are also going to stop using Jenkins within a most likely 3 months period. So by that we won't be able to reproduce this behavior anymore, whereby I'm even not sure if we can still reproduce. So I'm closing this bug as incomplete for now.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: