Closed
Bug 756549
Opened 13 years ago
Closed 13 years ago
Sync updates cause explosive memory usage leading to OOM
Categories
(Firefox :: Sync, defect)
Tracking
()
RESOLVED
WORKSFORME
People
(Reporter: gcp, Assigned: till)
References
(Blocks 1 open bug)
Details
(Keywords: regression, Whiteboard: [MemShrink][js:p0])
Situation: I came back after a workweek in London where I used Android Nightlies with Sync extensively, and had a desktop Firefox Beta on the laptop I used.
On my desktop machine at home, which has Desktop Nightlies, Firefox started crashes every 2 minutes. When it didn't crash, I got a Sync error, even though status.mozilla.com claims Sync is working fine.
Looking at the Error Console, I see the following errors:
Timestamp: 18/05/2012 20:58:53
Error: out of memory
Source File: resource://services-sync/record.js
Line: 638
Timestamp: 18/05/2012 20:58:53
Error: [Exception... "NS_ERROR_XPC_JAVASCRIPT_ERROR_WITH_DETAILS" nsresult: "0x80570021 (NS_ERROR_XPC_JAVASCRIPT_ERROR_WITH_DETAILS)" location: "JS frame :: resource:///modules/services-sync/resource.js :: Channel_onStopRequest :: line 528" data: no]
Source File: resource:///modules/services-sync/resource.js
Line: 528
I did a test, looking at memory usage in Windows Task Manager, which was stable at 400M. When I click the "Retry Sync" (Or whatever the button is called when you get the sync failed message), memory usage immediately increases very rapidly after which Firefox crashes due to OOM.
Comment 1•13 years ago
|
||
Asa also observed this:
12:37:22 < Asa> rnewman: I'm OOMing repeatedly and that was after a failed sync connection. I think it's sync killing Firefox. What can I do? (nightly, windows 7)
gps is attempting to repro.
Comment 2•13 years ago
|
||
Possible dupe of Bug 755641. This was reported on Reddit around the 16th.
Comment 3•13 years ago
|
||
I can repro on OS X using official Nightly builds.
From GC logs:
GC(T+626.7) Total Time: 21.4ms, Compartments Collected: 191, Total Compartments: 243, MMU (20ms): 0%, MMU (50ms): 57%, Reason: SET_NEW_DOCUMENT, Nonincremental Reason: GC mode, Allocated: 27MB, +Chunks: 0, -Chunks: 0
Totals: Purge: 0.1ms, Mark: 11.3ms, Mark Roots: 1.9ms, Mark Other: 0.5ms, Sweep: 9.6ms, Sweep Compartments: 5.8ms, Sweep Object: 0.9ms, Sweep String: 0.1ms, Sweep Script: 0.2ms, Sweep Shape: 1.2ms, Discard Code: 0.3ms, Discard Analysis: 3.3ms, Sweep Types: 0.2ms, Clear Script Analysis: 0.2ms, Finalize End Callback: 0.7ms, End Callback: 0.3ms
GC(T+631.4) Total Time: 97.4ms, Compartments Collected: 252, Total Compartments: 305, MMU (20ms): 0%, MMU (50ms): 0%, Reason: SET_NEW_DOCUMENT, Nonincremental Reason: GC mode, Allocated: 32MB, +Chunks: 0, -Chunks: 0
Totals: Purge: 0.1ms, Mark: 13.0ms, Mark Roots: 1.8ms, Mark Other: 0.5ms, Finalize Start Callback: 0.2ms, Sweep: 83.5ms, Sweep Compartments: 6.5ms, Sweep Object: 1.1ms, Sweep Script: 0.2ms, Sweep Shape: 1.4ms, Discard Code: 0.4ms, Discard Analysis: 3.9ms, Sweep Types: 0.3ms, Clear Script Analysis: 0.3ms, Finalize End Callback: 0.8ms, End Callback: 0.6ms
GC(T+647.9) Total Time: 27.0ms, Compartments Collected: 269, Total Compartments: 323, MMU (20ms): 0%, MMU (50ms): 45%, Reason: CC_WAITING, Nonincremental Reason: GC mode, Allocated: 34MB, +Chunks: 2, -Chunks: 0
Totals: Mark: 14.9ms, Mark Roots: 1.6ms, Mark Other: 0.8ms, Finalize Start Callback: 0.1ms, Sweep: 11.0ms, Sweep Compartments: 6.3ms, Sweep Object: 1.3ms, Sweep String: 0.1ms, Sweep Script: 0.2ms, Sweep Shape: 1.5ms, Discard Code: 0.4ms, Discard Analysis: 3.6ms, Sweep Types: 0.3ms, Clear Script Analysis: 0.3ms, Finalize End Callback: 0.9ms, End Callback: 0.9ms
The number of collected compartments is rather alarming, I think. Perhaps we regressed in compartments per global?
OS: Windows 7 → All
Comment 5•13 years ago
|
||
On Nightly from May 3, Compartments collected during Sync is never above 62 and total memory reported by memchaser is in the ~250MB range. My MBP's fan kicks in, so the computer is working pretty hard.
On Nighty from May 6, there are more compartments, but memory profile remains about the same.
Using Nightly from May 12, memory goes crazy.
I've also isolated the memory spike to the history engine. Did anything change in places recently?
Comment 6•13 years ago
|
||
That number of compartments, is totally "normal" with CPG and a tab or three open.
Comment 7•13 years ago
|
||
This is at a low point during sync. string-chars jumped to over 1GB at one point. This is with the May 12 Nightly.
1,047,332,184 B (100.0%) -- explicit
├────871,282,408 B (83.19%) -- js
│ ├──590,434,392 B (56.38%) -- compartment([System Principal])
│ │ ├──476,512,856 B (45.50%) ── string-chars [130]
│ │ ├───80,068,608 B (07.65%) -- gc-heap
│ │ │ ├──42,122,064 B (04.02%) -- objects
│ │ │ │ ├──35,929,072 B (03.43%) ── non-function [181]
│ │ │ │ └───6,192,992 B (00.59%) ── function [177]
│ │ │ ├──14,060,512 B (01.34%) ── strings [130]
│ │ │ ├──12,035,880 B (01.15%) -- arena
│ │ │ │ ├──10,829,072 B (01.03%) ── unused [181]
│ │ │ │ ├─────625,536 B (00.06%) ── headers [181]
│ │ │ │ └─────581,272 B (00.06%) ── padding [181]
│ │ │ ├───8,708,200 B (00.83%) -- shapes
│ │ │ │ ├──4,458,560 B (00.43%) ── tree [181]
│ │ │ │ ├──2,361,880 B (00.23%) ── dict [177]
│ │ │ │ └──1,887,760 B (00.18%) ── base [181]
│ │ │ ├───2,409,760 B (00.23%) ── scripts [177]
│ │ │ ├─────731,232 B (00.07%) ── type-objects [181]
│ │ │ └─────────960 B (00.00%) ── xml [6]
│ │ ├────9,409,344 B (00.90%) -- shapes-extra
│ │ │ ├──6,874,112 B (00.66%) ── compartment-tables [182]
│ │ │ ├──1,019,520 B (00.10%) ── tree-tables [117]
│ │ │ ├────968,448 B (00.09%) ── dict-tables [177]
│ │ │ └────547,264 B (00.05%) ── tree-shape-kids [177]
│ │ ├────9,289,568 B (00.89%) ── analysis-temporary [56]
│ │ ├────9,025,872 B (00.86%) -- objects
│ │ │ ├──5,462,912 B (00.52%) ── elements [134]
│ │ │ ├──3,535,552 B (00.34%) ── slots [181]
│ │ │ └─────27,408 B (00.00%) ── misc [181]
│ │ ├────5,340,096 B (00.51%) ── script-data [177]
│ │ ├──────598,800 B (00.06%) -- type-inference
│ │ │ ├──523,408 B (00.05%) ── script-main [125]
│ │ │ ├───49,888 B (00.00%) ── tables [13]
│ │ │ └───25,504 B (00.00%) ── object-main [12]
│ │ └──────189,248 B (00.02%) -- mjit
│ │ └──189,248 B (00.02%) ── data [5]
│ ├──218,189,824 B (20.83%) ── gc-heap-chunk-dirty-unused
│ ├───26,214,400 B (02.50%) ── gc-heap-chunk-clean-unused
│ ├────7,385,016 B (00.71%) ── xpconnect
│ ├────7,018,592 B (00.67%) -- runtime
│ │ ├──4,194,304 B (00.40%) ── stack-committed
│ │ ├──1,048,576 B (00.10%) ── atoms-table
│ │ ├────850,016 B (00.08%) ── mjit-code
│ │ ├────323,584 B (00.03%) ── temporary
│ │ ├────262,144 B (00.03%) ── gc-marker
│ │ ├────217,088 B (00.02%) ── runtime-object
│ │ ├─────54,952 B (00.01%) ── unused-code-memory
│ │ ├─────51,296 B (00.00%) ── contexts
│ │ ├─────12,536 B (00.00%) ── regexp-code
│ │ └──────4,096 B (00.00%) ── normal
Updated•13 years ago
|
Keywords: regressionwindow-wanted
Comment 8•13 years ago
|
||
I did not experience this problem until today. I had a failed sync warning bar and then I started crashing regularly. If this was in earlier builds it wasn't affecting me until today. Since about noon I've had over 20 crashes from it.
Comment 9•13 years ago
|
||
Bug 754771 would probably help a lot with debugging this, because then you could see which system compartment we're talking about here.
Updated•13 years ago
|
Whiteboard: [memshrink]
Comment 10•13 years ago
|
||
With patches from bug 754771 applied, I've isolated the modules causing the explosion:
1,329,651,304 B (100.0%) -- explicit
├──1,230,743,744 B (92.56%) -- js
│ ├────555,525,832 B (41.78%) -- compartment([System Principal], resource://services-sync/record.js)
│ │ ├──553,818,320 B (41.65%) ── string-chars
│ ├────554,112,096 B (41.67%) -- compartment([System Principal], resource://services-sync/resource.js)
│ │ ├──553,733,776 B (41.65%) ── string-chars
record.js contains the logic for dealing with Sync records. It takes raw data objects, does crypto, creates real record objects, then serializes them to JSON. It then passes them over a wall to resource.js for HTTP request foo. Incoming records work the opposite way.
In this process, there is lots of overhead. The crypto uses js-ctypes (WeaveCrypto) to talk to NSS. There are lots of temporary JS strings and objects being created. We plan to replace this with a single C++ function in the next major Sync release. For now, we have to live with it.
I'm surprised that resource.js has as many string-chars bytes as record.js. I would expect most of the memory allocation to be in record.js and resource.js would "inherit" the subset/final product. Of course, I'm naive as to how the JS internals work.
Is there any way for me to debug what exactly is allocating these string-chars?
Comment 11•13 years ago
|
||
Interestingly, outgoing only changes (i.e. first sync upload) aren't too bad. The problem appears to only be when processing incoming data.
And, history sync seems to be the bit that triggers. It is unknown whether it is something specific to history sync or whether it is proportional to the number of processed records (history typically has the most).
Comment 12•13 years ago
|
||
> Is there any way for me to debug what exactly is allocating these string-chars?
Maybe Andrew has a suggestion?
My only suggestion is to look at how data is passed between record.js and resource.js. The amount allocated is almost identical between the two compartments, which suggests to me that somehow the data is being passed from one compartment to the other. Perhaps that narrows things down?
Updated•13 years ago
|
Whiteboard: [memshrink] → [MemShrink]
Comment 13•13 years ago
|
||
Sorry, I have no idea how JS engine string allocation works. Can you use some sort of trace-malloc thing to get stack traces from where the allocations are happening? Of course, if it is all just JS code it won't be very useful.
Comment 14•13 years ago
|
||
Regression is somewhere between 93302:835d5fab89dc and 93609:6fe7dd2f8f57, inclusive. That's May 7 to May 10, roughly.
That range includes a services-central merge, which was 95% refactoring. I would be pretty surprised if anything in the merge caused this. Then again, the JS engine is at least half magic, so who knows.
I should have a full bisect within 24 hours.
Severity: normal → major
status-firefox15:
--- → affected
Comment 15•13 years ago
|
||
That date range includes CPG, I believe.
Comment 16•13 years ago
|
||
Narrowed it down to between 93548:dd29535bac5f and 93609:6fe7dd2f8f57.
I'm going to take a break for a little. The house is still accepting bets on the commit.
/me puts his chips on http://hg.mozilla.org/mozilla-central/rev/e6529138e338
Comment 18•13 years ago
|
||
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #17)
> /me puts his chips on http://hg.mozilla.org/mozilla-central/rev/e6529138e338
That would have been my guess as well. However, the range has been reduced to:
93543:b8f2a0ff6945 and 93563:075804813c58.
http://hg.mozilla.org/mozilla-central/rev/2f63829658ff is next on my betting list then.
Comment 20•13 years ago
|
||
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #19)
> http://hg.mozilla.org/mozilla-central/rev/2f63829658ff is next on my betting
> list then.
We have a winner!
Before bug 723350, Sync allocates a lot of memory but nothing gets out of hand. After bug 723350, total compartment size jumps 2-3x to ~1GB range during the peak of history sync (on my test profile, anyway).
My casual observation is that GC seems to be running less frequently or less "completely" after bug 723350.
So, next steps. I'm not a release manager, but I'm pretty sure that one wouldn't like things in the current state. Is backing out bug 723350 a reasonable request?
On a somewhat related topic. With CPG, if I have a 100MB JS string passed between modules, does that string get copied into the new compartment? Or, does the JS engine "copy by reference." How would this be reflected in the per-compartment memory reporting from bug 754771?
Depends on: 723350
Keywords: regressionwindow-wanted
Updated•13 years ago
|
tracking-firefox15:
--- → ?
Keywords: regression
Comment 21•13 years ago
|
||
Before Bug 723350 we had a global and per-compartment malloc accounting. Now we removed the global one and let the compartments do their own accounting.
From the API side, we only changed the updateMallocCounter logic.
I can think of 2 scenarios why this happens:
1) updateMallocCounter gets called without an context.
updateMallocCounter has 2 arguments: bytes and optional JSContext.
In the old version we updated the runtime counter and cx->compartment if available.
Now we only update the counter if cx && cx->compartment.
2) CPG increased the number of compartments and the trigger logic is off.
Comment 7 shows that most memory is allocated in a single compartment right? So back to 1)
Another thing I noticed is that we delete the mallocCounter for all compartments even if they are not GCed. This was not changed in bug 723350 but it might be part of the problem.
Comment 22•13 years ago
|
||
> Comment 7 shows that most memory is allocated in a single compartment right?
No, it just looks that way because with the about:memory used in comment 7, we merge all allocations from system compartments together.
Comment 10 has the system compartments split out, and we can see that the memory lives in two compartments. Indeed, the value in the two compartments is almost identical, which suggests there may be some copying going on here.
Assignee | ||
Comment 23•13 years ago
|
||
I'll look at the scenarios gwagner outlined in comment 21 and will update as soon as I know more.
Assignee: nobody → tschneidereit+bmo
Assignee | ||
Comment 24•13 years ago
|
||
Seems like scenario 1 is a winner: starting the current Nightly with a fresh profile to a single about:blank tab causes over 2200 calls to updateMallocCounter without a context.
I'll next see if there's anything we can do about that. If that's not the case, backing out http://hg.mozilla.org/mozilla-central/rev/2f63829658ff might be the best option, I'm afraid.
Assignee | ||
Comment 25•13 years ago
|
||
Is there any chance for getting a solid step of STRs or maybe even a profile that exhibits the explosive memory usage described here?
I'm trying to get to the problem's bottom but can't really be sure what I'm doing will help without STRs, I'm afraid.
Comment 26•13 years ago
|
||
(In reply to Till Schneidereit [:till] from comment #25)
> Is there any chance for getting a solid step of STRs or maybe even a profile
> that exhibits the explosive memory usage described here?
>
> I'm trying to get to the problem's bottom but can't really be sure what I'm
> doing will help without STRs, I'm afraid.
gps should have STRs, but I speculate that you can hit the problem by syncing a large profile, perhaps choosing "Reset Sync > Replace the contents of this device" to ensure a large download.
QA should be able to provide a fat Sync account if you need one: tracy, tchung?
Comment 27•13 years ago
|
||
I've been using a version of our largest account this morning and have only it this a couple times. Not reliably failing here. The account only has a 1000 or so bookmarks.
Comment 28•13 years ago
|
||
Just to make sure.. Could you go to about:memory and click "Minimize memory usage" and see if memory goes back to normal?
(In reply to Gregor Wagner [:gwagner] from comment #29)
> Just to make sure.. Could you go to about:memory and click "Minimize memory
> usage" and see if memory goes back to normal?
It wasn't leaking the memory for me when I was testing this the other day.
Comment 30•13 years ago
|
||
(In reply to Richard Newman [:rnewman] from comment #26)
> (In reply to Till Schneidereit [:till] from comment #25)
> > Is there any chance for getting a solid step of STRs or maybe even a profile
> > that exhibits the explosive memory usage described here?
> >
> > I'm trying to get to the problem's bottom but can't really be sure what I'm
> > doing will help without STRs, I'm afraid.
The problem is most pronounced with history sync. I am able to see it when uploading a large amount of history records (like setting up Sync from your main profile where you do all your surfing) or when pulling down synced history records to a fresh profile (new sync). I did all the bisecting through download because I think it is slightly more pronounced. But, I could be wrong.
If you don't want to mess with your main profile, create a fresh profile, sync it with your existing Sync account, disconnect the device from Sync (preserving all data), then set up a new Sync account with a separate email (e.g. bug756549@mozilla.com">gps+bug756549@mozilla.com).
Let me see if I can code up a "stress test" that generates a bunch of "random" places data. I'm sure I will be reinventing the wheel here. If someone knows of existing code to save me the headache, please make a comment!
Comment 31•13 years ago
|
||
Adding brains for Comment 31.
Updated•13 years ago
|
Updated•13 years ago
|
Updated•13 years ago
|
Whiteboard: [MemShrink] → [MemShrink][js:p0]
Can someone re-test this now that bug 723350 has been backed out?
Oops, didn't mean to unset the tracking flag.
Updated•13 years ago
|
Assignee | ||
Comment 34•13 years ago
|
||
I can confirm that backing out bug 723350 fixed the regression: Setting up a new profile to sync with my main profile causes RSS to go up to ~3.3GB without the backout whereas it never gets beyond ~650MB with it applied.
Unfortunately, my attempts at alleviating the regression haven't yet yielded much progress. Nevertheless: with bug 723350 backed out, this should be fixed and closeable.
Comment 35•13 years ago
|
||
I can also confirm that backing out bug 723350 fixed things. So, resolving WORKSFORME.
From comment 35, it appears that others are able to reproduce as thus can test changes. If you would like Sync code to generate data for testing, please ping me on bug 723350.
Assignee | ||
Comment 36•13 years ago
|
||
Thanks Gregory - both for confirming the fix and for the Sync code offer. You're right, though: I can reproduce memory usage exploding when syncing. The only reason I don't experience OOMs seems to be that I have enough memory on a 64bit system.
Updated•12 years ago
|
tracking-firefox15:
--- → +
Updated•6 years ago
|
Component: Firefox Sync: Backend → Sync
Product: Cloud Services → Firefox
You need to log in
before you can comment on or make changes to this bug.
Description
•