756549 - Sync updates cause explosive memory usage leading to OOM

Reporter

Description

•

13 years ago

Situation: I came back after a workweek in London where I used Android Nightlies with Sync extensively, and had a desktop Firefox Beta on the laptop I used. On my desktop machine at home, which has Desktop Nightlies, Firefox started crashes every 2 minutes. When it didn't crash, I got a Sync error, even though status.mozilla.com claims Sync is working fine. Looking at the Error Console, I see the following errors: Timestamp: 18/05/2012 20:58:53 Error: out of memory Source File: resource://services-sync/record.js Line: 638 Timestamp: 18/05/2012 20:58:53 Error: [Exception... "NS_ERROR_XPC_JAVASCRIPT_ERROR_WITH_DETAILS" nsresult: "0x80570021 (NS_ERROR_XPC_JAVASCRIPT_ERROR_WITH_DETAILS)" location: "JS frame :: resource:///modules/services-sync/resource.js :: Channel_onStopRequest :: line 528" data: no] Source File: resource:///modules/services-sync/resource.js Line: 528 I did a test, looking at memory usage in Windows Task Manager, which was stable at 400M. When I click the "Retry Sync" (Or whatever the button is called when you get the sync failed message), memory usage immediately increases very rapidly after which Firefox crashes due to OOM.

Richard Newman [:rnewman]

Comment 1

•

13 years ago

Asa also observed this: 12:37:22 < Asa> rnewman: I'm OOMing repeatedly and that was after a failed sync connection. I think it's sync killing Firefox. What can I do? (nightly, windows 7) gps is attempting to repro.

Richard Newman [:rnewman]

Comment 2

•

13 years ago

Possible dupe of Bug 755641. This was reported on Reddit around the 16th.

Gregory Szorc [:gps]

Comment 3

•

13 years ago

I can repro on OS X using official Nightly builds. From GC logs: GC(T+626.7) Total Time: 21.4ms, Compartments Collected: 191, Total Compartments: 243, MMU (20ms): 0%, MMU (50ms): 57%, Reason: SET_NEW_DOCUMENT, Nonincremental Reason: GC mode, Allocated: 27MB, +Chunks: 0, -Chunks: 0 Totals: Purge: 0.1ms, Mark: 11.3ms, Mark Roots: 1.9ms, Mark Other: 0.5ms, Sweep: 9.6ms, Sweep Compartments: 5.8ms, Sweep Object: 0.9ms, Sweep String: 0.1ms, Sweep Script: 0.2ms, Sweep Shape: 1.2ms, Discard Code: 0.3ms, Discard Analysis: 3.3ms, Sweep Types: 0.2ms, Clear Script Analysis: 0.2ms, Finalize End Callback: 0.7ms, End Callback: 0.3ms GC(T+631.4) Total Time: 97.4ms, Compartments Collected: 252, Total Compartments: 305, MMU (20ms): 0%, MMU (50ms): 0%, Reason: SET_NEW_DOCUMENT, Nonincremental Reason: GC mode, Allocated: 32MB, +Chunks: 0, -Chunks: 0 Totals: Purge: 0.1ms, Mark: 13.0ms, Mark Roots: 1.8ms, Mark Other: 0.5ms, Finalize Start Callback: 0.2ms, Sweep: 83.5ms, Sweep Compartments: 6.5ms, Sweep Object: 1.1ms, Sweep Script: 0.2ms, Sweep Shape: 1.4ms, Discard Code: 0.4ms, Discard Analysis: 3.9ms, Sweep Types: 0.3ms, Clear Script Analysis: 0.3ms, Finalize End Callback: 0.8ms, End Callback: 0.6ms GC(T+647.9) Total Time: 27.0ms, Compartments Collected: 269, Total Compartments: 323, MMU (20ms): 0%, MMU (50ms): 45%, Reason: CC_WAITING, Nonincremental Reason: GC mode, Allocated: 34MB, +Chunks: 2, -Chunks: 0 Totals: Mark: 14.9ms, Mark Roots: 1.6ms, Mark Other: 0.8ms, Finalize Start Callback: 0.1ms, Sweep: 11.0ms, Sweep Compartments: 6.3ms, Sweep Object: 1.3ms, Sweep String: 0.1ms, Sweep Script: 0.2ms, Sweep Shape: 1.5ms, Discard Code: 0.4ms, Discard Analysis: 3.6ms, Sweep Types: 0.3ms, Clear Script Analysis: 0.3ms, Finalize End Callback: 0.9ms, End Callback: 0.9ms The number of collected compartments is rather alarming, I think. Perhaps we regressed in compartments per global?

OS: Windows 7 → All

Gregory Szorc [:gps]

Comment 5

•

13 years ago

On Nightly from May 3, Compartments collected during Sync is never above 62 and total memory reported by memchaser is in the ~250MB range. My MBP's fan kicks in, so the computer is working pretty hard. On Nighty from May 6, there are more compartments, but memory profile remains about the same. Using Nightly from May 12, memory goes crazy. I've also isolated the memory spike to the history engine. Did anything change in places recently?

Boris Zbarsky [:bzbarsky]

Comment 6

•

13 years ago

That number of compartments, is totally "normal" with CPG and a tab or three open.

Gregory Szorc [:gps]

Comment 7

•

13 years ago

This is at a low point during sync. string-chars jumped to over 1GB at one point. This is with the May 12 Nightly. 1,047,332,184 B (100.0%) -- explicit ├────871,282,408 B (83.19%) -- js │ ├──590,434,392 B (56.38%) -- compartment([System Principal]) │ │ ├──476,512,856 B (45.50%) ── string-chars [130] │ │ ├───80,068,608 B (07.65%) -- gc-heap │ │ │ ├──42,122,064 B (04.02%) -- objects │ │ │ │ ├──35,929,072 B (03.43%) ── non-function [181] │ │ │ │ └───6,192,992 B (00.59%) ── function [177] │ │ │ ├──14,060,512 B (01.34%) ── strings [130] │ │ │ ├──12,035,880 B (01.15%) -- arena │ │ │ │ ├──10,829,072 B (01.03%) ── unused [181] │ │ │ │ ├─────625,536 B (00.06%) ── headers [181] │ │ │ │ └─────581,272 B (00.06%) ── padding [181] │ │ │ ├───8,708,200 B (00.83%) -- shapes │ │ │ │ ├──4,458,560 B (00.43%) ── tree [181] │ │ │ │ ├──2,361,880 B (00.23%) ── dict [177] │ │ │ │ └──1,887,760 B (00.18%) ── base [181] │ │ │ ├───2,409,760 B (00.23%) ── scripts [177] │ │ │ ├─────731,232 B (00.07%) ── type-objects [181] │ │ │ └─────────960 B (00.00%) ── xml [6] │ │ ├────9,409,344 B (00.90%) -- shapes-extra │ │ │ ├──6,874,112 B (00.66%) ── compartment-tables [182] │ │ │ ├──1,019,520 B (00.10%) ── tree-tables [117] │ │ │ ├────968,448 B (00.09%) ── dict-tables [177] │ │ │ └────547,264 B (00.05%) ── tree-shape-kids [177] │ │ ├────9,289,568 B (00.89%) ── analysis-temporary [56] │ │ ├────9,025,872 B (00.86%) -- objects │ │ │ ├──5,462,912 B (00.52%) ── elements [134] │ │ │ ├──3,535,552 B (00.34%) ── slots [181] │ │ │ └─────27,408 B (00.00%) ── misc [181] │ │ ├────5,340,096 B (00.51%) ── script-data [177] │ │ ├──────598,800 B (00.06%) -- type-inference │ │ │ ├──523,408 B (00.05%) ── script-main [125] │ │ │ ├───49,888 B (00.00%) ── tables [13] │ │ │ └───25,504 B (00.00%) ── object-main [12] │ │ └──────189,248 B (00.02%) -- mjit │ │ └──189,248 B (00.02%) ── data [5] │ ├──218,189,824 B (20.83%) ── gc-heap-chunk-dirty-unused │ ├───26,214,400 B (02.50%) ── gc-heap-chunk-clean-unused │ ├────7,385,016 B (00.71%) ── xpconnect │ ├────7,018,592 B (00.67%) -- runtime │ │ ├──4,194,304 B (00.40%) ── stack-committed │ │ ├──1,048,576 B (00.10%) ── atoms-table │ │ ├────850,016 B (00.08%) ── mjit-code │ │ ├────323,584 B (00.03%) ── temporary │ │ ├────262,144 B (00.03%) ── gc-marker │ │ ├────217,088 B (00.02%) ── runtime-object │ │ ├─────54,952 B (00.01%) ── unused-code-memory │ │ ├─────51,296 B (00.00%) ── contexts │ │ ├─────12,536 B (00.00%) ── regexp-code │ │ └──────4,096 B (00.00%) ── normal

Gregory Szorc [:gps]

Updated

•

13 years ago

Keywords: regressionwindow-wanted

Asa Dotzler [:asa]

Comment 8

•

13 years ago

I did not experience this problem until today. I had a failed sync warning bar and then I started crashing regularly. If this was in earlier builds it wasn't affecting me until today. Since about noon I've had over 20 crashes from it.

Justin Lebar (not reading bugmail)

Comment 9

•

13 years ago

Bug 754771 would probably help a lot with debugging this, because then you could see which system compartment we're talking about here.

Gregory Szorc [:gps]

Updated

•

13 years ago

Whiteboard: [memshrink]

Gregory Szorc [:gps]

Comment 10

•

13 years ago

With patches from bug 754771 applied, I've isolated the modules causing the explosion: 1,329,651,304 B (100.0%) -- explicit ├──1,230,743,744 B (92.56%) -- js │ ├────555,525,832 B (41.78%) -- compartment([System Principal], resource://services-sync/record.js) │ │ ├──553,818,320 B (41.65%) ── string-chars │ ├────554,112,096 B (41.67%) -- compartment([System Principal], resource://services-sync/resource.js) │ │ ├──553,733,776 B (41.65%) ── string-chars record.js contains the logic for dealing with Sync records. It takes raw data objects, does crypto, creates real record objects, then serializes them to JSON. It then passes them over a wall to resource.js for HTTP request foo. Incoming records work the opposite way. In this process, there is lots of overhead. The crypto uses js-ctypes (WeaveCrypto) to talk to NSS. There are lots of temporary JS strings and objects being created. We plan to replace this with a single C++ function in the next major Sync release. For now, we have to live with it. I'm surprised that resource.js has as many string-chars bytes as record.js. I would expect most of the memory allocation to be in record.js and resource.js would "inherit" the subset/final product. Of course, I'm naive as to how the JS internals work. Is there any way for me to debug what exactly is allocating these string-chars?

Gregory Szorc [:gps]

Comment 11

•

13 years ago

Interestingly, outgoing only changes (i.e. first sync upload) aren't too bad. The problem appears to only be when processing incoming data. And, history sync seems to be the bit that triggers. It is unknown whether it is something specific to history sync or whether it is proportional to the number of processed records (history typically has the most).

Justin Lebar (not reading bugmail)

Comment 12

•

13 years ago

> Is there any way for me to debug what exactly is allocating these string-chars? Maybe Andrew has a suggestion? My only suggestion is to look at how data is passed between record.js and resource.js. The amount allocated is almost identical between the two compartments, which suggests to me that somehow the data is being passed from one compartment to the other. Perhaps that narrows things down?

Justin Lebar (not reading bugmail)

Updated

•

13 years ago

Whiteboard: [memshrink] → [MemShrink]

Andrew McCreight [:mccr8]

Comment 13

•

13 years ago

Sorry, I have no idea how JS engine string allocation works. Can you use some sort of trace-malloc thing to get stack traces from where the allocations are happening? Of course, if it is all just JS code it won't be very useful.

Gregory Szorc [:gps]

Comment 14

•

13 years ago

Regression is somewhere between 93302:835d5fab89dc and 93609:6fe7dd2f8f57, inclusive. That's May 7 to May 10, roughly. That range includes a services-central merge, which was 95% refactoring. I would be pretty surprised if anything in the merge caused this. Then again, the JS engine is at least half magic, so who knows. I should have a full bisect within 24 hours.

Severity: normal → major

status-firefox15: --- → affected

Justin Lebar (not reading bugmail)

Comment 15

•

13 years ago

That date range includes CPG, I believe.

Gregory Szorc [:gps]

Comment 16

•

13 years ago

Narrowed it down to between 93548:dd29535bac5f and 93609:6fe7dd2f8f57. I'm going to take a break for a little. The house is still accepting bets on the commit.

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Comment 17

•

13 years ago

/me puts his chips on http://hg.mozilla.org/mozilla-central/rev/e6529138e338

Gregory Szorc [:gps]

Comment 18

•

13 years ago

(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #17) > /me puts his chips on http://hg.mozilla.org/mozilla-central/rev/e6529138e338 That would have been my guess as well. However, the range has been reduced to: 93543:b8f2a0ff6945 and 93563:075804813c58.

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Comment 19

•

13 years ago

http://hg.mozilla.org/mozilla-central/rev/2f63829658ff is next on my betting list then.

Gregory Szorc [:gps]

Comment 20

•

13 years ago

(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #19) > http://hg.mozilla.org/mozilla-central/rev/2f63829658ff is next on my betting > list then. We have a winner! Before bug 723350, Sync allocates a lot of memory but nothing gets out of hand. After bug 723350, total compartment size jumps 2-3x to ~1GB range during the peak of history sync (on my test profile, anyway). My casual observation is that GC seems to be running less frequently or less "completely" after bug 723350. So, next steps. I'm not a release manager, but I'm pretty sure that one wouldn't like things in the current state. Is backing out bug 723350 a reasonable request? On a somewhat related topic. With CPG, if I have a 100MB JS string passed between modules, does that string get copied into the new compartment? Or, does the JS engine "copy by reference." How would this be reflected in the per-compartment memory reporting from bug 754771?

Depends on: 723350

Keywords: regressionwindow-wanted

Boris Zbarsky [:bzbarsky]

Updated

•

13 years ago

tracking-firefox15: --- → ?

Keywords: regression

Gregor Wagner [:gwagner]

Comment 21

•

13 years ago

Before Bug 723350 we had a global and per-compartment malloc accounting. Now we removed the global one and let the compartments do their own accounting. From the API side, we only changed the updateMallocCounter logic. I can think of 2 scenarios why this happens: 1) updateMallocCounter gets called without an context. updateMallocCounter has 2 arguments: bytes and optional JSContext. In the old version we updated the runtime counter and cx->compartment if available. Now we only update the counter if cx && cx->compartment. 2) CPG increased the number of compartments and the trigger logic is off. Comment 7 shows that most memory is allocated in a single compartment right? So back to 1) Another thing I noticed is that we delete the mallocCounter for all compartments even if they are not GCed. This was not changed in bug 723350 but it might be part of the problem.

Justin Lebar (not reading bugmail)

Comment 22

•

13 years ago

> Comment 7 shows that most memory is allocated in a single compartment right? No, it just looks that way because with the about:memory used in comment 7, we merge all allocations from system compartments together. Comment 10 has the system compartments split out, and we can see that the memory lives in two compartments. Indeed, the value in the two compartments is almost identical, which suggests there may be some copying going on here.

Till Schneidereit [:till]

Assignee

Comment 23

•

13 years ago

I'll look at the scenarios gwagner outlined in comment 21 and will update as soon as I know more.

Assignee: nobody → tschneidereit+bmo

Till Schneidereit [:till]

Assignee

Comment 24

•

13 years ago

Seems like scenario 1 is a winner: starting the current Nightly with a fresh profile to a single about:blank tab causes over 2200 calls to updateMallocCounter without a context. I'll next see if there's anything we can do about that. If that's not the case, backing out http://hg.mozilla.org/mozilla-central/rev/2f63829658ff might be the best option, I'm afraid.

Till Schneidereit [:till]

Assignee

Comment 25

•

13 years ago

Is there any chance for getting a solid step of STRs or maybe even a profile that exhibits the explosive memory usage described here? I'm trying to get to the problem's bottom but can't really be sure what I'm doing will help without STRs, I'm afraid.

Richard Newman [:rnewman]

Comment 26

•

13 years ago

(In reply to Till Schneidereit [:till] from comment #25) > Is there any chance for getting a solid step of STRs or maybe even a profile > that exhibits the explosive memory usage described here? > > I'm trying to get to the problem's bottom but can't really be sure what I'm > doing will help without STRs, I'm afraid. gps should have STRs, but I speculate that you can hit the problem by syncing a large profile, perhaps choosing "Reset Sync > Replace the contents of this device" to ensure a large download. QA should be able to provide a fat Sync account if you need one: tracy, tchung?

Tracy Walker [:tracy]

Comment 27

•

13 years ago

I've been using a version of our largest account this morning and have only it this a couple times. Not reliably failing here. The account only has a 1000 or so bookmarks.

Gregor Wagner [:gwagner]

Comment 28

•

13 years ago

Just to make sure.. Could you go to about:memory and click "Minimize memory usage" and see if memory goes back to normal?

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 29

•

13 years ago

(In reply to Gregor Wagner [:gwagner] from comment #29) > Just to make sure.. Could you go to about:memory and click "Minimize memory > usage" and see if memory goes back to normal? It wasn't leaking the memory for me when I was testing this the other day.

Gregory Szorc [:gps]

Comment 30

•

13 years ago

(In reply to Richard Newman [:rnewman] from comment #26) > (In reply to Till Schneidereit [:till] from comment #25) > > Is there any chance for getting a solid step of STRs or maybe even a profile > > that exhibits the explosive memory usage described here? > > > > I'm trying to get to the problem's bottom but can't really be sure what I'm > > doing will help without STRs, I'm afraid. The problem is most pronounced with history sync. I am able to see it when uploading a large amount of history records (like setting up Sync from your main profile where you do all your surfing) or when pulling down synced history records to a fresh profile (new sync). I did all the bisecting through download because I think it is slightly more pronounced. But, I could be wrong. If you don't want to mess with your main profile, create a fresh profile, sync it with your existing Sync account, disconnect the device from Sync (preserving all data), then set up a new Sync account with a separate email (e.g. bug756549@mozilla.com">gps+bug756549@mozilla.com). Let me see if I can code up a "stress test" that generates a bunch of "random" places data. I'm sure I will be reinventing the wheel here. If someone knows of existing code to save me the headache, please make a comment!

Richard Newman [:rnewman]

Comment 31

•

13 years ago

Adding brains for Comment 31.

Alex Keybl [:akeybl]

Updated

•

13 years ago

tracking-firefox15: ? → +

Bill Gianopoulos [:WG9s]

Updated

•

13 years ago

Blocks: 723350

No longer depends on: 723350

David Mandelin [:dmandelin]

Updated

•

13 years ago

Whiteboard: [MemShrink] → [MemShrink][js:p0]

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 32

•

13 years ago

Can someone re-test this now that bug 723350 has been backed out?

tracking-firefox15: + → ?

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 33

•

13 years ago

Oops, didn't mean to unset the tracking flag.

:Gavin Sharp [email: gavin@gavinsharp.com]

Updated

•

13 years ago

tracking-firefox15: ? → +

Till Schneidereit [:till]

Assignee

Comment 34

•

13 years ago

I can confirm that backing out bug 723350 fixed the regression: Setting up a new profile to sync with my main profile causes RSS to go up to ~3.3GB without the backout whereas it never gets beyond ~650MB with it applied. Unfortunately, my attempts at alleviating the regression haven't yet yielded much progress. Nevertheless: with bug 723350 backed out, this should be fixed and closeable.

tracking-firefox15: + → ?

Gregory Szorc [:gps]

Comment 35

•

13 years ago

I can also confirm that backing out bug 723350 fixed things. So, resolving WORKSFORME. From comment 35, it appears that others are able to reproduce as thus can test changes. If you would like Sync code to generate data for testing, please ping me on bug 723350.

Status: NEW → RESOLVED

Closed: 13 years ago

status-firefox15: affected → fixed

tracking-firefox15: ? → ---

Resolution: --- → WORKSFORME

Till Schneidereit [:till]

Assignee

Comment 36

•

13 years ago

Thanks Gregory - both for confirming the fix and for the Sync code offer. You're right, though: I can reproduce memory usage exploding when syncing. The only reason I don't experience OOMs seems to be that I have enough memory on a 64bit system.

Boris Zbarsky [:bzbarsky]

Updated

•

12 years ago

tracking-firefox15: --- → +

Nobody; OK to take it and work on it

Updated

•

6 years ago

Component: Firefox Sync: Backend → Sync

Product: Cloud Services → Firefox