Closed Bug 668594 Opened 13 years ago Closed 13 years ago

while running reftest style tests, we seem to have a memory leak and fennec hangs

Categories

(Firefox for Android Graveyard :: General, defect)

ARM
Android
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED
Firefox 8

People

(Reporter: jmaher, Unassigned)

References

Details

(Whiteboard: [android][tegra][mobile_unittests][inbound][qa?])

Attachments

(1 file)

In debugging some failures on tinderbox for jsreftest, crashtest and reftest, I find that we are failing tests part way through a run (usually in the same spot, but not always) and fennec is still running until the harness times out after 2400 seconds. While looking at logcat (http://people.mozilla.org/~jmaher/android_dumps/jsreftest-2.log), I don't see anything useful other than the browser starting and no other gecko information. What I do see while looking at a process list when this happens is we only have a few processes running, not the usual 25-30 processes. [theory]I suspect this happens because we leak memory and android is freeing up space by closing down non essential processes [/theory] To reproduce this, grab a tinderbox ready tegra, and a tests.zip (or a 'make package-tests' in your objdir) file and run: cd reftests python remotereftest.py --deviceIP=192.168.1.101 --app=org.mozilla.fennec --xre-path=../../bin --extra-profile-file=jsreftest/tests/user.js --enable-privilege --total-chunks=2 --this-chunk=1 jsreftest/tests/jstests.list where 192.168.1.101 is the ip address of your tinderbox tegra. this reproduces the problem consistently on my tegra.
Blocks: 662468, 663657
(In reply to comment #0) > In debugging some failures on tinderbox for jsreftest, crashtest and > reftest, I find that we are failing tests part way through a run (usually in > the same spot, but not always) and fennec is still running until the harness > times out after 2400 seconds. > > While looking at logcat > (http://people.mozilla.org/~jmaher/android_dumps/jsreftest-2.log), I don't > see anything useful other than the browser starting and no other gecko > information. What I do see while looking at a process list when this > happens is we only have a few processes running, not the usual 25-30 > processes. [theory]I suspect this happens because we leak memory and > android is freeing up space by closing down non essential processes [/theory] Any idea how to test that hypothesis? > To reproduce this, grab a tinderbox ready tegra, What's a tinderbox-ready tegra, and how would I get one? We have a tegra running text-console Ubuntu. Is that good enough? > and a tests.zip (or a 'make > package-tests' in your objdir) file and run: > cd reftests > python remotereftest.py --deviceIP=192.168.1.101 --app=org.mozilla.fennec > --xre-path=../../bin --extra-profile-file=jsreftest/tests/user.js > --enable-privilege --total-chunks=2 --this-chunk=1 > jsreftest/tests/jstests.list > > where 192.168.1.101 is the ip address of your tinderbox tegra. > > this reproduces the problem consistently on my tegra.
I need to monitor the total memory consumption of the tegra during the test. It would be nice to figure out how to query (or dump to the log file) the total memory that Fennec thinks it is using. Other than that, we need to figure out why all the processes are going away and as soon as they do we stop running tests. Any tips for dumping the total memory that Fennec/Firefox thinks it is consuming from inside javascript/XUL? tinderbox-ready tegra is a tegra with the sutagent installed and accessible on the same network your machine is on. There is code to run these tests through ADB (usb cable), but we have had trouble getting that to work on a few different devices (including tegra). Does the ubuntu installation on the Tegra support python? There is a python version of SUTAgent which I have used to develop and fix bugs in testing remotely. That can be found here: http://people.mozilla.org/~jmaher/remotetesting/ Why this is in a people account vs version control is there is some newer code in there to support all the installation and tegra management stuff which has only been tested on linux (ubuntu) and not osx and win32.
so I dumped some system level stuff every 10 seconds to see what available memory we had and who was using it up. Low and behold plugin-container spikes and consumes all the available memory. Here is a log: http://people.mozilla.org/~jmaher/android_dumps/jsreftest4.log search for MemTotal or MemFree in the log and you will see the total system memory stats. In addition, scroll down past that and see a procrank (top for memory): PID Vss Rss Pss Uss cmdline 1622 82028K 75844K 47865K 41276K org.mozilla.fennec 1027 49320K 43308K 20339K 16740K system_server 1657 15492K 15492K 10918K 7420K /data/data/org.mozilla.fennec/plugin-container near the end when the process hangs, we see: PID Vss Rss Pss Uss cmdline 1657 720284K 720284K 715179K 711176K /data/data/org.mozilla.fennec/plugin-container 1622 81208K 75160K 46649K 39552K org.mozilla.fennec 1027 49584K 43548K 20576K 16980K system_server this reproduces every time for me, so the repro steps are still valid and very accurate.
I found 2 files so far which seem to be problematic: http://mxr.mozilla.org/mozilla-central/source/js/src/tests/e4x/GC/regress-280844-1.js http://mxr.mozilla.org/mozilla-central/source/js/src/tests/e4x/GC/regress-280844-2.js commenting these out the we don't see plugin-container have memory problems until e4x/Regress.
(In reply to comment #1) > > To reproduce this, grab a tinderbox ready tegra, > > What's a tinderbox-ready tegra, and how would I get one? We have a tegra > running text-console Ubuntu. Is that good enough? > Dave, Bmoss and I have one down here (2nd floor) that's not being used at the moment if you want to borrow it. (It's got Android on it, which may be required to repro. Not sure if this issue will repro on your ubuntu setup).
running the e4x/GC tests with similar code to about:memory(http://mxr.mozilla.org/mozilla-central/source/toolkit/components/aboutmemory/content/aboutMemory.js) after each test completes, I get this in the output: http://people.mozilla.org/~jmaher/android_dumps/e4x_gc_memory.log explicit/js/gc-heap is by far the biggest growing pain I see, but you can see everything else. I think my logic for main vs content is a bit lacking in the accuracy department, but I can continue to work on this as time goes on.
Several JS reftests are expected to use a lot of memory. These aren't leaks (well, in theory there could also be a leak), they are tests for problems that involve large amounts of memory. We recently disabled a few that were OOMing on desktop (after landing the script stack quota removal, they became more of a problem), if the tegras have less memory then we might need to disable some more for them. I ran |make jstestbrowser| on desktop fennec now, and it looks like memory usage gets to the 512MB-1GB area (for the plugin-container process) several times during the run. It's probably very similar on Android.
Note that js browser tests change prefs like gczeal, which can affect memory usage, but this is broken in the multiprocess case (you can't set prefs in the child process). I filed bug 669949 for this. There is some chance fixing that would have an effect here.
this patch skips what we believe to be the slowest and most resource intensive jsreftests. passes on try server and local testing on my tegra. Keep in mind this is jsreftest only, we should take a closer look at crashtest and reftest as we see similar behavior there.
Attachment #544795 - Flags: review?(bclary)
Comment on attachment 544795 [details] [diff] [review] skip slow jsreftests on android (1.0) thanks!
Attachment #544795 - Flags: review?(bclary) → review+
Whiteboard: [android][tegra][mobile_unittests][mobile_dev_needed] → [android][tegra][mobile_unittests][inbound]
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 8
Marking tests as skip-if(Android) without comments isn't very future-proof. It makes it unlikely that our tests or code will be fixed. It also causes us to duplicate effort: if we port to iOS or change desktop to use content processes or try to run our tests under Valgrind, we'll have to evaluate all these tests again. The tests in this bug should be things like: slow-if(MaxRAM<500) <-- changes the timeout and skips it in some test runs skip-if(MaxVM<500) <-- relevant for devices with swapping disabled expect-OOM <-- meaning this is a test of OOM behavior And elsewhere: skip-if(OOPContent) fails-if(screenWidth<600) skip-if(ARM) skip-if(AndroidWidget) In many cases, there should also be bugs filed on fixing the tests or fixing the code, and those bugs should be referenced in comments in the manifest.
Jesse, agree with you here. There are hundreds of these cases and in some scenarios large directories commented out. This would take one person at least a month (probably closer to two) to go through all of these and properly categorize these for the type of failures and file appropriate bugs.
Whiteboard: [android][tegra][mobile_unittests][inbound] → [android][tegra][mobile_unittests][inbound][qa?]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: