Closed Bug 1633339 Opened 5 years ago Closed 4 years ago

SeaMonkey consumes 100% CPU

Categories

(SeaMonkey :: General, defect)

defect

Tracking

(seamonkey2.53+ fixed, seamonkey2.57esr? affected)

RESOLVED FIXED
seamonkey2.53
Tracking Status
seamonkey2.53 + fixed
seamonkey2.57esr ? affected

People

(Reporter: commander4bugs, Assigned: buc, NeedInfo)

References

()

Details

(Whiteboard: SM2.53.8)

User Story

The issue discovered.

In short:

Mozilla-derived application, without (or switched off) e10s support, at the time of working with WebSockets, receives a "link-status-changed" event from the network interface. The execution of the corresponding code follows a path that has not been properly tested without e10s, which leads to a false positive on a check and an infinite loop.

GitHub uses websockets. The polyfill addon itself spoils nothing there -- it just causes things to work. But unlike in Firefox, e10s is not used neither in PaleMoon nor in SeaMonkey, and this explains why the issue appears.
To reproduce, you need any active websocket connections at the time of the link change event. Probably just the going to github from the clean profile does not provide this yet, which explains the difficulties with reproduction. For me, it was enough to have 3 tabs: "https://github.com/notifications?query=in:unread", "https://github.com/notifications" and some page from some of the projects. Make sure you are logged in, and let these tabs be restored at any new startup.

The issue is triggered by the "link-status-changed" "change" event. It can appear either right after the wakeup, or just sporadic during the normal work (esp. when ipv6 is on).
It can be caused manually -- fe. on Linux just add (or remove) an additional address to/from the network interface (ie. "ip addr add/del 192.168.100.100/24 dev ethX"), or even just reconnect the cable (if any).

To see what's going on, use MOZ_LOG="nsWebSocket:5 nsNotifyAddr:5" environ, as well as a proper MOZ_LOG_FILE= (instead of stderr). Make sure you exit soon after the hog, since the log file might be huge.
The "link-status-changed" event triggers WebSocketChannel::Observe() which then invokes WebSocketChannel::OnNetworkChanged(), providing it to be on a "target" thread. WebSocketChannel::OnNetworkChanged(), being under the "target" thread, performs a check, and then re-invokes itself under the socket thread.
Thus, the order is "Observe" --> "Check for data started under target thread" --> "send ping under socket thread".

Considering your logs (note "link-status-changed" and "on target thread"), you can see the order "Main Thread" --> "DOM Worker" --> "Socket Thread". But if you run firefox with the same test, you will see a difference: "Main Thread" --> "Main Thread" --> "Socket Thread", ie. the target thread is "Main" rather than "DOM Worker". With e10s, a separate process is used to handle content, therefore a stuff previously been at "DOM Worker" appears at the "Main Thread" of such a process.

The problem is IsOnTargetThread() check, which always returns true without e10s. This leads to an infinite loop.

WebSocketChannel::IsOnTargetThread() uses IsOnCurrentThread() method, which actually is WebSocketImpl::IsOnCurrentThread(), and utilizes WebSocketImpl::IsTargetThread(), which is kinda "true/false == true/false" primitive check.

Under both "DOM Worker" and "Socket Thread", it is "false == false" condition, which always return true.

So it seems that dom/base/WebSocket.cpp stuff (note, it is moved to dom/websocket/WebSocket.cpp in recent Firefox) was written in assumption that anything supports e10s nowadays, therefore its IsOnCurrentThread() implementation appears incomplete.

Fortunately, the algorithm in the affected code seems clear enough. We can implement this in a slightly different way that does not use the problem method call.

Workaround: set network.notify.changed" to "false"

Attachments

(6 files)

Attached image CPU.png (deleted) —

User Agent: Mozilla/5.0 (Windows NT 6.1; rv:60.0) Gecko/20100101 Firefox/60.0 SeaMonkey/2.53.1 Lightning/5.8.1

Steps to reproduce:

Every once in a while, the browser starts consuming 100% CPU
(100% of one core, to be exact, see the screenshot - 25% of my 4 core CPU).
This started happening after upgrading to 2.53.1, it never happened
in the previous version (2.49.5). All versions are 32-bit running
on 32-bit Windows 7.

The steps to reproduce this are vague, I know, but there seems
to be a pattern. As far as I can tell, it is related to memory
and the garbage collector going on some sort of an infinite loop.
When this happens, the browser is essentially not usable, but
is working although with VERY VERY slow response to scrolling,
closing/switching tabs etc.

The pattern I observed:

  • The "age" of the browser matters.

This seems to be related to how long the browser has been running.
I have never seen it in a freshly open browser. Instead, it usually happens
at the end of the day when I have been using it for the entire day.

  • The number of tabs does not seem to matter

I have seen this happen with just a few tabs. At the time I captured
the attached logs, I had only 2 tabs open. However, the number of tabs
does seem to make it easier for this bug to happen. The most common
pattern - open a web site (I usually do this with a news website,
http://www.lenta.ru), then Ctrl+Shift+Click on multiple new articles
so that they are open in background tabs without switching to them.
Having said that, even opening 20-40 tabs does not trigger this reliably.
The critical part, I think, is how many tabs are being open at the same time,
not how many tabs are actually open in a "complete" state.

  • The actual URL does not matter

Most commonly I see this on http://www.lenta.ru, but I suspect this
is just because of my browsing pattern. I have seen this happen
with other sites too. What is interesting is that even when the browser
gets into this "loop" state, if I kill the browser in the Task Manager
and restart it, when the browser restarts and opens all the same tabs
again, everything is working.

Bottom line - this seems to be a race condition of sorts somewhere
in the memory allocation or garbage collection. It is not a concrete
error on any one concrete page.

  • Closing all tabs stops the loop

If I am patient enough and close all tabs, the browser becomes
responsive again. The attached logs were captured after closing
all tabs after getting into this "loop" state without restarting
the browser.

Unfortunately, this is as much information as I can gather.
This happens almost every day to me, but I do not have a very reliable
way to reproduce this. I am attaching concise and verbose memory
logs here. Please let me know if there is anything else I can capture,
preferably from the outside of the browser as getting it done
inside takes forever.

Attached file memory-report.json.gz (deleted) —

Attaching the memory report.

Attached file concise-logs.7z (deleted) —

Concise memory logs

Verbose memory logs

Verbose memory logs

I sometimes have similar issues. Try to open page about:memory and do a CC (Cycle Collection). Does that help?

Forgot to mention, there is another observation - once the browser
gets into this state, even if I manage to get it to stop loop, when I close
it, the process does not actually end. Instead, it sits there forever
doing nothing (consuming 0% CPU). Opening the browser again
complains about another instance running.

Try to open page about:memory and do a CC (Cycle Collection). Does that help?

Yes, it does. This is actually what led me to believe this is related to memory.
However, I did not try to do just CC - I usually click all three buttons
(GC, CC, Minimize) several times.

Has this x86 7 Windows only 2 GB of memory?

Has this x86 7 Windows only 2 GB of memory?

This system has 4 GB RAM.

Having said that, I do not think this is related to the limit on the user space memory
or the process going into the swap file. When I captured the logs for this bug,
the SeaMonkey was consuming about 400MB of RAM (see the attached screenshot),
which is way below the user space limit. As to the swap, the browser was more or
less the only application running at that time (there were a few tiny applications,
but they do not matter). Overall, the system was using about one third
of the physical RAM available and not swapping anything at all.

Because this bug's Severity has not been changed from the default since it was filed, and it's Priority is -- (non,) indicating it has has not been previously triaged, the bug's Severity is being updated to -- (default, untriaged.)

Severity: normal → --

Is this perhaps related to or a duplicate of Bug 1616551?

I increased my memory to 6GB and it happens here. Periodically just ignores any input for 3+/- seconds. Doing a GC/CC/Memory Optimiziation removes the problem. Even though the memory footprint does not change significantly through it.

Some additional observations.

It appears that there are two sites that are more prone to trigger this behavior:

I can not specifically pin point either one of these sites. Most of the time they work just fine. However, in most cases when I see this bug, I am visiting either Amazon or Google Maps. If I had to guess, I would say that on Amazon the culprit is opening any item and then moving the cursor over the picture so that it brings up a zoomed version of the image. And on Google Maps it is just zooming in/out and dragging the map around. For Google Maps, it does not have to be that specific site. Some sites embed the map, so that works as a trigger too.

WARNING: It does not mean that just opening a fresh browser and opening Amazon is going to reproduce this bug. No. On most days both sites work just fine. However, it appears that these sites have something that is more likely to trigger the bug than other sites. If I had to put a number to this statement, I would say that in 70-80% of occurrences of this bug (but NOT always), I was just playing with either Amazon or Google Maps.

In all cases the symptoms are the same: the browser starts being really non-responsive and reacts to all user actions with a long delay. When I look in the Task Manager, the SeaMonkey process is consuming a single core (25% out of 4 cores in my cases) and consumes a few hundreds of megabytes of memory, usually 500-600-700 MB. If I manage to open about:memory and click "Minimize memory usage", the first cycle is VERY slow, but it seems to drop the CPU consumption to 0% after that (and releases some memory).

What is interesting is that even though the minimize button solves the problem at that specific moment, the browser appears to be already in some bad state. Continuing using the browser in this state is very likely to bring back the same CPU consumption very quickly. The only guaranteed way to get out of this state is to do a full restart of the browser. Even if open the same pages again (even if they are opened through the saved session at the same time), the browser works perfectly fine after that.

All of that suggests some form of a memory corruption, probably due to a race condition caused by heavy Java Script usage (both Amazon and Google Maps are like that). Once things are corrupted, the restart is the only cure. Another symptom that suggests a corruption of sorts is this: when this bug happens, there is a small chance to get not only 100% CPU usage, but also an actual visual corruption of the browser. In most cases it looks as if browser tabs suddenly stopped rendering themselves and become transparent. Clicking from tab to tab changes the title bar of the browser, but the HTML view inside the tab does not actually change. In fact, it is not even a proper HTML view anymore. It is just bits and pieces left of what was on the screen in that area before. By resizing and moving the browser window, I can get random pieces, borders, corners of images, broken text etc. Once the browser gets that bad, I can not even close it anymore. The only option is the forceful kill in the Task Manager.

Will it help with the diagnosis if I generate a full memory dump for the SeaMonkey process when this happens? It will be huge though, about 700 MB raw, 150 MB compressed.

Will it help with the diagnosis if I generate a full memory dump for the SeaMonkey process when this happens?
No.

Are you still running the 32 bit version? More ram won't help here then becuase it only has a small address space of about 3GB. I am using Amazon frequently and maps occasionally. They can use extensive memory at times and become a bit sluggish for a few seconds when they reload stuff but tthat is it about it. I would pick the x64 version and see if the problem goes away. Uninstall the x86 first. Makes no sense to run the x86 on a 64 bit OS with 4 or more GB ram.

(In reply to Frank-Rainer Grahl (:frg) from comment #15)

Are you still running the 32 bit version?

Yes.

More ram won't help here then becuase it only has a small address space of about 3GB.

The overall amount of RAM is definitely not the limit here. When this bug
happens, the SeaMonkey process is consuming way less than 1 GB. And the system
overall is not stressed by any means, there is plenty of RAM available and nothing
is being swapped.

I am using Amazon frequently and maps occasionally. They can use extensive
memory at times and become a bit sluggish for a few seconds when they reload
stuff but tthat is it about it.

Yes, that is my experience too. MOST OF THE TIMES. However, there seems to be
some non-trivial correlation between these two sites and this bug. Every once
in a while this bug happens, and more likely than not, I was just visiting
either Amazon or Google Maps.

I would pick the x64 version and see if the problem goes away. Uninstall the x86 first.
Makes no sense to run the x86 on a 64 bit OS with 4 or more GB ram.

No, can do :) My operating system is 32-bit.

Assignee: nobody → dmitry
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
OS: Unspecified → All
Hardware: Unspecified → All
Whiteboard: SM2.53.8
Version: SeaMonkey 2.53 Branch → Trunk

Big thanks to buc. r/a+ for SeaMonkey 2.53 and 2.57 releases branches only.

This seems to be the primary cause for reported cpu hogging problems. There might be others but so far I have not seen serious garbage collection problems in the latest releases.

So lest see what problems remain with this patch in place.

Attachment #9218795 - Flags: review+
Attachment #9218795 - Flags: approval-comm-release+
Attachment #9218795 - Flags: approval-comm-esr60+

Well, there is a workaround for websocket issue -- set "network.notify.changed" to false in about:config .

If such a workaround does not help here, then it is another issue.

It could be fine if you can check this in your cases.

Flags: needinfo?(frank)
Flags: needinfo?(commander4bugs)

I've built a version of 2.53.7.1 with the 1633339-websocketloop-2538.patch and have been running it on a (for nowadays) relatively unpowered netbook for the last two days. I can confirm that CPU usage is way down and responsiveness is much better (though responsiveness is not quite as good as I remember from the 2.49.x series). However, the severe memory usage reported in duped bugs such as Bug 1616551 remains; after a few hours of browsing htop reports 10.3 GB of virtual memory used. I can try to provide more detailed or rigorous comparisons between the patched 2.53.7.1 and 2.49.5 if you wish and if you can give me some idea of what information you need.

(In reply to Tristan Miller from comment #19)

I can try to provide more detailed or rigorous comparisons between the patched 2.53.7.1 and 2.49.5 if you wish and if you can give me some idea of what information you need.

Try to figure out if the increase in memory occurs smoothly from SM starting, or if it starts to grow from a certain point later.

When such a memory growing already started, try to close tabs step by step in a hope to see what inspires this.

Try to notice the difference in behavior between 2.49.5 and 2.53.7.1 on some used sites. If this is due to the support of new features in 2.53.x, try to disable such features (some prefs in about:config etc.).

As usual, for such bugs a reliable reproducible case is needed. Without it, nothing can be done by developers. Such a test case could be kinda start from the clean profile, open some pages (but not so much), wait some amount of time (but better minutes rather than hours).

Sorry for a late reply - did not notice this "needinfo" in a timely manner.

(In reply to Dmitry Butskoy from comment #18)

Well, there is a workaround for websocket issue -- set "network.notify.changed" to false in about:config .

I tried setting this value to "false" and jumped around Google Maps and Amazon.
So far SeaMonkey appears to be stable without any slowdown, although I did not
try to run this for too long: just 10-20 minutes of browsing. I will continue running
with this option set and see how it goes.

P.S. I am not sure if this is helping, but just in case. In one of the cases when
my SeaMonkey became bogged down, I closed all tabs and went to about:memory
and tried clicking "Minimize memory usage". Each click resulted in a substantial
delay (several seconds) before the "completed" message was updated.
When I looked in the Task Manager, the SeaMonkey process was consuming about 500 MB
of RAM and 100% of a single core (Task Manager was showing 25% because I have 4 cores
in my CPU).

The way I see it - the garbage collection code seems to be entering some form
of an infinite loop. Or alternatively, it goes into a such long cycle that runs
until the next garbage collection cycle is supposed to kick in. Either way,
in this state SeaMonkey sees to be running garbage collection in perpetuity
without ever leaving this state. Obviously, it makes the whole process very sluggish
and even unresponsive at time.

Flags: needinfo?(commander4bugs)

(In reply to Commander from comment #21)

Sorry for a late reply - did not notice this "needinfo" in a timely manner.

(In reply to Dmitry Butskoy from comment #18)

Well, there is a workaround for websocket issue -- set "network.notify.changed" to false in about:config .

I tried setting this value to "false" and jumped around Google Maps and Amazon.
So far SeaMonkey appears to be stable without any slowdown, although I did not
try to run this for too long: just 10-20 minutes of browsing. I will continue running
with this option set and see how it goes.

Some updates after using the browser with the given option for a week or so.

  1. The change DOES have a positive effect on the browser stability and memory
    consumption. I can finally use Google Maps and Amazon in SeaMonkey.

  2. However, the change does NOT seem to solve the problem completely.
    Today, after closing the tab with Google Maps, SeaMonkey went into the 100% CPU
    mode again. The Task Manager showed 25% usage (a single core out of 4 cores
    in my system) and memory slowly oscillating between 540 and 510 MB. The memory
    would go up and down with a period of a few seconds. The interesting thing
    is that I observed this behavior for 5 or 10 minutes and then left the computer
    to see what would happen. Surprisingly, when I can back to it 1.5 hours later,
    the browser was down to about 400 MB and no CPU usage AT ALL. It means that
    eventually, it was able to get out of the funk. I do not know how long it took.
    Definitely longer than 5-10 minutes.

  3. Changing this setting to disable notifications breaks Linked-in. When this override
    is active, Linked-in basically hangs right after asking for the user name/password
    on the "progress" page before it gets to the user landing page. I confirmed that
    removing the override fixes Linked-in and it works correctly after that.

(In reply to Commander from comment #22)

Today, after closing the tab with Google Maps, SeaMonkey went into the 100% CPU
mode again.

It could be fine if you can try to obtain a repeatable case here (even on your regular profile). While you still remember what you did before it happened. :)

Changing this setting to disable notifications breaks Linked-in. When this override
is active, Linked-in basically hangs right after asking for the user name/password
on the "progress" page before it gets to the user landing page. I confirmed that
removing the override fixes Linked-in and it works correctly after that.

It seems very strange. Why the disabling of the notifications about the "link scope" events (kinda local host <--> local router) affects the behaviour of remote site? Looks like after the "asking for the user name/password" your network interface changes its address, which breaks websockets (since nobody inform them about this).
Anyway, such a disabling is no more needed since 2.53.8 will be released (with the proper patch instead of the work-around).

https://gitlab.com/seamonkey-project/seamonkey-2.53-mozilla/-/commit/906104481fa5049a8982ce4e675019e21cbe3704
Fix infinite loop when network changes and e10s is disabled. r=frg a=frg

Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → seamonkey2.53
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: