Enable Stack clash protection on supported OS / arch
Categories
(Firefox Build System :: General, task, P2)
Tracking
(relnote-firefox -, firefox86 fixed)
People
(Reporter: Sylvestre, Assigned: Sylvestre)
References
Details
(Keywords: sec-want, Whiteboard: [adv-main86-])
Attachments
(6 files, 1 obsolete file)
This is a feature provided by gcc for a few years now.
clang doesn't have it and it has been marked as a blocker for redhat/fedora to use clang to build Firefox.
Serge started the work on the clang side to implement it:
http://lists.llvm.org/pipermail/llvm-dev/2019-October/135786.html
https://reviews.llvm.org/D68720
He reached out to me to evaluate the change for Firefox performances.
Opening a bug for transparency and sharing results
If there isn't performance regression, we might want to add this to our official builds.
Comment 1•5 years ago
|
||
Is it possible for him to make a backport of the patch to the clang 8 branch? That would be easiest for us to apply and test.
Assignee | ||
Comment 2•5 years ago
|
||
We moved to clang 9 and Serge shared with me a backported patch
https://hg.mozilla.org/try/rev/37fe420914ca37a20a75dd8ba42089579fa72b5a
https://treeherder.mozilla.org/#/jobs?repo=try&revision=1db6f0e4ceca73e401ca60d0e93ad29cbe1c3949&selectedJob=271266815
Assignee | ||
Comment 3•5 years ago
|
||
The linux pgo build with the option:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=3850d59d0d33e2ca0c746c54f059304306368972
Assignee | ||
Comment 4•5 years ago
|
||
With the perf results:
https://treeherder.mozilla.org/perf.html#/compare?originalProject=mozilla-central&newProject=try&newRevision=3850d59d0d33e2ca0c746c54f059304306368972&framework=1&selectedTimeRange=172800
I do see an improvement from 3 to 4% on Linux.
I am a bit confused to see Windows improvement as I didn't touch it (the stack clash patch is only applied on clang Linux) but the confidence is smaller.
Anyway, it matches what Serge told me. He saw improvements on some projects (and no regression yet).
I would however appreciate a confirmation by an actual perf expert.
So, if this is the case, once landed, we could even take the patch to clang 9 to avoid having to wait a few more months (clang 10 release)
Comment 5•5 years ago
|
||
:jmaher is who I usually go to for such requests, but he is on leave.
Would Serge be open to improving his patch to include Windows? Or at least documenting what would be needed to do so?
Assignee | ||
Comment 6•5 years ago
|
||
Comparing try with try isn't giving the same kind of results (slower)
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&newProject=try&newRevision=3850d59d0d33e2ca0c746c54f059304306368972&framework=1&selectedTimeRange=172800
I am a bit lost on how to understand these results.
Assignee | ||
Comment 7•5 years ago
|
||
On dmajor suggestion, I reused 1db6f0e4ceca73e401ca60d0e93ad29cbe1c3949 as base line (the clang patch but not enabled in our builds) and retriggered the talos jobs
Here is the comparison:
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=1db6f0e4ceca73e401ca60d0e93ad29cbe1c3949&newProject=try&newRevision=3850d59d0d33e2ca0c746c54f059304306368972&framework=1
Seems that there isn't any significant improvement or regression after all. Except maybe a regression in "about_preferences_basic opt e10s stylo" on Linux opt (not pgo)
Please also run raptor tests, at least speedometer. Should we also test this on Windows builds?
Assignee | ||
Comment 9•5 years ago
|
||
Serge told me that it should not have impact on Windows (and not needed there)
Comment 10•5 years ago
|
||
@dmajor, @tjr : The windows ABI already requires a touch on each stack page, so Windows shouldn't be affected, (not the case of the patch you're testing with, but should be the case in the near future). Basically it means that clang-cl will support -fstack-clash-protection
, but it should be a no-op for Windows.
@sylvestre: thanks for the feedback. Having firefox passing its validation with this patch is very good news. I've made several significant changes to the original patch, so an extra round may be needed, keep in touch.
@all: once the patch lands in llvm's master, I'll provide a backport if you want.
Assignee | ||
Comment 11•5 years ago
|
||
No diff with raptor/speedometer
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=1db6f0e4ceca73e401ca60d0e93ad29cbe1c3949&newProject=try&newRevision=3850d59d0d33e2ca0c746c54f059304306368972&framework=10
(thanks again david ;)
Assignee | ||
Comment 12•5 years ago
|
||
Serge updated the patch and provided me a clang-9 patch.
Trying it here:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=0e38b4343b7b64bdea67a6adca12844631357280
https://hg.mozilla.org/try/rev/bec46adab294c7b1547d447ac5f8d15b34b87fd0
Assignee | ||
Comment 13•5 years ago
|
||
Seems that there isn't any performance change (but I might be wrong as I am no expert in perfherder)
Assignee | ||
Updated•5 years ago
|
Assignee | ||
Comment 14•5 years ago
|
||
Dave Hunt told me the following:
If so, this suggests that your patch has caused a high confidence regression to the kraken results across all platforms. The kraken benchmark measures JavaScript performance, details can be found at https://wiki.mozilla.org/TestEngineering/Performance/Talos/Tests#kraken. There also appears to be a high confidence improvement to tabpaint across all platforms, which is documented here: https://wiki.mozilla.org/TestEngineering/Performance/Talos/Tests#tabpaint. There are a few other medium to high confidence improvements and regressions.
I see references to linux64, macos64, and win64 in your patch here https://hg.mozilla.org/try/rev/408e7270af623961e8cb466aa31b18248aa3e512 could these explain the cross-platform impact? Were the tasks executed around the same time? The reason I ask is that if the base tasks were triggered first, and the new triggered much later, perhaps an infrastructure change could have been introduced in the meantime.
Serge told me that he will have a look why Windows performances are impacted
Assignee | ||
Comment 15•5 years ago
|
||
After discussing with dmajor, he told me that I have been probably impacted by another unrelated issue.
So, comparing with and without the option activated, I saw that:
Seems that we are only regressing on Windows 7:
tp5n main_startup_fileio opt e10s stylo
tp5n time_to_session_store_window_restored_ms opt e10s stylo
Assignee | ||
Comment 16•5 years ago
|
||
for legacy
Comment 17•5 years ago
|
||
Seems that we are only regressing on Windows 7:
tp5n main_startup_fileio opt e10s stylo
tp5n time_to_session_store_window_restored_ms opt e10s stylo
I'm not worried about these two, historically they haven't been super reliable for me.
Comment 18•5 years ago
|
||
@sylvestre: I've explicitly disabled the flag for non-Linux targets, users receive a warning if they try to use it for non-Linux, non-x86:x64 platform. I've also setup an extra validation step for the review, it should be fine : https://github.com/serge-sans-paille/llvm-project/pull/3/checks
Assignee | ||
Comment 19•5 years ago
|
||
Serge provided me with a new patch to skip the execution of this code on Windows & Mac.
I just started some talos job;
-
without the option enabled:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=b67d19f2d016cfd19e2854fdd2334c7540a96b31
(just clang + a normal fx build)
https://hg.mozilla.org/try/rev/41b41afc9bc18557a86db05becdb0f6b6f154a10 -
with the option activated:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=a405edf572eb19ba33b829bb691a3a3095e68cc6
https://hg.mozilla.org/try/rev/cc533026241361425a0991c01e4bff3bb4239605
I kept the Windows & Mac part to make sure that it behaves correctly
Assignee | ||
Comment 20•5 years ago
|
||
Looking at the results ( https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=a405edf572eb19ba33b829bb691a3a3095e68cc6&newProject=try&newRevision=b67d19f2d016cfd19e2854fdd2334c7540a96b31&framework=1 ) . I don't seen any regression on Windows or Mac anymore
Just a small regression with displaylist_mutate opt e10s stylo on Linux 64 (-1.2%)
Comment 21•5 years ago
|
||
(In reply to Sylvestre Ledru [:Sylvestre] from comment #20)
Looking at the results ( https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=a405edf572eb19ba33b829bb691a3a3095e68cc6&newProject=try&newRevision=b67d19f2d016cfd19e2854fdd2334c7540a96b31&framework=1 ) . I don't seen any regression on Windows or Mac anymore
Just a small regression with displaylist_mutate opt e10s stylo on Linux 64 (-1.2%)
It looks like there are only five runs before and after. At that level, we can't draw a conclusion because the tests have too much variance. Also there aren't any raptor runs in there. Depending on how intense you want to be about this experiment, I would say we need more data. But if I understand correctly this is sort of just a casual run, so maybe this combined with the previous testing is sufficient.
Assignee | ||
Comment 22•4 years ago
|
||
Not sure we want to enable it as it might have perf impact
Assignee | ||
Comment 23•4 years ago
|
||
Serge told me that we would need:
https://github.com/llvm/llvm-project/commit/0f60bcc36c34522618bd1425a45f8c6006568fb6
Comment 24•4 years ago
|
||
(In reply to Sylvestre Ledru [:Sylvestre] from comment #23)
Serge told me that we would need:
https://github.com/llvm/llvm-project/commit/0f60bcc36c34522618bd1425a45f8c6006568fb6
For the record: that patch fixes probing for dynamic alloca. If firefox is not using VLA or explicit call to alloca
(or any other construct that generates dynamic alloca generation (?)), this patch should have no impact.
Assignee | ||
Comment 25•4 years ago
|
||
Assignee | ||
Comment 26•4 years ago
|
||
caused:
[task 2020-11-05T10:14:26.012Z] 10:14:26 INFO - In file included from Unified_cpp_sandbox_linux2.cpp:137:
[task 2020-11-05T10:14:26.012Z] 10:14:26 ERROR - /builds/worker/checkouts/gecko/security/sandbox/chromium/sandbox/linux/seccomp-bpf/syscall.cc:369:3: error: Unable to protect inline asm that clobbers stack pointer against stack clash [-Werror,-Wstack-protector]
[task 2020-11-05T10:14:26.012Z] 10:14:26 INFO - asm volatile(
[task 2020-11-05T10:14:26.013Z] 10:14:26 INFO - ^
[task 2020-11-05T10:14:26.013Z] 10:14:26 INFO - 1 error generated.
Depends on D97566
Assignee | ||
Comment 27•4 years ago
|
||
Depends on D97567
Updated•4 years ago
|
Updated•4 years ago
|
Updated•4 years ago
|
Assignee | ||
Comment 28•4 years ago
|
||
AFAIK, no perf regression:
https://treeherder.mozilla.org/perfherder/compare?originalProject=try&originalRevision=a47c98b909b61035dae2e1e00883f2ade0fef129&newProject=try&newRevision=62108fa48bd15fe01f1a0f1ffab133af9b4207cc&framework=13
For now, enabling it when built with clang >= 11.0.0 on x86, x86_64, s390x & ppc64 on non Windows.
According to the clang author (Serge), it is built-in on Windows
Assignee | ||
Updated•4 years ago
|
Comment 29•4 years ago
|
||
(In reply to Sylvestre Ledru [:Sylvestre] from comment #28)
AFAIK, no perf regression:
I added a few more test suites that usually give me trouble.
The webaudio test, which is very sensitive to c++ changes, does take a couple-percent hit in vanilla builds: https://treeherder.mozilla.org/perfherder/comparesubtest?originalProject=try&newProject=try&newRevision=62108fa48bd15fe01f1a0f1ffab133af9b4207cc&originalSignature=1726480&newSignature=1726480&framework=10&originalRevision=a47c98b909b61035dae2e1e00883f2ade0fef129
But in PGO builds, which are what really matter, it's fine: https://treeherder.mozilla.org/perfherder/comparesubtest?originalProject=try&newProject=try&newRevision=62108fa48bd15fe01f1a0f1ffab133af9b4207cc&originalSignature=1904137&newSignature=1904137&framework=10&originalRevision=a47c98b909b61035dae2e1e00883f2ade0fef129
Updated•4 years ago
|
Comment 30•4 years ago
|
||
Assignee | ||
Updated•4 years ago
|
Comment 31•4 years ago
|
||
bugherder |
Comment 32•4 years ago
|
||
Comment 33•4 years ago
|
||
bugherder |
Updated•4 years ago
|
Updated•4 years ago
|
Updated•4 years ago
|
Comment 34•4 years ago
|
||
Comment 35•4 years ago
|
||
bugherder |
Comment 36•4 years ago
|
||
Comment 37•4 years ago
|
||
bugherder |
Assignee | ||
Comment 38•4 years ago
|
||
The currently nightly has it \o/
Tom, As I was saying, we are going to write a blog post with Serge about it.
Do you think it should be mentioned in the release notes? I know it isn't much but some press like this kind security improvements.
Updated•4 years ago
|
Comment 39•4 years ago
|
||
I'm not sure about the release notes; I suppose add it and see what the people who manage that say.
Comment 40•4 years ago
|
||
We've seen some test_jsctypes.js
test failures on Try and locally. I tracked it down to alloca(0)
with -fstack-clash-protection
. It generates buggy machine code, a fix for that landed in LLVM:
https://github.com/llvm/llvm-project/commit/9573c9f2a363da71b2c07a3add4e52721e6028a0
https://www.mail-archive.com/llvm-branch-commits@lists.llvm.org/msg05187.html
I can reproduce this locally on central tip (hg revision f4001dfef5bc). I don't know why this is not perma-orange though.
Assignee | ||
Comment 41•4 years ago
|
||
Jandem, I opened bug 1680276 for this
Comment 42•4 years ago
|
||
Backed out from central for causing frequently test failures on try. (eg: https://treeherder.mozilla.org/jobs?repo=try&resultStatus=testfailed%2Cbusted%2Cexception&revision=754947316770e3b297b5b18a99ad39ab98cd77a5&selectedTaskRun=As7HGID9TNqBbHJ5eLaJhQ.0)
Backout link: https://hg.mozilla.org/mozilla-central/rev/4f4c8ad9952a6a3e74de5c4419e121c2f4ccd831
Assignee | ||
Comment 43•4 years ago
|
||
ok
I will rebase upstream fixes or wait for 11.0.1
Assignee | ||
Comment 44•4 years ago
|
||
From upstream repo:
a1e0363c7402f7aa58e24e0e6dfa447ebabc1910 to bbe6cbbed8c7460a7e8477373b9250543362e771
Comment 45•4 years ago
|
||
Assignee | ||
Comment 46•4 years ago
|
||
For now, it needs a patch version of clang 11.0.0 to work correctly.
I will remove it once 11.0.1 is released.
Updated•4 years ago
|
Comment 47•4 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/e45001cddaf4
https://hg.mozilla.org/mozilla-central/rev/f0b02db01894
Assignee | ||
Comment 48•4 years ago
|
||
This might be too technical but it shows some security improvement on Linux and I often see the press mentioning that.
Feel free to ignore it!
Release Note Request (optional, but appreciated)
[Why is this notable]: Prevent some classes of security issues
[Affects Firefox for Android]: Yes
[Suggested wording]: On Linux and Android, the protection to mitigate the stack clash attack has been activated.
[Links (documentation, blog post, etc)]: I have been writing a draft with Red Hat & Rust for the llvm blog.
A few more links for the curious:
https://blog.qualys.com/vulnerabilities-research/2017/06/19/the-stack-clash
https://pagure.io/fesco/issue/2020
https://reviews.llvm.org/D68720
https://www.phoronix.com/scan.php?page=news_item&px=LLVM-Stack-Clash-Protection-20
Comment 49•4 years ago
|
||
This had to be backed out for frequent crashes on Linux x64 debug affecting Try pushes (Bug 1679994):
https://hg.mozilla.org/mozilla-central/rev/5e25722bcc7c838934d95ca5b9ba11da0b16d1ab
https://hg.mozilla.org/integration/autoland/rev/5e25722bcc7c838934d95ca5b9ba11da0b16d1ab
Comment 50•4 years ago
|
||
(In reply to Sebastian Hengst [:aryx] (needinfo on intermittent or backout) from comment #49)
This had to be backed out for frequent crashes on Linux x64 debug affecting Try pushes (Bug 1679994):
The issue in bug 1679994 should have been fixed by the earlier backout here in comment 42, and we should not be seeing those issues anymore on the relanding in comment 47. Was there a different set of failures that prompted the second backout? Are there any links to the logs available?
Comment 51•4 years ago
|
||
Comment 52•4 years ago
|
||
Thanks Aryx.
The minidump from this GTest task has
00007f63`d80b1fab 4883c00f add rax,0Fh
00007f63`d80b1faf 4883e0f0 and rax,0FFFFFFFFFFFFFFF0h
00007f63`d80b1fb3 4989e5 mov r13,rsp
00007f63`d80b1fb6 4929c5 sub r13,rax
00007f63`d80b1fb9 4c89a5c8feffff mov qword ptr [rbp-138h],r12
00007f63`d80b1fc0 4939e5 cmp r13,rsp
00007f63`d80b1fc3 7c11 jl libxul+0x870ffd6 (00007f63`d80b1fd6)
00007f63`d80b1fc5 48c7042400000000 mov qword ptr [rsp],0
00007f63`d80b1fcd 4881ec00100000 sub rsp,1000h
00007f63`d80b1fd4 ebea jmp libxul+0x870ffc0 (00007f63`d80b1fc0)
00007f63`d80b1fd6 4c89ec mov rsp,r13
I don't think this does the right thing when rax
is initially 0.
The compiler for this build had this squashed patch on top of llvmorg-11.0.0
.
@sguelton: Does the above code look as expected to you?
Comment 53•4 years ago
|
||
Well, the story gets worse: I pulled the preprocessed quant_bands.c from CI, and downloaded the same compiler artifact that the builds are using, and on my machine I get different output (jge
).
Could this be a caching problem? It looks very strange that https://treeherder.mozilla.org/jobs?repo=autoland&revision=f0b02db0189473f12beb4b483dab17708ffde615&selectedTaskRun=eSFOksz8Rd67FeVxcky7Vw.0 built a brand-new clang, yet the build task claims sccache hit rate debug
was 0.6!
Comment 54•4 years ago
|
||
It's not strange, actually. That's the power of dynamic libraries. sccache uses the compiler binary (clang) for hashing, but the patch only touches libllvm... so from sccache's perspective, there was no change to the compiler. I don't think this is the first time this happens.
Assignee | ||
Comment 55•4 years ago
|
||
So, clearing the cache before landing this patch would fix the issue?
If so, how can we clear the cache ? :)
merci!
Comment 56•4 years ago
|
||
Clearing the cache would mean that builds using the older clang may get objects from a build with the new clang. It would be better to find a way to make the clang binary different.
Assignee | ||
Comment 57•4 years ago
|
||
Would it be possible to include libllvm in the hashing? (and/or all llvm/clang key files)
This kind of issue is pretty tricky to debug
Comment 58•4 years ago
|
||
That would need sccache to be aware of such details of the compiler...
Updated•4 years ago
|
Assignee | ||
Comment 59•4 years ago
|
||
Let's try to land it again now that we are using clang 11.0.1 (which includes the fixes), so that we don't have the cache issue one more time :)
Comment 60•4 years ago
|
||
Updated•4 years ago
|
Comment 61•4 years ago
|
||
Comment 62•4 years ago
|
||
bugherder |
Comment 63•4 years ago
|
||
bugherder |
Assignee | ||
Comment 64•4 years ago
|
||
If release managers want to relnote it, we (redhat, rust and firefox) wrote:
https://blog.llvm.org/posts/2021-01-05-stack-clash-protection/
Updated•4 years ago
|
Updated•4 years ago
|
Description
•