Closed Bug 1759196 Opened 3 years ago Closed 3 years ago

The seccomp-bpf compiler is overly strict when filtering 32-bit arguments (breaks the profiler via clock_gettime)

Categories

(Core :: Security: Process Sandboxing, defect, P2)

Unspecified
Linux
defect

Tracking

()

VERIFIED FIXED
101 Branch
Tracking Status
firefox-esr91 --- verified
firefox98 --- unaffected
firefox99 --- wontfix
firefox100 --- verified
firefox101 --- verified

People

(Reporter: tsebrenko, Assigned: jld)

References

(Regression)

Details

(Keywords: regression)

Attachments

(4 files, 1 obsolete file)

User Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Firefox/91.0

Steps to reproduce:

I am trying to launch the profiler with a button from the toolbar or from Web developer tools.

Actual results:

Кunning the profiler causes each page to freeze - the loading icon turns when switching tabs, no new tab is loaded. Stopping the profiler does not return FF to work - all tabs are hangs. You need to close the window and manually close the firefox process in the task manager.

Expected results:

On the latest ESR version, the profiler launched from Web developer tools - works as needed - allows you to capture and store a profile.

Thanks for your message. Indeed the new profiler is very different to the old one, especially it captures a lot more data, and therefore we may run into issues in some environments.
To unblock you, please know that you can reenable the old panel in the devtools settings.

Otherwise it would be good to know if you have the same bug on Nighly or Beta versions of Firefox. Also could you please share what computer you're using here.

Thanks for your help!

(In reply to Julien Wajsberg [:julienw] from comment #1)

To unblock you, please know that you can reenable the old panel in the devtools settings.

Unlocking the old profiler did not help - exactly the same symptoms - freezing immediately after pressing the start button.

Otherwise it would be good to know if you have the same bug on Nighly or Beta versions of Firefox.
With FF-100.0a1 (2022-03-11) - absolutely the same problem.

Also could you please share what computer you're using here.
AMD Ryzen 5 2600, 32 Gb, GF1650S with Nvidia driver 510.54 (also tested with 470.103.01).
Gentoo x64, Openbox without compositor and tested with picom-9.1 (glx and xrender), XFCE-4.16 with compositing.
All firefox builds are binary - 91.7.0esr, 98, 100.0a1 (2022-03-11).
Tested with new profile.

Then I would suggest to run mozregression (see https://mozilla.github.io/mozregression/ for more information) to pinpoint the exact change that produced this behavior. This would help us immensely.

Thanks a lot again

OK, I'm using mozregression-gui.
What settings need to be made?
Build type: shippable, opt, pgo, debug, asan or asan-debug?
Which repository?
What date range should I specify?

(In reply to tsebrenko from comment #4)

OK, I'm using mozregression-gui.
What settings need to be made?
Build type: shippable, opt, pgo, debug, asan or asan-debug?
I'd chose "opt" (I believe this is the default?)

Which repository?
leave blank

What date range should I specify?

Instead of "date" choose "release", we know 91 is working and 98 isn't.

You can look at the video you see how this works.

I'm more used to the command line myself to be honest ;-)
The command line would look like this:

  mozregression --good 91 --bad 98

You can install it with pip but you need a recent enough version of python and pip.

Thanks again

Hmm. When I launch a new profiler with default settings from Web developer tools, it freezes starting from version 91.
If I uncheck "enable new performance recorder" in the settings, then the profiler works fine, also in FF-98.

I'm confused now... In comment 2 you mentioned that the old panel wasn't working...

I'm out of options now, hopefully other folks will have ideas about how to debug this further.

My original video with FF-98:
https://vk.cc/cbOpPK

My original video with FF:ESR:
https://vk.cc/cbOq3d

When you reproduce using nightly, can you launch it from a terminal, and see if anything suspicious is written there ?

Last idea is to use rr and share with us the result, but that's more involved...

(In reply to Julien Wajsberg [:julienw] (away -> Mar 21) from comment #9)

When you reproduce using nightly, can you launch it from a terminal, and see if anything suspicious is written there ?

Last idea is to use rr and share with us the result, but that's more involved...

Ok, new video: https://vk.cc/cbOtDW

Attached file gdb stack (deleted) —

I have a similar issue on my Linux machine. Running the profiler in my optimized build deadlocks all content processes. For some reason my debug build works. I tried a while ago to check if it was a regression, and it looks like it isn't. I checked old versions of Firefox that I know I captured profiles from in the past and they also reproduce the bug. Maybe something has changed in an updated system library somehow?

I grabbed a stack using gdb. At my first attempt the stack was showing we were blocked trying to lock a mutex to capture a stack for a marker from the I/O interposer. To check if the problem was specific to the I/O interposer, I tried again with the noiostacks feature and now the stack shows the profiler attempting to lock a mutex to get a stack for a NotifyObserver marker; so the bug is unrelated to the I/O interposer.

Gerald, is the stack helpful enough to guess what the problem is?

Flags: needinfo?(gsquelart)
Attached file gdb showing other threads (deleted) —

Looking at the other threads, I see no ProfilerChild thread, and the "BHMgr Monitor" thread is also blocked waiting for a profiler mutex.

BHR is not enabled on release builds, so after all what I'm seeing might be different from what this bug was originally about.

(In reply to Florian Quèze [:florian] from comment #12)

the "BHMgr Monitor" thread is also blocked waiting for a profiler mutex.

Coincidentally(?) I've seen some other reports involving BHR recently.
Though a likely scenario is that the profiler really deadlocks by itself somehow (still to be determined), and later BHR wants to report that hang but itself gets caught in the same deadlock when it tries to use the profiler's stack sampler!

BHR is not enabled on release builds, so after all what I'm seeing might be different from what this bug was originally about.

It'd be great if we could determine the order in which things go wrong. If my assumption above is correct, BHR may not be the cause (but suffers from the same issue), so you may still have reproduced the original bug.
I should really try to reproduce it myself on Linux... rr would be ideal there, if it can be used.

I'm also seeing profiler hang. This seems to be sandbox-related - security.sandbox.content.level of 0 works, 1 or higher does not (tested on 0/1/2 with mozregression --launch 2022-03-27 --pref security.sandbox.content.level:0 etc.)

Some more debugging results.

The hang seems to happen (in the content process) due to the ProfilerChild thread dying without releasing a mutex that other threads wait for (SamplerThread in SamplerThread::Run, StreamTrans thread(s) when registering themselves). The thread dies at the following point:

[Switching to Thread 0x7fa1c9c35640 (LWP 1421522)]                                                                                    
                                                                                                                                      
Thread 14 "ProfilerChild" hit Catchpoint 1 (call to syscall clock_gettime), 0x00007fff4923a8d2 in clock_gettime ()                    
#0  0x00007fff4923a8d2 in clock_gettime ()                                                                                            
#1  0x00007fa1e9ae3389 in __GI___clock_gettime (clock_id=<optimized out>, tp=<optimized out>) at ../sysdeps/unix/sysv/linux/clock_gettime.c:42
#2  0x00007fa1e085abda in mozilla::profiler::ThreadRegistrationUnlockedConstReaderAndAtomicRW::GetNewCpuTimeInNs() () at /tmp/tmp42l2gm8m/firefox/libxul.so
#3  0x00007fa1e0846c5a in locked_profiler_start(PSAutoLock const&, mozilla::PowerOfTwo<unsigned int>, double, unsigned int, char const**, unsigned int, unsigned long, mozilla::Maybe<double> const&) () at /tmp/tmp42l2gm8m/firefox/libxul.so
#4  0x00007fa1e0849591 in profiler_start(mozilla::PowerOfTwo<unsigned int>, double, unsigned int, char const**, unsigned int, unsigned long, mozilla::Maybe<double> const&) () at /tmp/tmp42l2gm8m/firefox/libxul.so
#5  0x00007fa1e0870a24 in mozilla::ProfilerChild::RecvStart(mozilla::ProfilerInitParams const&, std::function<void (bool const&)>&&) () at /tmp/tmp42l2gm8m/firefox/libxul.so
#6  0x00007fa1e31193c3 in mozilla::PProfilerChild::OnMessageReceived(IPC::Message const&) () at /tmp/tmp42l2gm8m/firefox/libxul.so
#7  0x00007fa1e1c2819f in mozilla::ipc::MessageChannel::MessageTask::Run() () at /tmp/tmp42l2gm8m/firefox/libxul.so                                                                                                                                                          
#8  0x00007fa1e1bdf8b0 in nsThread::ProcessNextEvent(bool, bool*) () at /tmp/tmp42l2gm8m/firefox/libxul.so                            
#9  0x00007fa1e1c2b5b5 in mozilla::ipc::MessagePumpForNonMainThreads::Run(base::MessagePump::Delegate*) () at /tmp/tmp42l2gm8m/firefox/libxul.so                               
#10 0x00007fa1e27b5f2f in MessageLoop::Run() () at /tmp/tmp42l2gm8m/firefox/libxul.so                                                                                                                                                                                        
#11 0x00007fa1e25d3ec1 in nsThread::ThreadFunc(void*) () at /tmp/tmp42l2gm8m/firefox/libxul.so                                                                                                                                                                               
#12 0x00007fa1e99f6233 in _pt_root () at /tmp/tmp42l2gm8m/firefox/libnspr4.so                                                                                                                                                                                                
#13 0x000055fb788ac7be in set_alt_signal_stack_and_start(PthreadCreateParams*) ()                                                                                                                                                                                            
#14 0x00007fa1e9a9927a in start_thread (arg=<optimized out>) at pthread_create.c:442                                                                                                                                                                                         
#15 0x00007fa1e9b1aee0 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:100                                                                                                                                                                                          
[Thread 0x7fa1c9c35640 (LWP 1421522) exited]                                                                                                                                                                                                                                 

and strace on the previous attempt said the following for the ProfilerChild thread

getpid()                                = 1420192                                                                                                                                                                                                                            
gettid()                                = 1420224                                                                                                                                                                                                                            
gettid()                                = 1420224                                                                                                                                                                                                                            
mmap(NULL, 823296, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7f5d1c637000                                                                                                                                                                                   
mprotect(0x7f5d1c638000, 819200, PROT_READ|PROT_WRITE) = 0                                                                                                                                                                                                                   
rt_sigprocmask(SIG_BLOCK, ~[], [], 8)   = 0                                                                                                                                                                                                                                  
clone3({flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, child_tid=0x7f5d1c6ff910, parent_tid=0x7f5d1c6ff910, exit_signal=0, stack=0x7f5d1c637000, stack_size=0xc7ec0, tls=0x7f5d1c6ff640}
, 88) = -1 ENOSYS (Function not implemented)                                                                                                                                                                                                                                 
clone(child_stack=0x7f5d1c6feeb0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tid=[1420738], tls=0x7f5d1c6ff640, child_tidptr=0x7f5d1c6ff910) = 1420738                       
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0                                                                                                                                                                                                                                 
gettid()                                = 1420224                                                                                                                                                                                                                            
clock_gettime(0xff52a2fe /* CLOCK_??? */,  <unfinished ...>) = ?                                                                                                                                                                                                             
+++ killed by SIGSYS +++                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     

Whitelisting clock_gettime with mozregression's --pref security.sandbox.content.syscall_whitelist:0,228 also works around the problem, although I don't understand why the existing checks are not enough (looks like this case should be covered).

The sandbox configuration for the syscall clock_gettime is here: https://searchfox.org/mozilla-central/rev/625c3d0c8ae46502aed83f33bd530cb93e926e9f/security/sandbox/linux/SandboxFilter.cpp#741-774

I'm surprised by the last line in your strace:
0xff52a2fe /* CLOCK_??? *

Is it possible that the configuration related to the profiler (at the end of the block) is incorrect in some situations?

Regressed by: 1755055
Has Regression Range: --- → yes

The syscalls were added to the sandbox in bug 1749606.

When investigating this, note that getting the SIGSYS in the debugger is entirely normal, that's the syscall being intercepted and the seccomp-bpf filter being invoked. If you use ./mach --debug, it should set up GDB rules to ignore them. So the trace in comment 15 may be entirely normal (though the killed is suspicious, as you just straced and didn't catch it in the debugger).

One quick thing to check what happens if the rules https://searchfox.org/mozilla-central/rev/625c3d0c8ae46502aed83f33bd530cb93e926e9f/security/sandbox/linux/SandboxFilter.cpp#743 were changed to just Allow(). If everything works then, it narrows down the possibilities.

clock_gettime(0xff52a2fe /* CLOCK_??? */, <unfinished ...>) = ?

0xff52a2fe & 0x7 = 0b110 so that's CPUCLOCK_SCHED | CPUCLOCK_PERTHREAD_MASK which are allowed here: https://searchfox.org/mozilla-central/rev/625c3d0c8ae46502aed83f33bd530cb93e926e9f/security/sandbox/linux/SandboxFilter.cpp#767

Set release status flags based on info from the regressing bug 1755055

The other mystery here is why the VDSO isn't being used. Reporter shared a log where the sandbox was blocking loading attempts of the Linux VDSO, which AFAIK is extremely weird as the VDSO isn't a physical file.

(summarizing the matrix discussion)
vDSO is being used as expected (#0 0x00007fff4923a8d2 in clock_gettime () comes from it), but it cannot directly handle this clock ID, so it does the fallback syscall.

In the syscall the clock_id goes to the rdi register, and it looks like the upper 32 bits of it can be garbage (in another test the register value was 0x1ff3ff3be, which is 0x1 upper half /0xff3ff3be lower half). However, the syscall filter has assumptions about that

    // On 64-bit platforms, the upper 32-bits may be 0 or ~0; but we only allow
    // ~0 if the sign bit of the lower 32-bits is set too:

It looks like disabling the VDSO makes this work, so maybe it's exactly the VDSO fallback that violates the ABI expectation.

So maybe I'm on the wrong track here, but this is the VDSO fallback code:
https://github.com/torvalds/linux/blob/65c61de9d090edb8a3cfb3f45541e268eb2cdb13/arch/x86/include/asm/vdso/gettimeofday.h#L76

This passes RDI unchanged through to the syscall. Looking at all the callers, everything is manipulating the clock as a clockid_t though, so it's not clear to me how there can get junk in there.

With respect to the ABI confusion, something similar happened in bug 1215681 comment #17: Clang/LLVM passed a 32-bit parameter in a register with some garbage in the upper half, and the callee was written in assembly and used the full 64-bit register without extending it. In that case, the optimization responsible for the bits in the upper half was combining two 32-bit loads into a 64-bit load and a shift. In this case, it might be the optimizer widening the 32-bit operations used to construct the thread clock into 64-bit, because the extra bits “shouldn't matter” and there's some other reason to prefer the 64-bit form.

The mysterious thread deaths are from SECCOMP_RET_KILL_THREAD; originally this was named simply SECCOMP_RET_KILL, but later SECCOMP_RET_KILL_PROCESS was added, probably because threads mysteriously exiting isn't the nicest behavior to debug, as seen here. It looks like we could override that behavior with PolicyCompiler::SetPanicFunc and get a more helpful error report instead of a hang. It would even be possible to check if the syscall is clock_gettime and fix up the arguments, but that's kind of a hack.

And, unless I'm missing something, this could affect any syscall; e.g., here's the beginning of glibc's dup (the rest is handling the return value):

(gdb) disas dup
Dump of assembler code for function dup:
   0x00000000000ead40 <+0>:     mov    $0x20,%eax
   0x00000000000ead45 <+5>:     syscall 

Interestingly, musl libc sign-extends the argument:

(gdb) disas dup
Dump of assembler code for function dup:
   0x00000000000707b0 <+0>:     sub    $0x8,%rsp
   0x00000000000707b4 <+4>:     movslq %edi,%rdi
   0x00000000000707b7 <+7>:     mov    $0x20,%eax
   0x00000000000707bc <+12>:    syscall 

It looks like glibc's syscall stubs are written in (macro-generated) assembly, while musl uses C code with macros that cast the arguments to long before using inline asm.

In any case, the kernel appears to ignore the extra bits; I tested with a small program calling dup.

So, if we remove the upper-half checks from the seccomp-bpf program… I guess the failure mode there is if there's a syscall where the kernel actually uses the full register, but we test its value as a 32-bit type by accident (or because the documentation isn't accurate), then we'd allow inputs that could be unsafe… but I can't think of any way to avoid that without causing bugs like this one.

Component: Gecko Profiler → Security: Process Sandboxing
Flags: needinfo?(gsquelart)
OS: Unspecified → Linux

I'm going to mark this S2 because it observably makes the profiler unusable for the larger part of the Linux userbase (it affects Ubuntu 20.04).

Severity: -- → S2
Priority: -- → P2

I ran mozregression on the VM and found this pushlog:
https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=809b33fc0250a84ed7ad8a44067351d083ac1f57&tochange=e8346137ae4ebf136fb69dd07a5d34e1d35d00c7

Bug 1687631 is in there and I believe it's related.

Obviously there probably is a change in the VM too, maybe a linux kernel change. My machine (working) runs 5.15.15, the VM (broken) runs 5.13. I have a 5.13 on this machine too so I can try booting it.

Regressed by: 1687631

(In reply to Julien Wajsberg [:julienw] from comment #28)

Bug 1687631 is in there and I believe it's related.

It's related, as it's where we enabled the cpu measurement feature by default, and clock_gettime is how we get the cpu use data on Linux.

(In reply to Julien Wajsberg [:julienw] from comment #28)

Obviously there probably is a change in the VM too, maybe a linux kernel change. My machine (working) runs 5.15.15, the VM (broken) runs 5.13. I have a 5.13 on this machine too so I can try booting it.

I tried 5.13.14 on my (Debian stable) machine, and the profiler works fine with it. Not sure what to conclude from that. It may not be exactly similar, or there may be another difference in the libc .

(In reply to Florian Quèze [:florian] from comment #29)

(In reply to Julien Wajsberg [:julienw] from comment #28)

Bug 1687631 is in there and I believe it's related.

It's related, as it's where we enabled the cpu measurement feature by default, and clock_gettime is how we get the cpu use data on Linux.

Yeah, that's what I meant :) thanks for expliciting it.

A few quick comments:

  • I think Chromium is wrong about the syscall ABI, and has been for the decade or so since seccomp-bpf was created.
  • However, AMD64 is such that the bad cases are unlikely to happen: 32-bit operations zero-extend their outputs, and they're either shorter (if only low registers are used) or the same size as 64-bit. So compilers will tend not to create values with surprising upper halves in the first place, and if a contaminated value is ever spilled or moved between registers, it will be zero-extended.
  • But it's really up to the whims of the compiler. Here's an interesting example with Clang, where details of the function call ABI aren't entirely optimized away after inlining (and I think this is what's happening in the clock case here; interestingly, GCC doesn't have that problem). Also, while vDSO on Ubuntu 20.04 passes rdi through unchanged, on Debian unstable there are register moves that normalize the upper half; that's compiled from C code (and I think the same C code?), but probably with different compilers / flags.

It's a hack, but it's possible to write something like this:

int clean_clock_gettime(clockid_t clockid, struct timespec *tp) {
   asm("mov %1, %0" : "=r"(clockid) : "r"(clockid));
   return clock_gettime(clockid, tp);
}

Or, as a preload library:

#include <time.h>
#include <dlfcn.h>

int clock_gettime(clockid_t clockid, struct timespec *tp)
{
        static const auto real_clock_gettime =
            reinterpret_cast<decltype(clock_gettime)*>(dlsym(RTLD_NEXT, "clock_gettime"));
        asm("mov %1, %0" : "=r"(clockid) : "r"(clockid));
        return real_clock_gettime(clockid, tp);
}

// c++ -fPIC -O2 -Wall -shared -o clock_cleaner.so clock_cleaner.cc -ldl

But I think the correct fix is to make the sandbox ignore the top halves of 32-bit arguments.

Hey Jed,

what do you think is the right path forward? How do we fix the sandbox? Do we need to fix it upstream before fixing this in our copy?

Flags: needinfo?(jld)
Status: UNCONFIRMED → NEW
Ever confirmed: true
Version: Firefox 98 → unspecified
Summary: Profiler in FF-98 cause hang → Profiler causes hang due to vdso and sandbox bug in the clock_gettime syscall implementation

This should block the removal of the old performance panel, because it blocks some users from using the performance panel in DevTools

Blocks: 1668219

[In reply to Julian Descottes [:jdescottes] from comment #33)

This should block the removal of the old performance panel, because it blocks some users from using the performance panel in DevTools

Mmm I'm not so sure, as in comment 2, the reporter said this is happening with the old panel as well. It would be good to double check though.

Oh thanks for the heads up, I had not seen that. We just got another (potential duplicate) at Bug 1764551, so I'm asking there if it also impacts the old panel as well. Let's remove the dependency for now.

No longer blocks: 1668219

Idea: allow clock_gettime and friends unconditionally (without checking arguments) until the underlying sandbox bug is fixed. Could that lead to a big security problem in practice?

(In reply to Julien Wajsberg [:julienw] from comment #32)

what do you think is the right path forward? How do we fix the sandbox? Do we need to fix it upstream before fixing this in our copy?

We don't need to fix it upstream first; we try to minimize patches here but we've carried nontrivial changes before (e.g., Windows handle brokering).

There are two possible fixes, and I have working implementations of both:

  1. Ignore the upper halves of the arguments. This is fast and simple, but the failure mode is that if we're wrong about the argument size (a mistake on our side, an error in documentation, or maybe cases where the syscall interface isn't the same as the C function and we don't notice), we'll allow unwanted inputs that could be dangerous.
  2. Use SECCOMP_RET_TRAP to go back to userspace to fix the argument and re-issue the syscall. This is safe, but it will impose some overhead on the syscalls that need to be fixed. (I haven't measured it yet. If it's a problem in practice then something like the clean_clock_gettime example from comment #31 could be used.)

I'm in favor of option #2; I've written some test cases to verify that arguments are being passed correctly and it all seems to work (as does the profiler, tested in an Ubuntu 20.04 VM which is affected otherwise).

(In reply to Julien Wajsberg [:julienw] from comment #36)

Idea: allow clock_gettime and friends unconditionally (without checking arguments) until the underlying sandbox bug is fixed. Could that lead to a big security problem in practice?

It means the process can monitor the CPU usage of other processes. That's not great, because it shouldn't be able to interact with other processes at all (and especially for GMP we try to avoid obvious privacy/fingerprinting holes where possible), but it's not the end of the world.

But, in terms of minimal workarounds, the inline asm trick from comment #31 should also work (with enough ifdefs).

Assignee: nobody → jld
Flags: needinfo?(jld)
Summary: Profiler causes hang due to vdso and sandbox bug in the clock_gettime syscall implementation → The seccomp-bpf compiler is overly strict when filtering 32-bit arguments (breaks the profiler via clock_gettime)

Background: When 32-bit types are passed in registers on x86-64 (and
probably other platforms?), the function call ABI does not specify the
contents of the upper half, and the Linux kernel syscall ABI appears to
have the same behavior.

In practice, the upper half is usually zero (or maybe sign-extended from
the lower half), because 64-bit operations aren't cheaper than 32-bit,
and 32-bit operations zero-extend their outputs; therefore, this case
usually doesn't happen in the first place, and any kind of spill or
register move will zero the upper half. However, arbitrary values are
possible, and a case like this has occurred with the Firefox profiler
using clock_gettime. (This paragraph is applicable to x86-64 and
ARM64; other 64-bit architecutures may behave differently.)

But the Chromium seccomp-bpf compiler, when testing the value of a 32-bit
argument on a 64-bit platform, requires that the value be zero-extended
or sign-extended, and (incorrectly, as far as I can tell) considers
anything else an ABI violation.

With this patch, when that case is detected, we use the SIGSYS handler
to zero-extend the problematic argument and re-issue the syscall.

(It would also be possible to just ignore the upper half, and that would
be faster, but that could lead to subtle security holes if the type
used in bpf_dsl is incorrect and the kernel really does treat it as
64-bit.)

We now have a crash signature, via bug 1762540.

Crash Signature: [@ mozilla::SetCurrentProcessSandbox::$::operator() const::{lambda#1}::__invoke ]

Thanks for this amazing work Jed!

I was wondering if you also had some idea to uplift a fix to beta and even to ESR? I assume that it's not necessary to uplift to release.

Pushed by jedavis@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/f177a4875b01 Fix the Linux sandbox's handling of 32-bit arguments on 64-bit platforms. r=gcp,bobowen

(In reply to Julien Wajsberg [:julienw] from comment #41)

I was wondering if you also had some idea to uplift a fix to beta and even to ESR? I assume that it's not necessary to uplift to release.

Beta: Yes; I should be able to argue that this is low risk given that it affects only the case that was already broken.

ESR: Shouldn't be needed; the reporter mentioned that ESR91 was unaffected by the profiler issue. In theory anything could be affected, but there are no reports of it. (Bug 1762540 might be worth uplifting to ESR so that we get crash reports just in case….)

Backed out for causing build bustages on Unified_cpp_sandbox_common0.o

Backout link

Push with failures

Failure log // Failure log 2

Flags: needinfo?(jld)

A slight mistake in the tests; I have a fix for that and I'm validating it on Try.

Pushed by jedavis@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/d6c728f02fb4 Fix the Linux sandbox's handling of 32-bit arguments on 64-bit platforms. r=gcp,bobowen

(In reply to Jed Davis [:jld] ⟨⏰|UTC-6⟩ ⟦he/him⟧ from comment #43)

(In reply to Julien Wajsberg [:julienw] from comment #41)

I was wondering if you also had some idea to uplift a fix to beta and even to ESR? I assume that it's not necessary to uplift to release.

Beta: Yes; I should be able to argue that this is low risk given that it affects only the case that was already broken.

Thanks!

ESR: Shouldn't be needed; the reporter mentioned that ESR91 was unaffected by the profiler issue. In theory anything could be affected, but there are no reports of it. (Bug 1762540 might be worth uplifting to ESR so that we get crash reports just in case….)

ESR91 is unaffected for the reporter in the sense that the old devtools panel is unaffected in this version (note: it is affected since bug 1755055 which is after v91).

However the profiler itself is affected since bug 1687631 (I tested it myself). There's a workaround for that issue though, that is unchecking the feature "CPU utilization" in about:profiling (it is enabled by default since that bug). So maybe with this workaround we can argue that an uplift to ESR isn't needed.

Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → 101 Branch

[Tracking Requested - why for this release]: It would be good to fix this for the next release, as we get several reports from ubuntu users that trigger this error.

https://github.com/firefox-devtools/profiler/issues/3989
https://github.com/firefox-devtools/profiler/issues/4001
https://bugzilla.mozilla.org/show_bug.cgi?id=1764551

Crash Signature: [@ mozilla::SetCurrentProcessSandbox::$::operator() const::{lambda#1}::__invoke ]

[Beta uplift version; previously r=gcp,bobowen]

Background: When 32-bit types are passed in registers on x86-64 (and
probably other platforms?), the function call ABI does not specify the
contents of the upper half, and the Linux kernel syscall ABI appears to
have the same behavior.

In practice, the upper half is usually zero (or maybe sign-extended from
the lower half), because 64-bit operations aren't cheaper than 32-bit,
and 32-bit operations zero-extend their outputs; therefore, this case
usually doesn't happen in the first place, and any kind of spill or
register move will zero the upper half. However, arbitrary values are
possible, and a case like this has occurred with the Firefox profiler
using clock_gettime. (This paragraph is applicable to x86-64 and
ARM64; other 64-bit architecutures may behave differently.)

But the Chromium seccomp-bpf compiler, when testing the value of a 32-bit
argument on a 64-bit platform, requires that the value be zero-extended
or sign-extended, and (incorrectly, as far as I can tell) considers
anything else an ABI violation.

With this patch, when that case is detected, we use the SIGSYS handler
to zero-extend the problematic argument and re-issue the syscall.

(It would also be possible to just ignore the upper half, and that would
be faster, but that could lead to subtle security holes if the type
used in bpf_dsl is incorrect and the kernel really does treat it as
64-bit.)

[ESR91 uplift version; previously r=gcp,bobowen]

Background: When 32-bit types are passed in registers on x86-64 (and
probably other platforms?), the function call ABI does not specify the
contents of the upper half, and the Linux kernel syscall ABI appears to
have the same behavior.

In practice, the upper half is usually zero (or maybe sign-extended from
the lower half), because 64-bit operations aren't cheaper than 32-bit,
and 32-bit operations zero-extend their outputs; therefore, this case
usually doesn't happen in the first place, and any kind of spill or
register move will zero the upper half. However, arbitrary values are
possible, and a case like this has occurred with the Firefox profiler
using clock_gettime. (This paragraph is applicable to x86-64 and
ARM64; other 64-bit architecutures may behave differently.)

But the Chromium seccomp-bpf compiler, when testing the value of a 32-bit
argument on a 64-bit platform, requires that the value be zero-extended
or sign-extended, and (incorrectly, as far as I can tell) considers
anything else an ABI violation.

With this patch, when that case is detected, we use the SIGSYS handler
to zero-extend the problematic argument and re-issue the syscall.

(It would also be possible to just ignore the upper half, and that would
be faster, but that could lead to subtle security holes if the type
used in bpf_dsl is incorrect and the kernel really does treat it as
64-bit.)

Comment on attachment 9273740 [details]
Bug 1759196 - Fix the Linux sandbox's handling of 32-bit arguments on 64-bit platforms.

Beta/Release Uplift Approval Request

  • User impact if declined: Trying to use the profiler will lock up the browser on common Linux distributions (Ubuntu 20.04 LTS, probably other recent versions).
  • Is this code covered by automated tests?: Yes
  • Has the fix been verified in Nightly?: Yes
  • Needs manual test from QE?: No
  • If yes, steps to reproduce: There's an automated test (added by this patch) to cover the low-level issue, but if high-level verification is desired: enable the profiler (profiler.firefox.com) and start a profile (might need to select Firefox Platform settings?); the browser UI will still work but pages will become nonresponsive if this bug is present.
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): Low risk because it affects only the case that was previously broken (where the calling thread was immediately terminated). The patch had to be edited for uplift, but the changes are only to test code and are very minor.
  • String changes made/needed: none
  • Is Android affected?: No
Flags: needinfo?(jld)
Attachment #9273740 - Flags: approval-mozilla-beta?

Comment on attachment 9273741 [details]
Bug 1759196 - Fix the Linux sandbox's handling of 32-bit arguments on 64-bit platforms.

ESR Uplift Approval Request

  • If this is not a sec:{high,crit} bug, please state case for ESR consideration: The Firefox Profiler is broken for a relatively large fraction of a Tier-1 platform, and the risk is low.
  • User impact if declined: Trying to use the Firefox Profiler (but not the devtools profiler for Web content) will lock up the browser on common Linux distributions (Ubuntu 20.04 LTS, probably other recent versions).
  • Fix Landed on Version: 101
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): Low risk because it affects only the case that was previously broken (where the calling thread was immediately terminated). The patch had to be edited for uplift, but the changes are only to test code and are very minor.
Attachment #9273741 - Flags: approval-mozilla-esr91?

:jld the nominated patch needs a reviewer. Also this would need to go into release since we are in RC week and no longer in beta. Do you want to wait until beta next week for 101 or does this have to go into 100 release candidate?

Flags: needinfo?(jld)

(In reply to Dianna Smith [:diannaS] from comment #54)

:jld the nominated patch needs a reviewer.

D144642 and D144643 are copies of D143964 with trivial changes to the part of the patch that adds tests, because the original didn't merge cleanly; the part that actually ships is unchanged. That level of change normally wouldn't need re-review, and according to RyanVM on Matrix #developers the patches wouldn't need re-review for uplift. (I'm still a little confused about how this use case is supposed to work with Phabricator; the old workflow with text attachments and Splinter was more user-friendly in some ways.)

Also this would need to go into release since we are in RC week and no longer in beta. Do you want to wait until beta next week for 101 or does this have to go into 100 release candidate?

I'd prefer to get this fix into the 100 release, if possible.

Flags: needinfo?(jld)

Comment on attachment 9273740 [details]
Bug 1759196 - Fix the Linux sandbox's handling of 32-bit arguments on 64-bit platforms.

Approved for 100.0rc2

Attachment #9273740 - Flags: approval-mozilla-beta? → approval-mozilla-release+
Flags: qe-verify+
QA Whiteboard: [qa-triaged]

I have managed to reproduce this issue on Nightly v101.0a1 (2022-04-20) with the steps in comment 48 and verified the fix in Nightly v101.0a1 (2022-05-01), Beta (RC2) v100.0. After this fix, the browser is still usable after starting a Profiler Recording.

Status: RESOLVED → VERIFIED
QA Whiteboard: [qa-triaged]
Flags: qe-verify+
Attachment #9273740 - Attachment is obsolete: true

Comment on attachment 9273741 [details]
Bug 1759196 - Fix the Linux sandbox's handling of 32-bit arguments on 64-bit platforms.

Approved for 91.10esr.

Attachment #9273741 - Flags: approval-mozilla-esr91? → approval-mozilla-esr91+

I have reproduced this issue in Ubuntu 20.04.4 LTS on Firefox ESR v91.9.esr1 and I have verified the fix in ESR v91.10esr.
The pikabu.ru website is properly working after starting the firefox profiler.

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: