Elly Fong-Jones | fb88bfb | 2024-06-11 15:46:15 | [diff] [blame] | 1 | # The Linux Sandbox |
| 2 | |
| 3 | The Linux sandbox provides an API for restricting the capabilities of a process. |
| 4 | The overall design philosophy of the sandbox is documented |
| 5 | [elsewhere](../docs/design/sandbox.md); this document explains how it works on |
| 6 | Linux. |
| 7 | |
| 8 | ## Overall Design |
| 9 | |
| 10 | There are several different sandboxing mechanisms available on Linux: |
| 11 | |
| 12 | * setuid(2) |
| 13 | * namespaces |
| 14 | * seccomp(2) BPF |
| 15 | * seccomp(2) legacy |
| 16 | * selinux(7) |
| 17 | * apparmor(7) |
| 18 | * landlock(7) |
| 19 | |
| 20 | Chromium chooses which mechanisms to use based on which kernel features are |
| 21 | available. We also generally use multiple layers of sandboxing, to achieve both |
| 22 | confinement for the process and reduction of the exposed kernel attack surface. |
| 23 | Of these mechanisms, Chrome uses: |
| 24 | |
| 25 | * setuid(2) everywhere |
| 26 | * namespaces where supported (modern Linux kernels) |
| 27 | * seccomp(2) BPF where supported (modern Linux kernels) |
| 28 | |
| 29 | And we used to use, but no longer use: |
| 30 | |
| 31 | * selinux(7) |
| 32 | * apparmor(7) |
| 33 | |
| 34 | ## setuid(2) |
| 35 | |
| 36 | The setuid(2) sandbox takes advantage of the fact that privileged processes on |
| 37 | Linux are allowed to create new namespaces (see namespaces(7)) and sandboxes the |
| 38 | renderer by creating empty namespaces for it at launch time. It relies on a |
| 39 | setuid binary, usually installed at `/opt/google/chrome/chrome-sandbox`, which: |
| 40 | |
| 41 | * Enters new PID and network namespaces, preventing the sandboxed process from |
| 42 | directly accessing the network or seeing any other processes. |
| 43 | * chroot()s into a "safe" directory (currently inside the process's own /proc |
| 44 | directory) by spawning a privileged helper process which shares its fs state |
| 45 | (using `CLONE_FS`) and having that helper chroot() it, which leaves the |
| 46 | process in an empty, readonly root directory. |
| 47 | * Marks itself as un-dumpable using `prctl(2)`, which prevents any process |
| 48 | without `CAP_SYS_PTRACE` from tracing it. In theory this would keep renderers |
| 49 | from debugging each other, but in practice they are isolated from each other |
| 50 | by PID namespaces anyway. |
| 51 | * Uses capset(2) to drop all inherited capabilities. |
| 52 | * Drops from root back to the uid/gid/etc of the user running the browser |
| 53 | |
| 54 | In general, the setuid sandbox makes an effort to apply all these mitigations, |
| 55 | but support for them varies between kernel versions, so the strength of the |
| 56 | setuid sandbox is variable, with newer kernels providing better security. |
| 57 | |
| 58 | The setuid sandbox is implemented in [suid/](suid/). |
| 59 | |
| 60 | If you need to disable it, you can use `--disable-setuid-sandbox`. You should |
| 61 | also see |
| 62 | [docs/linux/suid_sandbox_development.md](../../docs/linux/suid_sandbox_development.md) |
| 63 | for advice on developing the setuid sandbox itself. |
| 64 | |
| 65 | ## seccomp(2) BPF |
| 66 | |
| 67 | On modern Linuxes, we use the filter mode of seccomp(2), which allows us to |
| 68 | supply a program (written in a domain-specific language called "BPF", see bpf(2) |
| 69 | and bpfc(1)) which is evaluated every time the sandboxed process makes a syscall |
| 70 | to figure out whether the syscall should be allowed. The seccomp filters are |
| 71 | compiled and applied "early" in the syscall process, so this both constrains |
| 72 | what the process can do and reduces attack surface of the kernel. |
| 73 | |
| 74 | The seccomp sandbox is implemented in [seccomp-bpf/](seccomp-bpf/), and our |
| 75 | tools for working with the BPF DSL are in [bpf_dsl/](bpf_dsl/). The actual |
| 76 | baseline policies we use are in [seccomp-bpf-helpers/](seccomp-bpf-helpers/). |
| 77 | |
| 78 | Since the seccomp sandbox has a filter that is applied to all syscalls being |
| 79 | made, to use it you must have an exhaustive list of syscalls that could be made |
| 80 | by the code being sandboxed - both code you did write and code you didn't write. |
| 81 | Generating that list of syscalls can be difficult and so it is helpful to have |
| 82 | very good test coverage **which runs under the sandbox** to ensure you are |
| 83 | exercising any code paths that could lead to syscalls. |
| 84 | |
| 85 | ## landlock(7) |
| 86 | |
| 87 | We currently don't use Landlock, but we'd like to: |
| 88 | [345514921](https://issues.chromium.org/issues/345514921). |