gvisor.dev/gvisor@v0.0.0-20240520182842-f9d4d51c7e0f/website/blog/2020-09-18-containing-a-real-vulnerability.md (about)

     1  # Containing a Real Vulnerability
     2  
     3  In the previous two posts we talked about gVisor's
     4  [security design principles](https://gvisor.dev/blog/2019/11/18/gvisor-security-basics-part-1/)
     5  as well as how those are applied in the
     6  [context of networking](https://gvisor.dev/blog/2020/04/02/gvisor-networking-security/).
     7  Recently, a new container escape vulnerability
     8  ([CVE-2020-14386](https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-14386))
     9  was announced that ties these topics well together. gVisor is
    10  [not vulnerable](https://seclists.org/oss-sec/2020/q3/168) to this specific
    11  issue, but it provides an interesting case study to continue our exploration of
    12  gVisor's security. While gVisor is not immune to vulnerabilities,
    13  [we take several steps](https://gvisor.dev/security/) to minimize the impact and
    14  remediate if a vulnerability is found.
    15  
    16  ## Escaping the Container
    17  
    18  First, let’s describe how the discovered vulnerability works. There are numerous
    19  ways one can send and receive bytes over the network with Linux. One of the most
    20  performant ways is to use a ring buffer, which is a memory region shared by the
    21  application and the kernel. These rings are created by calling
    22  [setsockopt(2)](https://man7.org/linux/man-pages/man2/setsockopt.2.html) with
    23  [`PACKET_RX_RING`](https://man7.org/linux/man-pages/man7/packet.7.html) for
    24  receiving and
    25  [`PACKET_TX_RING`](https://man7.org/linux/man-pages/man7/packet.7.html) for
    26  sending packets.
    27  
    28  The vulnerability is in the code that reads packets when `PACKET_RX_RING` is
    29  enabled. There is another option
    30  ([`PACKET_RESERVE`](https://man7.org/linux/man-pages/man7/packet.7.html)) that
    31  asks the kernel to leave some space in the ring buffer before each packet for
    32  anything the application needs, e.g. control structures. When a packet is
    33  received, the kernel calculates where to copy the packet to, taking the amount
    34  reserved before each packet into consideration. If the amount reserved is large,
    35  the kernel performed an incorrect calculation which could cause an overflow
    36  leading to an out-of-bounds write of up to 10 bytes, controlled by the attacker.
    37  The data in the write is easily controlled using the loopback to send a crafted
    38  packet and receiving it using a `PACKET_RX_RING` with a carefully selected
    39  `PACKET_RESERVE` size.
    40  
    41  ```c
    42  static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev,
    43                 struct packet_type *pt, struct net_device *orig_dev)
    44  {
    45  // ...
    46      if (sk->sk_type == SOCK_DGRAM) {
    47          macoff = netoff = TPACKET_ALIGN(po->tp_hdrlen) + 16 +
    48                    po->tp_reserve;
    49      } else {
    50          unsigned int maclen = skb_network_offset(skb);
    51          // tp_reserve is unsigned int, netoff is unsigned short.
    52          // Addition can overflow netoff
    53          netoff = TPACKET_ALIGN(po->tp_hdrlen +
    54                         (maclen < 16 ? 16 : maclen)) +
    55                         po->tp_reserve;
    56          if (po->has_vnet_hdr) {
    57              netoff += sizeof(struct virtio_net_hdr);
    58              do_vnet = true;
    59          }
    60          // Attacker controls netoff and can make macoff be smaller
    61          // than sizeof(struct virtio_net_hdr)
    62          macoff = netoff - maclen;
    63      }
    64  // ...
    65      // "macoff - sizeof(struct virtio_net_hdr)" can be negative,
    66      // resulting in a pointer before h.raw
    67      if (do_vnet &&
    68          virtio_net_hdr_from_skb(skb, h.raw + macoff -
    69                      sizeof(struct virtio_net_hdr),
    70                      vio_le(), true, 0)) {
    71  // ...
    72  ```
    73  
    74  The [`CAP_NET_RAW`](https://man7.org/linux/man-pages/man7/capabilities.7.html)
    75  capability is required to create the socket above. However, in order to support
    76  common debugging tools like `ping` and `tcpdump`, Docker containers, including
    77  those created for Kubernetes, are given `CAP_NET_RAW` by default and thus may be
    78  able to trigger this vulnerability to elevate privileges and escape the
    79  container.
    80  
    81  Next, we are going to explore why this vulnerability doesn’t work in gVisor, and
    82  how gVisor could prevent the escape even if a similar vulnerability existed
    83  inside gVisor’s kernel.
    84  
    85  ## Default Protections
    86  
    87  gVisor does not implement `PACKET_RX_RING`, but **does** support raw sockets
    88  which are required for `PACKET_RX_RING`. Raw sockets are a controversial feature
    89  to support in a sandbox environment. While it allows great customizations for
    90  essential tools like `ping`, it may allow packets to be written to the network
    91  without any validation. In general, allowing an untrusted application to write
    92  crafted packets to the network is a questionable idea and a historical source of
    93  vulnerabilities. With that in mind, if `CAP_NET_RAW` is enabled by default, it
    94  would not be _secure by default_ to run untrusted applications.
    95  
    96  After multiple discussions when raw sockets were first implemented, we decided
    97  to disable raw sockets by default, **even if `CAP_NET_RAW` is given to the
    98  application**. Instead, enabling raw sockets in gVisor requires the admin to set
    99  `--net-raw` flag to runsc when configuring the runtime, in addition to requiring
   100  the `CAP_NET_RAW` capability in the application. It comes at the expense that
   101  some tools may not work out of the box, but as part of our
   102  [secure-by-default](https://gvisor.dev/blog/2019/11/18/gvisor-security-basics-part-1/#secure-by-default)
   103  principle, we felt that it was important for the “less secure” configuration to
   104  be explicit.
   105  
   106  Since this bug was due to an overflow in the specific Linux implementation of
   107  the packet ring, gVisor's raw socket implementation is not affected. However, if
   108  there were a vulnerability in gVisor, containers would not be allowed to exploit
   109  it by default.
   110  
   111  As an alternative way to implement this same constraint, Kubernetes allows
   112  [admission controllers](https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/)
   113  to be configured to customize requests. Cloud providers can use this to
   114  implement more stringent policies. For example, GKE implements an admission
   115  controller for gVisor that
   116  [removes `CAP_NET_RAW` from gVisor pods](https://cloud.google.com/kubernetes-engine/docs/concepts/sandbox-pods#capabilities)
   117  unless it has been explicitly set in the pod spec.
   118  
   119  ## Isolated Kernel
   120  
   121  gVisor has its own application kernel, called the Sentry, that is distinct from
   122  the host kernel. Just like what you would expect from a kernel, gVisor has a
   123  memory management subsystem, virtual file system, and a full network stack. The
   124  host network is only used as a transport to carry packets in and out the
   125  sandbox[^1]. The loopback interface which is used in the exploit stays
   126  completely inside the sandbox, never reaching the host.
   127  
   128  Therefore, even if the Sentry was vulnerable to the attack, there would be two
   129  factors that would prevent a container escape from happening. First, the
   130  vulnerability would be limited to the Sentry, and the attacker would compromise
   131  only the application kernel, bound by a restricted set of
   132  [seccomp](https://en.wikipedia.org/wiki/Seccomp) filters, discussed more in
   133  depth below. Second, the Sentry is a distinct implementation of the API, written
   134  in Go, which provides bounds checking that would have likely prevented access
   135  past the bounds of the shared region (e.g. see
   136  [aio](https://cs.opensource.google/gvisor/gvisor/+/master:pkg/sentry/syscalls/linux/vfs2/aio.go;l=210;drc=a11061d78a58ed75b10606d1a770b035ed944b66?q=file:aio&ss=gvisor%2Fgvisor)
   137  or
   138  [kcov](https://cs.opensource.google/gvisor/gvisor/+/master:pkg/sentry/kernel/kcov.go;l=272?q=file:kcov&ss=gvisor%2Fgvisor),
   139  which have similar shared regions).
   140  
   141  Here, Kubernetes warrants slightly more explanation. gVisor makes pods the unit
   142  of isolation and a pod can run multiple containers. In other words, each pod is
   143  a gVisor instance, and each container is a set of processes running inside
   144  gVisor, isolated via Sentry-internal namespaces like regular containers inside a
   145  pod. If there were a vulnerability in gVisor, the privilege escalation would
   146  allow a container inside the pod to break out to other **containers inside the
   147  same pod**, but the container still **cannot break out of the pod**.
   148  
   149  ## Defense in Depth
   150  
   151  gVisor follows a
   152  [common security principle used at Google](https://cloud.google.com/security/infrastructure/design/resources/google_infrastructure_whitepaper_fa.pdf)
   153  that the system should have two layers of protection, and those layers should
   154  require different compromises to be broken. We apply this principle by assuming
   155  that the Sentry (first layer of defense)
   156  [will be compromised and should not be trusted](https://gvisor.dev/blog/2019/11/18/gvisor-security-basics-part-1/#defense-in-depth).
   157  In order to protect the host kernel from a compromised Sentry, we wrap it around
   158  many security and isolations features to ensure only the minimal set of
   159  functionality from the host kernel is exposed.
   160  
   161  ![Figure 1](/assets/images/2020-09-18-containing-a-real-vulnerability-figure1.png "Protection layers.")
   162  
   163  First, the sandbox runs inside a cgroup that can limit and throttle host
   164  resources being used. Second, the sandbox joins empty namespaces, including user
   165  and mount, to further isolate from the host. Next, it changes the process root
   166  to a read-only directory that contains only `/proc` and nothing else. Then, it
   167  executes with the unprivileged user/group
   168  [`nobody`](https://en.wikipedia.org/wiki/Nobody_\(username\)) with all
   169  capabilities stripped. Last and most importantly, a seccomp filter is added to
   170  tightly restrict what parts of the Linux syscall surface that gVisor is allowed
   171  to access. The allowed host surface is a far smaller set of syscalls than the
   172  Sentry implements for applications to use. Not only restricting the syscall
   173  being called, but also checking that arguments to these syscalls are within the
   174  expected set. Dangerous syscalls like <code>execve(2)</code>,
   175  <code>open(2)</code>, and <code>socket(2)</code> are prohibited, thus an
   176  attacker isn’t able to execute binaries or acquire new resources on the host.
   177  
   178  if there were a vulnerability in gVisor that allowed an attacker to execute code
   179  inside the Sentry, the attacker still has extremely limited privileges on the
   180  host. In fact, a compromised Sentry is much more restricted than a
   181  non-compromised regular container. For CVE-2020-14386 in particular, the attack
   182  would be blocked by more than one security layer: non-privileged user, no
   183  capability, and seccomp filters.
   184  
   185  Although the surface is drastically reduced, there is still a chance that there
   186  is a vulnerability in one of the allowed syscalls. That’s why it’s important to
   187  keep the surface small and carefully consider what syscalls are allowed. You can
   188  find the full set of allowed syscalls
   189  [here](https://cs.opensource.google/gvisor/gvisor/+/master:runsc/boot/filter/).
   190  
   191  Another possible attack vector is resources that are present in the Sentry, like
   192  open file descriptors. The Sentry has file descriptors that an attacker could
   193  potentially use, such as log files, platform files (e.g. `/dev/kvm`), an RPC
   194  endpoint that allows external communication with the Sentry, and a Netstack
   195  endpoint that connects the sandbox to the network. The Netstack endpoint in
   196  particular is a concern because it gives direct access to the network. It’s an
   197  `AF_PACKET` socket that allows arbitrary L2 packets to be written to the
   198  network. In the normal case, Netstack assembles packets that go out the network,
   199  giving the container control over only the payload. But if the Sentry is
   200  compromised, an attacker can craft packets to the network. In many ways this is
   201  similar to anyone sending random packets over the internet, but still this is a
   202  place where the host kernel surface exposed is larger than we would like it to
   203  be.
   204  
   205  ## Conclusion
   206  
   207  Security comes with many tradeoffs that are often hard to make, such as the
   208  decision to disable raw sockets by default. However, these tradeoffs have served
   209  us well, and we've found them to have paid off over time. CVE-2020-14386 offers
   210  great insight into how multiple layers of protection can be effective against
   211  such an attack.
   212  
   213  We cannot guarantee that a container escape will never happen in gVisor, but we
   214  do our best to make it as hard as we possibly can.
   215  
   216  If you have not tried gVisor yet, it’s easier than you think. Just follow the
   217  steps [here](https://gvisor.dev/docs/user_guide/install/).
   218  <br>
   219  <br>
   220  
   221  --------------------------------------------------------------------------------
   222  
   223  [^1]: Those packets are eventually handled by the host, as it needs to route
   224      them to local containers or send them out the NIC. The packet will be
   225      handled by many switches, routers, proxies, servers, etc. along the way,
   226      which may be subject to their own vulnerabilities.