github.com/SagerNet/gvisor@v0.0.0-20210707092255-7731c139d75c/website/blog/2020-09-18-containing-a-real-vulnerability.md (about) 1 # Containing a Real Vulnerability 2 3 In the previous two posts we talked about gVisor's 4 [security design principles](https://gvisor.dev/blog/2019/11/18/gvisor-security-basics-part-1/) 5 as well as how those are applied in the 6 [context of networking](https://gvisor.dev/blog/2020/04/02/gvisor-networking-security/). 7 Recently, a new container escape vulnerability 8 ([CVE-2020-14386](https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-14386)) 9 was announced that ties these topics well together. gVisor is 10 [not vulnerable](https://seclists.org/oss-sec/2020/q3/168) to this specific 11 issue, but it provides an interesting case study to continue our exploration of 12 gVisor's security. While gVisor is not immune to vulnerabilities, 13 [we take several steps](https://gvisor.dev/security/) to minimize the impact and 14 remediate if a vulnerability is found. 15 16 ## Escaping the Container 17 18 First, let’s describe how the discovered vulnerability works. There are numerous 19 ways one can send and receive bytes over the network with Linux. One of the most 20 performant ways is to use a ring buffer, which is a memory region shared by the 21 application and the kernel. These rings are created by calling 22 [setsockopt(2)](https://man7.org/linux/man-pages/man2/setsockopt.2.html) with 23 [`PACKET_RX_RING`](https://man7.org/linux/man-pages/man7/packet.7.html) for 24 receiving and 25 [`PACKET_TX_RING`](https://man7.org/linux/man-pages/man7/packet.7.html) for 26 sending packets. 27 28 The vulnerability is in the code that reads packets when `PACKET_RX_RING` is 29 enabled. There is another option 30 ([`PACKET_RESERVE`](https://man7.org/linux/man-pages/man7/packet.7.html)) that 31 asks the kernel to leave some space in the ring buffer before each packet for 32 anything the application needs, e.g. control structures. When a packet is 33 received, the kernel calculates where to copy the packet to, taking the amount 34 reserved before each packet into consideration. If the amount reserved is large, 35 the kernel performed an incorrect calculation which could cause an overflow 36 leading to an out-of-bounds write of up to 10 bytes, controlled by the attacker. 37 The data in the write is easily controlled using the loopback to send a crafted 38 packet and receiving it using a `PACKET_RX_RING` with a carefully selected 39 `PACKET_RESERVE` size. 40 41 ```c 42 static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev, 43 struct packet_type *pt, struct net_device *orig_dev) 44 { 45 // ... 46 if (sk->sk_type == SOCK_DGRAM) { 47 macoff = netoff = TPACKET_ALIGN(po->tp_hdrlen) + 16 + 48 po->tp_reserve; 49 } else { 50 unsigned int maclen = skb_network_offset(skb); 51 // tp_reserve is unsigned int, netoff is unsigned short. 52 // Addition can overflow netoff 53 netoff = TPACKET_ALIGN(po->tp_hdrlen + 54 (maclen < 16 ? 16 : maclen)) + 55 po->tp_reserve; 56 if (po->has_vnet_hdr) { 57 netoff += sizeof(struct virtio_net_hdr); 58 do_vnet = true; 59 } 60 // Attacker controls netoff and can make macoff be smaller 61 // than sizeof(struct virtio_net_hdr) 62 macoff = netoff - maclen; 63 } 64 // ... 65 // "macoff - sizeof(struct virtio_net_hdr)" can be negative, 66 // resulting in a pointer before h.raw 67 if (do_vnet && 68 virtio_net_hdr_from_skb(skb, h.raw + macoff - 69 sizeof(struct virtio_net_hdr), 70 vio_le(), true, 0)) { 71 // ... 72 ``` 73 74 The [`CAP_NET_RAW`](https://man7.org/linux/man-pages/man7/capabilities.7.html) 75 capability is required to create the socket above. However, in order to support 76 common debugging tools like `ping` and `tcpdump`, Docker containers, including 77 those created for Kubernetes, are given `CAP_NET_RAW` by default and thus may be 78 able to trigger this vulnerability to elevate privileges and escape the 79 container. 80 81 Next, we are going to explore why this vulnerability doesn’t work in gVisor, and 82 how gVisor could prevent the escape even if a similar vulnerability existed 83 inside gVisor’s kernel. 84 85 ## Default Protections 86 87 gVisor does not implement `PACKET_RX_RING`, but **does** support raw sockets 88 which are required for `PACKET_RX_RING`. Raw sockets are a controversial feature 89 to support in a sandbox environment. While it allows great customizations for 90 essential tools like `ping`, it may allow packets to be written to the network 91 without any validation. In general, allowing an untrusted application to write 92 crafted packets to the network is a questionable idea and a historical source of 93 vulnerabilities. With that in mind, if `CAP_NET_RAW` is enabled by default, it 94 would not be _secure by default_ to run untrusted applications. 95 96 After multiple discussions when raw sockets were first implemented, we decided 97 to disable raw sockets by default, **even if `CAP_NET_RAW` is given to the 98 application**. Instead, enabling raw sockets in gVisor requires the admin to set 99 `--net-raw` flag to runsc when configuring the runtime, in addition to requiring 100 the `CAP_NET_RAW` capability in the application. It comes at the expense that 101 some tools may not work out of the box, but as part of our 102 [secure-by-default](https://gvisor.dev/blog/2019/11/18/gvisor-security-basics-part-1/#secure-by-default) 103 principle, we felt that it was important for the “less secure” configuration to 104 be explicit. 105 106 Since this bug was due to an overflow in the specific Linux implementation of 107 the packet ring, gVisor's raw socket implementation is not affected. However, if 108 there were a vulnerability in gVisor, containers would not be allowed to exploit 109 it by default. 110 111 As an alternative way to implement this same constraint, Kubernetes allows 112 [admission controllers](https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/) 113 to be configured to customize requests. Cloud providers can use this to 114 implement more stringent policies. For example, GKE implements an admission 115 controller for gVisor that 116 [removes `CAP_NET_RAW` from gVisor pods](https://cloud.google.com/kubernetes-engine/docs/concepts/sandbox-pods#capabilities) 117 unless it has been explicitly set in the pod spec. 118 119 ## Isolated Kernel 120 121 gVisor has its own application kernel, called the Sentry, that is distinct from 122 the host kernel. Just like what you would expect from a kernel, gVisor has a 123 memory management subsystem, virtual file system, and a full network stack. The 124 host network is only used as a transport to carry packets in and out the 125 sandbox[^1]. The loopback interface which is used in the exploit stays 126 completely inside the sandbox, never reaching the host. 127 128 Therefore, even if the Sentry was vulnerable to the attack, there would be two 129 factors that would prevent a container escape from happening. First, the 130 vulnerability would be limited to the Sentry, and the attacker would compromise 131 only the application kernel, bound by a restricted set of 132 [seccomp](https://en.wikipedia.org/wiki/Seccomp) filters, discussed more in 133 depth below. Second, the Sentry is a distinct implementation of the API, written 134 in Go, which provides bounds checking that would have likely prevented access 135 past the bounds of the shared region (e.g. see 136 [aio](https://cs.opensource.google/gvisor/gvisor/+/master:pkg/sentry/syscalls/linux/vfs2/aio.go;l=210;drc=a11061d78a58ed75b10606d1a770b035ed944b66?q=file:aio&ss=gvisor%2Fgvisor) 137 or 138 [kcov](https://cs.opensource.google/gvisor/gvisor/+/master:pkg/sentry/kernel/kcov.go;l=272?q=file:kcov&ss=gvisor%2Fgvisor), 139 which have similar shared regions). 140 141 Here, Kubernetes warrants slightly more explanation. gVisor makes pods the unit 142 of isolation and a pod can run multiple containers. In other words, each pod is 143 a gVisor instance, and each container is a set of processes running inside 144 gVisor, isolated via Sentry-internal namespaces like regular containers inside a 145 pod. If there were a vulnerability in gVisor, the privilege escalation would 146 allow a container inside the pod to break out to other **containers inside the 147 same pod**, but the container still **cannot break out of the pod**. 148 149 ## Defense in Depth 150 151 gVisor follows a 152 [common security principle used at Google](https://cloud.google.com/security/infrastructure/design/resources/google_infrastructure_whitepaper_fa.pdf) 153 that the system should have two layers of protection, and those layers should 154 require different compromises to be broken. We apply this principle by assuming 155 that the Sentry (first layer of defense) 156 [will be compromised and should not be trusted](https://gvisor.dev/blog/2019/11/18/gvisor-security-basics-part-1/#defense-in-depth). 157 In order to protect the host kernel from a compromised Sentry, we wrap it around 158 many security and isolations features to ensure only the minimal set of 159 functionality from the host kernel is exposed. 160 161 ![Figure 1](/assets/images/2020-09-18-containing-a-real-vulnerability-figure1.png "Protection layers.") 162 163 First, the sandbox runs inside a cgroup that can limit and throttle host 164 resources being used. Second, the sandbox joins empty namespaces, including user 165 and mount, to further isolate from the host. Next, it changes the process root 166 to a read-only directory that contains only `/proc` and nothing else. Then, it 167 executes with the unprivileged user/group 168 [`nobody`](https://en.wikipedia.org/wiki/Nobody_\(username\)) with all 169 capabilities stripped. Last and most importantly, a seccomp filter is added to 170 tightly restrict what parts of the Linux syscall surface that gVisor is allowed 171 to access. The allowed host surface is a far smaller set of syscalls than the 172 Sentry implements for applications to use. Not only restricting the syscall 173 being called, but also checking that arguments to these syscalls are within the 174 expected set. Dangerous syscalls like <code>execve(2)</code>, 175 <code>open(2)</code>, and <code>socket(2)</code> are prohibited, thus an 176 attacker isn’t able to execute binaries or acquire new resources on the host. 177 178 if there were a vulnerability in gVisor that allowed an attacker to execute code 179 inside the Sentry, the attacker still has extremely limited privileges on the 180 host. In fact, a compromised Sentry is much more restricted than a 181 non-compromised regular container. For CVE-2020-14386 in particular, the attack 182 would be blocked by more than one security layer: non-privileged user, no 183 capability, and seccomp filters. 184 185 Although the surface is drastically reduced, there is still a chance that there 186 is a vulnerability in one of the allowed syscalls. That’s why it’s important to 187 keep the surface small and carefully consider what syscalls are allowed. You can 188 find the full set of allowed syscalls 189 [here](https://cs.opensource.google/gvisor/gvisor/+/master:runsc/boot/filter/). 190 191 Another possible attack vector is resources that are present in the Sentry, like 192 open file descriptors. The Sentry has file descriptors that an attacker could 193 potentially use, such as log files, platform files (e.g. `/dev/kvm`), an RPC 194 endpoint that allows external communication with the Sentry, and a Netstack 195 endpoint that connects the sandbox to the network. The Netstack endpoint in 196 particular is a concern because it gives direct access to the network. It’s an 197 `AF_PACKET` socket that allows arbitrary L2 packets to be written to the 198 network. In the normal case, Netstack assembles packets that go out the network, 199 giving the container control over only the payload. But if the Sentry is 200 compromised, an attacker can craft packets to the network. In many ways this is 201 similar to anyone sending random packets over the internet, but still this is a 202 place where the host kernel surface exposed is larger than we would like it to 203 be. 204 205 ## Conclusion 206 207 Security comes with many tradeoffs that are often hard to make, such as the 208 decision to disable raw sockets by default. However, these tradeoffs have served 209 us well, and we've found them to have paid off over time. CVE-2020-14386 offers 210 great insight into how multiple layers of protection can be effective against 211 such an attack. 212 213 We cannot guarantee that a container escape will never happen in gVisor, but we 214 do our best to make it as hard as we possibly can. 215 216 If you have not tried gVisor yet, it’s easier than you think. Just follow the 217 steps [here](https://gvisor.dev/docs/user_guide/install/). 218 <br> 219 <br> 220 221 -------------------------------------------------------------------------------- 222 223 [^1]: Those packets are eventually handled by the host, as it needs to route 224 them to local containers or send them out the NIC. The packet will be 225 handled by many switches, routers, proxies, servers, etc. along the way, 226 which may be subject to their own vulnerabilities.