github.com/SagerNet/gvisor@v0.0.0-20210707092255-7731c139d75c/website/blog/2020-09-18-containing-a-real-vulnerability.md

github.com/SagerNet/gvisor@v0.0.0-20210707092255-7731c139d75c/website/blog/2020-09-18-containing-a-real-vulnerability.md (about)

1 # Containing a Real Vulnerability
2
3 In the previous two posts we talked about gVisor's
4 [security design principles](https://gvisor.dev/blog/2019/11/18/gvisor-security-basics-part-1/)
5 as well as how those are applied in the
6 [context of networking](https://gvisor.dev/blog/2020/04/02/gvisor-networking-security/).
7 Recently, a new container escape vulnerability
8 ([CVE-2020-14386](https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-14386))
9 was announced that ties these topics well together. gVisor is
10 [not vulnerable](https://seclists.org/oss-sec/2020/q3/168) to this specific
11 issue, but it provides an interesting case study to continue our exploration of
12 gVisor's security. While gVisor is not immune to vulnerabilities,
13 [we take several steps](https://gvisor.dev/security/) to minimize the impact and
14 remediate if a vulnerability is found.
15
16 ## Escaping the Container
17
18 First, let’s describe how the discovered vulnerability works. There are numerous
19 ways one can send and receive bytes over the network with Linux. One of the most
20 performant ways is to use a ring buffer, which is a memory region shared by the
21 application and the kernel. These rings are created by calling
22 [setsockopt(2)](https://man7.org/linux/man-pages/man2/setsockopt.2.html) with
23 [`PACKET_RX_RING`](https://man7.org/linux/man-pages/man7/packet.7.html) for
24 receiving and
25 [`PACKET_TX_RING`](https://man7.org/linux/man-pages/man7/packet.7.html) for
26 sending packets.
27
28 The vulnerability is in the code that reads packets when `PACKET_RX_RING` is
29 enabled. There is another option
30 ([`PACKET_RESERVE`](https://man7.org/linux/man-pages/man7/packet.7.html)) that
31 asks the kernel to leave some space in the ring buffer before each packet for
32 anything the application needs, e.g. control structures. When a packet is
33 received, the kernel calculates where to copy the packet to, taking the amount
34 reserved before each packet into consideration. If the amount reserved is large,
35 the kernel performed an incorrect calculation which could cause an overflow
36 leading to an out-of-bounds write of up to 10 bytes, controlled by the attacker.
37 The data in the write is easily controlled using the loopback to send a crafted
38 packet and receiving it using a `PACKET_RX_RING` with a carefully selected
39 `PACKET_RESERVE` size.
40
41 ```c
42 static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev,
43 struct packet_type *pt, struct net_device *orig_dev)
44 {
45 // ...
46 if (sk->sk_type == SOCK_DGRAM) {
47 macoff = netoff = TPACKET_ALIGN(po->tp_hdrlen) + 16 +
48 po->tp_reserve;
49 } else {
50 unsigned int maclen = skb_network_offset(skb);
51 // tp_reserve is unsigned int, netoff is unsigned short.
52 // Addition can overflow netoff
53 netoff = TPACKET_ALIGN(po->tp_hdrlen +
54 (maclen < 16 ? 16 : maclen)) +
55 po->tp_reserve;
56 if (po->has_vnet_hdr) {
57 netoff += sizeof(struct virtio_net_hdr);
58 do_vnet = true;
59 }
60 // Attacker controls netoff and can make macoff be smaller
61 // than sizeof(struct virtio_net_hdr)
62 macoff = netoff - maclen;
63 }
64 // ...
65 // "macoff - sizeof(struct virtio_net_hdr)" can be negative,
66 // resulting in a pointer before h.raw
67 if (do_vnet &&
68 virtio_net_hdr_from_skb(skb, h.raw + macoff -
69 sizeof(struct virtio_net_hdr),
70 vio_le(), true, 0)) {
71 // ...
72 ```
73
74 The [`CAP_NET_RAW`](https://man7.org/linux/man-pages/man7/capabilities.7.html)
75 capability is required to create the socket above. However, in order to support
76 common debugging tools like `ping` and `tcpdump`, Docker containers, including
77 those created for Kubernetes, are given `CAP_NET_RAW` by default and thus may be
78 able to trigger this vulnerability to elevate privileges and escape the
79 container.
80
81 Next, we are going to explore why this vulnerability doesn’t work in gVisor, and
82 how gVisor could prevent the escape even if a similar vulnerability existed
83 inside gVisor’s kernel.
84
85 ## Default Protections
86
87 gVisor does not implement `PACKET_RX_RING`, but **does** support raw sockets
88 which are required for `PACKET_RX_RING`. Raw sockets are a controversial feature
89 to support in a sandbox environment. While it allows great customizations for
90 essential tools like `ping`, it may allow packets to be written to the network
91 without any validation. In general, allowing an untrusted application to write
92 crafted packets to the network is a questionable idea and a historical source of
93 vulnerabilities. With that in mind, if `CAP_NET_RAW` is enabled by default, it
94 would not be _secure by default_ to run untrusted applications.
95
96 After multiple discussions when raw sockets were first implemented, we decided
97 to disable raw sockets by default, **even if `CAP_NET_RAW` is given to the
98 application**. Instead, enabling raw sockets in gVisor requires the admin to set
99 `--net-raw` flag to runsc when configuring the runtime, in addition to requiring
100 the `CAP_NET_RAW` capability in the application. It comes at the expense that
101 some tools may not work out of the box, but as part of our
102 [secure-by-default](https://gvisor.dev/blog/2019/11/18/gvisor-security-basics-part-1/#secure-by-default)
103 principle, we felt that it was important for the “less secure” configuration to
104 be explicit.
105
106 Since this bug was due to an overflow in the specific Linux implementation of
107 the packet ring, gVisor's raw socket implementation is not affected. However, if
108 there were a vulnerability in gVisor, containers would not be allowed to exploit
109 it by default.
110
111 As an alternative way to implement this same constraint, Kubernetes allows
112 [admission controllers](https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/)
113 to be configured to customize requests. Cloud providers can use this to
114 implement more stringent policies. For example, GKE implements an admission
115 controller for gVisor that
116 [removes `CAP_NET_RAW` from gVisor pods](https://cloud.google.com/kubernetes-engine/docs/concepts/sandbox-pods#capabilities)
117 unless it has been explicitly set in the pod spec.
118
119 ## Isolated Kernel
120
121 gVisor has its own application kernel, called the Sentry, that is distinct from
122 the host kernel. Just like what you would expect from a kernel, gVisor has a
123 memory management subsystem, virtual file system, and a full network stack. The
124 host network is only used as a transport to carry packets in and out the
125 sandbox[^1]. The loopback interface which is used in the exploit stays
126 completely inside the sandbox, never reaching the host.
127
128 Therefore, even if the Sentry was vulnerable to the attack, there would be two
129 factors that would prevent a container escape from happening. First, the
130 vulnerability would be limited to the Sentry, and the attacker would compromise
131 only the application kernel, bound by a restricted set of
132 [seccomp](https://en.wikipedia.org/wiki/Seccomp) filters, discussed more in
133 depth below. Second, the Sentry is a distinct implementation of the API, written
134 in Go, which provides bounds checking that would have likely prevented access
135 past the bounds of the shared region (e.g. see
136 [aio](https://cs.opensource.google/gvisor/gvisor/+/master:pkg/sentry/syscalls/linux/vfs2/aio.go;l=210;drc=a11061d78a58ed75b10606d1a770b035ed944b66?q=file:aio&ss=gvisor%2Fgvisor)
137 or
138 [kcov](https://cs.opensource.google/gvisor/gvisor/+/master:pkg/sentry/kernel/kcov.go;l=272?q=file:kcov&ss=gvisor%2Fgvisor),
139 which have similar shared regions).
140
141 Here, Kubernetes warrants slightly more explanation. gVisor makes pods the unit
142 of isolation and a pod can run multiple containers. In other words, each pod is
143 a gVisor instance, and each container is a set of processes running inside
144 gVisor, isolated via Sentry-internal namespaces like regular containers inside a
145 pod. If there were a vulnerability in gVisor, the privilege escalation would
146 allow a container inside the pod to break out to other **containers inside the
147 same pod**, but the container still **cannot break out of the pod**.
148
149 ## Defense in Depth
150
151 gVisor follows a
152 [common security principle used at Google](https://cloud.google.com/security/infrastructure/design/resources/google_infrastructure_whitepaper_fa.pdf)
153 that the system should have two layers of protection, and those layers should
154 require different compromises to be broken. We apply this principle by assuming
155 that the Sentry (first layer of defense)
156 [will be compromised and should not be trusted](https://gvisor.dev/blog/2019/11/18/gvisor-security-basics-part-1/#defense-in-depth).
157 In order to protect the host kernel from a compromised Sentry, we wrap it around
158 many security and isolations features to ensure only the minimal set of
159 functionality from the host kernel is exposed.
160
161 ![Figure 1](/assets/images/2020-09-18-containing-a-real-vulnerability-figure1.png "Protection layers.")
162
163 First, the sandbox runs inside a cgroup that can limit and throttle host
164 resources being used. Second, the sandbox joins empty namespaces, including user
165 and mount, to further isolate from the host. Next, it changes the process root
166 to a read-only directory that contains only `/proc` and nothing else. Then, it
167 executes with the unprivileged user/group
168 [`nobody`](https://en.wikipedia.org/wiki/Nobody_\(username\)) with all
169 capabilities stripped. Last and most importantly, a seccomp filter is added to
170 tightly restrict what parts of the Linux syscall surface that gVisor is allowed
171 to access. The allowed host surface is a far smaller set of syscalls than the
172 Sentry implements for applications to use. Not only restricting the syscall
173 being called, but also checking that arguments to these syscalls are within the
174 expected set. Dangerous syscalls like <code>execve(2)</code>,
175 <code>open(2)</code>, and <code>socket(2)</code> are prohibited, thus an
176 attacker isn’t able to execute binaries or acquire new resources on the host.
177
178 if there were a vulnerability in gVisor that allowed an attacker to execute code
179 inside the Sentry, the attacker still has extremely limited privileges on the
180 host. In fact, a compromised Sentry is much more restricted than a
181 non-compromised regular container. For CVE-2020-14386 in particular, the attack
182 would be blocked by more than one security layer: non-privileged user, no
183 capability, and seccomp filters.
184
185 Although the surface is drastically reduced, there is still a chance that there
186 is a vulnerability in one of the allowed syscalls. That’s why it’s important to
187 keep the surface small and carefully consider what syscalls are allowed. You can
188 find the full set of allowed syscalls
189 [here](https://cs.opensource.google/gvisor/gvisor/+/master:runsc/boot/filter/).
190
191 Another possible attack vector is resources that are present in the Sentry, like
192 open file descriptors. The Sentry has file descriptors that an attacker could
193 potentially use, such as log files, platform files (e.g. `/dev/kvm`), an RPC
194 endpoint that allows external communication with the Sentry, and a Netstack
195 endpoint that connects the sandbox to the network. The Netstack endpoint in
196 particular is a concern because it gives direct access to the network. It’s an
197 `AF_PACKET` socket that allows arbitrary L2 packets to be written to the
198 network. In the normal case, Netstack assembles packets that go out the network,
199 giving the container control over only the payload. But if the Sentry is
200 compromised, an attacker can craft packets to the network. In many ways this is
201 similar to anyone sending random packets over the internet, but still this is a
202 place where the host kernel surface exposed is larger than we would like it to
203 be.
204
205 ## Conclusion
206
207 Security comes with many tradeoffs that are often hard to make, such as the
208 decision to disable raw sockets by default. However, these tradeoffs have served
209 us well, and we've found them to have paid off over time. CVE-2020-14386 offers
210 great insight into how multiple layers of protection can be effective against
211 such an attack.
212
213 We cannot guarantee that a container escape will never happen in gVisor, but we
214 do our best to make it as hard as we possibly can.
215
216 If you have not tried gVisor yet, it’s easier than you think. Just follow the
217 steps [here](https://gvisor.dev/docs/user_guide/install/).
218 <br>
219 <br>
220
221 --------------------------------------------------------------------------------
222
223 [^1]: Those packets are eventually handled by the host, as it needs to route
224 them to local containers or send them out the NIC. The packet will be
225 handled by many switches, routers, proxies, servers, etc. along the way,
226 which may be subject to their own vulnerabilities.