istio.io/istio@v0.0.0-20240520182934-d79c90f27776/architecture/ambient/ztunnel.md

istio.io/istio@v0.0.0-20240520182934-d79c90f27776/architecture/ambient/ztunnel.md (about)

1 # Ztunnel
2
3 This document provides an overview of the architecture and design decisions around Ztunnel, the node-proxy component in ambient mode.
4
5 ## Background and motivation
6
7 Motivations to implement ztunnel generally came from two areas.
8
9 First, and most importantly, it serves as a means to implement the real goal: waypoints.
10 For various reasons outside the scope of this document, there is a desire to move from a sidecar based architecture to a "remote proxy" architecture.
11 However, this has one glaring issue: how do we get the traffic to the remote proxies, while maintaining the zero-trust properties that Istio is built upon?
12
13 A secondary goal was to enable a smoother on-ramp from "Zero" to "Getting some value".
14 Historically, Istio had to really be consumed all-or-nothing for things to work as expected.
15 In particular, an easy answer to "I just want to have mTLS everywhere, then I can think about adopting the rest of service mesh" was desired.
16
17 ## Goals
18
19 Ztunnel should:
20 * **Not break users**. This means that deploying Ztunnel should retain all existing Kubernetes behavior.
21 * This includes UDP, non-compliant HTTP, server-first protocols, stateful sets, external services, etc.
22 * Explicitly opting into behavioral changes can be acceptable. For example, introducing Istio multi-cluster semantics.
23 * Ensure traffic between mesh workloads is securely encrypted with an Istio identity.
24 * Be lightweight enough to not limit adoption.
25 * This puts a much tighter budget on CPU, memory, latency, and throughput requirements than traditional Istio sidecars.
26
27 Ztunnel was not designed to be a feature-rich data plane.
28 Quite the opposite - an *aggressively* small feature set is the key feature that makes ztunnel viable.
29 It very intentionally does not offer L7 (HTTP) functionality, for instance, which would likely violate some of the goals above, without contributing to them.
30 Instead, the rich functionality that service mesh is traditionally associated with is deferred to the waypoints.
31 The ztunnel is primarily a mechanism to get traffic to the waypoints, securely.
32
33 ## Proxy implementation
34
35 In its initial implementations, the ztunnel was actually implemented in 3 different ways: a bespoke Rust implementation, a bespoke Go implementation, and in Envoy.
36
37 In the end, [after evaluation](https://docs.google.com/document/d/1c2123cKuYsBDpIon9FFdctWTUIMFweSjgwG7r8l3818/edit), the decision was to move forward with a Rust implementation.
38 This offered performance benefits that were too large to leave on the table, as well as opportunities to tune to our specific needs.
39
40 ## Configuration protocol
41
42 Ztunnel, of course, needs to be dynamically configured in order to make decisions on how it should handle traffic.
43 For this purpose, we chose to use xDS transport protocol due to our expertise and existing infrastructure, and because the protocol is well suited to our needs.
44
45 However, while we chose to use the xDS *transport protocol*, we chose to not use the xDS resource types, such as Clusters and Listeners.
46 In our experience and testing, these types force us to represent data in inefficient ways because they are general purpose.
47 Ztunnel is not general purpose; it has an extremely tight goal.
48 We can exploit this to make a more efficient protocol, which is critical to achieve our resource footprint goals.
49
50 For example, configuring Istio mTLS in Envoy takes roughly 50 lines of JSON (it is in Protobuf, of course, but still relevant).
51 Because Ztunnel can have Istio semantics baked in, we do not need to encode all this information on the wire.
52 Instead, an Istio specific field like `ExpectedTLSIdentity: spiffe://foo.bar` can encode the same information, at a fraction of the cost.
53
54 In our testing, even the most generous representations give custom types a 10x edge (in size, allocations, and CPU time) over Envoy types.
55 In addition, they are more clear and strictly typed; using Envoy types would require us to put a lot of information in untyped `metadata` maps.
56
57 With this in mind, Ztunnel supports two xDS resources: `Address` and `Authorization`.
58
59 ### Address Type
60
61 The primary configuration consumed by Ztunnel is the [`Address` resource](../../pkg/workloadapi/workload.proto).
62 As the name suggests, an `Address` represents a particular IP Address.
63 This can be a `Service` or a `Workload`.
64
65 The address type has the following goals:
66 * It should support (but not require) on-demand lookups.
67 * Specifically, ztunnel should be able to send a request to the control plane to answer "I got a request to send traffic to 1.1.1.1, what is 1.1.1.1?"
68 * While this is not needed for small scales, this is important for the long tail of massive clusters (think 1 million endpoints), where the entire set of endpoints cannot reasonably be replicated to each ztunnel.
69 * It should not be client-specific.
70 * In Istio sidecars, historically we had a lot of client-specific xDS. For example, putting the xDS-client's IP back into the xDS response. This makes efficient control plane implementation (most notably, caching), extremely challenging.
71 * In practice, this largely means that references are fully qualified in the API. IP Addresses (generally) have a network associated with them, node names have a cluster associated with them, etc.
72
73 See the [XDS Evolution](https://docs.google.com/document/d/1V5wkeBHbLSLMzAMbwFlFZNHdZPyUEspG4lHbnB0UaCg/edit) document for more history and details.
74
75 The `Workload` aims to represent everything about a workload (generally a `Pod` or `WorkloadEntry`).
76 This includes things like its IP address, identity, metadata (name, namespace, app, version, etc), and whether it has a waypoint proxy associated.
77
78 The `Service` aims to represent everything about a service (generally a `Service` or `ServiceEntry`).
79 This includes things like its IP addresses, ports and an associated waypoint proxy if it has one.
80
81 ### Authorization Type
82
83 A secondary configuration consumed by Ztunnel is the [`Authorization` resource](../../pkg/workloadapi/security/authorization.proto).
84 [Original Design Doc](https://docs.google.com/document/d/17mRVzXe8PS7VoligvIx52T10tOP7xPQ9noeOzoLO2cY/edit).
85
86 This resource aims to represent the relatively small set of Authorization policies that Ztunnel support.
87 Most notably, this is only L4 resources.
88
89 Most of the API is fairly straight forward.
90 However, one interesting aspect is how these policies associate with workloads.
91 Istio's AuthorizationPolicy has label selectors.
92 However, we intentionally do not send those as part of the Workload API, in order to keep the size low.
93
94 The obvious solution to this is to put the list of selected workloads into the policy itself.
95 However, this means anytime a workload changes (often), we need to update the policy.
96
97 Instead, the opposite was chosen: each workload will list the policies that select it.
98 This works out to be more efficient in common cases where policies change much less often than workloads.
99 This only applies for selector-based policies; namespaced and global policies can be handled without needing to list them out in the Workload API.
100
101 ## Redirection
102
103 As ztunnel aims to transparently encrypt and route users traffic, we need a mechanism to capture all traffic entering and leaving "mesh" pods.
104 This is a security critical task: if the ztunnel can be bypassed, authorization policies can be bypassed.
105
106 Redirection must meet these requirements:
107 * All traffic *egressing* a pod in the mesh should be redirected to the node-local ztunnel on port 15001.
108 * It is critical that this path preserves the Service IP, if the traffic was to a Service.
109 * All traffic *ingressing* a pod on port 15008 in the mesh is assumed to be HBONE, and should be redirected to the node-local ztunnel on port 15008 (more on this later).
110 * All other traffic *ingressing* a pod in the mesh should be redirected to the node-local ztunnel on port 15006, regardless of intended original destination port.
111
112 TODO: fill in implementation details of how redirection is actually implemented.
113
114 ## HBONE
115
116 Along with pass-through traffic, Ztunnel supports the "HBONE" (HTTP-Based Overlay Network) protocol.
117 This is not really so much a new protocol, but rather a name we came up with to refer to the expectations of clients and servers communicating in the mesh.
118
119 HBONE is just a standard HTTP `CONNECT` tunnel, over mutual TLS with mesh (SPIFFE) certificates, on a well known port (15008).
120 The target destination address is set in the `:authority` header, and additional headers can be included as well.
121 Currently, only HTTP/2 is supported, though HTTP/1.1 and HTTP/3 are planned.
122
123 Currently, SNI is not set by Istio clients and ignored by Istio servers.
124 This makes identifying which certificate to use problematic for Ztunnel.
125 To handle this, requests to Ztunnel are sent to `DestinationPod:15008` and redirected to ztunnel, rather than `ZtunnelPod:15008`.
126 The original destination is then extracted to determined which certificate to use.
127 SNI is not used because it is illegal to use IPs in SNI, and there is no other existing standard format to represent what we need to.
128 Additionally, using the redirection mechanism reduces the need for clients to know the destination's ztunnel address.
129
130 Below shows an example outbound request. The "target" path is what the client sends, while the "actual" path is the real network flow after redirection.
131
132 ```mermaid
133 graph LR
134 subgraph Client Node
135 Client
136 CZ["Ztunnel"]
137 end
138 subgraph Server Node
139 Server
140 SZ["Ztunnel"]
141 end
142 Client--Plain-->CZ
143 CZ-."HBONE (target)".->Server
144 CZ--"HBONE (actual)"-->SZ
145 SZ--Plain-->Server
146 ```
147
148 ### Pooling
149
150 User connections can be multiplexed over shared HBONE connections.
151 This is done through standard HTTP/2 pooling.
152 The pooling is keyed off the `{source identity, destination identity, destination ip}`.
153
154 ### Headers
155
156 Ztunnel uses the following well-known headers in HBONE:
157
158 | Header | Purpose |
159 |---------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
160 | `:authority` | Required in `CONNECT`, this is the target destination |
161 | `Forwarded` | For outgoing requests, the original source IP. Note that since we spoof IPs in most cases, this usually is the same as the actual IP seen. For incoming requests, this is used only for traffic from waypoints (which are trusted and cannot spoof IPs) |
162 | `Baggage` | (Experimental, likely to be removed) This contains metadata about the source/destination workload for telemetry purposes |
163 | `Traceparent` | (Experimental) This maintains tracing information. Note this is tracing of *connections*, and is not correlated to tracing of user's own HTTP requests. However, this is useful to follow a connection across ztunnels. |
164
165 ## Traffic routing
166
167 Based on the three [redirection](#redirection) paths, the ztunnel has three primary types of traffic it handles.
168
169 ### Outbound
170
171 Requests leaving a pod go through the "outbound" code path on port 15001.
172 This is where most of Ztunnel's logic lives.
173
174 For outbound traffic, we need to first determine where the traffic is destined to.
175 As Ztunnel operates at L4, we only have the destination IP/port (recovered via `SO_ORIGINAL_DST`).
176 This may be an IP of a Service, a Pod, or something outside the cluster.
177 Ztunnel will look up the destination from the [addresses](#address-type) it is configured with.
178
179 For traffic to unknown addresses, or to workloads that are not a part of the mesh, the traffic will just be passed through as is.
180 To make ztunnel more transparent, the original source IP address will be spoofed.
181 Additionally, `splice` will be used to make this proxying more efficient when possible.
182
183 For traffic in the mesh, things are a bit more complex:
184
185 1. If the destination has a waypoint proxy, we must send to it to the waypoint (using HBONE).
186 When we do this, we will want to preserve the original destination Service IP, as the waypoint can do a better job picking a backend pod than we can.
187 Note: the application itself may have already resolved the Service IP to a specific pod if it has Kubernetes native routing built in; since we don't have the Service information in this case we will use the destination IP we received (a pod). Most notably, sidecar proxies behave this way.
188 1. If the destination is on our node, we "fast path" the request and convert this into an inbound request.
189 This has the same semantics as if we had sent a request back to ourselves, but is more efficient and reduces complexity in the Ztunnel.
190 1. Otherwise, we forward the request to the destination using HBONE. If the destination is a Service, we resolve this to a specific pod IP.
191
192 In all cases, we spoof the original source IP.
193
194 ### Inbound Passthrough
195
196 Traffic entering a pod that is not transmitted over HBONE (i.e. with a destination port != 15008) is handled by the "inbound passthrough" code path, on ztunnel's port 15006.
197
198 This is fairly straightforward.
199
200 First, we need to check that this traffic is allowed.
201 Traffic may be denied by RBAC policies (especially from a `STRICT` mode enforcement, which denies plaintext traffic).
202
203 If it is allowed, we will forward to the target destination.
204
205 #### Hairpin
206
207 In the case that the destination has a waypoint, that waypoint must have been bypassed to reach the inbound passthrough codepath.
208 How we handle this is [under discussion](https://docs.google.com/document/d/1uM1c3zzoehiijh1ZpZuJ1-SzuVVupenv8r5yuCaFshs/edit#heading=h.dwbqvwmg6ud3).
209
210 ### Inbound
211
212 Traffic entering a pod over HBONE will be handled by the "inbound" code path, on port 15008.
213
214 Incoming requests have multiple "layers": TLS wrapping HTTP CONNECT that is wrapping the user's connection.
215
216 To unwrap the first layer, we terminate TLS.
217 As part of this, we need to pick the correct certificate to serve on behalf of the destination workload.
218 As discussed in [HBONE](#hbone), this is based on the destination IP.
219 Additionally, we enforce the peer has a valid mesh identity (but do not assert _which_ identity, yet).
220
221 Next, we terminate the CONNECT.
222 From the [headers](#headers), we know the target destination.
223 If the target destination has a waypoint, we enforce that the request is coming from that waypoint. Otherwise, the request is rejected.
224 If there is no waypoint, ztunnel will enforce RBAC policies against the request.
225
226 If all checks pass, ztunnel will open a connection to the target. This will spoof the source IP (from `Forwarded` for waypoints, or the incoming IP otherwise).
227 Once the connection is established we return a 200 HTTP code, and bi-directionally copy data to/from the tunnel to the destination.
228
229 ## Certificates
230
231 Ztunnel certificates are based on the standard Istio SPIFFE format: `spiffe://<trust domain>/ns/<ns>/sa/<sa>`.
232
233 However, the identities of the certificates will be of the actual user workloads, not Ztunnel's own identity.
234 This means Ztunnel will have multiple distinct certificates at a time, one for each unique identity (service account) running on its node.
235
236 When fetching certificates, ztunnel will authenticate to the CA with its own identity, but request the identity of another workload.
237 Critically, the CA must enforce that the ztunnel has permission to request that identity.
238 Requests for identities not running on the node are rejected.
239 This is critical to ensure that a compromised node does not compromise the entire mesh.
240
241 This CA enforcement is done by Istio's CA, and is a requirement for any alternative CAs integrating with Ztunnel.
242
243 Note: Ztunnel authenticates to the CA with a Kubernetes Service Account JWT token, which encodes the pod information, which is what enables this.
244
245 Ztunnel will request certificates for all identities on the node.
246 It determines this based on the Workload xDS configuration it receives.
247 When a new identity is discovered on the node, it will be enqueued for fetching at a low priority, as an optimization.
248 However, if a request needs a certain identity that we have not fetched yet, it will be immediately requested.
249
250 Ztunnel additionally will handle the rotation of these certificates (typically 24hr expiration) as they approach expiry.
251
252 ## Telemetry
253
254 Ztunnel emits the full set of [Istio Standard Metrics](https://istio.io/latest/docs/reference/config/metrics/), for the 4 TCP metrics.