istio.io/istio@v0.0.0-20240520182934-d79c90f27776/architecture/ambient/ztunnel.md (about)

     1  # Ztunnel
     2  
     3  This document provides an overview of the architecture and design decisions around Ztunnel, the node-proxy component in ambient mode.
     4  
     5  ## Background and motivation
     6  
     7  Motivations to implement ztunnel generally came from two areas.
     8  
     9  First, and most importantly, it serves as a means to implement the real goal: waypoints.
    10  For various reasons outside the scope of this document, there is a desire to move from a sidecar based architecture to a "remote proxy" architecture.
    11  However, this has one glaring issue: how do we get the traffic to the remote proxies, while maintaining the zero-trust properties that Istio is built upon?
    12  
    13  A secondary goal was to enable a smoother on-ramp from "Zero" to "Getting some value".
    14  Historically, Istio had to really be consumed all-or-nothing for things to work as expected.
    15  In particular, an easy answer to "I just want to have mTLS everywhere, then I can think about adopting the rest of service mesh" was desired.
    16  
    17  ## Goals
    18  
    19  Ztunnel should:
    20  * **Not break users**. This means that deploying Ztunnel should retain all existing Kubernetes behavior.
    21    * This includes UDP, non-compliant HTTP, server-first protocols, stateful sets, external services, etc.
    22    * Explicitly opting into behavioral changes can be acceptable. For example, introducing Istio multi-cluster semantics.
    23  * Ensure traffic between mesh workloads is securely encrypted with an Istio identity.
    24  * Be lightweight enough to not limit adoption.
    25    * This puts a much tighter budget on CPU, memory, latency, and throughput requirements than traditional Istio sidecars.
    26  
    27  Ztunnel was not designed to be a feature-rich data plane.
    28  Quite the opposite - an *aggressively* small feature set is the key feature that makes ztunnel viable.
    29  It very intentionally does not offer L7 (HTTP) functionality, for instance, which would likely violate some of the goals above, without contributing to them.
    30  Instead, the rich functionality that service mesh is traditionally associated with is deferred to the waypoints.
    31  The ztunnel is primarily a mechanism to get traffic to the waypoints, securely.
    32  
    33  ## Proxy implementation
    34  
    35  In its initial implementations, the ztunnel was actually implemented in 3 different ways: a bespoke Rust implementation, a bespoke Go implementation, and in Envoy.
    36  
    37  In the end, [after evaluation](https://docs.google.com/document/d/1c2123cKuYsBDpIon9FFdctWTUIMFweSjgwG7r8l3818/edit), the decision was to move forward with a Rust implementation.
    38  This offered performance benefits that were too large to leave on the table, as well as opportunities to tune to our specific needs.
    39  
    40  ## Configuration protocol
    41  
    42  Ztunnel, of course, needs to be dynamically configured in order to make decisions on how it should handle traffic.
    43  For this purpose, we chose to use xDS transport protocol due to our expertise and existing infrastructure, and because the protocol is well suited to our needs.
    44  
    45  However, while we chose to use the xDS *transport protocol*, we chose to not use the xDS resource types, such as Clusters and Listeners.
    46  In our experience and testing, these types force us to represent data in inefficient ways because they are general purpose.
    47  Ztunnel is not general purpose; it has an extremely tight goal.
    48  We can exploit this to make a more efficient protocol, which is critical to achieve our resource footprint goals.
    49  
    50  For example, configuring Istio mTLS in Envoy takes roughly 50 lines of JSON (it is in Protobuf, of course, but still relevant).
    51  Because Ztunnel can have Istio semantics baked in, we do not need to encode all this information on the wire.
    52  Instead, an Istio specific field like `ExpectedTLSIdentity: spiffe://foo.bar` can encode the same information, at a fraction of the cost.
    53  
    54  In our testing, even the most generous representations give custom types a 10x edge (in size, allocations, and CPU time) over Envoy types.
    55  In addition, they are more clear and strictly typed; using Envoy types would require us to put a lot of information in untyped `metadata` maps.
    56  
    57  With this in mind, Ztunnel supports two xDS resources: `Address` and `Authorization`.
    58  
    59  ### Address Type
    60  
    61  The primary configuration consumed by Ztunnel is the [`Address` resource](../../pkg/workloadapi/workload.proto).
    62  As the name suggests, an `Address` represents a particular IP Address.
    63  This can be a `Service` or a `Workload`.
    64  
    65  The address type has the following goals:
    66  * It should support (but not require) on-demand lookups.
    67    * Specifically, ztunnel should be able to send a request to the control plane to answer "I got a request to send traffic to 1.1.1.1, what is 1.1.1.1?"
    68    * While this is not needed for small scales, this is important for the long tail of massive clusters (think 1 million endpoints), where the entire set of endpoints cannot reasonably be replicated to each ztunnel.
    69  * It should not be client-specific.
    70    * In Istio sidecars, historically we had a lot of client-specific xDS. For example, putting the xDS-client's IP back into the xDS response. This makes efficient control plane implementation (most notably, caching), extremely challenging.
    71    * In practice, this largely means that references are fully qualified in the API. IP Addresses (generally) have a network associated with them, node names have a cluster associated with them, etc.
    72  
    73  See the [XDS Evolution](https://docs.google.com/document/d/1V5wkeBHbLSLMzAMbwFlFZNHdZPyUEspG4lHbnB0UaCg/edit) document for more history and details.
    74  
    75  The `Workload` aims to represent everything about a workload (generally a `Pod` or `WorkloadEntry`).
    76  This includes things like its IP address, identity, metadata (name, namespace, app, version, etc), and whether it has a waypoint proxy associated.
    77  
    78  The `Service` aims to represent everything about a service (generally a `Service` or `ServiceEntry`).
    79  This includes things like its IP addresses, ports and an associated waypoint proxy if it has one.
    80  
    81  ### Authorization Type
    82  
    83  A secondary configuration consumed by Ztunnel is the [`Authorization` resource](../../pkg/workloadapi/security/authorization.proto).
    84  [Original Design Doc](https://docs.google.com/document/d/17mRVzXe8PS7VoligvIx52T10tOP7xPQ9noeOzoLO2cY/edit).
    85  
    86  This resource aims to represent the relatively small set of Authorization policies that Ztunnel support.
    87  Most notably, this is only L4 resources.
    88  
    89  Most of the API is fairly straight forward.
    90  However, one interesting aspect is how these policies associate with workloads.
    91  Istio's AuthorizationPolicy has label selectors.
    92  However, we intentionally do not send those as part of the Workload API, in order to keep the size low.
    93  
    94  The obvious solution to this is to put the list of selected workloads into the policy itself.
    95  However, this means anytime a workload changes (often), we need to update the policy.
    96  
    97  Instead, the opposite was chosen: each workload will list the policies that select it.
    98  This works out to be more efficient in common cases where policies change much less often than workloads.
    99  This only applies for selector-based policies; namespaced and global policies can be handled without needing to list them out in the Workload API.
   100  
   101  ## Redirection
   102  
   103  As ztunnel aims to transparently encrypt and route users traffic, we need a mechanism to capture all traffic entering and leaving "mesh" pods.
   104  This is a security critical task: if the ztunnel can be bypassed, authorization policies can be bypassed.
   105  
   106  Redirection must meet these requirements:
   107  * All traffic *egressing* a pod in the mesh should be redirected to the node-local ztunnel on port 15001.
   108    * It is critical that this path preserves the Service IP, if the traffic was to a Service.
   109  * All traffic *ingressing* a pod on port 15008 in the mesh is assumed to be HBONE, and should be redirected to the node-local ztunnel on port 15008 (more on this later).
   110  * All other traffic *ingressing* a pod in the mesh should be redirected to the node-local ztunnel on port 15006, regardless of intended original destination port.
   111  
   112  TODO: fill in implementation details of how redirection is actually implemented.
   113  
   114  ## HBONE
   115  
   116  Along with pass-through traffic, Ztunnel supports the "HBONE" (HTTP-Based Overlay Network) protocol.
   117  This is not really so much a new protocol, but rather a name we came up with to refer to the expectations of clients and servers communicating in the mesh.
   118  
   119  HBONE is just a standard HTTP `CONNECT` tunnel, over mutual TLS with mesh (SPIFFE) certificates, on a well known port (15008).
   120  The target destination address is set in the `:authority` header, and additional headers can be included as well.
   121  Currently, only HTTP/2 is supported, though HTTP/1.1 and HTTP/3 are planned.
   122  
   123  Currently, SNI is not set by Istio clients and ignored by Istio servers.
   124  This makes identifying which certificate to use problematic for Ztunnel.
   125  To handle this, requests to Ztunnel are sent to `DestinationPod:15008` and redirected to ztunnel, rather than `ZtunnelPod:15008`.
   126  The original destination is then extracted to determined which certificate to use.
   127  SNI is not used because it is illegal to use IPs in SNI, and there is no other existing standard format to represent what we need to.
   128  Additionally, using the redirection mechanism reduces the need for clients to know the destination's ztunnel address.
   129  
   130  Below shows an example outbound request. The "target" path is what the client sends, while the "actual" path is the real network flow after redirection.
   131  
   132  ```mermaid
   133  graph LR
   134      subgraph Client Node
   135          Client
   136          CZ["Ztunnel"]
   137      end
   138      subgraph Server Node
   139          Server
   140          SZ["Ztunnel"]
   141      end
   142      Client--Plain-->CZ
   143      CZ-."HBONE (target)".->Server
   144      CZ--"HBONE (actual)"-->SZ
   145      SZ--Plain-->Server
   146  ```
   147  
   148  ### Pooling
   149  
   150  User connections can be multiplexed over shared HBONE connections.
   151  This is done through standard HTTP/2 pooling.
   152  The pooling is keyed off the `{source identity, destination identity, destination ip}`.
   153  
   154  ### Headers
   155  
   156  Ztunnel uses the following well-known headers in HBONE:
   157  
   158  | Header        | Purpose                                                                                                                                                                                                                                                 |
   159  |---------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
   160  | `:authority`  | Required in `CONNECT`, this is the target destination                                                                                                                                                                                                   |
   161  | `Forwarded`   | For outgoing requests, the original source IP. Note that since we spoof IPs in most cases, this usually is the same as the actual IP seen. For incoming requests, this is used only for traffic from waypoints (which are trusted and cannot spoof IPs) |
   162  | `Baggage`     | (Experimental, likely to be removed) This contains metadata about the source/destination workload for telemetry purposes                                                                                                                                |
   163  | `Traceparent` | (Experimental) This maintains tracing information. Note this is tracing of *connections*, and is not correlated to tracing of user's own HTTP requests. However, this is useful to follow a connection across ztunnels.                                 |
   164  
   165  ## Traffic routing
   166  
   167  Based on the three [redirection](#redirection) paths, the ztunnel has three primary types of traffic it handles.
   168  
   169  ### Outbound
   170  
   171  Requests leaving a pod go through the "outbound" code path on port 15001.
   172  This is where most of Ztunnel's logic lives.
   173  
   174  For outbound traffic, we need to first determine where the traffic is destined to.
   175  As Ztunnel operates at L4, we only have the destination IP/port (recovered via `SO_ORIGINAL_DST`).
   176  This may be an IP of a Service, a Pod, or something outside the cluster.
   177  Ztunnel will look up the destination from the [addresses](#address-type) it is configured with.
   178  
   179  For traffic to unknown addresses, or to workloads that are not a part of the mesh, the traffic will just be passed through as is.
   180  To make ztunnel more transparent, the original source IP address will be spoofed.
   181  Additionally, `splice` will be used to make this proxying more efficient when possible.
   182  
   183  For traffic in the mesh, things are a bit more complex:
   184  
   185  1. If the destination has a waypoint proxy, we must send to it to the waypoint (using HBONE).
   186     When we do this, we will want to preserve the original destination Service IP, as the waypoint can do a better job picking a backend pod than we can.
   187     Note: the application itself may have already resolved the Service IP to a specific pod if it has Kubernetes native routing built in; since we don't have the Service information in this case we will use the destination IP we received (a pod). Most notably, sidecar proxies behave this way.
   188  1. If the destination is on our node, we "fast path" the request and convert this into an inbound request.
   189     This has the same semantics as if we had sent a request back to ourselves, but is more efficient and reduces complexity in the Ztunnel.
   190  1. Otherwise, we forward the request to the destination using HBONE. If the destination is a Service, we resolve this to a specific pod IP.
   191  
   192  In all cases, we spoof the original source IP.
   193  
   194  ### Inbound Passthrough
   195  
   196  Traffic entering a pod that is not transmitted over HBONE  (i.e. with a destination port != 15008) is handled by the "inbound passthrough" code path, on ztunnel's port 15006.
   197  
   198  This is fairly straightforward.
   199  
   200  First, we need to check that this traffic is allowed.
   201  Traffic may be denied by RBAC policies (especially from a `STRICT` mode enforcement, which denies plaintext traffic).
   202  
   203  If it is allowed, we will forward to the target destination.
   204  
   205  #### Hairpin
   206  
   207  In the case that the destination has a waypoint, that waypoint must have been bypassed to reach the inbound passthrough codepath.
   208  How we handle this is [under discussion](https://docs.google.com/document/d/1uM1c3zzoehiijh1ZpZuJ1-SzuVVupenv8r5yuCaFshs/edit#heading=h.dwbqvwmg6ud3).
   209  
   210  ### Inbound
   211  
   212  Traffic entering a pod over HBONE will be handled by the "inbound" code path, on port 15008.
   213  
   214  Incoming requests have multiple "layers": TLS wrapping HTTP CONNECT that is wrapping the user's connection.
   215  
   216  To unwrap the first layer, we terminate TLS.
   217  As part of this, we need to pick the correct certificate to serve on behalf of the destination workload.
   218  As discussed in [HBONE](#hbone), this is based on the destination IP.
   219  Additionally, we enforce the peer has a valid mesh identity (but do not assert _which_ identity, yet).
   220  
   221  Next, we terminate the CONNECT.
   222  From the [headers](#headers), we know the target destination.
   223  If the target destination has a waypoint, we enforce that the request is coming from that waypoint. Otherwise, the request is rejected.
   224  If there is no waypoint, ztunnel will enforce RBAC policies against the request.
   225  
   226  If all checks pass, ztunnel will open a connection to the target. This will spoof the source IP (from `Forwarded` for waypoints, or the incoming IP otherwise).
   227  Once the connection is established we return a 200 HTTP code, and bi-directionally copy data to/from the tunnel to the destination.
   228  
   229  ## Certificates
   230  
   231  Ztunnel certificates are based on the standard Istio SPIFFE format: `spiffe://<trust domain>/ns/<ns>/sa/<sa>`.
   232  
   233  However, the identities of the certificates will be of the actual user workloads, not Ztunnel's own identity.
   234  This means Ztunnel will have multiple distinct certificates at a time, one for each unique identity (service account) running on its node.
   235  
   236  When fetching certificates, ztunnel will authenticate to the CA with its own identity, but request the identity of another workload.
   237  Critically, the CA must enforce that the ztunnel has permission to request that identity.
   238  Requests for identities not running on the node are rejected.
   239  This is critical to ensure that a compromised node does not compromise the entire mesh.
   240  
   241  This CA enforcement is done by Istio's CA, and is a requirement for any alternative CAs integrating with Ztunnel.
   242  
   243  Note: Ztunnel authenticates to the CA with a Kubernetes Service Account JWT token, which encodes the pod information, which is what enables this.
   244  
   245  Ztunnel will request certificates for all identities on the node.
   246  It determines this based on the Workload xDS configuration it receives.
   247  When a new identity is discovered on the node, it will be enqueued for fetching at a low priority, as an optimization.
   248  However, if a request needs a certain identity that we have not fetched yet, it will be immediately requested.
   249  
   250  Ztunnel additionally will handle the rotation of these certificates (typically 24hr expiration) as they approach expiry.
   251  
   252  ## Telemetry
   253  
   254  Ztunnel emits the full set of [Istio Standard Metrics](https://istio.io/latest/docs/reference/config/metrics/), for the 4 TCP metrics.