istio.io/istio@v0.0.0-20240520182934-d79c90f27776/architecture/networking/pilot.md (about)

     1  # Architecture of Istiod
     2  
     3  This document describes the high level architecture of the Istio control plane, Istiod.
     4  Istiod is structured as a modular monolith, housing a wide range of functionality from certificate signing, proxy configuration (XDS), traditional Kubernetes controllers, and more.
     5  
     6  ## Proxy Configuration
     7  
     8  Istiod's primary role - and most code - is to dynamically configure proxies (Envoy sidecars and ingress, gRPC, ztunnel, and more). This roughly consists of 3 parts:
     9  1. Config ingestion (inputs to the system)
    10  1. Config translation
    11  1. Config serving (XDS)
    12  
    13  ### Config Ingestion
    14  
    15  Istio reads from over 20 different resources types, and aggregates them together to build the proxy configuration. These resources can be sourced from Kubernetes (via watches), files, or over xDS; Kubernetes is by far the most common usage, though.
    16  
    17  Primarily for historical reasons, ingestion is split into a few components.
    18  
    19  #### ConfigStore
    20  
    21  The `ConfigStore` reads a variety of resources and exposes them over a standard interface (Get, List, etc). These types are wrapped in a common `config.Config` struct, contrasting with typical Kubernetes clients which use per-resource types. The most common is reading from Kubernetes via the `crdclient` package.
    22  
    23  ```mermaid
    24  graph TD
    25      subgraph ConfigStore
    26          xcs(XDS Client)
    27          ccs(CRD Client)
    28          fcs(Filesystem Client)
    29          acs(Aggregate)
    30          xcs-->acs
    31          ccs-->acs
    32          fcs-->acs
    33      end
    34  ```
    35  
    36  #### ServiceDiscovery
    37  
    38  The other primary interface is the ServiceDiscovery. Similar to ConfigStore, this aggregates over a variety of resources. However, it does not provide generic resource access, and instead precomputes a variety of service-oriented internal resources, such as `model.Service` and `model.ServiceInstance`.
    39  
    40  This is composed of two controllers - one driven from core Kubernetes types ("Kube Controller") and one by Istio types ("ServiceEntry controller").
    41  
    42  ```mermaid
    43  graph TD
    44      subgraph Kube Controller
    45          s(Services)
    46          e(Endpoints)
    47          p(Pods)
    48          ksi(ServiceInstances)
    49          kwi(WorkloadInstances)
    50          s-->ksi
    51          e-->ksi
    52          p-->kwi
    53      end
    54      subgraph ServiceEntry Controller
    55          se(ServiceEntry)
    56          we(WorkloadEntry)
    57          ssi(ServiceInstances)
    58          swi(WorkloadInstances)
    59          se-->ssi
    60          swi-->ssi
    61          we-->swi
    62      end
    63      kwi-->ssi
    64      swi-->ksi
    65  ```
    66  
    67  For the most part this is fairly straight forward. However, we support `ServiceEntry` selecting `Pod`, and `Service` selecting `WorkloadEntry`, which leads to cross-controller communication.
    68  
    69  Note: the asymmetry with `Pods` not contributing to Kube controller's `ServiceInstances` is due to the use of `Endpoints` which itself is derived from `Pod` from Kubernetes core.
    70  
    71  #### PushContext
    72  
    73  `PushContext` is an immutable snapshot of the current state of the world. It is regenerated (usually partially) on each configuration push (more on this below). Due to being a snapshot, most lookups are lock-free.
    74  
    75  `PushContext` is built up by querying the above layers. For some simple use cases, this is as simple as storing something like `configstore.List(SomeType)`; in this case, the only difference from directly exposing the configstore is to snapshot the current state. In other cases, some pre-computations and indexes are computed to make later accesses efficient.
    76  
    77  #### Endpoints
    78  
    79  Endpoints have an optimized code path, as they are by far the most frequently updated resource - in a steady cluster, this will often be the *only* change, caused by scale up/down.
    80  
    81  As a result, they do not go through `PushContext`, and changes do not trigger a `PushContext` recomputation. Instead, the current state is incrementally computed based on events from `ServiceDiscovery`.
    82  
    83  #### Conclusion
    84  
    85  Overall, the high level config ingestion flow:
    86  
    87  ```mermaid
    88  graph TD
    89      sd(Service Discovery)
    90      cs(ConfigStore)
    91      ep(Endpoints)
    92      pc(PushContext)
    93      sd-->pc
    94      cs-->pc
    95      sd-->ep
    96  ```
    97  
    98  ### Config Translation
    99  
   100  Config Translation turns the above inputs into the actual types consumed by the connected XDS clients (typically Envoy). This is done by `Generators`, which register a function to build a given type. For example, there is a `RouteGenerator` responsible for building `Routes`. Along with the core Envoy XDS types, there are a few custom Istio types, such as our `NameTable` type used for DNS, as well as debug interfaces.
   101  
   102  `Generators` get as input the `Proxy` (a representation of the current client), the current `PushContext` snapshot, and a list of config updates that caused the change.
   103  
   104  The `Proxy` as an input parameter is important, and a major distinction from some other XDS implementations. We are not able to statically translate inputs to XDS without per-client information. For example, we rely on the client's labels to determine the set of policies applied. While this is necessary to implement Istio's APIs, it does limit performance substantially.
   105  
   106  #### Caching
   107  
   108  Config translation typically takes the overwhelming majority of Istiod's resource usage. In particular, protobuf encoding. As a result, caching has been introduced, storing the already encoded `protobuf.Any` for a given resource.
   109  
   110  This caching depends on declaring all inputs to the given generator as part of the cache key. This is extremely error-prone, as there is nothing preventing generators from consuming inputs that are *not* part of the key. When this happens, different clients will non-deterministically get incorrect configuration. This type of bug has historically resulted in CVEs.
   111  
   112  There are a few ways to prevent these:
   113  * Only pass in to the generation logic the cache key itself, so no other unaccounted inputs can be used. Unfortunately, this has not been done for any generators today.
   114  * Be very, very careful.
   115  * The cache has a builtin test, enabled with `UNSAFE_PILOT_ENABLE_RUNTIME_ASSERTIONS=true`, that runs in CI. This will panic if any key is written to with a different value.
   116  
   117  #### Partial Computations
   118  
   119  Along with caching, partial computations are a critical performance optimization to ensure that we do not need to build (or send) every resource to every proxy on every change. This is discussed more in the Config Serving section.
   120  
   121  ### Config Serving
   122  
   123  Config serving is the layer that actually accepts proxy clients, connected over bidirectional gRPC streams, and serve them the required configuration.
   124  
   125  We will have two triggers for sending config - requests and pushes.
   126  
   127  #### Requests
   128  
   129  Requests come from the client specifically asking for a set of resources. This could be requesting the initial set of resources on a new connection, or from a new dependency. For example, a push of `Cluster X` referencing `Endpoint Y` may lead to a request for `Endpoint Y` if it is not already known to the client.
   130  
   131  Note that clients can actually send three types of messages - requests, ACKs of previous pushes, and NACKs of previous pushes. Unfortunately, these are not clearly distinguished in the API, so there is some logic to split these out (`shouldRespond`).
   132  
   133  #### Pushes
   134  
   135  A push occurs when Istiod detects an update of some set of configuration is needed. This results in roughly the same result as a Request (new configuration is pushed to the client), and is just triggered by a different source.
   136  
   137  Various components described in Config Ingestion can trigger a Config Update. These are batched up ("debounced"), to avoid excessive activity when many changes happen in succession, and eventually enqueued in the Push Queue.
   138  
   139  The Push Queue is mostly a normal queue, but it has some special logic to merge push requests for each given proxy. This results in each proxy having 0 or 1 outstanding push requests; if additional updates come in the existing push request is just expanded.
   140  
   141  Another job polls this queue and triggers each client to start a push.
   142  
   143  ```mermaid
   144  graph TD
   145      subgraph Config Flow
   146          cu(Config Update)
   147          db(Debounce)
   148          pc(Recompute Push Context)
   149          pq(Push Queue)
   150          cu-->db
   151          db--Trigger Once Steady-->pc
   152          pc--Enqueue All Clients-->pq
   153      end
   154      subgraph Proxy
   155          c(Client)
   156      end
   157      subgraph Pusher
   158          pj(Push Job)
   159          pj--read-->pq
   160          pj--trigger-->c
   161      end
   162  ```
   163  
   164  At a high level, each client job will find the correct generator for the request, generate the required configuration, and send it.
   165  
   166  #### Optimizations
   167  
   168  A naive implementation would simply regenerate all resources, of all subscribed types, for each client, on any configuration change. However, this scales poorly. As a result, we have many levels of optimizations to avoid doing this work.
   169  
   170  First, we have a concept of a `Full` push. Only `Full` pushes will recompute `PushContext` on change; otherwise this is skipped and the last `PushContext` is re-used. Note: even when `Full`, we try to copy as much from the previous `PushContext` as possible. For example, if only a `WasmPlugin` changed, we would not recompute services indexes.
   171  Note: `Full` only refers to whether a `PushContext` recomputation is needed. Even within a `Full` push, we keep track of which configuration updates triggered this, so we could have "Full update of Config X" or "Full update of all configs".
   172  
   173  Next, for an individual proxy we will check if it could possibly be impacted by the change. For example, we know a sidecar never is impacted by a `Gateway` update, and we can also look at scoping (from `Sidecar.egress.hosts`) to further restrict update scopes.
   174  
   175  Once we determine the proxy may be impacted, we determine which *types* may be impacted. For example, we know a `WasmPlugin` does not impact the `Cluster` type, so we can skip generating `Cluster` in this case. Warning: Envoy currently has a bug that *requires* `Endpoints` to be pushed any time the corresponding `Cluster` is pushed, so this optimization is intentionally turned off in this specific case.
   176  
   177  Finally, we determine which subset of the type we need to generate. XDS has two modes - "State of the World (SotW)" and "Delta". In SotW, we generally need to generate all resources of the type, even if only one changed. Note that we actually need to *generate* all of them, typically, as we do not store previously generated resources (mostly because they are generated per-client). This also means that whenever we are determining if a change is required, we are doing this based on careful code analysis, not at runtime.
   178  Despite this expectation in SotW, due to a quirk in the protocol we can actually enable one of our most important optimizations. XDS types form a tree, with CDS and LDS the root of the tree for Envoy. For root types, we *must* always generate the full set of resources - missing resources are treated as deletions.
   179  However, all other types *cannot* be deleted explicitly, and instead are cleaned up when all references are removed. This means we can send partial updates for non-root types, without deleting unsent resources. This effectively allows doing delta updates over SotW. This optimization is critical for our endpoints generator, ensuring that when a pod scales we only need to update the endpoints within that pod.
   180  
   181  Istio currently supports both SotW and Delta protocol. However, the delta implementation is not yet optimized well, so it performs mostly the same as SotW.
   182  
   183  ## Controllers
   184  
   185  Istiod consists of a collection of controllers. Per Kubernetes, "controllers are control loops that watch the state of your cluster, then make or request changes where needed."
   186  
   187  In Istio, we use the term a bit more liberally. Istio controllers watch more than just the state of *a* cluster -- many are reading from multiple clusters, or even external sources (files and XDS). Generally, Kubernetes controllers are then writing state back to the cluster; Istio does have a few of these controllers, but most of them are centered around driving the [Proxy Configuration](#proxy-configuration).
   188  
   189  ### Writing controllers
   190  
   191  Istio provides a few helper libraries to get started writing a controller. While these libraries help, there are still a lot of subtleties in correctly writing (and testing!) a controller properly.
   192  
   193  To get started writing a controller, review the [Example Controller](../../pkg/kube/controllers/example_test.go).
   194  
   195  ### Controllers overview
   196  
   197  Below provides a high level overview of controllers in Istiod. For more information about each controller, consulting the controllers Go docs is recommended.
   198  
   199  ```mermaid
   200  graph BT
   201      crd("CRD Watcher")
   202      subgraph Service Discovery
   203          ksd("Kubernetes Controller")
   204          sesd("Service Entry Controller")
   205          msd("Memory Controller")
   206          asd("Aggregate")
   207          ksd--Join-->asd
   208          sesd--Join-->asd
   209          msd--Join-->asd
   210          ksd<--"Data Sharing"-->sesd
   211      end
   212      subgraph ConfigStore
   213          ccs("CRD Client")
   214          xcs("XDS Store")
   215          fcs("File Store")
   216          mcs("Memory Store")
   217          acs("Aggregate")
   218          ccs--Join-->acs
   219          xcs--Join-->acs
   220          fcs--Join-->acs
   221          mcs--Join-->acs
   222      end
   223      subgraph VMs
   224          vmhc("Health Check")
   225          vmar("Auto Registration")
   226      end
   227      subgraph Gateway
   228          twc("Tag Watcher")
   229          gdc("Gateway Deployment")
   230          gcc("Gateway Class")
   231          twc--Depends-->gdc
   232          gdc-.-gcc
   233      end
   234      subgraph Ingress
   235          ic("Ingress Controller")
   236          isc("Ingress Status Controller")
   237          ic-.-isc
   238      end
   239      mcsc("Multicluster Secret")
   240      scr("Credentials Controller")
   241      mcsc--"1 per cluster"-->scr
   242      mcsc--"1 per cluster"-->ksd
   243      crd--Depends-->ccs
   244  
   245      iwhc("Injection Webhook")
   246      vwhc("Validation Webhook")
   247      nsc("Namespace Controller")
   248      ksd--"External Istiod"-->nsc
   249      ksd--"External Istiod"-->iwhc
   250  
   251      df("Discovery Filter")
   252  
   253      axc("Auto Export Controller")
   254  
   255      mcfg("Mesh Config")
   256      dfc("Default Revision Controller")
   257  ```
   258  
   259  As you can see, the landscape of controllers is pretty extensive at this point.
   260  
   261  [Service Discovery](#ServiceDiscovery) and [Config Store](#ConfigStore) were already discussed above, so do not need more explanation here.
   262  
   263  #### Mesh Config
   264  
   265  Mesh Config controller is a pretty simple controller, reading from `ConfigMap`(s) (multiple if `SHARED_MESH_CONFIG` is used), processing and merging these into a the typed `MeshConfig`. It then exposes this over a simple `mesh.Watcher`, which just exposes a way to access the current `MeshConfig` and get notified when it changes.
   266  
   267  #### Ingress
   268  
   269  In addition to `VirtualService` and `Gateway`, Istio supports the `Ingress` core resource type. Like CRDs, the `Ingress` controller implements `ConfigStore`, but a bit differently. `Ingress` resources are converted on the fly to `VirtualService` and `Gateway`, so while the controller reads `Ingress` resources (and a few related types like `IngressClass`), it emits other types. This allows the rest of the code to be unaware of Ingress and just focus on the core types
   270  
   271  In addition to this conversion, `Ingress` requires writing the address it can be reached at in status. This is done by the Ingress Status controller.
   272  
   273  #### Gateway
   274  
   275  Gateway (referring to the [Kubernetes API](http://gateway-api.org/), not the same-named Istio type) works very similarly to [Ingress](#ingress). The Gateway controller also converts Gateway API types into `VirtualService` and `Gateway`, implementing the `ConfigStore` interface.
   276  
   277  However, there is also a bit of additional logic. Gateway types have extensive status reporting. Unlike Ingress, this is status reporting is done inline in the main controller, allowing status generation to be done directly in the logic processing the resources.
   278  
   279  Additionally, Gateway involves two components writing to the cluster:
   280  * The Gateway Class controller is a simple controller that just writes a default `GatewayClass` object describing our implementation.
   281  * The Gateway Deployment controller enables users to create a Gateway which actually provisions the underlying resources for the implementation (Deployment and Service). This is more like a traditional "operator". Part of this logic is determining which Istiod revision should handle the resource based on `istio.io/rev` labeling (mirroring sidecar injection); as a result, this takes a dependency on the "Tag Watcher" controller.
   282  
   283  #### CRD Watcher
   284  
   285  For watches against custom types (CRDs), we want to gracefully handle missing CRDs. Naively starting informers against the missing types would result in errors and blocking startup. Instead, we introduce a "CRD Watcher" component that watches the CRDs in the cluster to determine if they are available or not.
   286  
   287  This is consumed in two ways:
   288  * Some components just block on `watcher.WaitForCRD(...)` before doing the work they need.
   289  * `kclient.NewDelayedInformer` can also fully abstract this away, by providing a client that handles this behind the scenes.
   290  
   291  #### Credentials Controller
   292  
   293  The Credentials controller exposes access to TLS certificate information, stored in cluster as `Secrets`. Aside from simply accessing certificates, it also has an authorization component that can verify whether a requester has access to read `Secret`s in its namespace.
   294  
   295  #### Discovery Filter
   296  
   297  The Discovery Filter controller is used to implement the `discoverySelectors` field of `MeshConfig`. This controller reads `Namespace`s in the cluster to determine if they should be "selected". Many controllers consumer this filter to only process a subset of configurations.
   298  
   299  #### Multicluster
   300  
   301  Various controllers read from multiple clusters.
   302  
   303  This is rooted in the Multicluster Secret controller, which reads `kubeconfig` files (stored as `Secrets`), and creating Kubernetes clients for each. The controller allows registering handlers which can process Add/Update/Delete of clusters.
   304  
   305  This has two implementations:
   306  * The Credentials controller is responsible for reading TLS certificates, stored as Secrets.
   307  * The Kubernetes Service Discovery controller is a bit of a monolith, and spins off a bunch of other sub-controllers in addition to the core service discovery controller.
   308  
   309  Because of the monolithic complexity it helps to see this magnified a bit:
   310  
   311  ```mermaid
   312  graph BT
   313      mcsc("Multicluster Secret")
   314      scr("Credentials Controller")
   315      ksd("Kubernetes Service Controller")
   316      nsc("Namespace Controller")
   317      wes("Workload Entry Store")
   318      iwh("Injection Patcher")
   319      aex("Auto Service Export")
   320      scr-->mcsc
   321      ksd-->mcsc
   322      nsc-->ksd
   323      wes-->ksd
   324      iwh-->ksd
   325      aex-->ksd
   326  ```
   327  
   328  #### VMs
   329  
   330  Virtual Machine support consists of two controllers.
   331  
   332  The Auto Registration controller is pretty unique as a controller - the inputs to the controller are XDS connections. In response to each XDS connection, a `WorkloadEntry` is created to register the XDS client (which is generally `istio-proxy` running on a VM) to the mesh. This `WorkloadEntry` is tied to the lifecycle of the connection, with some logic to ensure that temporary downtime (reconnecting, etc) does not remove the `WorkloadEntry`.
   333  
   334  The Health Check controller additionally controls the health status of the `WorkloadEntry`. The health is reported over the XDS client and synced with the `WorkloadEntry`.
   335  
   336  #### Webhooks
   337  
   338  Istio contains both Validation and Mutating webhook configurations. These need a `caBundle` specified in order to provision the TLS trust. Because Istiod's CA certificate is somewhat dynamic, this is patched at runtime (rather than part of the install). The webhook controllers handle this patching.
   339  
   340  These controllers are very similar but are distinct components for a variety of reasons.