github.com/cilium/cilium@v1.16.2/pkg/fqdn/doc.go (about)

     1  // SPDX-License-Identifier: Apache-2.0
     2  // Copyright Authors of Cilium
     3  
     4  // Package fqdn handles some of the DNS-based policy functions:
     5  //   - A DNS lookup cache used to populate toFQDNs rules in the policy layer.
     6  //   - A NameManager that coordinates distributing IPs to matching toFQDNs
     7  //     selectors.
     8  //   - A DNS Proxy that applies L7 DNS rules and populates the lookup cache with
     9  //     IPs from allowed/successful DNS lookups.
    10  //   - (deprecated) A DNS Poller that actively polls all L3 toFQDNs.MatchName
    11  //     entries and populates the DNS lookup cache.
    12  //
    13  // Note: There are 2 different requests that are handled: the DNS lookup and
    14  // the connection to the domain in the DNS lookup.
    15  //
    16  // Proxy redirection and L3 policy calculations are handled by the datapath and
    17  // policy layer, respectively.
    18  //
    19  // DNS data is tracked per-endpoint but collected globally in each cilium-agent
    20  // when calculating policy. This differs from toEndpoints rules, which use
    21  // cluster-global information, and toCIDR rules, which use static information
    22  // in the policy. toServices rules are similar but they are cluster-global and
    23  // have no TTL nor a distinct lookup request from the endpoint. Furthermore,
    24  // toFQDNs cannot handle in-cluster IPs but toServices can.
    25  //
    26  //	+-------------+   +----------------+        +---------+     +---------+
    27  //	|             |   |                |        |         |     |         |
    28  //	|             +<--+   NameManager  +<-------+         |     |         |
    29  //	|             |   |                | Update |         |     |         |
    30  //	|   Policy    |   +-------+--------+ Trigger|   DNS   |     |         |
    31  //	|  Selectors  |           ^                 |  Proxy  +<--->+ Network |
    32  //	|             |           |                 |         |     |         |
    33  //	|             |   +-------+--------+        |         |     |         |
    34  //	|             |   |      DNS       |        |         |     |         |
    35  //	|             |   |  Lookup Cache  +<-------+         |     |         |
    36  //	+------+------+   |                |   DNS  +----+----+     +----+----+
    37  //	       |          +----------------+   Data      ^               ^
    38  //	       v                                         |               |
    39  //	+------+------+--------------------+             |               |
    40  //	|             |                    |             |               |
    41  //	|   Datapath  |                    |             |               |
    42  //	|             |                    |   DNS Lookup|               |
    43  //	+-------------+                    +<------------+               |
    44  //	|                                  |                             |
    45  //	|                Pod               |                             |
    46  //	|                                  |                   HTTP etc. |
    47  //	|                                  +<----------------------------+
    48  //	|                                  |
    49  //	+----------------------------------+
    50  //
    51  // === L7 DNS ===
    52  // L7 DNS is handled by the DNS Proxy. The proxy is always running within
    53  // cilium-agent but traffic is only redirected to it when a L7 rule includes a
    54  // DNS section such as:
    55  //
    56  //	---
    57  //	- toEndpoints:
    58  //	  toPorts:
    59  //	  - ports:
    60  //	     - port: "53"
    61  //	       protocol: ANY
    62  //	    rules:
    63  //	      dns:
    64  //	        - matchPattern: "*"
    65  //	        - matchName: "cilium.io"
    66  //
    67  // These redirects are implemented by the datapath and the management logic is
    68  // shared with other proxies in cilium (envoy and kafka). L7 DNS rules can
    69  // apply to an endpoint from various policies and, if any allow a request, it
    70  // will be forwarded to the original target of the DNS packet. This is often
    71  // configured in /etc/resolv.conf for a pod and k8s sets this automatically
    72  // (https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/#pod-s-dns-config)
    73  // In the example above `matchPattern: "*"` allows all requests and makes
    74  // `matchName: "cilium.io"` redundant.
    75  // Notes:
    76  //   - The forwarded requests are sent from cilium-agent on the host interface
    77  //     and not from the endpoint.
    78  //   - Users must explicitly allow `*.*.svc.cluster.local.` in k8s clusters.
    79  //     This is not automatic.
    80  //   - L7 DNS rules are egress-only,
    81  //   - The proxy emits L7 cilium-monitor events: one for the request, an
    82  //     accept/reject event, and the final response.
    83  //
    84  // Apart from allowing or denying DNS requests, the DNS proxy is used to
    85  // observe DNS lookups in order to then allow L3 connections with the response
    86  // information. These must separately allowed with toFQDNs L3 rules. The
    87  // example above is a common "visibility" policy that allows all requests but
    88  // ensures that they traverse the proxy. This information is then placed in the
    89  // per-Endpoint and global DNS lookup caches and propagates from there.
    90  //
    91  // === L3 DNS ===
    92  // L3 DNS rules control L3 connections and not the DNS requests themselves.
    93  // They rely on DNS lookup cache information and it must come from the DNS
    94  // proxy, or via a L7 DNS rule.
    95  //
    96  //	---
    97  //	- toFQDNs:
    98  //	    - matchName: "my-remote-service.com"
    99  //	    - matchPattern: "bucket.*.my-remote-service.com"
   100  //
   101  // IPs seen in a DNS response (i.e. the request was allowed by a L7 policy)
   102  // that are also selected in a DNS L3 rule matchPattern or matchName have a /32
   103  // or /128 CIDR identity created. This occurs when they are first passed to the
   104  // toFQDN selectors from NameManager. These identities are not special in any
   105  // way and can overlap with toCIDR rules in policiies. They are placed in the
   106  // node-local ipcache and in the policy map of each endpoint that is allowed to
   107  // connect to them (i.e. defined in the L3 DNS rule).
   108  // Notes:
   109  //   - Generally speaking, toFQDNs can only handle non-cluster IPs. In-cluster
   110  //     policy should use toEndpoints and toServices. This is partly historical but
   111  //     is because of ipcache limitations when mapping ip->identity. Endpoint
   112  //     identities can clobber the FQDN IP identity.
   113  //   - Despite being tracked per-Endpoint. DNS lookup IPs are collected into a
   114  //     global cache. This is historical and can be changed.
   115  //     The original implementation created policy documents in the policy
   116  //     repository to represent the IPs being allowed and could not distinguish
   117  //     between endpoints. The current implementation uses selectors that also do
   118  //     not distinguish between Endpoints. There is some provision for this,
   119  //     however, and it just requires better plumbing in how we place data in the
   120  //     Endpoint's datapath.
   121  //
   122  // === Caching, Long-Lived Connections & Garbage Collection ===
   123  // DNS requests are distinct traffic from the connections that pods make with
   124  // the response information. This makes it difficult to correlate one DNS
   125  // lookup to a later connection; a pod may reuse the IPs in a DNS response an
   126  // arbitrary time after the lookup occurred, even past the DNS TTL. The
   127  // solution is multi-layered for historical reasons:
   128  //   - Keep a per-Endpoint cache that can be stored to disk and restored on
   129  //     startup. These caches apply TTL expiration and limit the IP count per domain.
   130  //   - Keep a global cache to combine all this DNS information and send it to the
   131  //     policy system. This cache applies TTL but not per-domain limits.
   132  //     This causes a DNS lookup in one endpoint to leak to another!
   133  //   - Track live connections allowed by DNS policy and delay expiring that data
   134  //     while the connection is open. If the policy itself is removed, however, the
   135  //     connection is interrupted.
   136  //
   137  // The same DNSCache type is used in all cases. DNSCache instances remain
   138  // consistent if the update order is different and merging multiple caches
   139  // should be equivalent to applying the constituent updates individually. As a
   140  // result, DNS data is all inserted into a single global cache from which the
   141  // policy layer receives information. This is historic and per-Endpoint
   142  // handling can be added. The data is internally tracked per IP because
   143  // overlapping DNS responses may have different TTLs for IPs that appear in
   144  // both.
   145  // Notes:
   146  //   - The default configurable minimum TTL in the caches is 1 hour. This is
   147  //     mostly for identity stability, as short TTLs would cause more identity
   148  //     churn. This is mostly history as CIDR identities now have a near-0
   149  //     allocation overhead.
   150  //   - DNSCache deletes only currently occur when the cilium API clears the cache
   151  //     or when the garbage collector evicts entries.
   152  //   - The combination of caches: per-Endpoint and global must manage disparate
   153  //     behaviours of pods. The worst case scenario is one where one pod makes many
   154  //     requests to a target with changing IPs (like S3) but another makes few
   155  //     requests that are long-lived. We need to ensure "fairness" where one does
   156  //     not starve the other. The limits in the per-Endpoint caches allow this, and
   157  //     the global cache acts as a collector across different Endpoints (without
   158  //     restrictions).
   159  //
   160  // Expiration of DNS data is handled by the dns-garbage-collector-job controller.
   161  // Historically, the only expiration was TTL based and the per-Endpoint and
   162  // global caches would expire data at the same time without added logic.
   163  // This is not true when we apply per-host IP limits in the cache. These
   164  // default to 50 IPs for a given domain, per Endpoint. To account for these
   165  // evictions the controller handles TTL and IP limit evictions. This ensures
   166  // that the global cache is consistent with the per-Endpoint caches. The result
   167  // is that the actual expiration is imprecise (TTL especially). The caches mark
   168  // to-evict data internally and only do so on GC method calls from the
   169  // controller.
   170  // When DNS data is evicted from any per-Endpoint cache, for any reason, each
   171  // IP is retained as a "zombie" in type fqdn.DNSZombieMapping. These "zombies"
   172  // represent IPs that were previously associated with a resolved DNS name, but
   173  // the DNS name is no longer known (for example because of TTL expiry). However
   174  // there may still be an active connection associated with the zombie IP.
   175  // Externally, related options use the term "deferred connection delete".
   176  // Zombies are tracked per IP for the endpoint they come from (with a default
   177  // limit of 10000 set by defaults.ToFQDNsMaxDeferredConnectionDeletes). When
   178  // the Connection Tracking garbage collector runs, it marks any zombie IP that
   179  // correlates to a live connection by that endpoint as "alive". At the next
   180  // iteration of the dns-garbage-collector-job controller, the not-live zombies
   181  // are finally evicted. These IPs are then, finally, no longer placed into the
   182  // global cache on behalf of this endpoint. Other endpoints may have live DNS
   183  // TTLs or connections to the same IPs, however, so these IPs may be inserted
   184  // into the global cache for the same domain or a different one (or both).
   185  // Note: The CT GC has a variable run period. This ranges from 30s to 12 hours
   186  // and is shorter when more connection churn is observed (the constants are
   187  // ConntrackGCMinInterval and ConntrackGCMaxLRUInterval in package defaults).
   188  //
   189  // === Flow of DNS data ===
   190  //
   191  //	+---------------------+
   192  //	|      DNS Proxy      |
   193  //	+----------+----------+
   194  //	           |
   195  //	           v
   196  //	+----------+----------+
   197  //	| per-EP Lookup Cache |
   198  //	+----------+----------+
   199  //	           |
   200  //	           v
   201  //	+----------+----------+
   202  //	| per-EP Zombie Cache |
   203  //	+----------+----------+
   204  //	           |
   205  //	           v
   206  //	+----------+----------+
   207  //	|  Global DNS Cache   |
   208  //	+----------+----------+
   209  //	           |
   210  //	           v
   211  //	+----------+----------+
   212  //	|     NameManager     |
   213  //	+----------+----------+
   214  //	           |
   215  //	           v
   216  //	+----------+----------+
   217  //	|   Policy toFQDNs    |
   218  //	|      Selectors      |
   219  //	+----------+----------+
   220  //	           |
   221  //	           v
   222  //	+----------+----------+
   223  //	|   per-EP Datapath   |
   224  //	+---------------------+
   225  package fqdn