github.com/cilium/cilium@v1.16.2/pkg/fqdn/doc.go (about) 1 // SPDX-License-Identifier: Apache-2.0 2 // Copyright Authors of Cilium 3 4 // Package fqdn handles some of the DNS-based policy functions: 5 // - A DNS lookup cache used to populate toFQDNs rules in the policy layer. 6 // - A NameManager that coordinates distributing IPs to matching toFQDNs 7 // selectors. 8 // - A DNS Proxy that applies L7 DNS rules and populates the lookup cache with 9 // IPs from allowed/successful DNS lookups. 10 // - (deprecated) A DNS Poller that actively polls all L3 toFQDNs.MatchName 11 // entries and populates the DNS lookup cache. 12 // 13 // Note: There are 2 different requests that are handled: the DNS lookup and 14 // the connection to the domain in the DNS lookup. 15 // 16 // Proxy redirection and L3 policy calculations are handled by the datapath and 17 // policy layer, respectively. 18 // 19 // DNS data is tracked per-endpoint but collected globally in each cilium-agent 20 // when calculating policy. This differs from toEndpoints rules, which use 21 // cluster-global information, and toCIDR rules, which use static information 22 // in the policy. toServices rules are similar but they are cluster-global and 23 // have no TTL nor a distinct lookup request from the endpoint. Furthermore, 24 // toFQDNs cannot handle in-cluster IPs but toServices can. 25 // 26 // +-------------+ +----------------+ +---------+ +---------+ 27 // | | | | | | | | 28 // | +<--+ NameManager +<-------+ | | | 29 // | | | | Update | | | | 30 // | Policy | +-------+--------+ Trigger| DNS | | | 31 // | Selectors | ^ | Proxy +<--->+ Network | 32 // | | | | | | | 33 // | | +-------+--------+ | | | | 34 // | | | DNS | | | | | 35 // | | | Lookup Cache +<-------+ | | | 36 // +------+------+ | | DNS +----+----+ +----+----+ 37 // | +----------------+ Data ^ ^ 38 // v | | 39 // +------+------+--------------------+ | | 40 // | | | | | 41 // | Datapath | | | | 42 // | | | DNS Lookup| | 43 // +-------------+ +<------------+ | 44 // | | | 45 // | Pod | | 46 // | | HTTP etc. | 47 // | +<----------------------------+ 48 // | | 49 // +----------------------------------+ 50 // 51 // === L7 DNS === 52 // L7 DNS is handled by the DNS Proxy. The proxy is always running within 53 // cilium-agent but traffic is only redirected to it when a L7 rule includes a 54 // DNS section such as: 55 // 56 // --- 57 // - toEndpoints: 58 // toPorts: 59 // - ports: 60 // - port: "53" 61 // protocol: ANY 62 // rules: 63 // dns: 64 // - matchPattern: "*" 65 // - matchName: "cilium.io" 66 // 67 // These redirects are implemented by the datapath and the management logic is 68 // shared with other proxies in cilium (envoy and kafka). L7 DNS rules can 69 // apply to an endpoint from various policies and, if any allow a request, it 70 // will be forwarded to the original target of the DNS packet. This is often 71 // configured in /etc/resolv.conf for a pod and k8s sets this automatically 72 // (https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/#pod-s-dns-config) 73 // In the example above `matchPattern: "*"` allows all requests and makes 74 // `matchName: "cilium.io"` redundant. 75 // Notes: 76 // - The forwarded requests are sent from cilium-agent on the host interface 77 // and not from the endpoint. 78 // - Users must explicitly allow `*.*.svc.cluster.local.` in k8s clusters. 79 // This is not automatic. 80 // - L7 DNS rules are egress-only, 81 // - The proxy emits L7 cilium-monitor events: one for the request, an 82 // accept/reject event, and the final response. 83 // 84 // Apart from allowing or denying DNS requests, the DNS proxy is used to 85 // observe DNS lookups in order to then allow L3 connections with the response 86 // information. These must separately allowed with toFQDNs L3 rules. The 87 // example above is a common "visibility" policy that allows all requests but 88 // ensures that they traverse the proxy. This information is then placed in the 89 // per-Endpoint and global DNS lookup caches and propagates from there. 90 // 91 // === L3 DNS === 92 // L3 DNS rules control L3 connections and not the DNS requests themselves. 93 // They rely on DNS lookup cache information and it must come from the DNS 94 // proxy, or via a L7 DNS rule. 95 // 96 // --- 97 // - toFQDNs: 98 // - matchName: "my-remote-service.com" 99 // - matchPattern: "bucket.*.my-remote-service.com" 100 // 101 // IPs seen in a DNS response (i.e. the request was allowed by a L7 policy) 102 // that are also selected in a DNS L3 rule matchPattern or matchName have a /32 103 // or /128 CIDR identity created. This occurs when they are first passed to the 104 // toFQDN selectors from NameManager. These identities are not special in any 105 // way and can overlap with toCIDR rules in policiies. They are placed in the 106 // node-local ipcache and in the policy map of each endpoint that is allowed to 107 // connect to them (i.e. defined in the L3 DNS rule). 108 // Notes: 109 // - Generally speaking, toFQDNs can only handle non-cluster IPs. In-cluster 110 // policy should use toEndpoints and toServices. This is partly historical but 111 // is because of ipcache limitations when mapping ip->identity. Endpoint 112 // identities can clobber the FQDN IP identity. 113 // - Despite being tracked per-Endpoint. DNS lookup IPs are collected into a 114 // global cache. This is historical and can be changed. 115 // The original implementation created policy documents in the policy 116 // repository to represent the IPs being allowed and could not distinguish 117 // between endpoints. The current implementation uses selectors that also do 118 // not distinguish between Endpoints. There is some provision for this, 119 // however, and it just requires better plumbing in how we place data in the 120 // Endpoint's datapath. 121 // 122 // === Caching, Long-Lived Connections & Garbage Collection === 123 // DNS requests are distinct traffic from the connections that pods make with 124 // the response information. This makes it difficult to correlate one DNS 125 // lookup to a later connection; a pod may reuse the IPs in a DNS response an 126 // arbitrary time after the lookup occurred, even past the DNS TTL. The 127 // solution is multi-layered for historical reasons: 128 // - Keep a per-Endpoint cache that can be stored to disk and restored on 129 // startup. These caches apply TTL expiration and limit the IP count per domain. 130 // - Keep a global cache to combine all this DNS information and send it to the 131 // policy system. This cache applies TTL but not per-domain limits. 132 // This causes a DNS lookup in one endpoint to leak to another! 133 // - Track live connections allowed by DNS policy and delay expiring that data 134 // while the connection is open. If the policy itself is removed, however, the 135 // connection is interrupted. 136 // 137 // The same DNSCache type is used in all cases. DNSCache instances remain 138 // consistent if the update order is different and merging multiple caches 139 // should be equivalent to applying the constituent updates individually. As a 140 // result, DNS data is all inserted into a single global cache from which the 141 // policy layer receives information. This is historic and per-Endpoint 142 // handling can be added. The data is internally tracked per IP because 143 // overlapping DNS responses may have different TTLs for IPs that appear in 144 // both. 145 // Notes: 146 // - The default configurable minimum TTL in the caches is 1 hour. This is 147 // mostly for identity stability, as short TTLs would cause more identity 148 // churn. This is mostly history as CIDR identities now have a near-0 149 // allocation overhead. 150 // - DNSCache deletes only currently occur when the cilium API clears the cache 151 // or when the garbage collector evicts entries. 152 // - The combination of caches: per-Endpoint and global must manage disparate 153 // behaviours of pods. The worst case scenario is one where one pod makes many 154 // requests to a target with changing IPs (like S3) but another makes few 155 // requests that are long-lived. We need to ensure "fairness" where one does 156 // not starve the other. The limits in the per-Endpoint caches allow this, and 157 // the global cache acts as a collector across different Endpoints (without 158 // restrictions). 159 // 160 // Expiration of DNS data is handled by the dns-garbage-collector-job controller. 161 // Historically, the only expiration was TTL based and the per-Endpoint and 162 // global caches would expire data at the same time without added logic. 163 // This is not true when we apply per-host IP limits in the cache. These 164 // default to 50 IPs for a given domain, per Endpoint. To account for these 165 // evictions the controller handles TTL and IP limit evictions. This ensures 166 // that the global cache is consistent with the per-Endpoint caches. The result 167 // is that the actual expiration is imprecise (TTL especially). The caches mark 168 // to-evict data internally and only do so on GC method calls from the 169 // controller. 170 // When DNS data is evicted from any per-Endpoint cache, for any reason, each 171 // IP is retained as a "zombie" in type fqdn.DNSZombieMapping. These "zombies" 172 // represent IPs that were previously associated with a resolved DNS name, but 173 // the DNS name is no longer known (for example because of TTL expiry). However 174 // there may still be an active connection associated with the zombie IP. 175 // Externally, related options use the term "deferred connection delete". 176 // Zombies are tracked per IP for the endpoint they come from (with a default 177 // limit of 10000 set by defaults.ToFQDNsMaxDeferredConnectionDeletes). When 178 // the Connection Tracking garbage collector runs, it marks any zombie IP that 179 // correlates to a live connection by that endpoint as "alive". At the next 180 // iteration of the dns-garbage-collector-job controller, the not-live zombies 181 // are finally evicted. These IPs are then, finally, no longer placed into the 182 // global cache on behalf of this endpoint. Other endpoints may have live DNS 183 // TTLs or connections to the same IPs, however, so these IPs may be inserted 184 // into the global cache for the same domain or a different one (or both). 185 // Note: The CT GC has a variable run period. This ranges from 30s to 12 hours 186 // and is shorter when more connection churn is observed (the constants are 187 // ConntrackGCMinInterval and ConntrackGCMaxLRUInterval in package defaults). 188 // 189 // === Flow of DNS data === 190 // 191 // +---------------------+ 192 // | DNS Proxy | 193 // +----------+----------+ 194 // | 195 // v 196 // +----------+----------+ 197 // | per-EP Lookup Cache | 198 // +----------+----------+ 199 // | 200 // v 201 // +----------+----------+ 202 // | per-EP Zombie Cache | 203 // +----------+----------+ 204 // | 205 // v 206 // +----------+----------+ 207 // | Global DNS Cache | 208 // +----------+----------+ 209 // | 210 // v 211 // +----------+----------+ 212 // | NameManager | 213 // +----------+----------+ 214 // | 215 // v 216 // +----------+----------+ 217 // | Policy toFQDNs | 218 // | Selectors | 219 // +----------+----------+ 220 // | 221 // v 222 // +----------+----------+ 223 // | per-EP Datapath | 224 // +---------------------+ 225 package fqdn