sigs.k8s.io/gateway-api@v1.0.0/geps/gep-1364.md (about) 1 # GEP-1364: Status and Conditions Update 2 3 * Issue: [#1364](https://github.com/kubernetes-sigs/gateway-api/issues/1364) 4 * Status: Standard 5 6 ## TLDR 7 8 The status, particularly the Conditions, across the whole Gateway API have very much 9 grown organically, and so have many inconsistencies and odd behaviors. 10 This GEP covers doing a review and consolidation to make Condition behavior consistent 11 across the whole API. 12 13 ## Goals 14 15 * Update Conditions design to be consistent across Gateway API resources 16 * Provide a model and guidelines for Conditions for future new resources 17 * Specify changes to conformance required for Condition updates 18 19 ## Non-Goals 20 21 * Define the full set of Conditions that will ever be used with Gateway API 22 23 ## Introduction 24 25 Gateway API currently has a lot of issues related to status, especially that 26 status is inconsistent ([#1111][1111]), that names are hard to understand ([#1110][1110]), 27 and that Reasons aren't explained properly ([#1362][1362]). 28 29 As the API has grown, the way we talk about resources has changed a lot, and some of the 30 status design hasn't been updated since resources were created. 31 32 So, for example, we have GatewayClass with `Accepted`, Gateway with `Scheduled`, 33 the Gateway Listeners with `Detached` (which you want to be `false`, unlike the previous 34 two), and then Gateways and Gateway Listeners have `Ready`, but Route doesn't (and which 35 also you want to be `true`). 36 37 This document lays out large-scale changes to the way that we talk about resources, 38 and the Conditions to match them. This means that there will be an unavoidable break 39 in what constitutes a healthy or unhealthy resource, and there will be changes 40 required for all implementations to be conformant with the release that includes 41 these changes. 42 43 The constants that mark the deprecated types will be also marked as deprecated, 44 and will no longer be tested as part of conformance. They'll still be present, 45 and will work, but they won't be part of the spec any more. This should give 46 implementations and users a release to transition to the new design (in UX terms). 47 This grace period should be one release (so, the constants will be removed in 48 v0.7.0.) 49 50 This level of change is not optimal, and the intent is to make this a one-off change 51 that can be built upon for future resources - since there are definitely more resources 52 on the way. 53 54 ## Background: Kubernetes API conventions and prior art on Conditions 55 56 Because this GEP is mainly concerned with updating the Conditions we are setting in 57 Gateway API resources' `status`, it's worth reviewing some important points about 58 Conditions. (This information is mainly taken from the [Typical status properties][typstatus] 59 section of the API conventions document.) 60 61 1. Conditions are a standard type used to represent arbitrary higher-level status from 62 a controller. 63 2. They are a listMapType, a list that is enforced by the apiserver to have only 64 one entry of each item, using the `type` field as a key. (So, this is effectively 65 a map that looks like a list in YAML form). 66 3. Each has a number of fields, the most important of which for this discussion 67 are `type`, `status`, `reason`, and `observedGeneration`. 68 69 * `type` is a string value indicating the Condition type. `Accepted`, `Scheduled`, 70 and `Ready` are current examples. 71 * `status` indicates the state of the condition, and can be one of three values, 72 `true`, `false`, or `unknown`. Unknown in particular is important, because it 73 means that the controller is unable to determine the status for some reason. 74 (Also notable is that "" is also valid, and must be treated as `Unknown`. 75 Controllers must not set the value to "", but consumers should accept it 76 as meaning the same thing as `Unknown`.) 77 * `reason` is a CamelCase string that is a brief description of the reason why 78 the `status` is set the way it is. 79 * `observedGeneration` is an optional field that sets what the `metadata.generation` 80 field was when the controller last saw a resource. Note that this is optional 81 _in the struct_, but is required for Gateway API conditions. This will be 82 enforced in the conformance tests in the future. 83 84 4. Conditions should describe the _current state_ of the resource at observation 85 time, which means that they should be an adjective (like `Ready`), or a past-tense 86 verb (like `Accepted`). This one in particular is documented pretty closely in the 87 [Typical status properties][typstatus] section of the guidelines. 88 5. Conditions should be applied to a resource the first time the controller sees 89 the resource. This seems to imply that _all conditions should be present on every 90 resource owned by a controller_, but the rest of the conventions don't make this 91 clear, and it is often not complied with. 92 6. It's helpful to have a top-level condition which summarizes more detailed conditions. 93 The guidelines suggest using either `Ready` for long-running processes, or `Succeeded` 94 for bounded execution. 95 96 From these guidelines, we can see that Conditions can be either _positive polarity_ 97 (healthy resources have them as `status: true`) or _negative polarity_ (healthy 98 resources have them as `status: false`). `Ready` is an example of a positive polarity 99 condition, and conditions like `Conflicted` from Listener or `NetworkUnavailable`, 100 `MemoryPressure`, or `DiskPressure` from the Node resource are examples of 101 negative-polarity conditions. 102 103 There is also some extra context that's not in the API conventions doc: 104 105 SIG-API Machinery has been reluctant to add fields that would aid in machine-parsing 106 of Conditions, especially fields that would indicate the polarity, because they 107 are intended more for human consumption than machine consumption. Probably the best 108 example of this was in the PR [#4521](https://github.com/kubernetes/community/pull/4521#issuecomment-64894206). 109 110 This means that there's no guidance from upstream about condition polarity. We'll 111 discuss this more when we talk about new conditions. 112 113 The guidance about Conditions being added as soon as a controller sees a resource 114 is a bit unclear - as written in the conventions, it seems to imply that _all_ 115 relevant conditions should always be added, even if their status has to be set to 116 `unknown`. 117 Gateway API resources do not currently require this, and the practice seems to be 118 uncommon. 119 120 ## Proposed changes 121 122 ### Proposed changes summary 123 124 * All the current Conditions that indicate that the resource is okay and ready 125 for processing will be replaced with `Accepted`. 126 * In general, resources should be considered `Accepted` if their config is valid 127 enough to generate some config in the underlying data plane. Examples are provided 128 below. 129 * There will be a limited set of positive polarity summary conditions, and a number 130 of other specific negative-polarity error conditions. 131 * All relevant positive-polarity summary Conditions for a resource must be added 132 when it's observed. 133 For example, HTTPRoutes must always have `Accepted` and `ResolvedRefs`, regardless 134 of their state. 135 * Negative polarity error conditions must only be added when the error is True. 136 * The `Ready` condition will be moved to Extended conformance, and we'll re-evaluate 137 if it's used by any implementations after some time has passed. If not, it may be 138 removed. 139 * To capture the behavior that `Ready` currently captures, `Programmed` will be 140 introduced. This means that the implementation has seen the config, has everything 141 it needs, parsed it, and sent configuration off to the data plane. The configuration 142 should be available "soon". We'll leave "soon" undefined for now. 143 * Resolving a comment that came up, documentation will be added to clarify that 144 it's okay to add your own Conditions, and that implementations should namespace 145 their custom Conditions with a domain prefix (so `implementation.io/CustomType` 146 rather than just `CustomType`), or run the risk of using a word that's reserved later. 147 * It's recommended that implementations publish both new and old conditions to 148 provide a smoother transition, but conformance tests will only require the new 149 conditions. 150 151 The exact list of changes is detailed below. The next few sections detail 152 the reasons for these large-scale changes. 153 154 ### Conceptual and language changes 155 156 Gateway API resources are, conceptually, all about breaking up the configuration for a 157 data plane into separate resources that are _expressive_ and _extensible_, while being 158 split up along _role-oriented_ boundaries. 159 160 So, when we talk about Gateway API, it's _always_ about a _system of related resources_. 161 162 We already acknowledge this when we talk about Routes "attaching" to Gateways, or Gateways 163 referencing Services, or Gateways requiring a GatewayClass in their spec. 164 165 However, this GEP is proposing that we move all our discussion into using 166 "accepted" to indicate that a resource has attached correctly enough to be 167 _accepted_ for processing. 168 169 So resources are `Accepted` for processing when their attachment succeeds enough 170 to generate some configuration. This allows us to make calls about when partially 171 valid objects should be accepted and when they shouldn't. 172 173 Of course, because we're using all of this configuration to describe some sort of data 174 path from "outside"/lacking cluster context to "inside"/enriched with cluster context, 175 we also need a way to describe when that data path is configured and working. 176 177 We already have a word in the Kubernetes API, but it comes with some expectations 178 that implementations are not currently able to meet. That word is `Ready`, but it 179 implies that the data path is Ready _when you read the status_, rather than that 180 it _will be ready soon_ (which is what most implementations can guarantee currently.) 181 182 So we have an unresolved question as to what to do with the `Ready` condition. 183 This is addressed further below. 184 185 ### Condition polarity 186 187 In terms of the polarity of conditions, we have three options, of which only two are 188 really viable: 189 * All conditions must be negative polarity 190 * All conditions must be positive polarity 191 * Some conditions can be positive polarity, but most should be negative. 192 193 The fact that the user experience of `Ready` or conditions like `Accepted` being `true` 194 in the healthy case is much better rules out the first option, so we are left to 195 decide between enforcing that all conditions are positive, or that we have a mix. 196 197 Having an arbitrary mix will make doing machine-based extraction of information 198 much harder, so here I'm going to talk about the distinction between having all 199 conditions positive or some, summary conditions positive, and the rest negative. 200 201 #### All Conditions Positive 202 203 In this case, all Condition types are written in such a way that they're positive 204 polarity, and are `true` in the healthy case. 205 206 As already discussed, `Ready`, and `Accepted` are current examples, but another 207 one that's a little more important here is `ResolvedRefs` which is set to `true` 208 when all references to other resources have been successfully resolved. This is 209 not a _blocking_ Condition that affects the `Ready` condition, since having _some_ 210 references valid is enough to produce some configuration in the underlying data 211 plane. 212 213 So, All Conditions Positive pros: 214 * We're close already. Most conditions in the API are currently positive polarity. 215 * Easier to understand - there are no double negatives. "Good: true" is less 216 cognitive overhead than "NotGood: false". 217 218 Cons: 219 * Reduces flexibility - it can surprisingly difficult to avoid double negatives for 220 conditions that describe error states, as in general programmers are more used 221 to reporting "something went wrong" than they are "everything's okay". 222 223 Not sure if pro or con: 224 * Leans the design towards favoring conditions always being present, since you 225 can't be sure if everything is good unless you see `AllGood: true`. The absence 226 of a positive-polarity condition implies that the condition could be false. This 227 puts this option more in line with the API guidelines on this point. 228 229 #### Some Conditions Positive 230 231 In this case, only a limited set of summary conditions are positive, and the rest 232 are negative. 233 234 Pros: 235 * Error states can be described with `Error: true` instead of `NoError: false`. 236 * Negative polarity error conditions are more friendly to not being present (since 237 absence of `Error: true` implies everything's okay). 238 239 Cons: 240 * Any code handling conditions will need a list of the positive ones, and will 241 need to assume that any others are negative. 242 243 #### Decision 244 245 Gateway API conditions will be positive for conditions that describe the happy 246 state of the object, which is currently `Accepted` and `ResolvedRefs`, and will 247 also include the new `Programmed` condition, and the newly-Extended condition 248 `Ready`. A separate set of negative-polarity Error conditions will be set on an 249 object when they are true. 250 251 252 ### Should conditions always be added? 253 254 Not all of them. 255 256 Positive polarity Conditions that describe the desirable state of the object must 257 always be set. These are currently `Accepted`, `ResolvedRefs`, and `Programmed`. 258 Implementations that use `Ready` must also add it before programming the Route. 259 260 ### Partial validity and Conditions 261 262 One of the trickiest parts of Gateway API objects is that it's very possible to 263 end up with an object that has some parts with valid configuration and some that 264 don't. We refer to this as _partial validity_, and communicating this via status 265 conditions is difficult. 266 267 The intent with the `Accepted` condition is that it serves as an indicator that 268 _something_ is working, that _some traffic_ from what the config specifies will 269 be routed as configured. 270 271 At this time, we haven't added a "no errors at all present" Condition, choosing 272 to have a "some config is working" condition, with specific errors to aid in 273 finding the exact problem with the objects. We could conceivably add this later 274 if users find `Accepted` insufficient, but we're erring on the side of having 275 less positive Conditions for now. 276 277 ### New and Updated Conditions 278 279 #### `Accepted` 280 281 This GEP proposes replacing all conditions that indicate syntactic and semantic 282 validity with one, `Accepted` condition type. 283 284 That is, the proposal is to replace: 285 286 * `Scheduled` on Gateway 287 * `Detached` on Listener 288 289 with `Accepted` in all these locations. 290 291 GatewayClass and Route will maintain the `Accepted` condition. 292 293 All of these conditions share the following meanings: 294 295 * The resource has been accepted for processing by the controller 296 * The resource is syntactically and semantically valid, and internally consistent 297 * The resource fits into a larger system of Gateway API resources, and there is 298 is no missing information, including but not limited to: 299 * Any mandatory references resolve to existing resources (examples here are the 300 Gateway's gatewayClass field, or the `parentRefs` field in Route resources) 301 * Any specified TLS secrets exist 302 * The resource is supported by the controller by ensuring things like: 303 * Any Kinds being referred to by the resource are supported 304 * Features being used by the resource are supported 305 306 All of these rules can be summarized into: 307 308 * The resource is valid enough to produce some configuration in the underlying 309 data plane. 310 311 For Gateway, `Accepted` also subsumes the functions of `Scheduled`: `Accepted` 312 set to `true` means that sufficient capacity exists on underlying infrastructure 313 for the Gateway to be provisioned. If that capacity does not exist, then the 314 Gateway cannot be reconciled successfully, and so fails to attach to the 315 owning GatewayClass, and cannot be accepted. 316 317 Note that some classes of inter-resource reference failure do _not_ cause a resource 318 to become unattached and stop being accepted (that is, to have the `Accepted` 319 condition set to `status: false`). 320 321 * Non-existent Service backends - if the backend does not exist on a HTTPRoute that 322 is otherwise okay, then the data plane must generate 500s for traffic that matches 323 that HTTPRoute. In this case, the `Accepted` Condition must be true, and the 324 `ResolvedRefs` Condition must be false, with reasons and messages indicating that 325 the backend services do not exist. 326 * HTTPRoutes with *all* backends in other namespaces, but not permitted by ReferenceGrants. 327 In this case, the "non-existent service backends" rules apply, and 500s must be 328 generated. In this case, again, the `Accepted` condition is true, and the 329 `ResolvedRefs` Condition is false, with reasons and messages indicating that the 330 backend services are not reachable. 331 332 For ReferenceGrant or not-designed-yet Policy resources, `Accepted` means that: 333 334 * the resource has a correctly-defined set of resources that it applies to 335 * the resource has a syntactically and semantically valid `spec` 336 337 Note that having a correctly-defined set of resources that is empty does not make 338 these resources unattached, as long as it's possible to create some config in the 339 underlying data plane. By "empty" here we mean that there are no backends, 340 not that the config is incomplete or missing references. So you can have a 341 GatewayClass, Gateway, HTTPRoute and Service all present and referred to correctly 342 when there are no endpoints in the Service, and the resource will not stop being 343 accepted, because HTTPRoute contains rules about what to program in the data plane 344 if there are no endpoints (that is, it should return 500 for any matching request). 345 346 Note that for other Route types that don't have a clear mechanism like HTTP does 347 for indicating a server failure (like the HTTP code 500 does), not having existing 348 backends may not produce any configuration in the data plane, and so may cause 349 the resource to fail to attach. (An example here could be a TCP Route with 350 no backends, we need to decide if that means that a port should be opened that 351 actively closes connections, or if no port should be opened.) 352 353 Examples of Conditions: 354 355 * HTTPRoute with one match with one backend that is valid. `Accepted` is true, 356 `ResolvedRefs` is true. 357 * HTTPRoute with one match with one backend that is a non-existent Service backend. 358 The `Accepted` Condition is true, the `ResolvedRefs` condition is false, with 359 a reason of `BackendNotFound`. `Accepted` is true in this case because the data 360 path must respond to requests that would be sent to that backend with a 500 response. 361 * HTTPRoute with one match with two backends, one of which is a non-existent Service 362 backend. The `Accepted` Condition is true, the `ResolvedRefs` condition is false. 363 `Accepted` is true in this case because the data path must respond to a percentage 364 of the requests matching the rule corresponding to the weighting of the non-existent 365 backend (which would be fifty percent unless weights are applied). 366 * HTTPRoute with one match with one backend that is in a different namespace, and 367 does _not_ have a ReferenceGrant permitting that access. The `Accepted` condition 368 is true, and the `ResolvedRefs` Condition is false, with a reason of `RefNotPermitted`. 369 As before, `Accepted` is true because in this case, the data path must be 370 programmed with 500s for the match. 371 * TCPRoute with one match with a backend that is a non-existent Service. `Accepted` 372 is false, and `ResolvedRefs` is false. `Accepted` is false in this case because 373 there is not enough information to program any rules to handle the traffic in the 374 underlying data plane - TCP doesn't have a way to say "this is a valid destination 375 that has something wrong with it". 376 * HTTPRoute with one Custom supported filter added that is not supported by the 377 implementation. Our spec is currently unclear on what happens in this case, but 378 custom HTTP Filters require the use of the `ExtensionRef` filter type, and the 379 setting of the ExtensionRef field to the name, group, version, and kind of a 380 custom resource that describes the filter. If that custom resource is not supported, 381 it seems reasonable to say that this should be a reference failure, and be treated 382 like other reference failures (`Accepted` will be set to true, `ResolvedRefs` to 383 false with a `InvalidKind` Reason, and traffic that would have matched the filter 384 should receive a 500 error.) 385 * A HTTPRoute with one rule that specifies a HTTPRequestRedirect filter _and_ a 386 HTTPURLRewrite filter. `Accepted` must be false, because there's only one rule, 387 and this configuration for the rule is invalid (see [reference][httpreqredirect]) 388 The error condition in this case is undefined currently - we should define it, 389 thanks @sunjayBhatia. 390 * A HTTPRoute with two rules, one valid and one which specifies a HTTPRequestRedirect 391 filter _and a HTTPURLRewrite filter. `Accepted` is true, because the valid rule 392 can produce some config in the data plane. We'll need to raise the more specific 393 error condition for an incompatible filter combination as well to make the partial 394 validity clear. 395 396 397 #### Ready 398 399 Currently, the `Ready` condition text for Gateway says: 400 ```go 401 // This condition is true when the Gateway is expected to be able 402 // to serve traffic. Note that this does not indicate that the 403 // Gateway configuration is current or even complete (e.g. the 404 // controller may still not have reconciled the latest version, 405 // or some parts of the configuration could be missing). 406 ``` 407 408 This is pretty unclear - how can the Gateway serve traffic if config is missing? 409 In the past, we've been asked to have a Condition that only flips to `true` when 410 *all* required configuration is present. 411 412 For many implementations (certainly for Envoy-based ones), getting this information 413 correctly and avoiding races on applying it is surprisingly difficult. 414 415 For this reason, this GEP proposes that we exclude the `Ready` condition from Core 416 conformance, and make it a feature that implementations may opt in to - making it 417 an Extended condition. 418 419 It will have the following behavior: 420 421 * `Ready` is an optional Condition that has Extended support, with conformance 422 tests to verify the behavior. 423 * When it's set, the condition indicates that traffic is ready to flow through 424 the data plane _immediately_, not at some eventual point in the future. 425 426 We'll need to add conformance testing for this. 427 428 #### Programmed 429 430 The `Programmed` condition is being added to replicate the functionality that the 431 `Ready` condition currently indicates, namely that all the resources in the set 432 are valid enough to produce some data plane configuration, and that configuration 433 has been sent to the data plane, and should be ready soon. 434 435 It is a positive-polarity summary condition, and so should always be present on 436 the resource. It should be set to `Unknown` if the implementation performs updates 437 to the status before it has all the information it needs to be able to determine 438 if the condition is true. 439 440 441 ## Alternatives 442 443 (Most alternatives have been discussed inline. Please comment here if this section 444 needs updating.) 445 446 ## References 447 [kep-status]: https://github.com/kubernetes/enhancements/blob/master/keps/NNNN-kep-template/kep.yaml#L9 448 449 [1111]: https://github.com/kubernetes-sigs/gateway-api/issues/1111 450 [1110]: https://github.com/kubernetes-sigs/gateway-api/issues/1110 451 [1362]: https://github.com/kubernetes-sigs/gateway-api/issues/1362 452 453 [typstatus]: https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md#typical-status-properties 454 [httpreqredirect]: https://gateway-api.sigs.k8s.io/reference/spec/#gateway.networking.k8s.io%2fv1beta1.HTTPRequestRedirectFilter