sigs.k8s.io/gateway-api@v1.0.0/geps/gep-1364.md (about)

     1  # GEP-1364: Status and Conditions Update
     2  
     3  * Issue: [#1364](https://github.com/kubernetes-sigs/gateway-api/issues/1364)
     4  * Status: Standard
     5  
     6  ## TLDR
     7  
     8  The status, particularly the Conditions, across the whole Gateway API have very much
     9  grown organically, and so have many inconsistencies and odd behaviors.
    10  This GEP covers doing a review and consolidation to make Condition behavior consistent
    11  across the whole API.
    12  
    13  ## Goals
    14  
    15  * Update Conditions design to be consistent across Gateway API resources
    16  * Provide a model and guidelines for Conditions for future new resources
    17  * Specify changes to conformance required for Condition updates
    18  
    19  ## Non-Goals
    20  
    21  * Define the full set of Conditions that will ever be used with Gateway API
    22  
    23  ## Introduction
    24  
    25  Gateway API currently has a lot of issues related to status, especially that
    26  status is inconsistent ([#1111][1111]), that names are hard to understand ([#1110][1110]),
    27  and that Reasons aren't explained properly ([#1362][1362]).
    28  
    29  As the API has grown, the way we talk about resources has changed a lot, and some of the
    30  status design hasn't been updated since resources were created.
    31  
    32  So, for example, we have GatewayClass with `Accepted`, Gateway with `Scheduled`,
    33  the Gateway Listeners with `Detached` (which you want to be `false`, unlike the previous
    34  two), and then Gateways and Gateway Listeners have `Ready`, but Route doesn't (and which
    35  also you want to be `true`).
    36  
    37  This document lays out large-scale changes to the way that we talk about resources,
    38  and the Conditions to match them. This means that there will be an unavoidable break
    39  in what constitutes a healthy or unhealthy resource, and there will be changes
    40  required for all implementations to be conformant with the release that includes
    41  these changes.
    42  
    43  The constants that mark the deprecated types will be also marked as deprecated,
    44  and will no longer be tested as part of conformance. They'll still be present,
    45  and will work, but they won't be part of the spec any more. This should give
    46  implementations and users a release to transition to the new design (in UX terms).
    47  This grace period should be one release (so, the constants will be removed in
    48  v0.7.0.)
    49  
    50  This level of change is not optimal, and the intent is to make this a one-off change
    51  that can be built upon for future resources - since there are definitely more resources
    52  on the way.
    53  
    54  ## Background: Kubernetes API conventions and prior art on Conditions
    55  
    56  Because this GEP is mainly concerned with updating the Conditions we are setting in
    57  Gateway API resources' `status`, it's worth reviewing some important points about
    58  Conditions. (This information is mainly taken from the [Typical status properties][typstatus]
    59  section of the API conventions document.)
    60  
    61  1. Conditions are a standard type used to represent arbitrary higher-level status from
    62  a controller.
    63  2. They are a listMapType, a list that is enforced by the apiserver to have only
    64  one entry of each item, using the `type` field as a key. (So, this is effectively
    65  a map that looks like a list in YAML form).
    66  3. Each has a number of fields, the most important of which for this discussion
    67  are `type`, `status`, `reason`, and `observedGeneration`.
    68  
    69      * `type` is a string value indicating the Condition type. `Accepted`, `Scheduled`,
    70      and `Ready` are current examples.
    71      * `status` indicates the state of the condition, and can be one of three values,
    72      `true`, `false`, or `unknown`. Unknown in particular is important, because it
    73      means that the controller is unable to determine the status for some reason.
    74      (Also notable is that "" is also valid, and must be treated as `Unknown`.
    75      Controllers must not set the value to "", but consumers should accept it
    76      as meaning the same thing as `Unknown`.)
    77      * `reason` is a CamelCase string that is a brief description of the reason why
    78      the `status` is set the way it is.
    79      * `observedGeneration` is an optional field that sets what the `metadata.generation`
    80      field was when the controller last saw a resource. Note that this is optional
    81      _in the struct_, but is required for Gateway API conditions. This will be
    82      enforced in the conformance tests in the future.
    83  
    84  4. Conditions should describe the _current state_ of the resource at observation
    85  time, which means that they should be an adjective (like `Ready`), or a past-tense
    86  verb (like `Accepted`). This one in particular is documented pretty closely in the
    87  [Typical status properties][typstatus] section of the guidelines.
    88  5. Conditions should be applied to a resource the first time the controller sees
    89  the resource. This seems to imply that _all conditions should be present on every
    90  resource owned by a controller_, but the rest of the conventions don't make this
    91  clear, and it is often not complied with.
    92  6. It's helpful to have a top-level condition which summarizes more detailed conditions.
    93  The guidelines suggest using either `Ready` for long-running processes, or `Succeeded`
    94  for bounded execution.
    95  
    96  From these guidelines, we can see that Conditions can be either _positive polarity_
    97  (healthy resources have them as `status: true`) or _negative polarity_ (healthy
    98  resources have them as `status: false`). `Ready` is an example of a positive polarity
    99  condition, and conditions like `Conflicted` from Listener or `NetworkUnavailable`,
   100  `MemoryPressure`, or `DiskPressure` from the Node resource are examples of
   101  negative-polarity conditions.
   102  
   103  There is also some extra context that's not in the API conventions doc:
   104  
   105  SIG-API Machinery has been reluctant to add fields that would aid in machine-parsing
   106  of Conditions, especially fields that would indicate the polarity, because they
   107  are intended more for human consumption than machine consumption. Probably the best
   108  example of this was in the PR [#4521](https://github.com/kubernetes/community/pull/4521#issuecomment-64894206).
   109  
   110  This means that there's no guidance from upstream about condition polarity. We'll
   111  discuss this more when we talk about new conditions.
   112  
   113  The guidance about Conditions being added as soon as a controller sees a resource
   114  is a bit unclear - as written in the conventions, it seems to imply that _all_ 
   115  relevant conditions should always be added, even if their status has to be set to
   116  `unknown`.
   117  Gateway API resources do not currently require this, and the practice seems to be
   118  uncommon.
   119  
   120  ## Proposed changes
   121  
   122  ### Proposed changes summary
   123  
   124  * All the current Conditions that indicate that the resource is okay and ready
   125  for processing will be replaced with `Accepted`.
   126  * In general, resources should be considered `Accepted` if their config is valid
   127  enough to generate some config in the underlying data plane. Examples are provided
   128  below.
   129  * There will be a limited set of positive polarity summary conditions, and a number
   130  of other specific negative-polarity error conditions.
   131  * All relevant positive-polarity summary Conditions for a resource must be added
   132  when it's observed.
   133  For example, HTTPRoutes must always have `Accepted` and `ResolvedRefs`, regardless
   134  of their state.
   135  * Negative polarity error conditions must only be added when the error is True.
   136  * The `Ready` condition will be moved to Extended conformance, and we'll re-evaluate
   137  if it's used by any implementations after some time has passed. If not, it may be
   138  removed.
   139  * To capture the behavior that `Ready` currently captures, `Programmed` will be
   140  introduced. This means that the implementation has seen the config, has everything
   141  it needs, parsed it, and sent configuration off to the data plane. The configuration
   142  should be available "soon". We'll leave "soon" undefined for now.
   143  * Resolving a comment that came up, documentation will be added to clarify that
   144  it's okay to add your own Conditions, and that implementations should namespace
   145  their custom Conditions with a domain prefix (so `implementation.io/CustomType`
   146  rather than just `CustomType`), or run the risk of using a word that's reserved later.
   147  * It's recommended that implementations publish both new and old conditions to
   148  provide a smoother transition, but conformance tests will only require the new
   149  conditions.
   150  
   151  The exact list of changes is detailed below. The next few sections detail
   152  the reasons for these large-scale changes.
   153  
   154  ### Conceptual and language changes
   155  
   156  Gateway API resources are, conceptually, all about breaking up the configuration for a
   157  data plane into separate resources that are _expressive_ and _extensible_, while being
   158  split up along _role-oriented_ boundaries.
   159  
   160  So, when we talk about Gateway API, it's _always_ about a _system of related resources_.
   161  
   162  We already acknowledge this when we talk about Routes "attaching" to Gateways, or Gateways
   163  referencing Services, or Gateways requiring a GatewayClass in their spec.
   164  
   165  However, this GEP is proposing that we move all our discussion into using
   166  "accepted" to indicate that a resource has attached correctly enough to be
   167  _accepted_ for processing.
   168  
   169  So resources are `Accepted` for processing when their attachment succeeds enough
   170  to generate some configuration. This allows us to make calls about when partially
   171  valid objects should be accepted and when they shouldn't.
   172  
   173  Of course, because we're using all of this configuration to describe some sort of data
   174  path from "outside"/lacking cluster context to "inside"/enriched with cluster context,
   175  we also need a way to describe when that data path is configured and working.
   176  
   177  We already have a word in the Kubernetes API, but it comes with some expectations
   178  that implementations are not currently able to meet. That word is `Ready`, but it
   179  implies that the data path is Ready _when you read the status_, rather than that
   180  it _will be ready soon_ (which is what most implementations can guarantee currently.)
   181  
   182  So we have an unresolved question as to what to do with the `Ready` condition.
   183  This is addressed further below.
   184  
   185  ### Condition polarity
   186  
   187  In terms of the polarity of conditions, we have three options, of which only two are
   188  really viable:
   189  * All conditions must be negative polarity
   190  * All conditions must be positive polarity
   191  * Some conditions can be positive polarity, but most should be negative.
   192  
   193  The fact that the user experience of `Ready` or conditions like `Accepted` being `true`
   194  in the healthy case is much better rules out the first option, so we are left to
   195  decide between enforcing that all conditions are positive, or that we have a mix.
   196  
   197  Having an arbitrary mix will make doing machine-based extraction of information
   198  much harder, so here I'm going to talk about the distinction between having all
   199  conditions positive or some, summary conditions positive, and the rest negative.
   200  
   201  #### All Conditions Positive
   202  
   203  In this case, all Condition types are written in such a way that they're positive
   204  polarity, and are `true` in the healthy case.
   205  
   206  As already discussed, `Ready`, and `Accepted` are current examples, but another
   207  one that's a little more important here is `ResolvedRefs` which is set to `true`
   208  when all references to other resources have been successfully resolved. This is
   209  not a _blocking_ Condition that affects the `Ready` condition, since having _some_
   210  references valid is enough to produce some configuration in the underlying data
   211  plane.
   212  
   213  So, All Conditions Positive pros:
   214  * We're close already. Most conditions in the API are currently positive polarity.
   215  * Easier to understand - there are no double negatives. "Good: true" is less
   216  cognitive overhead than "NotGood: false".
   217  
   218  Cons:
   219  * Reduces flexibility - it can surprisingly difficult to avoid double negatives for
   220  conditions that describe error states, as in general programmers are more used
   221  to reporting "something went wrong" than they are "everything's okay".
   222  
   223  Not sure if pro or con:
   224  * Leans the design towards favoring conditions always being present, since you
   225  can't be sure if everything is good unless you see `AllGood: true`. The absence
   226  of a positive-polarity condition implies that the condition could be false. This
   227  puts this option more in line with the API guidelines on this point.
   228  
   229  #### Some Conditions Positive
   230  
   231  In this case, only a limited set of summary conditions are positive, and the rest
   232  are negative.
   233  
   234  Pros:
   235  * Error states can be described with `Error: true` instead of `NoError: false`.
   236  * Negative polarity error conditions are more friendly to not being present (since
   237  absence of `Error: true` implies everything's okay).
   238  
   239  Cons:
   240  * Any code handling conditions will need a list of the positive ones, and will
   241  need to assume that any others are negative.
   242  
   243  #### Decision
   244  
   245  Gateway API conditions will be positive for conditions that describe the happy
   246  state of the object, which is currently `Accepted` and `ResolvedRefs`, and will 
   247  also include the new `Programmed` condition, and the newly-Extended condition
   248  `Ready`. A separate set of negative-polarity Error conditions will be set on an
   249  object when they are true.
   250  
   251  
   252  ### Should conditions always be added?
   253  
   254  Not all of them.
   255  
   256  Positive polarity Conditions that describe the desirable state of the object must
   257  always be set. These are currently `Accepted`, `ResolvedRefs`, and `Programmed`.
   258  Implementations that use `Ready` must also add it before programming the Route.
   259  
   260  ### Partial validity and Conditions
   261  
   262  One of the trickiest parts of Gateway API objects is that it's very possible to
   263  end up with an object that has some parts with valid configuration and some that
   264  don't. We refer to this as _partial validity_, and communicating this via status
   265  conditions is difficult.
   266  
   267  The intent with the `Accepted` condition is that it serves as an indicator that
   268  _something_ is working, that _some traffic_ from what the config specifies will
   269  be routed as configured. 
   270  
   271  At this time, we haven't added a "no errors at all present" Condition, choosing
   272  to have a "some config is working" condition, with specific errors to aid in
   273  finding the exact problem with the objects. We could conceivably add this later
   274  if users find `Accepted` insufficient, but we're erring on the side of having
   275  less positive Conditions for now.
   276  
   277  ### New and Updated Conditions
   278  
   279  #### `Accepted`
   280  
   281  This GEP proposes replacing all conditions that indicate syntactic and semantic
   282  validity with one, `Accepted` condition type.
   283  
   284  That is, the proposal is to replace:
   285  
   286  * `Scheduled` on Gateway
   287  * `Detached` on Listener
   288  
   289  with `Accepted` in all these locations.
   290  
   291  GatewayClass and Route will maintain the `Accepted` condition.
   292  
   293  All of these conditions share the following meanings:
   294  
   295  * The resource has been accepted for processing by the controller
   296  * The resource is syntactically and semantically valid, and internally consistent
   297  * The resource fits into a larger system of Gateway API resources, and there is
   298  is no missing information, including but not limited to:
   299    * Any mandatory references resolve to existing resources (examples here are the
   300    Gateway's gatewayClass field, or the `parentRefs` field in Route resources)
   301    * Any specified TLS secrets exist
   302  * The resource is supported by the controller by ensuring things like:
   303    * Any Kinds being referred to by the resource are supported
   304    * Features being used by the resource are supported
   305  
   306  All of these rules can be summarized into:
   307  
   308  * The resource is valid enough to produce some configuration in the underlying
   309  data plane.
   310  
   311  For Gateway, `Accepted` also subsumes the functions of `Scheduled`: `Accepted`
   312  set to `true` means that sufficient capacity exists on underlying infrastructure
   313  for the Gateway to be provisioned. If that capacity does not exist, then the
   314  Gateway cannot be reconciled successfully, and so fails to attach to the
   315  owning GatewayClass, and cannot be accepted.
   316  
   317  Note that some classes of inter-resource reference failure do _not_ cause a resource
   318  to become unattached and stop being accepted (that is, to have the `Accepted`
   319  condition set to `status: false`).
   320  
   321  * Non-existent Service backends - if the backend does not exist on a HTTPRoute that
   322  is otherwise okay, then the data plane must generate 500s for traffic that matches
   323  that HTTPRoute. In this case, the `Accepted` Condition must be true, and the
   324  `ResolvedRefs` Condition must be false, with reasons and messages indicating that
   325  the backend services do not exist.
   326  * HTTPRoutes with *all* backends in other namespaces, but not permitted by ReferenceGrants.
   327  In this case, the "non-existent service backends" rules apply, and 500s must be
   328  generated. In this case, again, the `Accepted` condition is true, and the
   329  `ResolvedRefs` Condition is false, with reasons and messages indicating that the
   330  backend services are not reachable.
   331  
   332  For ReferenceGrant or not-designed-yet Policy resources, `Accepted` means that:
   333  
   334  * the resource has a correctly-defined set of resources that it applies to
   335  * the resource has a syntactically and semantically valid `spec`
   336  
   337  Note that having a correctly-defined set of resources that is empty does not make
   338  these resources unattached, as long as it's possible to create some config in the
   339  underlying data plane. By "empty" here we mean that there are no backends,
   340  not that the config is incomplete or missing references. So you can have a
   341  GatewayClass, Gateway, HTTPRoute and Service all present and referred to correctly
   342  when there are no endpoints in the Service, and the resource will not stop being
   343  accepted, because HTTPRoute contains rules about what to program in the data plane
   344  if there are no endpoints (that is, it should return 500 for any matching request).
   345  
   346  Note that for other Route types that don't have a clear mechanism like HTTP does
   347  for indicating a server failure (like the HTTP code 500 does), not having existing
   348  backends may not produce any configuration in the data plane, and so may cause
   349  the resource to fail to attach. (An example here could be a TCP Route with
   350  no backends, we need to decide if that means that a port should be opened that
   351  actively closes connections, or if no port should be opened.)
   352  
   353  Examples of Conditions:
   354  
   355  * HTTPRoute with one match with one backend that is valid. `Accepted` is true,
   356  `ResolvedRefs` is true.
   357  * HTTPRoute with one match with one backend that is a non-existent Service backend.
   358  The `Accepted` Condition is true, the `ResolvedRefs` condition is false, with
   359  a reason of `BackendNotFound`. `Accepted` is true in this case because the data
   360  path must respond to requests that would be sent to that backend with a 500 response.
   361  * HTTPRoute with one match with two backends, one of which is a non-existent Service
   362  backend. The `Accepted` Condition is true, the `ResolvedRefs` condition is false.
   363  `Accepted` is true in this case because the data path must respond to a percentage
   364  of the requests matching the rule corresponding to the weighting of the non-existent
   365  backend (which would be fifty percent unless weights are applied).
   366  * HTTPRoute with one match with one backend that is in a different namespace, and
   367  does _not_ have a ReferenceGrant permitting that access. The `Accepted` condition
   368  is true, and the `ResolvedRefs` Condition is false, with a reason of `RefNotPermitted`.
   369  As before, `Accepted` is true because in this case, the data path must be
   370  programmed with 500s for the match.
   371  * TCPRoute with one match with a backend that is a non-existent Service. `Accepted`
   372  is false, and `ResolvedRefs` is false. `Accepted` is false in this case because
   373  there is not enough information to program any rules to handle the traffic in the
   374  underlying data plane - TCP doesn't have a way to say "this is a valid destination
   375  that has something wrong with it".
   376  * HTTPRoute with one Custom supported filter added that is not supported by the
   377  implementation. Our spec is currently unclear on what happens in this case, but
   378  custom HTTP Filters require the use of the `ExtensionRef` filter type, and the
   379  setting of the ExtensionRef field to the name, group, version, and kind of a
   380  custom resource that describes the filter. If that custom resource is not supported,
   381  it seems reasonable to say that this should be a reference failure, and be treated
   382  like other reference failures (`Accepted` will be set to true, `ResolvedRefs` to
   383  false with a `InvalidKind` Reason, and traffic that would have matched the filter
   384  should receive a 500 error.)
   385  * A HTTPRoute with one rule that specifies a HTTPRequestRedirect filter _and_ a
   386  HTTPURLRewrite filter. `Accepted` must be false, because there's only one rule,
   387  and this configuration for the rule is invalid (see [reference][httpreqredirect])
   388  The error condition in this case is undefined currently - we should define it,
   389  thanks @sunjayBhatia.
   390  * A HTTPRoute with two rules, one valid and one which specifies a HTTPRequestRedirect
   391  filter _and a HTTPURLRewrite filter. `Accepted` is true, because the valid rule
   392  can produce some config in the data plane. We'll need to raise the more specific
   393  error condition for an incompatible filter combination as well to make the partial
   394  validity clear.
   395  
   396  
   397  #### Ready
   398  
   399  Currently, the `Ready` condition text for Gateway says:
   400  ```go
   401  	// This condition is true when the Gateway is expected to be able
   402  	// to serve traffic. Note that this does not indicate that the
   403  	// Gateway configuration is current or even complete (e.g. the
   404  	// controller may still not have reconciled the latest version,
   405  	// or some parts of the configuration could be missing).
   406  ```
   407  
   408  This is pretty unclear - how can the Gateway serve traffic if config is missing?
   409  In the past, we've been asked to have a Condition that only flips to `true` when
   410  *all* required configuration is present.
   411  
   412  For many implementations (certainly for Envoy-based ones), getting this information
   413  correctly and avoiding races on applying it is surprisingly difficult. 
   414  
   415  For this reason, this GEP proposes that we exclude the `Ready` condition from Core
   416  conformance, and make it a feature that implementations may opt in to - making it
   417  an Extended condition.
   418  
   419  It will have the following behavior:
   420  
   421  * `Ready` is an optional Condition that has Extended support, with conformance
   422  tests to verify the behavior.
   423  * When it's set, the condition indicates that traffic is ready to flow through
   424  the data plane _immediately_, not at some eventual point in the future.
   425  
   426  We'll need to add conformance testing for this.
   427  
   428  #### Programmed
   429  
   430  The `Programmed` condition is being added to replicate the functionality that the
   431  `Ready` condition currently indicates, namely that all the resources in the set
   432  are valid enough to produce some data plane configuration, and that configuration
   433  has been sent to the data plane, and should be ready soon.
   434  
   435  It is a positive-polarity summary condition, and so should always be present on
   436  the resource. It should be set to `Unknown` if the implementation performs updates
   437  to the status before it has all the information it needs to be able to determine
   438  if the condition is true.
   439  
   440  
   441  ## Alternatives
   442  
   443  (Most alternatives have been discussed inline. Please comment here if this section
   444  needs updating.)
   445  
   446  ## References
   447  [kep-status]: https://github.com/kubernetes/enhancements/blob/master/keps/NNNN-kep-template/kep.yaml#L9
   448  
   449  [1111]: https://github.com/kubernetes-sigs/gateway-api/issues/1111
   450  [1110]: https://github.com/kubernetes-sigs/gateway-api/issues/1110
   451  [1362]: https://github.com/kubernetes-sigs/gateway-api/issues/1362
   452  
   453  [typstatus]: https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md#typical-status-properties
   454  [httpreqredirect]: https://gateway-api.sigs.k8s.io/reference/spec/#gateway.networking.k8s.io%2fv1beta1.HTTPRequestRedirectFilter