go.chromium.org/luci@v0.0.0-20240309015107-7cdc2e660f33/server/quota/doc.go (about)

     1  // Copyright 2022 The LUCI Authors.
     2  //
     3  // Licensed under the Apache License, Version 2.0 (the "License");
     4  // you may not use this file except in compliance with the License.
     5  // You may obtain a copy of the License at
     6  //
     7  //      http://www.apache.org/licenses/LICENSE-2.0
     8  //
     9  // Unless required by applicable law or agreed to in writing, software
    10  // distributed under the License is distributed on an "AS IS" BASIS,
    11  // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    12  // See the License for the specific language governing permissions and
    13  // limitations under the License.
    14  
    15  // Package quota provides an implementation for server quotas which are backed
    16  // by Redis.
    17  //
    18  // # Rationale
    19  //
    20  // Quotas are a way to restrict shared resource consumption in order to provide
    21  // fairness and prevent abuse. The quota library implements a way to configure
    22  // and track resource limits for users for application-specific resources.
    23  //
    24  // We intend that this library be a 'good enough' implementation that it can
    25  // serve the needs of many (if not all) LUCI services and provide additional
    26  // common benefits (logging, metrics, administration ACLs/API/UI) so that each
    27  // individual service doesn't need to re-invent these mechanisms.
    28  //
    29  // The current implementation is based on Redis and is fully synchronous.
    30  // There's a possibility in the future that we could extend the implementation
    31  // to other datastores or to allow the application to make a tradeoff between
    32  // accuracy and latency.
    33  //
    34  // # Data Model
    35  //
    36  // There are 2 different types of entities managed by the quota libary: Policies
    37  // (grouped into a PolicyConfig) and Accounts. The library provides a variety of
    38  // Operations which all work in terms of these entities.
    39  //
    40  // # Data Model - Entity identities
    41  //
    42  // All entities have an identity which is composed of the following 'atoms'.
    43  // Some of these atoms need structure which is meaningful to the application.
    44  // The quota library has a convention for such atoms called "ASIs" (Application
    45  // specific identifiers). See that section for what/why. Note that all of these
    46  // identifiers end up as Redis keys (or hash keys) one way or the other, so all
    47  // the usual caveats around absurd key lengths apply here. However, Redis allows
    48  // keys up to 512MB, so have fun...
    49  //
    50  // Common identifier atoms:
    51  //
    52  //   - app_id - The app_id allows multiple logical applications to share the same
    53  //     Redis instance. This should reflect the service that the account or policy
    54  //     belongs to. For example this would allow a single deployment to have quota
    55  //     accounts/policies for an application "cv" and "rdb" in the same binary.
    56  //   - realm - For administration purposes, Accounts and PolicyConfigs belong to
    57  //     a realm (though likely not the same one). Typically, PolicyConfigs will
    58  //     belong to a project's @project realm. Accounts will belong to realms which
    59  //     make sense in the context of the application. `realm` here is a global
    60  //     realm (i.e. `project:something`).
    61  //   - resource_type (ASI) - A given Policy or Account can only deal in a single
    62  //     resource_type. This value only needs to make sense to the application.
    63  //   - namespace (ASI) - Namespace allows the Application to segment a given
    64  //     realm into multiple sub-domains. For example, Buildbucket could use the
    65  //     namespace to indicate that a given Account is being used for a single
    66  //     builder within a bucket. This only needs to make sense to the application.
    67  //   - name (ASI) - Name is the name of the entity. This only needs to make sense
    68  //     to the application.
    69  //
    70  // # Data Model - PolicyConfig
    71  //
    72  // ID: app_id ~ realm ~ version
    73  //
    74  // A PolicyConfig is an immutable group (Redis Hash) of Policies.
    75  // Typically this will be in a @project realm of some LUCI project, as current
    76  // users will likely derive a PolicyConfig from some other LUCI project
    77  // configuration.
    78  //
    79  // The realm indicates which realm this PolicyConfig is administered under, but
    80  // it doesn't need to (and likely will not) match the realm for Accounts using
    81  // the Policies within it.
    82  //
    83  // In the PolicyConfig ID, the `version` field is a content hash (starting with
    84  // `$`), or manually supplied ("#" followed by an ASI). Once written,
    85  // PolicyConfigs cannot be modified (but they can be purged). It's recommended
    86  // to use the content hash versioning scheme (this will also do implicit
    87  // deduplication when configs change without policy changes). However, some
    88  // applications may find it more convenient to tie the PolicyConfig version to
    89  // an external version identifier (like a git commit id of the overall configs),
    90  // so manually versioning the PolicyConfigs is an option.
    91  //
    92  // Purging PolicyConfigs results in the deletion of a PolicyConfig and should
    93  // only be used for PolicyConfigs that the application knows are no longer in
    94  // use. However, in the event that a PolicyConfig is purged while Accounts still
    95  // reference it:
    96  //   - Operations on those Accounts without supplying a new Policy reference
    97  //     will continue to use the snapshot of the policy stored in the Account.
    98  //     We could potentially make this produce a warning or error, however.
    99  //   - Operations on Accounts that supply a new Policy reference must have that
   100  //     Policy exist, as usual, and it will replace the referenced/snapshotted
   101  //     policy in the Account.
   102  //
   103  // # Data Model - Policy
   104  //
   105  // Key (within a PolicyConfig): namespace ~ name ~ resource_type
   106  //
   107  // A Policy is an immutable member of a PolicyConfig, and stores a numeric
   108  // Default, Limit, Refill, and a Lifetime.
   109  //   - Default - The value to set a previously non-existant Account to when
   110  //     first accessing it.
   111  //   - Limit - The maximum value an Account can have.
   112  //   - Options - Bit field indicating various options. Currently the only option
   113  //     is `ABSOLUTE_RESOURCE` which indicates that this policy constrains
   114  //     a resource which is managed exclusively by the application (for example,
   115  //     represents the current number of in-flight builds, etc.). This will
   116  //     disable the `quota.accounts.write` permission for accounts managed with
   117  //     this Policy.
   118  //   - Lifetime - The number of seconds to wait before garbage collecting an
   119  //     Account after its last update. This is implemented with a Redis TTL which
   120  //     is refreshed on the Account each time it's written.
   121  //
   122  // Refill is a numeric triple (see the "Refill Behavior" section for details of
   123  // how refill works):
   124  //   - Units - The number of units to add.
   125  //   - Interval - The number of seconds in between fill events. Intervals are
   126  //     synchronized to UTC midnight + Offset. See the "Refill Behavior" section
   127  //     for a discussion on how Refill is implemented. Note that there is no cron
   128  //     or "stampede" from synchronizing refill events in this way. This must
   129  //     evenly divide 24 hours (86400 seconds).
   130  //   - Offset - The number of seconds to offset UTC midnight to the 0th daily
   131  //     interval.
   132  //
   133  // # Data Model - Account
   134  //
   135  // ID: app_id ~ realm ~ namespace ~ name ~ resource_type
   136  //
   137  // Accounts hold the balance of a specific owning identity for a specific
   138  // resource. They contain:
   139  //   - Balance - Current number of units held.
   140  //   - LastUpdate - Time when this Account was last updated.
   141  //   - LastRefill - Time when this Account was last refilled (always <=
   142  //     LastUpdate).
   143  //   - LastPolicyChange - Time when the currently applied Policy was first
   144  //     set.
   145  //   - PolicyConfig - Redis key for the versioned PolicyConfig last used for this
   146  //     Account.
   147  //   - PolicyKey - Hash key (namespace ~ name ~ resource_type) in the PolicyConfig
   148  //     for the Policy last used for this Account.
   149  //   - PolicyRaw - Raw encoded snapshot of the last-used policy for this Account.
   150  //     This is necessary to allow the quota library to interact with an Account
   151  //     under it's last-applied policy without needing to re-read the original
   152  //     policy (which is technically difficult to do in Redis scripts because
   153  //     they need to have all Redis keys supplied to them in advance of their
   154  //     execution).
   155  //
   156  // # Operations
   157  //
   158  // Operations combine a Policy with an Account, plus a delta.
   159  //
   160  // Operations have:
   161  //   - account - The ID of the account to apply to.
   162  //   - policy - (optional) The PolicyConfig ID + Policy key to set on this
   163  //     Account.
   164  //   - delta - An offset from the value specified by `relative_to`.
   165  //   - relative_to - Enum with values CURRENT_BALANCE, ZERO, DEFAULT, and LIMIT.
   166  //   - options -
   167  //   - IGNORE_POLICY_BOUNDS - This allows `$relative_to + delta` to bring
   168  //     balance outside of the Policy's (0,limit) range.
   169  //
   170  // An Operation is applied by:
   171  //   - Creating the Account if it is missing, populating it with the provided
   172  //     Policy default, applying any refill to the existing Account balance
   173  //     under the Account's existing policy.
   174  //   - If the Operation includes a Policy, setting that Policy on the Account.
   175  //   - Calculating the new balance and checking if it is within the current/new
   176  //     Policy bounds.
   177  //   - Saving the new Account balance, policy, and resetting the Account TTL.
   178  //
   179  // Operations can fail in one of three ways:
   180  //   - FAIL_OUT_OF_BOUNDS - The Operation would have brought the Account out of
   181  //     (0, Policy.Limit), and options=IGNORE_POLICY_BOUNDS was unset.
   182  //   - FAIL_UNKNOWN_POLICY - The Operation included a policy which wasn't
   183  //     loaded.
   184  //   - FAIL_MISSING_ACCOUNT - The Operation referred to an Account, but also
   185  //     didn't set a policy, meaning that the Operation couldn't create the
   186  //     Account.
   187  //
   188  // NOTE: For Accounts where the balance is ALREADY out bounds, Operations which
   189  // bring the balance closer to in-bounds ARE allowed. For example, a delta
   190  // CURRENT_BALANCE+1 would be allowed for an Account whose balance was -10, and
   191  // a delta CURRENT_BALANCE-10 would be allowed for an Account whose balance was
   192  // 19 with a limit of 10.
   193  //
   194  // There is also a Get operation which ONLY reads the data, returning the
   195  // full Account data and also the projected value (e.g. after refills). This
   196  // operation does NOT change the Account at all (i.e. last_refill, TTL, etc.
   197  // are all left as-is).
   198  //
   199  // # Application-specific identifiers (ASIs)
   200  //
   201  // The quota library has several application-specific identifiers (ASIs). These
   202  // ASIs end up ~verbatim in Redis as row keys. This means that your storage
   203  // costs and lookup performance will be proportional to their length.
   204  //
   205  // The quota libary reserves the character "~" for partitioning ASIs when
   206  // synthesizing a full Redis key.
   207  //
   208  // Additionally, two characters will be treated specially as a convention:
   209  //   - "|" is available to separate sections within an ASI.
   210  //   - "{", if the first character in an ASI section, indicates that the
   211  //     remainder of that section is encoded with ascii85 (an encoding which
   212  //     conveniently excludes "~", "|", and "{"). Functions in this library
   213  //     which attempt to do this interpretation will return the raw string
   214  //     instead of failing (e.g. if you had `{z` in a section, it would be
   215  //     returned as `{z` rather than as an error).
   216  //
   217  // The quota library provides functions to encode/decode a series of arbitrary
   218  // section strings to/from a single ASI string.
   219  //
   220  // The quota library may use "|" as a way to group related keys together when
   221  // displaying a large collection of quota Account or Policy data. Think of it
   222  // similarly to how GCS treats "/". It's a visual delimiter, but the underlying
   223  // service doesn't really care if you use it or not. Similarly, sections
   224  // starting with '{' will attempt to decode in certain contexts (like the UI),
   225  // but if decoding fails it will return the original string. If your application
   226  // dosen't care about this functionality at all, it's free to use any string it
   227  // likes as an ASI, as long as it doesn't contain `~`.
   228  //
   229  // # Refill Behavior
   230  //
   231  // Refills in the quota library are intended to mimic the behavior of a cron job
   232  // which runs every second, scanning all Accounts, seeing if their Interval is
   233  // past and refilling them.
   234  //
   235  // However, such an implementation would be terribly slow. Instead, the quota
   236  // library remembers the policy details for each account and then when
   237  // interacting with the Account as part of an Operation, this will refill based
   238  // on the real elapsed time under the previous Policy.
   239  //
   240  // Refills are synchronized to UTC plus an offset. This means if you specify 17
   241  // units with an interval of "21600" (i.e. 6 hours), and an offset of 0, then
   242  // each 6 hours after UTC midnight, 17 units would be added to the account. If
   243  // the account was created at, say, 0740 UTC, then the next refill event would
   244  // occur at 1200 UTC.
   245  //
   246  // Offset allows you to 'rotate' this cycle so that a given policy's "midnight"
   247  // occurs at a different time of day. (NOTE: Theoretically this offset could be
   248  // per-Account rather than per-Policy. If this becomes a necessary usecase, it
   249  // wouldn't be hard to add, but for now we're keeping it simple).
   250  //
   251  // Please also refer to "Implementation notes - Refill Interval" and
   252  // "Implementation notes - Refill Synchronization" for a discussion on why we
   253  // picked this Refill system vs. a simpler units/second alternative and why we
   254  // tie refills to the wall clock time.
   255  //
   256  // # Behavior when switching Policies
   257  //
   258  // Over time, it is likely that a single Account will go through multiple
   259  // different Policies which apply to it, or where those Policies change
   260  // parameters over time.
   261  //
   262  // Account names should always be stable, comprising a who/what/where of
   263  // a resource. When policies shift for an Account, the quota library will
   264  // maintain the previous balance of the Account, except that no Refill will take
   265  // place if the Account is over its limit. Additionally, no matter how far out
   266  // of spec an Account is, it will always be permitted to make an over-limit
   267  // account smaller, or an under-zero account larger.
   268  //
   269  // So, say an account had a policy which had a limit of 20, with a balance of
   270  // 18, and switched to a policy with a balance of 15. It would maintain its
   271  // balance of 18 until debited, but any positive refill policy would have no
   272  // effect.
   273  //
   274  // # Access control and Administration
   275  //
   276  // The quota library implements an administration service API. This is an
   277  // auxilliary API to read/write the values manipulated by the quota library, to
   278  // be used for debugging or manual intervention (rather than directly poking the
   279  // underlying Redis data).
   280  //
   281  // The `self` binding context attribute has the value "1" if the Account ID's
   282  // identity field matches the current auth identity, "0" otherwise.
   283  //
   284  // Access via this service is granted via realm permissions:
   285  //   - quota.accounts.read - Allows reading single accounts within a realm.
   286  //     Binding context: {app_id, resource_type, namespace, self}
   287  //   - quota.accounts.list - Allows listing accounts
   288  //     Binding context: {app_id, resource_type, namespace}
   289  //   - quota.accounts.write - Allows modifying accounts. Note that this only
   290  //     applies to accounts which do not have the option ABSOLUTE_RESOURCE.
   291  //     Binding context: {app_id, resource_type, namespace, self}
   292  //   - quota.policies.read - Allows reading policy contents.
   293  //     Binding context: {app_id}
   294  //   - quota.policies.write - Allows writing new content-addressed policy
   295  //     configs. Binding context: {app_id}
   296  //   - quota.policies.overrideVersion - If granted in conjunction with
   297  //     `quota.policies.write`, allows writing new manually-versioned policy
   298  //     configs. Binding context: {app_id}. Note that manually-versioned policy
   299  //     configs are not verifiable by the quota library and could allow users
   300  //     with this permission to 'poison' a quota policy version.
   301  //   - quota.policies.purge - Allows perging PolicyConfigs.
   302  //     Binding context: {app_id}.
   303  //
   304  // Permission checks require one of:
   305  //   - hasPermission(perm, operation_realm) OR
   306  //   - hasPermission(perm, "@internal:<service-app-id>")
   307  //
   308  // That is, internal permissions can be granted to service deployment Admins.
   309  // Additionally, permissions granted in this realm will ignore the
   310  // ABSOLUTE_RESOURCE flag on accounts, becuase it's presumed that service
   311  // deployment Admins understand the nuances of manually adjusting such Accounts.
   312  //
   313  // NOTE: These access controls ONLY apply to requests via the Administration
   314  // service API. Interaction with the quotas via the Go API do not do any access
   315  // checking, because it is assumed that the application has already done
   316  // appropriate access checks before computing the Accounts/Policies to interact
   317  // with.
   318  //
   319  // # Implementation notes - Refill Interval
   320  //
   321  // Initially the Quota library implemented a "units/second" refill system. This
   322  // made the implementation nice due to its simplicity, but had two noticeable
   323  // drawbacks:
   324  //
   325  //  1. Low quantity quotas (e.g. builds per day) were difficult to express
   326  //     naturally (for example, the application would have to have accounts in
   327  //     fractional builds, like 100,000 == one build).
   328  //  2. Even if the application expressed account values in this way, this leads
   329  //     to an effectively "analog" replenismhent system which would lead to
   330  //     mistakes when setting quotas.
   331  //
   332  // Consider the case where you want to restrict users to "10 builds per day".
   333  // You first make the accounts hold thousandths of a build, and then set
   334  // a policy with (limit=1000000, refill_each_sec=11). Ignoring the fact that the
   335  // refill should actually be something like 11.574, we've basically achieved
   336  // what we want, right? A user can only run 10 builds (a bit less) per day.
   337  //
   338  // Not quite. Consider that the user can wait until their quota is full (10
   339  // builds) and then they:
   340  //   - Run 10 builds in hour 0
   341  //   - Run one build every ~2 hours for the next 24 hours.
   342  //
   343  // Oops... our 10/day quota actually allows the user to burst up to 19/day.
   344  // Mondays are gonna be spicy.
   345  //
   346  // Another aspect of the current implementation is that the Interval MUST
   347  // cleanly divide one day. This allows the Interval to have a daily cycle and
   348  // reduces the possible edge cases when switching policies for an Acccount where
   349  // the Policies have different refill periods. Otherwise, oddball intervals
   350  // (like 13h) would skew by an hour each day, and when we eventually switch
   351  // policies, the Account would lose an unpredictable amount of refill time.
   352  //
   353  // # Implementation notes - Refill Synchronization
   354  //
   355  // Quota refills are tricky; originally we started the clock at account creation
   356  // time, but realized this would lead to two issues:
   357  //
   358  //  1. Every quota account would refresh at seemingly-random times, which makes
   359  //     debugging more difficult. This would not be beneficial for 'load
   360  //     distribution' in a system (it should explicitly use short term quotas or
   361  //     some othe rate limiting techniques instead).
   362  //  2. This would lead to very difficult to reason-about behaviors when
   363  //     policies change for a given account.
   364  //
   365  // In the case of policy changes, the only sensible thing to do while
   366  // maintaining the interval based refill events would be to reset the refill
   367  // timer when changing policies on an account. However, for Refill policies with
   368  // long intervals, this could lead to artifacts where users are inexplicably
   369  // starved for quota. Consider a situation where a user is allowed 10 builds per
   370  // day. They exhaust their quota at hour 23 of the day and complain to a trooper
   371  // who then moves them to a higher-tier policy group with 20 builds per day.
   372  //
   373  // However, when hour 24 rolls around, the user's account not only doesn't get
   374  // 20 builds added to it, it doesn't even get the original 10. Instead the user
   375  // has to wait an ADDITIONAL 24h before their quota replenishes.
   376  //
   377  // Synchronizing refill events significantly improves the predictability of the
   378  // system here.
   379  //
   380  // # Implementation notes - Deduplication
   381  //
   382  // The quota library has a simple deduplication scheme which is indended to
   383  // prevent accidentally applying Operations multiple times (for example,
   384  // applying a Op(-10) operation twice when you only wanted to apply it once
   385  // could be pretty bad).
   386  //
   387  // When any actor interacts with the Quota library (either via the Go interface
   388  // or the Administration API), they provide a request ID. The quota library then
   389  // calculates if ALL of the Operations in the request can proceed with the
   390  // current Account state, and, if so, applies ALL of the Operations atomically*,
   391  // followed by recording the RequestID into Redis with a TTL (defaulting to
   392  // 2 hours), a hash of the requested operations, plus the returned value for the
   393  // Account balances after applying all of the Operations. If a subsequent
   394  // request comes in with the same RequestID, the hash of the Operations is
   395  // checked, and if it matches the stored value, the original result will be
   396  // returned without error.
   397  //
   398  // (* I put the scary asterisk on atomically, because _as far as I can tell_,
   399  // EVAL scripts in Redis are either fully applied, or not applied at all.
   400  // However the statements in the docs aren't as strong as I'd like to this
   401  // effect. The docs do state that EVAL (or FUNCTIONs) is our best bet.)
   402  //
   403  // Supplying a different set of Operations with the same RequestID is an error,
   404  // and the request will be rejected.
   405  //
   406  // Where this departs from "normal" deduplication is that _negative_ (error)
   407  // results are NOT recorded; That is, if you attempt to debit an account "A"
   408  // by 1 unit, but the balance is currently 0, this will return an "underflow"
   409  // error, but the RequestID will not be consumed (so retrying this exact same
   410  // request later may succeed, if the balance of "A" has risen above 1.
   411  //
   412  // We speculate that this mode is more intuitive, since many of the places we
   413  // expect applications to interact with the quota library are attempting to make
   414  // rapid, otherwise stateless, decisions about what to do next, where generating
   415  // the RequestID deterministically in the context of that decision is
   416  // convenient. If we stored the rejection via the RequestID, it would require
   417  // these stateless invocations to likely store the fact that a RequestID was
   418  // consumed, or to pick randomized RequestIDs (which then gets you in trouble
   419  // when multiple processes are attempting to make the same decision and would
   420  // only fail out on a transaction after communicating intent to the quota
   421  // service).
   422  //
   423  // # Implementation notes - Redis encoding
   424  //
   425  // This library makes use of `msgpack` to encode both Accounts and Policies in
   426  // Redis. Unfortunately, because we need to implement quota manipulation in
   427  // `lua`, regular protobuf wasn't an option for these.
   428  //
   429  // See the go.chromium.org/luci/common/proto/msgpackpb for documentation on this
   430  // encoding form.
   431  //
   432  // This encoding form intends to preserve protobuf's backwards compatibility
   433  // semantics, which (hopefully) will make forward schema migrations easy to
   434  // implement without requiring total cache eviction.
   435  //
   436  // # Implementation notes - Debugging lua code
   437  //
   438  // I don't have any great strategy for this, but I did add a `DUMP` global
   439  // function which is available in both `internal/luatest` as well as
   440  // `quotatestmonkeypatch`. This will dump (print) all arguments, and will
   441  // serialize any tables given to it with `cjson.encode`, which is usually good
   442  // enough for quick debugging.
   443  package quota