go.chromium.org/luci@v0.0.0-20240309015107-7cdc2e660f33/server/quota/doc.go (about) 1 // Copyright 2022 The LUCI Authors. 2 // 3 // Licensed under the Apache License, Version 2.0 (the "License"); 4 // you may not use this file except in compliance with the License. 5 // You may obtain a copy of the License at 6 // 7 // http://www.apache.org/licenses/LICENSE-2.0 8 // 9 // Unless required by applicable law or agreed to in writing, software 10 // distributed under the License is distributed on an "AS IS" BASIS, 11 // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 // See the License for the specific language governing permissions and 13 // limitations under the License. 14 15 // Package quota provides an implementation for server quotas which are backed 16 // by Redis. 17 // 18 // # Rationale 19 // 20 // Quotas are a way to restrict shared resource consumption in order to provide 21 // fairness and prevent abuse. The quota library implements a way to configure 22 // and track resource limits for users for application-specific resources. 23 // 24 // We intend that this library be a 'good enough' implementation that it can 25 // serve the needs of many (if not all) LUCI services and provide additional 26 // common benefits (logging, metrics, administration ACLs/API/UI) so that each 27 // individual service doesn't need to re-invent these mechanisms. 28 // 29 // The current implementation is based on Redis and is fully synchronous. 30 // There's a possibility in the future that we could extend the implementation 31 // to other datastores or to allow the application to make a tradeoff between 32 // accuracy and latency. 33 // 34 // # Data Model 35 // 36 // There are 2 different types of entities managed by the quota libary: Policies 37 // (grouped into a PolicyConfig) and Accounts. The library provides a variety of 38 // Operations which all work in terms of these entities. 39 // 40 // # Data Model - Entity identities 41 // 42 // All entities have an identity which is composed of the following 'atoms'. 43 // Some of these atoms need structure which is meaningful to the application. 44 // The quota library has a convention for such atoms called "ASIs" (Application 45 // specific identifiers). See that section for what/why. Note that all of these 46 // identifiers end up as Redis keys (or hash keys) one way or the other, so all 47 // the usual caveats around absurd key lengths apply here. However, Redis allows 48 // keys up to 512MB, so have fun... 49 // 50 // Common identifier atoms: 51 // 52 // - app_id - The app_id allows multiple logical applications to share the same 53 // Redis instance. This should reflect the service that the account or policy 54 // belongs to. For example this would allow a single deployment to have quota 55 // accounts/policies for an application "cv" and "rdb" in the same binary. 56 // - realm - For administration purposes, Accounts and PolicyConfigs belong to 57 // a realm (though likely not the same one). Typically, PolicyConfigs will 58 // belong to a project's @project realm. Accounts will belong to realms which 59 // make sense in the context of the application. `realm` here is a global 60 // realm (i.e. `project:something`). 61 // - resource_type (ASI) - A given Policy or Account can only deal in a single 62 // resource_type. This value only needs to make sense to the application. 63 // - namespace (ASI) - Namespace allows the Application to segment a given 64 // realm into multiple sub-domains. For example, Buildbucket could use the 65 // namespace to indicate that a given Account is being used for a single 66 // builder within a bucket. This only needs to make sense to the application. 67 // - name (ASI) - Name is the name of the entity. This only needs to make sense 68 // to the application. 69 // 70 // # Data Model - PolicyConfig 71 // 72 // ID: app_id ~ realm ~ version 73 // 74 // A PolicyConfig is an immutable group (Redis Hash) of Policies. 75 // Typically this will be in a @project realm of some LUCI project, as current 76 // users will likely derive a PolicyConfig from some other LUCI project 77 // configuration. 78 // 79 // The realm indicates which realm this PolicyConfig is administered under, but 80 // it doesn't need to (and likely will not) match the realm for Accounts using 81 // the Policies within it. 82 // 83 // In the PolicyConfig ID, the `version` field is a content hash (starting with 84 // `$`), or manually supplied ("#" followed by an ASI). Once written, 85 // PolicyConfigs cannot be modified (but they can be purged). It's recommended 86 // to use the content hash versioning scheme (this will also do implicit 87 // deduplication when configs change without policy changes). However, some 88 // applications may find it more convenient to tie the PolicyConfig version to 89 // an external version identifier (like a git commit id of the overall configs), 90 // so manually versioning the PolicyConfigs is an option. 91 // 92 // Purging PolicyConfigs results in the deletion of a PolicyConfig and should 93 // only be used for PolicyConfigs that the application knows are no longer in 94 // use. However, in the event that a PolicyConfig is purged while Accounts still 95 // reference it: 96 // - Operations on those Accounts without supplying a new Policy reference 97 // will continue to use the snapshot of the policy stored in the Account. 98 // We could potentially make this produce a warning or error, however. 99 // - Operations on Accounts that supply a new Policy reference must have that 100 // Policy exist, as usual, and it will replace the referenced/snapshotted 101 // policy in the Account. 102 // 103 // # Data Model - Policy 104 // 105 // Key (within a PolicyConfig): namespace ~ name ~ resource_type 106 // 107 // A Policy is an immutable member of a PolicyConfig, and stores a numeric 108 // Default, Limit, Refill, and a Lifetime. 109 // - Default - The value to set a previously non-existant Account to when 110 // first accessing it. 111 // - Limit - The maximum value an Account can have. 112 // - Options - Bit field indicating various options. Currently the only option 113 // is `ABSOLUTE_RESOURCE` which indicates that this policy constrains 114 // a resource which is managed exclusively by the application (for example, 115 // represents the current number of in-flight builds, etc.). This will 116 // disable the `quota.accounts.write` permission for accounts managed with 117 // this Policy. 118 // - Lifetime - The number of seconds to wait before garbage collecting an 119 // Account after its last update. This is implemented with a Redis TTL which 120 // is refreshed on the Account each time it's written. 121 // 122 // Refill is a numeric triple (see the "Refill Behavior" section for details of 123 // how refill works): 124 // - Units - The number of units to add. 125 // - Interval - The number of seconds in between fill events. Intervals are 126 // synchronized to UTC midnight + Offset. See the "Refill Behavior" section 127 // for a discussion on how Refill is implemented. Note that there is no cron 128 // or "stampede" from synchronizing refill events in this way. This must 129 // evenly divide 24 hours (86400 seconds). 130 // - Offset - The number of seconds to offset UTC midnight to the 0th daily 131 // interval. 132 // 133 // # Data Model - Account 134 // 135 // ID: app_id ~ realm ~ namespace ~ name ~ resource_type 136 // 137 // Accounts hold the balance of a specific owning identity for a specific 138 // resource. They contain: 139 // - Balance - Current number of units held. 140 // - LastUpdate - Time when this Account was last updated. 141 // - LastRefill - Time when this Account was last refilled (always <= 142 // LastUpdate). 143 // - LastPolicyChange - Time when the currently applied Policy was first 144 // set. 145 // - PolicyConfig - Redis key for the versioned PolicyConfig last used for this 146 // Account. 147 // - PolicyKey - Hash key (namespace ~ name ~ resource_type) in the PolicyConfig 148 // for the Policy last used for this Account. 149 // - PolicyRaw - Raw encoded snapshot of the last-used policy for this Account. 150 // This is necessary to allow the quota library to interact with an Account 151 // under it's last-applied policy without needing to re-read the original 152 // policy (which is technically difficult to do in Redis scripts because 153 // they need to have all Redis keys supplied to them in advance of their 154 // execution). 155 // 156 // # Operations 157 // 158 // Operations combine a Policy with an Account, plus a delta. 159 // 160 // Operations have: 161 // - account - The ID of the account to apply to. 162 // - policy - (optional) The PolicyConfig ID + Policy key to set on this 163 // Account. 164 // - delta - An offset from the value specified by `relative_to`. 165 // - relative_to - Enum with values CURRENT_BALANCE, ZERO, DEFAULT, and LIMIT. 166 // - options - 167 // - IGNORE_POLICY_BOUNDS - This allows `$relative_to + delta` to bring 168 // balance outside of the Policy's (0,limit) range. 169 // 170 // An Operation is applied by: 171 // - Creating the Account if it is missing, populating it with the provided 172 // Policy default, applying any refill to the existing Account balance 173 // under the Account's existing policy. 174 // - If the Operation includes a Policy, setting that Policy on the Account. 175 // - Calculating the new balance and checking if it is within the current/new 176 // Policy bounds. 177 // - Saving the new Account balance, policy, and resetting the Account TTL. 178 // 179 // Operations can fail in one of three ways: 180 // - FAIL_OUT_OF_BOUNDS - The Operation would have brought the Account out of 181 // (0, Policy.Limit), and options=IGNORE_POLICY_BOUNDS was unset. 182 // - FAIL_UNKNOWN_POLICY - The Operation included a policy which wasn't 183 // loaded. 184 // - FAIL_MISSING_ACCOUNT - The Operation referred to an Account, but also 185 // didn't set a policy, meaning that the Operation couldn't create the 186 // Account. 187 // 188 // NOTE: For Accounts where the balance is ALREADY out bounds, Operations which 189 // bring the balance closer to in-bounds ARE allowed. For example, a delta 190 // CURRENT_BALANCE+1 would be allowed for an Account whose balance was -10, and 191 // a delta CURRENT_BALANCE-10 would be allowed for an Account whose balance was 192 // 19 with a limit of 10. 193 // 194 // There is also a Get operation which ONLY reads the data, returning the 195 // full Account data and also the projected value (e.g. after refills). This 196 // operation does NOT change the Account at all (i.e. last_refill, TTL, etc. 197 // are all left as-is). 198 // 199 // # Application-specific identifiers (ASIs) 200 // 201 // The quota library has several application-specific identifiers (ASIs). These 202 // ASIs end up ~verbatim in Redis as row keys. This means that your storage 203 // costs and lookup performance will be proportional to their length. 204 // 205 // The quota libary reserves the character "~" for partitioning ASIs when 206 // synthesizing a full Redis key. 207 // 208 // Additionally, two characters will be treated specially as a convention: 209 // - "|" is available to separate sections within an ASI. 210 // - "{", if the first character in an ASI section, indicates that the 211 // remainder of that section is encoded with ascii85 (an encoding which 212 // conveniently excludes "~", "|", and "{"). Functions in this library 213 // which attempt to do this interpretation will return the raw string 214 // instead of failing (e.g. if you had `{z` in a section, it would be 215 // returned as `{z` rather than as an error). 216 // 217 // The quota library provides functions to encode/decode a series of arbitrary 218 // section strings to/from a single ASI string. 219 // 220 // The quota library may use "|" as a way to group related keys together when 221 // displaying a large collection of quota Account or Policy data. Think of it 222 // similarly to how GCS treats "/". It's a visual delimiter, but the underlying 223 // service doesn't really care if you use it or not. Similarly, sections 224 // starting with '{' will attempt to decode in certain contexts (like the UI), 225 // but if decoding fails it will return the original string. If your application 226 // dosen't care about this functionality at all, it's free to use any string it 227 // likes as an ASI, as long as it doesn't contain `~`. 228 // 229 // # Refill Behavior 230 // 231 // Refills in the quota library are intended to mimic the behavior of a cron job 232 // which runs every second, scanning all Accounts, seeing if their Interval is 233 // past and refilling them. 234 // 235 // However, such an implementation would be terribly slow. Instead, the quota 236 // library remembers the policy details for each account and then when 237 // interacting with the Account as part of an Operation, this will refill based 238 // on the real elapsed time under the previous Policy. 239 // 240 // Refills are synchronized to UTC plus an offset. This means if you specify 17 241 // units with an interval of "21600" (i.e. 6 hours), and an offset of 0, then 242 // each 6 hours after UTC midnight, 17 units would be added to the account. If 243 // the account was created at, say, 0740 UTC, then the next refill event would 244 // occur at 1200 UTC. 245 // 246 // Offset allows you to 'rotate' this cycle so that a given policy's "midnight" 247 // occurs at a different time of day. (NOTE: Theoretically this offset could be 248 // per-Account rather than per-Policy. If this becomes a necessary usecase, it 249 // wouldn't be hard to add, but for now we're keeping it simple). 250 // 251 // Please also refer to "Implementation notes - Refill Interval" and 252 // "Implementation notes - Refill Synchronization" for a discussion on why we 253 // picked this Refill system vs. a simpler units/second alternative and why we 254 // tie refills to the wall clock time. 255 // 256 // # Behavior when switching Policies 257 // 258 // Over time, it is likely that a single Account will go through multiple 259 // different Policies which apply to it, or where those Policies change 260 // parameters over time. 261 // 262 // Account names should always be stable, comprising a who/what/where of 263 // a resource. When policies shift for an Account, the quota library will 264 // maintain the previous balance of the Account, except that no Refill will take 265 // place if the Account is over its limit. Additionally, no matter how far out 266 // of spec an Account is, it will always be permitted to make an over-limit 267 // account smaller, or an under-zero account larger. 268 // 269 // So, say an account had a policy which had a limit of 20, with a balance of 270 // 18, and switched to a policy with a balance of 15. It would maintain its 271 // balance of 18 until debited, but any positive refill policy would have no 272 // effect. 273 // 274 // # Access control and Administration 275 // 276 // The quota library implements an administration service API. This is an 277 // auxilliary API to read/write the values manipulated by the quota library, to 278 // be used for debugging or manual intervention (rather than directly poking the 279 // underlying Redis data). 280 // 281 // The `self` binding context attribute has the value "1" if the Account ID's 282 // identity field matches the current auth identity, "0" otherwise. 283 // 284 // Access via this service is granted via realm permissions: 285 // - quota.accounts.read - Allows reading single accounts within a realm. 286 // Binding context: {app_id, resource_type, namespace, self} 287 // - quota.accounts.list - Allows listing accounts 288 // Binding context: {app_id, resource_type, namespace} 289 // - quota.accounts.write - Allows modifying accounts. Note that this only 290 // applies to accounts which do not have the option ABSOLUTE_RESOURCE. 291 // Binding context: {app_id, resource_type, namespace, self} 292 // - quota.policies.read - Allows reading policy contents. 293 // Binding context: {app_id} 294 // - quota.policies.write - Allows writing new content-addressed policy 295 // configs. Binding context: {app_id} 296 // - quota.policies.overrideVersion - If granted in conjunction with 297 // `quota.policies.write`, allows writing new manually-versioned policy 298 // configs. Binding context: {app_id}. Note that manually-versioned policy 299 // configs are not verifiable by the quota library and could allow users 300 // with this permission to 'poison' a quota policy version. 301 // - quota.policies.purge - Allows perging PolicyConfigs. 302 // Binding context: {app_id}. 303 // 304 // Permission checks require one of: 305 // - hasPermission(perm, operation_realm) OR 306 // - hasPermission(perm, "@internal:<service-app-id>") 307 // 308 // That is, internal permissions can be granted to service deployment Admins. 309 // Additionally, permissions granted in this realm will ignore the 310 // ABSOLUTE_RESOURCE flag on accounts, becuase it's presumed that service 311 // deployment Admins understand the nuances of manually adjusting such Accounts. 312 // 313 // NOTE: These access controls ONLY apply to requests via the Administration 314 // service API. Interaction with the quotas via the Go API do not do any access 315 // checking, because it is assumed that the application has already done 316 // appropriate access checks before computing the Accounts/Policies to interact 317 // with. 318 // 319 // # Implementation notes - Refill Interval 320 // 321 // Initially the Quota library implemented a "units/second" refill system. This 322 // made the implementation nice due to its simplicity, but had two noticeable 323 // drawbacks: 324 // 325 // 1. Low quantity quotas (e.g. builds per day) were difficult to express 326 // naturally (for example, the application would have to have accounts in 327 // fractional builds, like 100,000 == one build). 328 // 2. Even if the application expressed account values in this way, this leads 329 // to an effectively "analog" replenismhent system which would lead to 330 // mistakes when setting quotas. 331 // 332 // Consider the case where you want to restrict users to "10 builds per day". 333 // You first make the accounts hold thousandths of a build, and then set 334 // a policy with (limit=1000000, refill_each_sec=11). Ignoring the fact that the 335 // refill should actually be something like 11.574, we've basically achieved 336 // what we want, right? A user can only run 10 builds (a bit less) per day. 337 // 338 // Not quite. Consider that the user can wait until their quota is full (10 339 // builds) and then they: 340 // - Run 10 builds in hour 0 341 // - Run one build every ~2 hours for the next 24 hours. 342 // 343 // Oops... our 10/day quota actually allows the user to burst up to 19/day. 344 // Mondays are gonna be spicy. 345 // 346 // Another aspect of the current implementation is that the Interval MUST 347 // cleanly divide one day. This allows the Interval to have a daily cycle and 348 // reduces the possible edge cases when switching policies for an Acccount where 349 // the Policies have different refill periods. Otherwise, oddball intervals 350 // (like 13h) would skew by an hour each day, and when we eventually switch 351 // policies, the Account would lose an unpredictable amount of refill time. 352 // 353 // # Implementation notes - Refill Synchronization 354 // 355 // Quota refills are tricky; originally we started the clock at account creation 356 // time, but realized this would lead to two issues: 357 // 358 // 1. Every quota account would refresh at seemingly-random times, which makes 359 // debugging more difficult. This would not be beneficial for 'load 360 // distribution' in a system (it should explicitly use short term quotas or 361 // some othe rate limiting techniques instead). 362 // 2. This would lead to very difficult to reason-about behaviors when 363 // policies change for a given account. 364 // 365 // In the case of policy changes, the only sensible thing to do while 366 // maintaining the interval based refill events would be to reset the refill 367 // timer when changing policies on an account. However, for Refill policies with 368 // long intervals, this could lead to artifacts where users are inexplicably 369 // starved for quota. Consider a situation where a user is allowed 10 builds per 370 // day. They exhaust their quota at hour 23 of the day and complain to a trooper 371 // who then moves them to a higher-tier policy group with 20 builds per day. 372 // 373 // However, when hour 24 rolls around, the user's account not only doesn't get 374 // 20 builds added to it, it doesn't even get the original 10. Instead the user 375 // has to wait an ADDITIONAL 24h before their quota replenishes. 376 // 377 // Synchronizing refill events significantly improves the predictability of the 378 // system here. 379 // 380 // # Implementation notes - Deduplication 381 // 382 // The quota library has a simple deduplication scheme which is indended to 383 // prevent accidentally applying Operations multiple times (for example, 384 // applying a Op(-10) operation twice when you only wanted to apply it once 385 // could be pretty bad). 386 // 387 // When any actor interacts with the Quota library (either via the Go interface 388 // or the Administration API), they provide a request ID. The quota library then 389 // calculates if ALL of the Operations in the request can proceed with the 390 // current Account state, and, if so, applies ALL of the Operations atomically*, 391 // followed by recording the RequestID into Redis with a TTL (defaulting to 392 // 2 hours), a hash of the requested operations, plus the returned value for the 393 // Account balances after applying all of the Operations. If a subsequent 394 // request comes in with the same RequestID, the hash of the Operations is 395 // checked, and if it matches the stored value, the original result will be 396 // returned without error. 397 // 398 // (* I put the scary asterisk on atomically, because _as far as I can tell_, 399 // EVAL scripts in Redis are either fully applied, or not applied at all. 400 // However the statements in the docs aren't as strong as I'd like to this 401 // effect. The docs do state that EVAL (or FUNCTIONs) is our best bet.) 402 // 403 // Supplying a different set of Operations with the same RequestID is an error, 404 // and the request will be rejected. 405 // 406 // Where this departs from "normal" deduplication is that _negative_ (error) 407 // results are NOT recorded; That is, if you attempt to debit an account "A" 408 // by 1 unit, but the balance is currently 0, this will return an "underflow" 409 // error, but the RequestID will not be consumed (so retrying this exact same 410 // request later may succeed, if the balance of "A" has risen above 1. 411 // 412 // We speculate that this mode is more intuitive, since many of the places we 413 // expect applications to interact with the quota library are attempting to make 414 // rapid, otherwise stateless, decisions about what to do next, where generating 415 // the RequestID deterministically in the context of that decision is 416 // convenient. If we stored the rejection via the RequestID, it would require 417 // these stateless invocations to likely store the fact that a RequestID was 418 // consumed, or to pick randomized RequestIDs (which then gets you in trouble 419 // when multiple processes are attempting to make the same decision and would 420 // only fail out on a transaction after communicating intent to the quota 421 // service). 422 // 423 // # Implementation notes - Redis encoding 424 // 425 // This library makes use of `msgpack` to encode both Accounts and Policies in 426 // Redis. Unfortunately, because we need to implement quota manipulation in 427 // `lua`, regular protobuf wasn't an option for these. 428 // 429 // See the go.chromium.org/luci/common/proto/msgpackpb for documentation on this 430 // encoding form. 431 // 432 // This encoding form intends to preserve protobuf's backwards compatibility 433 // semantics, which (hopefully) will make forward schema migrations easy to 434 // implement without requiring total cache eviction. 435 // 436 // # Implementation notes - Debugging lua code 437 // 438 // I don't have any great strategy for this, but I did add a `DUMP` global 439 // function which is available in both `internal/luatest` as well as 440 // `quotatestmonkeypatch`. This will dump (print) all arguments, and will 441 // serialize any tables given to it with `cjson.encode`, which is usually good 442 // enough for quick debugging. 443 package quota