github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/design/accepted/task-management-api.md

github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/design/accepted/task-management-api.md (about)

     1  # Task management API: requirements
     2  
     3  Multiple subsystems require reliable long-lived distributed operations.  We shall support them
     4  with a task queue subsystem.  Concepts of the API are intended to parallel concepts in the
     5  [Celery][celery-api] and/or [Machinery][machinery-api] APIs.  We do not use either of them to
     6  reduce the number of required dependencies.  A future version may support external queues to
     7  achieve better performance or robustness, at the price of increased ops or cost.  These will
     8  likely be queues such as Kafka or SQS rather than full Celery or Machinery.
     9  
    10  In practice this generally means providing APIs similar to those of Machinery (which is more
    11  Go-like than Celery) for constructing task flows and for registering workers.
    12  
    13  In particular:
    14  1. We provide similar concepts for building task flows as do existing
    15     task queues.
    16  1. We use similar terminology.
    17  1. We do *not* require the entire API of an existing task queue.
    18  1. We do *not* use the verbs or API calls of an existing task queue.
    19  
    20  This API definition comes with implementation sketches for how we to use these APIs to implement
    21  the branch export story.  We shall also (re-)implement retention expiry to use these APIs for
    22  better ops; that story is considerably easier to imagine.
    23  
    24  ## API
    25  
    26  ### Concepts
    27  
    28  #### Tasks
    29  
    30  A task is the basic atom of task management.  It represents a single unit of work to
    31  perform, and can succeed or fail.  Tasks may be retried on failure, so _executing a task
    32  must be idempotent_.
    33  
    34  Tasks connect to application code via an action and a body.  The _action_ identifies the
    35  operation to complete these task.  Examples of actions can include "copy a file", "delete a
    36  path", "report success".  It is essentially the _name_ of a procedure to perform at some future
    37  time.  The _body_ of a task gives information necessary to configure the specific task.
    38  Examples of bodies can include "source file path X, destination path Z" (for a copy task), "path
    39  Z" (for a delete task), or "date started, number of objects and a message" (for a report task).
    40  It essentially holds the _parameters_ the action uses to perform the task.
    41  
    42  Tasks include these attributes:
    43  - `Id`: a unique identifier for the task.  Use a known-unique substring in the identifier
    44    (e.g. a UUID or [nanoid][nanoid]) to avoid collisions, or a well-known identifier to ensure
    45    only one task of a type can exist.
    46  - `Action`: the type of action to perform for this task.  Workers pick tasks to perform and the
    47    actions to perform on them according to this field.
    48  - `Body`: a description of parameters for this text.  E.g. in a "copy file" task the body might
    49    specify source key, ETag and destination key.
    50  - `StatusCode`: the internally-used state of the task in its lifecycle, see [life of a
    51    task](#life-of-a-task) below.
    52  - `Status`: a textual description of the current status, generated by application code.
    53  - `NumSignals`: number of tasks that must signal this task before it can be performed.
    54    Initially equal to the number of tasks on which it appears in the `ToSignalAfter` array.
    55  - `MaxTries`: the maximal number of times to try to execute the task if it keeps being returned
    56    to state `pending`.
    57  - `ActorId`: the unique string identifier chosen by a worker which is currently performing the
    58    task.  Useful for monitoring.
    59  - `ActionDeadline`: a time by which the worker currently performing the task has committed to
    60    finish it.
    61  - `ToSignalAfter`: an array of task IDs that cannot start before this task ends, and will therefore
    62    be signalled when it does.
    63  
    64  Tasks provide these additional facilities (and include fields not listed here to support them):
    65  - **Retries**.  A task repeatedly placed back into state `pending` will not be retried again.
    66  - **Dependencies**.  Every task can only occur after some other tasks
    67    are done.
    68  
    69  A task is performed by a single worker; if that worker does not finish processing it and an
    70  action deadline was set, it will be given to another worker.
    71  
    72  #### Life of a task
    73  
    74  ```
    75                       |
    76                       | InsertTasks
    77                       |
    78                       |
    79                 +-----v-----+
    80             +-->|  pending  |
    81             |   +-----+-----+
    82   ReturnTask|         |
    83   (to       |         | OwnTasks
    84   pending)  |         |
    85             |   +-----v-----+
    86             +---+in-progress|
    87                 +-----------+
    88                       |
    89          +------------+------------+      ReturnTask
    90          |                         |
    91     +----v---+                +----v----+
    92     |aborted |                |completed|
    93     +--------+                +---------+
    94  ```
    95  
    96  A task arrives complete with dependencies, a count of the number of preceding tasks that
    97  must "signal" it before it may be executed.  When the task completes it signals all of its
    98  dependent tasks.
    99  
   100  Tasks are inserted in state `pending`.  Multiple workers call `OwnTasks` to get tasks.  A
   101  task may only be claimed by a call to `OwnTasks` if:
   102  * Its action is specified as acceptable to that call.
   103  * All dependencies of the task have been settled: all tasks specifying its task ID in their
   104    `ToSignalAfter` list have completed.
   105  * The task is not claimed by another worker.  Either:
   106    - the task is in state `pending`, or
   107    - the task is in state `in-progress`, but its `ActionDeadline` has elapsed (see "ownership
   108      expiry", below).
   109  
   110  `OwnTasks` returns task IDs and for each returned task a "performance token" for this
   111  performance of it.  Both ID and token must be provided to _return_ the task from ownership.
   112  (The performance token is used to resolve conflicts during "ownership expiry", below.)
   113  
   114  A typical use is that a worker loop repeatedly calls `OwnTasks` on one or more actions, and
   115  dispatches each to a separate function.  The application controls concurrency by setting the
   116  number of concurrent worker loops.  For instance, it might set 20 worker loops to perform "copy"
   117  and "delete" tasks and a single worker loop to perform "report to DataDog".
   118  
   119  Once a worker owns a task, it performs it.  It can decide to return the task to the task
   120  queue and _complete_, _abort_ or _retry_ it by calling `ReturnTask`.  Once completed, all
   121  dependents of the task are signalled, causing any dependent that has received all its
   122  required signals to be eligible for return by `OwnTasks`.
   123  
   124  #### Ownership expiry
   125  
   126  Processes can fail.  To allow restarting a failed process calls to `OwnTasks` may specify a
   127  deadline.  The lease granted to an owning worker will expire after this deadline, allowing
   128  another worker to own the task.  Only the _last_ worker granted ownership may call
   129  `ReturnTask` on the task.  A delayed worker should still return the task, in case the task
   130  has not yet been granted to another worker.
   131  
   132  #### Basic API
   133  
   134  This is a sample API.  All details are fully subject to change, of course!  Note that most
   135  `func`s are probably going to be methods on some object, which we assume will carry DB
   136  connection information etc.
   137  
   138  ##### TaskData
   139  
   140  ```go
   141  type TaskId string
   142  
   143  type ActorId string
   144  
   145  type PerformanceToken pgtype.UUID // With added stringifiers
   146  
   147  // TaskData describes a task to perform.
   148  type TaskData struct {
   149  	Id         TaskId              // Unique ID of task
   150  	Action     string              // Action to perform, used to fetch in OwnTasks
   151  	Body       *string             // Body containing details of action, used by clients only
   152  	Status     *string             // Human- and client-readable status
   153  	StatusCode TaskStatusCodeValue // Status code, used by task queue
   154  	NumTries   int                 // Number of times this task has moved from started to in-progress
   155  	MaxTries   *int                // Maximal number of times to try this task
   156  	// Dependencies might be stored or handled differently, depending on what gives reasonable
   157  	// performance.
   158  	TotalDependencies *int              // Number of tasks which must signal before this task can be owned
   159  	ToSignalAfter     []TaskId          // Tasks to signal after this task is done
   160  	ActorId           ActorId           // ID of current actor performing this task (if in-progress)
   161  	ActionDeadline    *time.Time        // Deadline for current actor to finish performing this task (if in-progress)
   162  	PerformanceToken  *PerformanceToken // Token to allow ReturnTask
   163  	PostResult        bool              // If set allow waiting for this task using WaitForTask
   164  }
   165  ```
   166  
   167  ##### InsertTasks
   168  
   169  ```go
   170  // InsertTasks atomically adds all tasks to the queue: if any task cannot be added (typically because
   171  // it re-uses an existing key) then no tasks will be added.  If PostResult was set on any tasks then
   172  // they can be waited upon after InsertTasks returns.
   173  func InsertTasks(ctx context.Context, source *taskDataIterator) error
   174  ```
   175  
   176  A variant allows inserting a task _by force_
   177  
   178  ```go
   179  // ReplaceTasks atomically adds all tasks to the queue.   If a task not yet in-process with the same
   180  // ID already exists then _replace it_ as though it were atomically aborted before this insert.  If
   181  // PostResult was set on any tasks then they can be waited upon after InsertTasks returns.  Tasks that
   182  // are in process cannot be replaced.
   183  func ReplaceTasks(ctx context.Context, source *taskDataIterator) error
   184  ```
   185  
   186  ##### OwnTasks
   187  
   188  ```go
   189  // OwnedTaskData is a task returned from OwnedTask
   190  type OwnedTaskData struct {
   191  	Id    TaskId           `db:"task_id"`
   192  	Token PerformanceToken `db:"token"`
   193  	Action string
   194  	Body  *string
   195  }
   196  
   197  // OwnTasks owns for actor and returns up to maxTasks tasks for performing any of actions, setting
   198  // the lifetime of each returned owned task to maxDuration.
   199  func OwnTasks(ctx context.Context, actor ActorId, maxTasks int, actions []string, maxDuration *time.Duration) ([]OwnedTaskData, error)
   200  ```
   201  
   202  `maxDuration` should be a time during which no other worker can access the task.  It does not
   203  have to be the time to _complete_ the task: workers can periodically call `ExtendTasksOwnership`
   204  to extend the lifetime.
   205  
   206  ##### ExtendTasksOwnership
   207  
   208  ```go
   209  // ExtendTasksOwnership extends the current action lifetime for each of task by another maxDuration,
   210  // if that task is still owned by this actor with that performance token.  It returns true for each
   211  // task if it is still owned, or false if ownership extension failed because the task is no longer
   212  // owned.
   213  func ExtendTasksOwnership(ctx context.Context, actor ActorId, toExtend []OwnedTaskData, maxDuration time.Duration) ([]bool, error)
   214  ```
   215  
   216  ##### ReturnTask
   217  
   218  ```go
   219  // ReturnTask returns taskId which was acquired using the specified performanceToken, giving it
   220  // resultStatus and resultStatusCode.  It returns InvalidTokenError if the performanceToken is
   221  // invalid; this happens when ReturnTask is called after its deadline expires, or due to a logic
   222  // error.  If resultStatusCode is ABORT, abort all succeeding tasks.
   223  func ReturnTask(ctx context.Context, taskId TaskId, token PerformanceToken, resultStatus string, resultStatusCode TaskStatusCodeValue) error
   224  ```
   225  
   226  ##### WaitForTask
   227  
   228  ```go
   229  // WaitForTask waits for taskId (which must have been started with PostResult) to finish and
   230  // returns it.  It returns immediately the task has already finished.
   231  func WaitForTask(ctx context.Context, taskId TaskId) (TaskData, error)
   232  ```
   233  
   234  ##### AddDependencies
   235  
   236  ```go
   237  // AddDependencies atomically adds dependencies: for every dependency,  task Run must  run after
   238  // task After.
   239  type TaskDependency interface {
   240  	After, Run TaskID
   241  }
   242  
   243  func AddDependencies(ctx context.Context, dependencies []TaskDependency) error
   244  ```
   245  
   246  ##### Monitoring
   247  
   248  Also some routine as a basis for monitoring: it gives the number and status of each of a number
   249  of actions and task IDs, possibly with some filtering.  The exact nature depends on the
   250  implementation chosen, however we _do_ require its availability.
   251  
   252  #### Differences from the Celery model
   253  
   254  This task management model is at a somewhat lower level than the Celery model:
   255  * **Workers explicitly loop to own and handle tasks.** Emulate the Celery model by writing an
   256    explicit function that takes "handlers" for the different actions.  We may well do this.
   257  
   258    _Why change?_ Writing the loop is rarely an important factor.  Flexibility in specifying the
   259    action parameter of OwnTasks allows variable actions, for instance handling particular action
   260    types only when a particular environmental condition is met (say, system load), or
   261    incorporating side data in action names (and not only in task IDs).  Flexibility in timing
   262    allows a per-process rate limiters for particular actions: filter out expensive actions when
   263    their token bucket runs out.  Flexibility in specifying _when_ OwnTasks is called allows
   264    controlling load on the queuing component.  Flexibility in specifying action dispatch allows
   265    controlling _how many goroutines_ run particular actions concurrently.  All this without
   266    having to add configurable structures to the task manager.
   267  * **No explicit graph structures.** Emulate these using the section [Structures][#structures]
   268    below.
   269  * **No implicit argument serialization.** Rather than flatten an "args" array we pass a stringy
   270    "body".  In practice "args" anyway require serialization; incorporating them into the queue
   271    requires either configuring the queue with relevant serialization or allowing only primitives.
   272    Celery selects the first, Machinery the second.  In both cases a Go client library must place
   273    most of the serialization burden on application code -- simplest is to do so explicitly.
   274  
   275  #### Structures
   276  
   277  We can implement what Celery calls _Chains_, _Chords_ and _Groups_ using the basic API: these
   278  are just ways to describe structured dependencies which form [parallel/serial
   279  networks][parallel-series] networks.  Drawings appear below.
   280  
   281  ##### Chains
   282  
   283  ```
   284      +----------+
   285      |  task 1  |
   286      +----------+
   287           |
   288           |
   289      +----v-----+
   290      |  task 2  |
   291      +----+-----+
   292           |
   293           |
   294      +----v-----+
   295      |  task 3  |
   296      +----------+
   297  ```
   298  
   299  ##### Chords (and Groups)
   300  
   301  ```
   302                      +--------+
   303                 +--->|  task1 +-----+
   304                 |    +--------+     |
   305                 |                   |
   306                 |                   |
   307                 |    +--------+     |
   308                 +--->|  task2 +-----+
   309                 |    +--------+     |
   310   +-------+     |                   |     +-------------+
   311   |  prev |-----+                   +---->|(spontaneous)|
   312   +-------+     |    +--------+     |     +-------------+
   313                 +--->|  task3 +-----+
   314                 |    +--------+     |
   315                 |                   |
   316                 |                   |
   317                 |    +--------+     |
   318                 +--->|  task4 +-----+
   319                      +--------+
   320  ```
   321  
   322  ## Implementing "user" stories with the API
   323  
   324  ### Branch export
   325  
   326  Each branch uses separate tasks arranged in a cycle.  These names are task IDs with a matching
   327  action name: e.g. the action name for `next-export-{branch}` is `next-export`.  (The branch name
   328  also appears in the body.)
   329  
   330  * `next-export-{branch}` is to start the next export if an export is already underway, by
   331    creating a task `start-export-{branch}`.
   332  * `start-export-{branch}` handles the actual logic of generating the copy tasks in a network,
   333    leading eventually to the task `done-export-{branch}` (which is also generates) becoming
   334    available.
   335  * `done-export-{branch}` is there so that `next-export-{branch}` can depend on it -- and not
   336    start before the current export operation terminates.  (If it does not exist,
   337    `next-export-{branch}` has no dependency blocking it and can run immediately.)
   338  
   339  The actual steps:
   340  1. Under the merge/commit lock for the branch: _replace_ the task with ID `next-export-{branch}`
   341       with a task to export _this_ commit ID.  And add a dependency on `done-export-{branch}`
   342       (which may fail if that task has completed; that is safe).
   343  1. To handle `next-export-{branch}`: create `start-export-{branch}` (the previous one must have
   344     ended), and return the task.
   345  1. To handle `start-export-{branch}`:
   346     1. Generate a task to copy or delete each file object (this is an opportunity to batch
   347        multiple file objects if performance doesn't match.  `done-export-{branch}` depends on
   348        each of these tasks (and cannot have been deleted since `start-export-{branch}` has not
   349        yet been returned).  For every prefix for which such an object is configured, add a task
   350        to generate its `.../_SUCCESS` object on S3, dependent on all the objects under that
   351        prefix (or, to handle objects in sub-prefixes, on just the `_SUCCESS` of that sub-prefix).
   352     1. Add a task to generate manifests, dependent on `done-export-{branch}`.
   353     1. Return `start-export-{branch}` as completed.
   354  1. To handle a copy or delete operation, perform it.
   355  1. To handle `done-export-{branch}`: just return it, it can be spontaneous (if the task queue
   356     supports that).
   357  
   358  `next-export-{branch}` is used to serialize branch exports, achieving the requirement for single
   359  concurrent export per branch.  Per-prefix `_SUCCESS` objects are generated on time due to their
   360  dependencies.  (As an option, we could set priorities and return tasks in priority order from
   361  `OwnTasks`, to allow `_SUCCESS` objects to be created before copying other objects.)  Retries
   362  are handled by setting multiple per-copy attempts.
   363  
   364  ## References
   365  
   366  ### Well-known task queues
   367  1. [Celery][celery-api]
   368  2. [Machinery][machinery-api]
   369  
   370  ### Modules
   371  1. [nanoid][nanoid]
   372  
   373  ### Graphs
   374  1. [Parallel series][parallel-series]
   375  
   376  [celery-api]: https://docs.celeryproject.org/en/stable/userguide/index.html
   377  [machinery-api]: https://github.com/RichardKnop/machinery#readme
   378  [nanoid]: https://www.npmjs.com/package/nanoid
   379  [parallel-series]: https://www.cpp.edu/~elab/projects/project_05/index.html