github.com/rkt/rkt@v1.30.1-0.20200224141603-171c416fac02/Documentation/devel/pod-lifecycle.md (about)

     1  # Life-cycle of a pod in rkt
     2  
     3  Throughout this document `$var` is used to refer to the directory `/var/lib/rkt/pods`, and `$uuid` refers to a pod's UUID e.g. "076292e6-54c4-4cc8-9fa7-679c5f7dcfd3".
     4  
     5  Due to rkt's [architecture][rkt-arch] - and specifically its lack of any management daemon process - a combination of advisory file locking and atomic directory renames (via [`rename(2)`][man-rename]) is used to represent and transition the basic pod states.
     6  
     7  At times where a state must be reliably coupled to an executing process, that process is executed with an open file descriptor possessing an exclusive advisory lock on the respective pod's directory.
     8  Should that process exit for any reason, its open file descriptors will automatically be closed by the kernel, implicitly unlocking the pod's directory.
     9  By attempting to acquire a shared non-blocking advisory lock on a pod directory we're able to poll for these process-bound states, additionally by employing a blocking acquisition mode we may reliably synchronize indirectly with the exit of such processes, effectively providing us with a wake-up event the moment such a state transitions.
    10  For more information on advisory locks see the [`flock(2)`][man-flock] man page.
    11  
    12  At this time there are four distinct phases of a pod's life which involve process-bound states:
    13  
    14  * Prepare
    15  * Run
    16  * ExitedGarbage
    17  * Garbage
    18  
    19  Each of these phases involves an exclusive lock on a given pod's directory.
    20  As an exclusive lock by itself cannot express both the phase and process-bound activity within that phase, we combine the lock with the pod's directory location to represent the whole picture:
    21  
    22  | Phase         | Directory                   | Locked exclusively      | Unlocked                 |
    23  |---------------|-----------------------------|-------------------------|--------------------------|
    24  | Prepare       | "$var/prepare/$uuid"        | preparing               | prepare-failed           |
    25  | Run           | "$var/run/$uuid"            | running                 | exited                   |
    26  | ExitedGarbage | "$var/exited-garbage/$uuid" | exited+deleting         | exited+gc-marked         |
    27  | Garbage       | "$var/garbage/$uuid"        | prepare-failed+deleting | prepare-failed+gc-marked |
    28  
    29  To prevent the period between first creating a pod's directory and acquiring its lock from appearing as prepare-failed in the Prepare phase, and to provide a phase for prepared pods where they may dwell and the lock may be acquired prior to entering the Run phase, two additional directories are employed where locks have no meaning:
    30  
    31  | Phase           | Directory                   | Locked exclusively      | Unlocked                 |
    32  |-----------------|-----------------------------|-------------------------|--------------------------|
    33  | Embryo          | "$var/embryo/$uuid"         | -                       | -                        |
    34  | Prepare         | "$var/prepare/$uuid"        | preparing               | prepare-failed           |
    35  | Prepared        | "$var/prepared/$uuid"       | -                       | -                        |
    36  | Run             | "$var/run/$uuid"            | running                 | exited                   |
    37  | ExitedGarbage   | "$var/exited-garbage/$uuid" | exited+deleting         | exited+gc-marked         |
    38  | Garbage         | "$var/garbage/$uuid"        | prepare-failed+deleting | prepare-failed+gc-marked |
    39  
    40  ## App
    41  
    42  The `rkt app` experimental family of subcommands allow mutating operations on a running pod: namely, adding, starting, stopping, and removing applications.
    43  To be able to use these subcommands the environment variable `RKT_EXPERIMENT_APP=true` must be set.
    44  The `rkt app sandbox` subcommand transitions to the Run phase as described above, whereas the remaining subcommands mutate the pod while staying in the Run phase.
    45  To synchronize operations inside the Run phase an additional advisory lock `$var/run/$uuid/pod.lck` is being introduced.
    46  Locking on the `$var/run/$uuid/pod` manifest won't work because changes on it need to be atomic, realized by overwriting the original manifest.
    47  If this file is locked, the pod is undergoing a mutation. Note that only `rkt add/rm` operations are synchronized.
    48  To retain consistency for all other operations (i.e. `rkt list`) that need to read the `$var/run/$uuid/pod` manifest all mutating operations are atomic.
    49  
    50  The `app add/start/stop/rm` subcommands all run within the Run phase where the exclusive advisory lock on the `$var/run/$uuid` directory is held by the systemd-nspawn process.
    51  The following table gives an overview of the states when a lock on `$var/run/$uuid/pod.lck` is being held:
    52  
    53  | Phase  | Locked exclusively | Unlocked |
    54  |--------|--------------------|----------|
    55  | Add    | adding             | added    |
    56  | Start  | -                  | -        |
    57  | Stop   | -                  | -        |
    58  | Remove | removing           | removed  |
    59  
    60  These phases, their function, and how they proceed through their respective states is explained in more detail below.
    61  
    62  ## Embryo
    63  
    64  `rkt run` and `rkt prepare` instantiate a new pod by creating an empty directory at `$var/embryo/$uuid`.
    65  
    66  An exclusive lock is immediately acquired on the created directory which is then renamed to `$var/prepare/$uuid`, transitioning to the `Prepare` phase.
    67  
    68  ## Prepare
    69  
    70  `rkt run` and `rkt prepare` enter this phase identically; holding an exclusive lock on the pod directory `$var/prepare/$uuid`.
    71  
    72  After preparation completes, while still holding the exclusive lock (the lock is held for the duration):
    73  
    74  `rkt prepare` transitions to `Prepared` by renaming `$var/prepare/$uuid` to `$var/prepared/$uuid`.
    75  
    76  `rkt run` transitions directly from `Prepare` to `Run` by renaming `$var/prepare/$uuid` to `$var/run/$uuid`, entirely skipping the `Prepared` phase.
    77  
    78  Should `Prepare` fail or be interrupted, `$var/prepare/$uuid` will be left in an unlocked state.
    79  Any directory in `$var/prepare` in an unlocked state is considered a failed prepare.
    80  `rkt gc` identifies failed prepares in need of clean up by trying to acquire a shared lock on all directories in `$var/prepare`, renaming successfully locked directories to `$var/garbage` where they are then deleted.
    81  
    82  ## Prepared
    83  
    84  `rkt prepare` concludes successfully by leaving the pod directory at `$var/prepared/$uuid` in an unlocked state before returning `$uuid` to the user.
    85  
    86  `rkt run-prepared` resumes where `rkt prepare` concluded by exclusively locking the pod at `$var/prepared/$uuid` before renaming it to `$var/run/$uuid`, specifically acquiring the lock prior to entering the `Run` phase.
    87  
    88  `rkt run` never enters this phase, skipping directly from `Prepare` to `Run` with the lock held.
    89  
    90  ## Run
    91  
    92  `rkt run` and `rkt run-prepared` both arrive here with the pod at `$var/run/$uuid` while holding the exclusive lock.
    93  
    94  The pod is then executed while holding this lock.
    95  It is required that the stage1 `coreos.com/rkt/stage1/run` entrypoint keep the file descriptor representing the exclusive lock open for the lifetime of the pod's process.
    96  All this requires is that the stage1 implementation not close the inherited file descriptor.
    97  This is facilitated by supplying stage1 its number in the RKT_LOCK_FD environment variable.
    98  
    99  What follows applies equally to `rkt run` and `rkt run-prepared`.
   100  
   101  ## Death / exit
   102  
   103  A pod is considered exited if a shared lock can be acquired on `$var/run/$uuid`.
   104  Upon exit of a pod's process, the exclusive lock acquired before entering the `Run` phase becomes released by the kernel.
   105  
   106  ## Garbage collection
   107  
   108  Exited pods are discarded using a common mark-and-sweep style of garbage collection by invoking the `rkt gc` command.
   109  This relatively simple approach lends itself well to a minimal file-system based implementation utilizing no additional daemons or record keeping with good efficiency.
   110  The process is performed in two distinct passes explained in detail below.
   111  
   112  ### Pass 1: mark
   113  
   114  All directories found in `$var/run` are tested for exited status by trying to acquire a shared advisory lock on each directory.
   115  
   116  When a directory's lock cannot be acquired, the directory is skipped as it indicates the pod is currently executing.
   117  
   118  When the lock is successfully acquired, the directory is renamed from `$var/run/$uuid` to `$var/exited-garbage/$uuid`.
   119  This renaming effectively implements the "mark" operation.
   120  Since the locks are immediately released, operations like `rkt status` may safely execute concurrently with `rkt gc`.
   121  
   122  Marked exited pods dwell in the `$var/exited-garbage` directory for a grace period during which their status may continue to be queried by `rkt status`.
   123  The rename from `$var/run/$uuid` to `$var/exited-garbage/$uuid` serves in part to keep marked pods from cluttering the `$var/run` directory during their respective dwell periods.
   124  
   125  ### Pass 2: sweep
   126  
   127  A side-effect of the rename operation responsible for moving a pod from `$var/run` to `$var/exited-garbage` is an update to the pod directory's change time.
   128  The sweep operation takes this updated file change time as the beginning of the "dwell" grace period, and discards exited pods at the expiration of that period.
   129  This grace period currently defaults to 30 minutes, and may be explicitly specified using the `--grace-period=duration` flag with `rkt gc`.
   130  Note that this grace period begins from the time a pod was marked by `rkt gc`, not when the pod exited.
   131  A pod becomes eligible for marking when it exits, but will not actually be marked for collection until a subsequent `rkt gc`.
   132  
   133  The change times of all directories found in `$var/exited-garbage` are compared against the current time.
   134  Directories having sufficiently old change times are locked exclusively and cleaned up.
   135  If a lock acquisition fails, the directory is skipped.
   136  `rkt gc` may fail to acquire an exclusive lock if the pod to be collected is currently being accessed, by `rkt status` or another `rkt gc`, for example.
   137  The skipped pods will be revisited on a subsequent `rkt gc` invocation's sweep pass.
   138  During the cleanup, the pod's stage1 gc entry point is first executed.
   139  This gives the stage1 a chance to clean up anything related to the environment shared between containers.
   140  The default stage1 uses the gc entrypoint to clean up the private networking artifacts.
   141  After the completion of the gc entrypoint, the pod directory is recursively deleted.
   142  
   143  ## Pulse
   144  
   145  To answer the questions "Has this pod exited?" and "Is this pod being deleted?" the pod's UUID is looked for in `$var/run` and `$var/exited-garbage`, respectively.
   146  Pods found in the `$var/exited-garbage` directory must already be exited, and a shared lock acquisition may be used to determine if the garbage pod is actively being deleted.
   147  Those found in the `$var/run` directory may be exited or running, and a failed shared lock acquisition indicates a pod in `$var/run` is alive at the time of the failed acquisition.
   148  
   149  Care must be taken when acting on what is effectively always going to be stale knowledge of pod state; though a pod's status may be found to be "running" by the mechanisms documented here, this was an instantaneously sampled state that was true at the time sampled (failed lock attempt at `$var/run/$uuid`), and may cease to be true by the time code execution progressed to acting on that sample.
   150  Pod exit is totally asynchronous and cannot be prevented, relevant code must take this into consideration (e.g. `rkt enter`) and be tolerant of states progressing.
   151  
   152  For example, two `rkt run-prepared` invocations for the same UUID may occur simultaneously.
   153  Only one of these will successfully transition the pod from `Prepared` to `Run` due to rename's atomicity, which is exactly what we want.
   154  The loser of this race needs to simply inform the user of the inability to transition the pod to the run state, perhaps with a check to see if the pod transitioned independently and a useful message mentioning it.
   155  
   156  Another example would be two `rkt gc` commands finding the same exited pods and attempting to transition them to the `Garbage` phase concurrently.
   157  They can't both perform the transitions, one will lose the race at each pod.
   158  This needs to be considered in the error handling of the transition callers as perfectly normal.
   159  Simply ignoring ENOENT errors propagated from the loser's rename calls can suffice.
   160  
   161  
   162  [man-flock]: http://man7.org/linux/man-pages/man2/flock.2.html
   163  [man-rename]: http://man7.org/linux/man-pages/man2/rename.2.html
   164  [rkt-arch]: ../devel/architecture.md