github.com/rkt/rkt@v1.30.1-0.20200224141603-171c416fac02/Documentation/devel/pod-lifecycle.md (about) 1 # Life-cycle of a pod in rkt 2 3 Throughout this document `$var` is used to refer to the directory `/var/lib/rkt/pods`, and `$uuid` refers to a pod's UUID e.g. "076292e6-54c4-4cc8-9fa7-679c5f7dcfd3". 4 5 Due to rkt's [architecture][rkt-arch] - and specifically its lack of any management daemon process - a combination of advisory file locking and atomic directory renames (via [`rename(2)`][man-rename]) is used to represent and transition the basic pod states. 6 7 At times where a state must be reliably coupled to an executing process, that process is executed with an open file descriptor possessing an exclusive advisory lock on the respective pod's directory. 8 Should that process exit for any reason, its open file descriptors will automatically be closed by the kernel, implicitly unlocking the pod's directory. 9 By attempting to acquire a shared non-blocking advisory lock on a pod directory we're able to poll for these process-bound states, additionally by employing a blocking acquisition mode we may reliably synchronize indirectly with the exit of such processes, effectively providing us with a wake-up event the moment such a state transitions. 10 For more information on advisory locks see the [`flock(2)`][man-flock] man page. 11 12 At this time there are four distinct phases of a pod's life which involve process-bound states: 13 14 * Prepare 15 * Run 16 * ExitedGarbage 17 * Garbage 18 19 Each of these phases involves an exclusive lock on a given pod's directory. 20 As an exclusive lock by itself cannot express both the phase and process-bound activity within that phase, we combine the lock with the pod's directory location to represent the whole picture: 21 22 | Phase | Directory | Locked exclusively | Unlocked | 23 |---------------|-----------------------------|-------------------------|--------------------------| 24 | Prepare | "$var/prepare/$uuid" | preparing | prepare-failed | 25 | Run | "$var/run/$uuid" | running | exited | 26 | ExitedGarbage | "$var/exited-garbage/$uuid" | exited+deleting | exited+gc-marked | 27 | Garbage | "$var/garbage/$uuid" | prepare-failed+deleting | prepare-failed+gc-marked | 28 29 To prevent the period between first creating a pod's directory and acquiring its lock from appearing as prepare-failed in the Prepare phase, and to provide a phase for prepared pods where they may dwell and the lock may be acquired prior to entering the Run phase, two additional directories are employed where locks have no meaning: 30 31 | Phase | Directory | Locked exclusively | Unlocked | 32 |-----------------|-----------------------------|-------------------------|--------------------------| 33 | Embryo | "$var/embryo/$uuid" | - | - | 34 | Prepare | "$var/prepare/$uuid" | preparing | prepare-failed | 35 | Prepared | "$var/prepared/$uuid" | - | - | 36 | Run | "$var/run/$uuid" | running | exited | 37 | ExitedGarbage | "$var/exited-garbage/$uuid" | exited+deleting | exited+gc-marked | 38 | Garbage | "$var/garbage/$uuid" | prepare-failed+deleting | prepare-failed+gc-marked | 39 40 ## App 41 42 The `rkt app` experimental family of subcommands allow mutating operations on a running pod: namely, adding, starting, stopping, and removing applications. 43 To be able to use these subcommands the environment variable `RKT_EXPERIMENT_APP=true` must be set. 44 The `rkt app sandbox` subcommand transitions to the Run phase as described above, whereas the remaining subcommands mutate the pod while staying in the Run phase. 45 To synchronize operations inside the Run phase an additional advisory lock `$var/run/$uuid/pod.lck` is being introduced. 46 Locking on the `$var/run/$uuid/pod` manifest won't work because changes on it need to be atomic, realized by overwriting the original manifest. 47 If this file is locked, the pod is undergoing a mutation. Note that only `rkt add/rm` operations are synchronized. 48 To retain consistency for all other operations (i.e. `rkt list`) that need to read the `$var/run/$uuid/pod` manifest all mutating operations are atomic. 49 50 The `app add/start/stop/rm` subcommands all run within the Run phase where the exclusive advisory lock on the `$var/run/$uuid` directory is held by the systemd-nspawn process. 51 The following table gives an overview of the states when a lock on `$var/run/$uuid/pod.lck` is being held: 52 53 | Phase | Locked exclusively | Unlocked | 54 |--------|--------------------|----------| 55 | Add | adding | added | 56 | Start | - | - | 57 | Stop | - | - | 58 | Remove | removing | removed | 59 60 These phases, their function, and how they proceed through their respective states is explained in more detail below. 61 62 ## Embryo 63 64 `rkt run` and `rkt prepare` instantiate a new pod by creating an empty directory at `$var/embryo/$uuid`. 65 66 An exclusive lock is immediately acquired on the created directory which is then renamed to `$var/prepare/$uuid`, transitioning to the `Prepare` phase. 67 68 ## Prepare 69 70 `rkt run` and `rkt prepare` enter this phase identically; holding an exclusive lock on the pod directory `$var/prepare/$uuid`. 71 72 After preparation completes, while still holding the exclusive lock (the lock is held for the duration): 73 74 `rkt prepare` transitions to `Prepared` by renaming `$var/prepare/$uuid` to `$var/prepared/$uuid`. 75 76 `rkt run` transitions directly from `Prepare` to `Run` by renaming `$var/prepare/$uuid` to `$var/run/$uuid`, entirely skipping the `Prepared` phase. 77 78 Should `Prepare` fail or be interrupted, `$var/prepare/$uuid` will be left in an unlocked state. 79 Any directory in `$var/prepare` in an unlocked state is considered a failed prepare. 80 `rkt gc` identifies failed prepares in need of clean up by trying to acquire a shared lock on all directories in `$var/prepare`, renaming successfully locked directories to `$var/garbage` where they are then deleted. 81 82 ## Prepared 83 84 `rkt prepare` concludes successfully by leaving the pod directory at `$var/prepared/$uuid` in an unlocked state before returning `$uuid` to the user. 85 86 `rkt run-prepared` resumes where `rkt prepare` concluded by exclusively locking the pod at `$var/prepared/$uuid` before renaming it to `$var/run/$uuid`, specifically acquiring the lock prior to entering the `Run` phase. 87 88 `rkt run` never enters this phase, skipping directly from `Prepare` to `Run` with the lock held. 89 90 ## Run 91 92 `rkt run` and `rkt run-prepared` both arrive here with the pod at `$var/run/$uuid` while holding the exclusive lock. 93 94 The pod is then executed while holding this lock. 95 It is required that the stage1 `coreos.com/rkt/stage1/run` entrypoint keep the file descriptor representing the exclusive lock open for the lifetime of the pod's process. 96 All this requires is that the stage1 implementation not close the inherited file descriptor. 97 This is facilitated by supplying stage1 its number in the RKT_LOCK_FD environment variable. 98 99 What follows applies equally to `rkt run` and `rkt run-prepared`. 100 101 ## Death / exit 102 103 A pod is considered exited if a shared lock can be acquired on `$var/run/$uuid`. 104 Upon exit of a pod's process, the exclusive lock acquired before entering the `Run` phase becomes released by the kernel. 105 106 ## Garbage collection 107 108 Exited pods are discarded using a common mark-and-sweep style of garbage collection by invoking the `rkt gc` command. 109 This relatively simple approach lends itself well to a minimal file-system based implementation utilizing no additional daemons or record keeping with good efficiency. 110 The process is performed in two distinct passes explained in detail below. 111 112 ### Pass 1: mark 113 114 All directories found in `$var/run` are tested for exited status by trying to acquire a shared advisory lock on each directory. 115 116 When a directory's lock cannot be acquired, the directory is skipped as it indicates the pod is currently executing. 117 118 When the lock is successfully acquired, the directory is renamed from `$var/run/$uuid` to `$var/exited-garbage/$uuid`. 119 This renaming effectively implements the "mark" operation. 120 Since the locks are immediately released, operations like `rkt status` may safely execute concurrently with `rkt gc`. 121 122 Marked exited pods dwell in the `$var/exited-garbage` directory for a grace period during which their status may continue to be queried by `rkt status`. 123 The rename from `$var/run/$uuid` to `$var/exited-garbage/$uuid` serves in part to keep marked pods from cluttering the `$var/run` directory during their respective dwell periods. 124 125 ### Pass 2: sweep 126 127 A side-effect of the rename operation responsible for moving a pod from `$var/run` to `$var/exited-garbage` is an update to the pod directory's change time. 128 The sweep operation takes this updated file change time as the beginning of the "dwell" grace period, and discards exited pods at the expiration of that period. 129 This grace period currently defaults to 30 minutes, and may be explicitly specified using the `--grace-period=duration` flag with `rkt gc`. 130 Note that this grace period begins from the time a pod was marked by `rkt gc`, not when the pod exited. 131 A pod becomes eligible for marking when it exits, but will not actually be marked for collection until a subsequent `rkt gc`. 132 133 The change times of all directories found in `$var/exited-garbage` are compared against the current time. 134 Directories having sufficiently old change times are locked exclusively and cleaned up. 135 If a lock acquisition fails, the directory is skipped. 136 `rkt gc` may fail to acquire an exclusive lock if the pod to be collected is currently being accessed, by `rkt status` or another `rkt gc`, for example. 137 The skipped pods will be revisited on a subsequent `rkt gc` invocation's sweep pass. 138 During the cleanup, the pod's stage1 gc entry point is first executed. 139 This gives the stage1 a chance to clean up anything related to the environment shared between containers. 140 The default stage1 uses the gc entrypoint to clean up the private networking artifacts. 141 After the completion of the gc entrypoint, the pod directory is recursively deleted. 142 143 ## Pulse 144 145 To answer the questions "Has this pod exited?" and "Is this pod being deleted?" the pod's UUID is looked for in `$var/run` and `$var/exited-garbage`, respectively. 146 Pods found in the `$var/exited-garbage` directory must already be exited, and a shared lock acquisition may be used to determine if the garbage pod is actively being deleted. 147 Those found in the `$var/run` directory may be exited or running, and a failed shared lock acquisition indicates a pod in `$var/run` is alive at the time of the failed acquisition. 148 149 Care must be taken when acting on what is effectively always going to be stale knowledge of pod state; though a pod's status may be found to be "running" by the mechanisms documented here, this was an instantaneously sampled state that was true at the time sampled (failed lock attempt at `$var/run/$uuid`), and may cease to be true by the time code execution progressed to acting on that sample. 150 Pod exit is totally asynchronous and cannot be prevented, relevant code must take this into consideration (e.g. `rkt enter`) and be tolerant of states progressing. 151 152 For example, two `rkt run-prepared` invocations for the same UUID may occur simultaneously. 153 Only one of these will successfully transition the pod from `Prepared` to `Run` due to rename's atomicity, which is exactly what we want. 154 The loser of this race needs to simply inform the user of the inability to transition the pod to the run state, perhaps with a check to see if the pod transitioned independently and a useful message mentioning it. 155 156 Another example would be two `rkt gc` commands finding the same exited pods and attempting to transition them to the `Garbage` phase concurrently. 157 They can't both perform the transitions, one will lose the race at each pod. 158 This needs to be considered in the error handling of the transition callers as perfectly normal. 159 Simply ignoring ENOENT errors propagated from the loser's rename calls can suffice. 160 161 162 [man-flock]: http://man7.org/linux/man-pages/man2/flock.2.html 163 [man-rename]: http://man7.org/linux/man-pages/man2/rename.2.html 164 [rkt-arch]: ../devel/architecture.md