github.com/coreos/rocket@v1.30.1-0.20200224141603-171c416fac02/Documentation/devel/architecture.md (about)

     1  # rkt architecture
     2  
     3  This document discusses rkt architecture in detail.
     4  For a more hands-on guide to inspecting rkt's internals, check [Inspect how rkt works](inspect-containers.md).
     5  
     6  ## Overview
     7  
     8  rkt's primary interface is a command-line tool, `rkt`, which does not require a long running daemon.
     9  This architecture allows rkt to be updated in-place without affecting application containers which are currently running.
    10  It also means that levels of privilege can be separated out between different operations.
    11  
    12  All state in rkt is communicated via the filesystem.
    13  Facilities like file-locking are used to ensure co-operation and mutual exclusion between concurrent invocations of the `rkt` command.
    14  
    15  ## Stages
    16  
    17  Execution with rkt is divided into several distinct stages.
    18  
    19  **NB** The goal is for the ABI between stages to be relatively fixed, but while rkt is still under heavy development this is still evolving.
    20  
    21  After calling `rkt` the execution chain follows the numbering of stages, having the following general order:
    22  
    23  ![execution-flow](execution-flow.png)
    24  
    25  1. invoking process -> stage0:
    26  The invoking process uses its own mechanism to invoke the rkt binary (stage0). When started via a regular shell or a supervisor, stage0 is usually forked and exec'ed becoming a child process of the invoking shell or supervisor.
    27  
    28  2. stage0 -> stage1:
    29  An ordinary [`exec(3)`][man-exec] is being used to replace the stage0 process with the stage1 entrypoint. The entrypoint is referenced by the `coreos.com/rkt/stage1/run` annotation in the stage1 image manifest.
    30  
    31  3. stage1 -> stage2:
    32  The stage1 entrypoint uses its mechanism to invoke the stage2 app executables. The app executables are referenced by the `apps.app.exec` settings in the stage2 image manifest.
    33  
    34  The details of the execution flow varies across the different stage1 implementations.
    35  
    36  ### Stage 0
    37  
    38  The first stage is the actual `rkt` binary itself.
    39  When running a pod, this binary is responsible for performing a number of initial preparatory tasks:
    40  
    41  - Fetching the specified ACIs, including the stage1 ACI of --stage1-{url,path,name,hash,from-dir} if specified.
    42  - Generating a Pod UUID
    43  - Generating a Pod Manifest
    44  - Creating a filesystem for the pod
    45  - Setting up stage1 and stage2 directories in the filesystem
    46  - Unpacking the stage1 ACI into the pod filesystem
    47  - Unpacking the ACIs and copying each app into the stage2 directories
    48  
    49  Given a run command such as:
    50  
    51  ```
    52  # rkt run app1.aci app2.aci
    53  ```
    54  
    55  a pod manifest compliant with the ACE spec will be generated, and the filesystem created by stage0 should be:
    56  
    57  ```
    58  /pod
    59  /stage1
    60  /stage1/manifest
    61  /stage1/rootfs/init
    62  /stage1/rootfs/opt
    63  /stage1/rootfs/opt/stage2/${app1-name}
    64  /stage1/rootfs/opt/stage2/${app2-name}
    65  ```
    66  
    67  where:
    68  
    69  - `pod` is the pod manifest file
    70  - `stage1` is a copy of the stage1 ACI that is safe for read/write
    71  - `stage1/manifest` is the manifest of the stage1 ACI
    72  - `stage1/rootfs` is the rootfs of the stage1 ACI
    73  - `stage1/rootfs/init` is the actual stage1 binary to be executed (this path may vary according to the `coreos.com/rkt/stage1/run` annotation of the stage1 ACI)
    74  - `stage1/rootfs/opt/stage2` are copies of the unpacked ACIs
    75  
    76  At this point the stage0 execs `/stage1/rootfs/init` with the current working directory set to the root of the new filesystem.
    77  
    78  ### Stage 1
    79  
    80  The next stage is a binary that the user trusts, and has the responsibility of taking the pod filesystem that was created by stage0, create the necessary container isolation, network, and mounts to launch the pod.
    81  Specifically, it must:
    82  
    83  - Read the Image and Pod Manifests. The Image Manifest defines the default `exec` specifications of each application; the Pod Manifest defines the ordering of the units, as well as any overrides.
    84  - Set up/execute the actual isolation environment for the target pod, called the "stage1 flavor". Currently there are three flavors implemented:
    85    - fly: a simple chroot only environment.
    86    - systemd/nspawn: a cgroup/namespace based isolation environment using systemd, and systemd-nspawn.
    87    - kvm: a fully isolated kvm environment.
    88  
    89  There are also out of tree stage1:
    90  - [stage1-xen](https://github.com/rkt/stage1-xen), stage1 based on the Xen hypervisor
    91  - [volo](https://github.com/lucab/rkt-volo), stage1 written in Rust, akin to stage1-fly
    92  - [docker-skim](https://github.com/coreos/docker-skim), a weaker version of stage1-fly using `LD_PRELOAD` tricks instead of chroot
    93  - [dgr/aci-builder](https://github.com/blablacar/dgr), a stage1 used internally by the container build and runtime tool dgr
    94  - [stage1-builder](https://github.com/kinvolk/stage1-builder), scripts and tools to generate KVM-based stage1 with customized Linux kernels
    95  
    96  ### Stage 2
    97  
    98  The final stage, stage2, is the actual environment in which the applications run, as launched by stage1.
    99  
   100  ## Flavors
   101  ### systemd/nspawn flavors
   102  
   103  The "host", "src", and "coreos" flavors (referenced to as systemd/nspawn flavors) use  `systemd-nspawn`, and `systemd` to set up the execution chain.
   104  They include a very minimal systemd that takes care of launching the apps in each pod, apply per-app resource isolators and makes sure the apps finish in an orderly manner.
   105  
   106  These flavors will:
   107  - Read the image and pod manifests
   108  - Generate systemd unit files from those Manifests
   109  - Create and enter network namespace if rkt is not started with `--net=host`
   110  - Start systemd-nspawn (which takes care of the following steps)
   111      - Set up any external volumes
   112      - Launch systemd as PID 1 in the pod within the appropriate cgroups and namespaces
   113      - Have systemd inside the pod launch the app(s).
   114  
   115  This process is slightly different for the qemu-kvm stage1 but a similar workflow starting at `exec()`'ing kvm instead of an nspawn.
   116  
   117  We will now detail how the starting, shutdown, and exit status collection of the apps in a pod are implemented internally.
   118  
   119  ### Immutable vs. mutable pods
   120  
   121  rkt supports two kinds of pod runtime environments: an _immutable pod_ runtime environment, and a new, experimental _mutable pod_ runtime environment.
   122  
   123  The immutable runtime environment is currently the default, i.e. when executing any `rkt prepare` or `rkt run` command.
   124  Once a pod has been created in this mode, no modifications can be applied.
   125  
   126  Conversely, the mutable runtime environment allows users to add, remove, start, and stop applications after a pod has been started.
   127  Currently this mode is only available in the experimental `rkt app` family of subcommands; see the [app pod lifecycle documentation](pod-lifecycle.md#app) for a more detailed description.
   128  
   129  Both runtime environments are supervised internally by systemd, using a custom dependency graph.
   130  The differences between both dependency graphs are described below.
   131  
   132  #### Immutable runtime environment
   133  
   134  ![rkt-systemd](rkt-systemd.png)
   135  
   136  There's a systemd rkt apps target (`default.target`) which has a [*Wants*][systemd-wants] and [*After*][systemd-beforeafter] dependency on each app's service file, making sure they all start.
   137  Once this target is reached, the pod is in its steady-state. This is signaled by the pod supervisor via a dedicated `supervisor-ready.service`, which is triggered by `default.target` with a [*Wants*][systemd-wants] dependency on it.
   138  
   139  Each app's service has a *Wants* dependency on an associated reaper service that deals with writing the app's status exit.
   140  Each reaper service has a *Wants* and *After* dependency with `shutdown.service` that simply shuts down the pod.
   141  
   142  The reaper services and the `shutdown.service` all start at the beginning but do nothing and remain after exit (with the [*RemainAfterExit*][systemd-remainafterexit] flag).
   143  By using the [*StopWhenUnneeded*][systemd-stopwhenunneeded] flag, whenever they stop being referenced, they'll do the actual work via the *ExecStop* command.
   144  
   145  This means that when an app service is stopped, its associated reaper will run and will write its exit status to `/rkt/status/${app}` and the other apps will continue running.
   146  When all apps' services stop, their associated reaper services will also stop and will cease referencing `shutdown.service` causing the pod to exit.
   147  Every app service has an [*OnFailure*][systemd-onfailure] flag that starts the `halt.target`.
   148  This means that if any app in the pod exits with a failed status, the systemd shutdown process will start, the other apps' services will automatically stop and the pod will exit.
   149  In this case, the failed app's exit status will get propagated to rkt.
   150  
   151  A [*Conflicts*][systemd-conflicts] dependency was also added between each reaper service and the halt and poweroff targets (they are triggered when the pod is stopped from the outside when rkt receives `SIGINT`).
   152  This will activate all the reaper services when one of the targets is activated, causing the exit statuses to be saved and the pod to finish like it was described in the previous paragraph.
   153  
   154  #### Mutable runtime environment
   155  
   156  The initial mutable runtime environment is very simple and resembles a minimal systemd system without any applications installed.
   157  Once `default.target` has been reached, apps can be added/removed.
   158  Unlike the immutable runtime environment, the `default.target` has no dependencies on any apps, but only on `supervisor-ready.service` and `systemd-journald.service`, to ensure the journald daemon is started before apps are added.
   159  
   160  In order for the pod to not shut down immediately on its creation, the `default.target` has `Before` and `Conflicts` dependencies on `halt.target`.
   161  This "deadlock" state between `default.target` and `halt.target` keeps the mutable pod alive.
   162  `halt.target` has `After` and `Requires` dependencies on `shutdown.service`.
   163  
   164  ![rkt-mutable](mutable.png)
   165  
   166  When adding an app, the corresponding application service units `[app].service` and `reaper-[app].service` are generated (where `[app]` is the actual app name).
   167  In order for the pod to not shut down when all apps stop, there is no dependency on `shutdown.service`.
   168  The `OnFailure` behavior is the same as in an immutable environment.
   169  When an app fails, `halt.target`, and `shutdown.service` will be started, and `default.target` will be stopped.
   170  
   171  ![rkt-mutable-app](mutable-app.png)
   172  
   173  The following table enumerates the service unit behavior differences in the two environments:
   174  
   175   Unit | Immutable | Mutable
   176  ------|-----------|--------
   177  `shutdown.service`  | In **Started** state when the pod starts. *Stopped*, when there is no dependency on it (`StopWhenUnneeded`) or `OnFailure` of any app. | In **Stopped** state when the pod starts. *Started* at explicit shutdown or `OnFailure` of any app. |
   178  `reaper-app.service` | `Wants`, and `After` dependency on `shutdown.service`. `Conflicts`, and `Before` dependency on `halt.target`. | `Conflicts`, and `Before` dependency on `halt.target`. |
   179  
   180  #### Execution chain
   181  
   182  We will now detail the execution chain for the stage1 systemd/nspawn flavors. The entrypoint is implemented in the `stage1/init/init.go` binary and sets up the following execution chain:
   183  
   184  1. "ld-linux-*.so.*": Depending on the architecture the appropriate loader helper in the stage1 rootfs is invoked using "exec". This makes sure that subsequent binaries load shared libraries from the stage1 rootfs and not from the host file system.
   185  
   186  2. "systemd-nspawn": Used for starting the actual container. systemd-nspawn registers the started container in "systemd-machined" on the host, if available. It is parametrized with the `--boot` option to instruct it to "fork+exec" systemd as the supervisor in the started container.
   187  
   188  3. "systemd": Used as the supervisor in the started container. Similar as on a regular host system, it uses "fork+exec" to execute the child app processes.
   189  
   190  The following diagram illustrates the execution chain:
   191  
   192  ![execution-flow-systemd](execution-flow-systemd.png)
   193  
   194  The resulting process tree reveals the parent-child relationships. Note that "exec"ing processes do not appear in the tree:
   195  
   196  ```
   197  $ ps auxf
   198  ...
   199  \_ -bash
   200      \_ stage1/rootfs/usr/lib/ld-linux-x86-64.so.2 stage1/rootfs/usr/bin/systemd-nspawn
   201          \_ /usr/lib/systemd/systemd
   202              \_ /usr/lib/systemd/systemd-journald
   203              \_ nginx
   204  ```
   205  
   206  #### External resource limits
   207  
   208  Depending on how rkt is executed, certain external resource limits will be applied or not.
   209  
   210  If rkt is executed within a systemd service, the container will inherit the cgroup resource limits applied to the service itself and any ulimit-like limits.
   211  
   212  ![rkt-process-model-with-systemd](rkt-process-model-with-systemd.svg)
   213  
   214  If rkt is executed, say, from a terminal, the container will inherit ulimit-like limits, but not cgroup resource limits.
   215  The reason for this is that systemd will move the container to a new `machine` slice.
   216  
   217  ![rkt-process-model](rkt-process-model.svg)
   218  
   219  ### fly flavor
   220  
   221  The "fly" flavor uses a very simple mechanism being limited to only execute one child app process. The entrypoint is implemented in `stage1_fly/run/main.go`. After setting up a chroot'ed environment it simply exec's the target app without any further internal supervision:
   222  
   223  ![execution-flow-fly](execution-flow-fly.png)
   224  
   225  The resulting example process tree shows the target process as a direct child of the invoking process:
   226  
   227  ```
   228  $ ps auxf
   229  ...
   230  \_ -bash
   231      \_ nginx
   232  ```
   233  
   234  ## Image lifecycle
   235  
   236  rkt commands like prepare and run, as a first step, need to retrieve all the images requested in the command line and prepare the stage2 directories with the application contents.
   237  
   238  This is done with the following chain:
   239  
   240  ![image-chain](image-chain.png)
   241  
   242  * Fetch: in the fetch phase rkt retrieves the requested images. The fetching implementation depends on the provided image argument such as an image string/hash/https URL/file (e.g. `example.com/app:v1.0`).
   243  * Store: in the store phase the fetched images are saved to the local store. The local store is a cache for fetched images and related data.
   244  * Render: in the render phase, a renderer pulls the required images from the store and renders them so they can be easily used as stage2 content.
   245  
   246  
   247  These three logical blocks are implemented inside rkt in this way:
   248  
   249  ![image-logical-blocks](image-logical-blocks.png)
   250  
   251  Currently rkt implements the [appc][appc-spec] internally, converting to it from other container image formats for compatibility. In the future, additional formats like the [OCI image spec][oci-img-spec] may be added to rkt, keeping the same basic scheme for fetching, storing, and rendering application container images.
   252  
   253  * Fetchers: Fetchers retrieve images from either a provided URL, or a URL found by [image discovery][appc-discovery] on a given image string. Fetchers read data from the Image Store to check if an image is already present. Once fetched, images are verified with their signatures, then saved in the Image Store. An image's [dependencies][appc-dependencies] are also discovered and fetched. For details, see the [image fetching][rkt-image-fetching] documentation.
   254  * Image Store: the Image Store is used to store images (currently ACIs) and their related information.
   255  * The render phase can be done in different ways:
   256   * Directly render the stage1-2 contents inside a pod. This will require more disk space and more stage1-2 preparation time.
   257   * Render in the treestore. The treestore is a cache of rendered images (currently ACIs). When using the treestore, rkt mounts an overlayfs with the treestore rendered image as its lower directory.
   258  
   259  When using stage1-2 with overlayfs a pod will contain references to the required treestore rendered images. So there's an hard connection between pods and the treestore.
   260  
   261  *Aci Renderer*
   262  
   263  Both stage1-2 render modes internally uses the [aci renderer][acirenderer].
   264  Since an ACI may depend on other ones the acirenderer may require other ACIs.
   265  The acirenderer only relies on the ACIStore, so all the required ACIs must already be available in the store.
   266  Additionally, since appc dependencies can be found only via discovery, a dependency may be updated and so there can be multiple rendered images for the same ACI.
   267  
   268  Given this 1:N relation between an ACI and their rendered images, the ACIStore and TreeStore are decoupled.
   269  
   270  
   271  ## Logging and attaching
   272  
   273  Applications running inside a rkt pod can produce output on stdout/stderr, which can be redirected at runtime. Optionally, they can receive input on stdin from an external component that can be attached/detached during execution.
   274  
   275  The internal architecture for attaching (TTY and single streams) and logging is described in full details in the [Logging and attaching design document][attach-design].
   276  
   277  For each application, rkt support separately configuring stdin/stdout/stderr via runtime command-line flags. The following modes are available:
   278   * interactive: application will be run under the TTY of the parent process. A single application is allowed in the pod, which is tied to the lifetime of the parent terminal and cannot be later re-attached.
   279   * TTY: selected I/O streams will be run under a newly allocated TTY, which can be later used for external attaching.
   280   * streaming: selected I/O streams will be supervised by a separate multiplexing process (running in the pod context). They can be later externally attached.
   281   * logging: selected output streams will be supervised by a separate logging process (running in the pod context). Output entries will be handled as log entries, and the application cannot be later re-attached.
   282   * null: selected I/O streams will be closed. Application will not received the file-descriptor for the corresponding stream, and it cannot be later re-attached.
   283  
   284  From a UI perspective, main consumers of the logging and attaching subsystem are the `rkt attach` subcommand and the `--stdin`, `--stdout`, `--stderr` runtime options.
   285  
   286  
   287  [acirenderer]: https://github.com/appc/spec/tree/master/pkg/acirenderer
   288  [attach-design]: ./log-attach-design.md
   289  [appc-spec]: https://github.com/appc/spec
   290  [appc-dependencies]: https://github.com/appc/spec/blob/master/spec/aci.md#image-manifest-schema
   291  [appc-discovery]: https://github.com/appc/spec/blob/master/spec/discovery.md
   292  [man-exec]: http://man7.org/linux/man-pages/man3/exec.3.html
   293  [oci-img-spec]: https://github.com/opencontainers/image-spec
   294  [rkt-image-fetching]: ../image-fetching-behavior.md
   295  [systemd-beforeafter]: https://www.freedesktop.org/software/systemd/man/systemd.unit.html#Before=
   296  [systemd-conflicts]: https://www.freedesktop.org/software/systemd/man/systemd.unit.html#Conflicts=
   297  [systemd-wants]: https://www.freedesktop.org/software/systemd/man/systemd.unit.html#Wants=
   298  [systemd-onfailure]: https://www.freedesktop.org/software/systemd/man/systemd.unit.html#OnFailure=
   299  [systemd-remainafterexit]: https://www.freedesktop.org/software/systemd/man/systemd.service.html#RemainAfterExit=
   300  [systemd-stopwhenunneeded]: https://www.freedesktop.org/software/systemd/man/systemd.unit.html#StopWhenUnneeded=