gvisor.dev/gvisor@v0.0.0-20240520182842-f9d4d51c7e0f/pkg/sentry/kernel/README.md (about)

     1  This package contains:
     2  
     3  -   A (partial) emulation of the "core Linux kernel", which governs task
     4      execution and scheduling, system call dispatch, and signal handling. See
     5      below for details.
     6  
     7  -   The top-level interface for the sentry's Linux kernel emulation in general,
     8      used by the `main` function of all versions of the sentry. This interface
     9      revolves around the `Env` type (defined in `kernel.go`).
    10  
    11  # Background
    12  
    13  In Linux, each schedulable context is referred to interchangeably as a "task" or
    14  "thread". Tasks can be divided into userspace and kernel tasks. In the sentry,
    15  scheduling is managed by the Go runtime, so each schedulable context is a
    16  goroutine; only "userspace" (application) contexts are referred to as tasks, and
    17  represented by Task objects. (From this point forward, "task" refers to the
    18  sentry's notion of a task unless otherwise specified.)
    19  
    20  At a high level, Linux application threads can be thought of as repeating a "run
    21  loop":
    22  
    23  -   Some amount of application code is executed in userspace.
    24  
    25  -   A trap (explicit syscall invocation, hardware interrupt or exception, etc.)
    26      causes control flow to switch to the kernel.
    27  
    28  -   Some amount of kernel code is executed in kernelspace, e.g. to handle the
    29      cause of the trap.
    30  
    31  -   The kernel "returns from the trap" into application code.
    32  
    33  Analogously, each task in the sentry is associated with a *task goroutine* that
    34  executes that task's run loop (`Task.run` in `task_run.go`). However, the
    35  sentry's task run loop differs in structure in order to support saving execution
    36  state to, and resuming execution from, checkpoints.
    37  
    38  While in kernelspace, a Linux thread can be descheduled (cease execution) in a
    39  variety of ways:
    40  
    41  -   It can yield or be preempted, becoming temporarily descheduled but still
    42      runnable. At present, the sentry delegates scheduling of runnable threads to
    43      the Go runtime.
    44  
    45  -   It can exit, becoming permanently descheduled. The sentry's equivalent is
    46      returning from `Task.run`, terminating the task goroutine.
    47  
    48  -   It can enter interruptible sleep, a state in which it can be woken by a
    49      caller-defined wakeup or the receipt of a signal. In the sentry,
    50      interruptible sleep (which is ambiguously referred to as *blocking*) is
    51      implemented by making all events that can end blocking (including signal
    52      notifications) communicated via Go channels and using `select` to multiplex
    53      wakeup sources; see `task_block.go`.
    54  
    55  -   It can enter uninterruptible sleep, a state in which it can only be woken by
    56      a caller-defined wakeup. Killable sleep is a closely related variant in
    57      which the task can also be woken by SIGKILL. (These definitions also include
    58      Linux's "group-stopped" (`TASK_STOPPED`) and "ptrace-stopped"
    59      (`TASK_TRACED`) states.)
    60  
    61  To maximize compatibility with Linux, sentry checkpointing appears as a spurious
    62  signal-delivery interrupt on all tasks; interrupted system calls return `EINTR`
    63  or are automatically restarted as usual. However, these semantics require that
    64  uninterruptible and killable sleeps do not appear to be interrupted. In other
    65  words, the state of the task, including its progress through the interrupted
    66  operation, must be preserved by checkpointing. For many such sleeps, the wakeup
    67  condition is application-controlled, making it infeasible to wait for the sleep
    68  to end before checkpointing. Instead, we must support checkpointing progress
    69  through sleeping operations.
    70  
    71  # Implementation
    72  
    73  We break the task's control flow graph into *states*, delimited by:
    74  
    75  1.  Points where uninterruptible and killable sleeps may occur. For example,
    76      there exists a state boundary between signal dequeueing and signal delivery
    77      because there may be an intervening ptrace signal-delivery-stop.
    78  
    79  2.  Points where sleep-induced branches may "rejoin" normal execution. For
    80      example, the syscall exit state exists because it can be reached immediately
    81      following a synchronous syscall, or after a task that is sleeping in
    82      `execve()` or `vfork()` resumes execution.
    83  
    84  3.  Points containing large branches. This is strictly for organizational
    85      purposes. For example, the state that processes interrupt-signaled
    86      conditions is kept separate from the main "app" state to reduce the size of
    87      the latter.
    88  
    89  4.  `SyscallReinvoke`, which does not correspond to anything in Linux, and
    90      exists solely to serve the autosave feature.
    91  
    92  ![dot -Tpng -Goverlap=false -orun_states.png run_states.dot](g3doc/run_states.png "Task control flow graph")
    93  
    94  States before which a stop may occur are represented as implementations of the
    95  `taskRunState` interface named `run(state)`, allowing them to be saved and
    96  restored. States that cannot be immediately preceded by a stop are simply `Task`
    97  methods named `do(state)`.
    98  
    99  Conditions that can require task goroutines to cease execution for unknown
   100  lengths of time are called *stops*. Stops are divided into *internal stops*,
   101  which are stops whose start and end conditions are implemented within the
   102  sentry, and *external stops*, which are stops whose start and end conditions are
   103  not known to the sentry. Hence all uninterruptible and killable sleeps are
   104  internal stops, and the existence of a pending checkpoint operation is an
   105  external stop. Internal stops are reified into instances of the `TaskStop` type,
   106  while external stops are merely counted. The task run loop alternates between
   107  checking for stops and advancing the task's state. This allows checkpointing to
   108  hold tasks in a stopped state while waiting for all tasks in the system to stop.