github.com/smintz/nomad@v0.8.3/website/source/docs/internals/scheduling.html.md

github.com/smintz/nomad@v0.8.3/website/source/docs/internals/scheduling.html.md (about)

     1  ---
     2  layout: "docs"
     3  page_title: "Scheduling"
     4  sidebar_current: "docs-internals-scheduling"
     5  description: |-
     6    Learn about how scheduling works in Nomad.
     7  ---
     8  
     9  # Scheduling
    10  
    11  Scheduling is a core function of Nomad. It is the process of assigning tasks
    12  from jobs to client machines. This process must respect the constraints as
    13  declared in the job, and optimize for resource utilization. This page documents
    14  the details of how scheduling works in Nomad to help both users and developers
    15  build a mental model. The design is heavily inspired by Google's work on both
    16  [Omega: flexible, scalable schedulers for large compute clusters][Omega] and
    17  [Large-scale cluster management at Google with Borg][Borg].
    18  
    19  ~> **Advanced Topic!** This page covers technical details of Nomad. You do not
    20  ~> need to understand these details to effectively use Nomad. The details are
    21  ~> documented here for those who wish to learn about them without having to
    22  ~> go spelunking through the source code.
    23  
    24  # Scheduling in Nomad
    25  
    26  [![Nomad Data Model][img-data-model]][img-data-model]
    27  
    28  There are four primary "nouns" in Nomad; jobs, nodes, allocations, and
    29  evaluations. Jobs are submitted by users and represent a _desired state_. A job
    30  is a declarative description of tasks to run which are bounded by constraints
    31  and require resources. Tasks can be scheduled on  nodes in the cluster running
    32  the Nomad client. The mapping of tasks in a job to clients is done using
    33  allocations. An allocation is used to declare that a set of tasks in a job
    34  should be run on a particular node. Scheduling is the process of determining
    35  the appropriate allocations and is done as part of an evaluation.
    36  
    37  An evaluation is created any time the external state, either desired or
    38  emergent, changes. The desired state is based on jobs, meaning the desired
    39  state changes if a new job is submitted, an existing job is updated, or a job
    40  is deregistered. The emergent state is based on the client nodes, and so we
    41  must handle the failure of any clients in the system. These events trigger the
    42  creation of a new evaluation, as Nomad must _evaluate_ the state of the world
    43  and reconcile it with the desired state.
    44  
    45  This diagram shows the flow of an evaluation through Nomad:
    46  
    47  [![Nomad Evaluation Flow][img-eval-flow]][img-eval-flow]
    48  
    49  The lifecycle of an evaluation begins with an event causing the evaluation to
    50  be created. Evaluations are created in the `pending` state and are enqueued
    51  into the evaluation broker. There is a single evaluation broker which runs on
    52  the leader server. The evaluation broker is used to manage the queue of pending
    53  evaluations, provide priority ordering, and ensure at least once delivery.
    54  
    55  Nomad servers run scheduling workers, defaulting to one per CPU core, which are
    56  used to process evaluations. The workers dequeue evaluations from the broker,
    57  and then invoke the appropriate scheduler as specified by the job. Nomad ships
    58  with a `service` scheduler that optimizes for long-lived services, a `batch`
    59  scheduler that is used for fast placement of batch jobs, a `system` scheduler
    60  that is used to run jobs on every node, and a `core` scheduler which is used
    61  for internal maintenance. Nomad can be extended to support custom schedulers as
    62  well.
    63  
    64  Schedulers are responsible for processing an evaluation and generating an
    65  allocation _plan_. The plan is the set of allocations to evict, update, or
    66  create. The specific logic used to generate a plan may vary by scheduler, but
    67  generally the scheduler needs to first reconcile the desired state with the
    68  real state to determine what must be done. New allocations need to be placed
    69  and existing allocations may need to be updated, migrated, or stopped.
    70  
    71  Placing allocations is split into two distinct phases, feasibility checking and
    72  ranking. In the first phase the scheduler finds nodes that are feasible by
    73  filtering unhealthy nodes, those missing necessary drivers, and those failing
    74  the specified constraints.
    75  
    76  The second phase is ranking, where the scheduler scores feasible nodes to find
    77  the best fit. Scoring is primarily based on bin packing, which is used to
    78  optimize the resource utilization and density of applications, but is also
    79  augmented by affinity and anti-affinity rules. Nomad automatically applies a job
    80  anti-affinity rule which discourages colocating multiple instances of a task
    81  group. The combination of this anti-affinity and bin packing optimizes for
    82  density while reducing the probability of correlated failures.
    83  
    84  Once the scheduler has ranked enough nodes, the highest ranking node is
    85  selected and added to the allocation plan.
    86  
    87  When planning is complete, the scheduler submits the plan to the leader which
    88  adds the plan to the plan queue. The plan queue manages pending plans, provides
    89  priority ordering, and allows Nomad to handle concurrency races. Multiple
    90  schedulers are running in parallel without locking or reservations, making
    91  Nomad optimistically concurrent. As a result, schedulers might overlap work on
    92  the same node and cause resource over-subscription. The plan queue allows the
    93  leader node to protect against this and do partial or complete rejections of a
    94  plan.
    95  
    96  As the leader processes plans, it creates allocations when there is no conflict
    97  and otherwise informs the scheduler of a failure in the plan result. The plan
    98  result provides feedback to the scheduler, allowing it to terminate or explore
    99  alternate plans if the previous plan was partially or completely rejected.
   100  
   101  Once the scheduler has finished processing an evaluation, it updates the status
   102  of the evaluation and acknowledges delivery with the evaluation broker. This
   103  completes the lifecycle of an evaluation. Allocations that were created,
   104  modified or deleted as a result will be picked up by client nodes and will
   105  begin execution.
   106  
   107  [Omega]: https://research.google.com/pubs/pub41684.html
   108  [Borg]: https://research.google.com/pubs/pub43438.html
   109  [img-data-model]: /assets/images/nomad-data-model.png
   110  [img-eval-flow]: /assets/images/nomad-evaluation-flow.png