github.com/smintz/nomad@v0.8.3/website/source/guides/operating-a-job/update-strategies/rolling-upgrades.html.md

github.com/smintz/nomad@v0.8.3/website/source/guides/operating-a-job/update-strategies/rolling-upgrades.html.md (about)

     1  ---
     2  layout: "guides"
     3  page_title: "Rolling Upgrades - Operating a Job"
     4  sidebar_current: "guides-operating-a-job-updating-rolling-upgrades"
     5  description: |-
     6    In order to update a service while reducing downtime, Nomad provides a
     7    built-in mechanism for rolling upgrades. Rolling upgrades incrementally
     8    transitions jobs between versions and using health check information to
     9    reduce downtime.
    10  ---
    11  
    12  # Rolling Upgrades
    13  
    14  Nomad supports rolling updates as a first class feature. To enable rolling
    15  updates a job or task group is annotated with a high-level description of the
    16  update strategy using the [`update` stanza][update]. Under the hood, Nomad
    17  handles limiting parallelism, interfacing with Consul to determine service
    18  health and even automatically reverting to an older, healthy job when a
    19  deployment fails.
    20  
    21  ## Enabling Rolling Updates
    22  
    23  Rolling updates are enabled by adding the [`update` stanza][update] to the job
    24  specification. The `update` stanza may be placed at the job level or in an
    25  individual task group. When placed at the job level, the update strategy is
    26  inherited by all task groups in the job. When placed at both the job and group
    27  level, the `update` stanzas are merged, with group stanzas taking precedence
    28  over job level stanzas. See the [`update` stanza
    29  documentation](/docs/job-specification/update.html#upgrade-stanza-inheritance)
    30  for an example.
    31  
    32  ```hcl
    33  job "geo-api-server" {
    34    # ...
    35  
    36    group "api-server" {
    37      count = 6
    38  
    39      # Add an update stanza to enable rolling updates of the service
    40      update {
    41        max_parallel = 2
    42        min_healthy_time = "30s"
    43        healthy_deadline = "10m"
    44      }
    45  
    46      task "server" {
    47        driver = "docker"
    48  
    49        config {
    50          image = "geo-api-server:0.1"
    51        }
    52  
    53        # ...
    54      }
    55    }
    56  }
    57  ```
    58  
    59  In this example, by adding the simple `update` stanza to the "api-server" task
    60  group, we inform Nomad that updates to the group should be handled with a
    61  rolling update strategy.
    62  
    63  Thus when a change is made to the job file that requires new allocations to be
    64  made, Nomad will deploy 2 allocations at a time and require that the allocations
    65  be running in a healthy state for 30 seconds before deploying more versions of the
    66  new group.
    67  
    68  By default Nomad determines allocation health by ensuring that all tasks in the
    69  group are running and that any [service
    70  check](/docs/job-specification/service.html#check-parameters) the tasks register
    71  are passing.
    72  
    73  ## Planning Changes
    74  
    75  Suppose we make a change to a file to upgrade the version of a Docker container
    76  that is configured with the same rolling update strategy from above.
    77  
    78  ```diff
    79  @@ -2,6 +2,8 @@ job "geo-api-server" {
    80     group "api-server" {
    81       task "server" {
    82         driver = "docker"
    83  
    84         config {
    85  -        image = "geo-api-server:0.1"
    86  +        image = "geo-api-server:0.2"
    87  ```
    88  
    89  The [`nomad job plan` command](/docs/commands/job/plan.html) allows
    90  us to visualize the series of steps the scheduler would perform. We can analyze
    91  this output to confirm it is correct:
    92  
    93  ```text
    94  $ nomad job plan geo-api-server.nomad
    95  +/- Job: "geo-api-server"
    96  +/- Task Group: "api-server" (2 create/destroy update, 4 ignore)
    97    +/- Task: "server" (forces create/destroy update)
    98      +/- Config {
    99        +/- image: "geo-api-server:0.1" => "geo-api-server:0.2"
   100      }
   101  
   102  Scheduler dry-run:
   103  - All tasks successfully allocated.
   104  
   105  Job Modify Index: 7
   106  To submit the job with version verification run:
   107  
   108  nomad job run -check-index 7 my-web.nomad
   109  
   110  When running the job with the check-index flag, the job will only be run if the
   111  server side version matches the job modify index returned. If the index has
   112  changed, another user has modified the job and the plan's results are
   113  potentially invalid.
   114  ```
   115  
   116  Here we can see that Nomad will begin a rolling update by creating and
   117  destroying 2 allocations first and for the time being ignoring 4 of the old
   118  allocations, matching our configured `max_parallel`.
   119  
   120  ## Inspecting a Deployment
   121  
   122  After running the plan we can submit the updated job by simply running `nomad
   123  run`. Once run, Nomad will begin the rolling upgrade of our service by placing
   124  2 allocations at a time of the new job and taking two of the old jobs down.
   125  
   126  We can inspect the current state of a rolling deployment using `nomad status`:
   127  
   128  ```text
   129  $ nomad status geo-api-server
   130  ID            = geo-api-server
   131  Name          = geo-api-server
   132  Submit Date   = 07/26/17 18:08:56 UTC
   133  Type          = service
   134  Priority      = 50
   135  Datacenters   = dc1
   136  Status        = running
   137  Periodic      = false
   138  Parameterized = false
   139  
   140  Summary
   141  Task Group  Queued  Starting  Running  Failed  Complete  Lost
   142  api-server  0       0         6        0       4         0
   143  
   144  Latest Deployment
   145  ID          = c5b34665
   146  Status      = running
   147  Description = Deployment is running
   148  
   149  Deployed
   150  Task Group  Desired  Placed  Healthy  Unhealthy
   151  api-server  6        4       2        0
   152  
   153  Allocations
   154  ID        Node ID   Task Group  Version  Desired  Status    Created At
   155  14d288e8  f7b1ee08  api-server  1        run      running   07/26/17 18:09:17 UTC
   156  a134f73c  f7b1ee08  api-server  1        run      running   07/26/17 18:09:17 UTC
   157  a2574bb6  f7b1ee08  api-server  1        run      running   07/26/17 18:08:56 UTC
   158  496e7aa2  f7b1ee08  api-server  1        run      running   07/26/17 18:08:56 UTC
   159  9fc96fcc  f7b1ee08  api-server  0        run      running   07/26/17 18:04:30 UTC
   160  2521c47a  f7b1ee08  api-server  0        run      running   07/26/17 18:04:30 UTC
   161  6b794fcb  f7b1ee08  api-server  0        stop     complete  07/26/17 18:04:30 UTC
   162  9bc11bd7  f7b1ee08  api-server  0        stop     complete  07/26/17 18:04:30 UTC
   163  691eea24  f7b1ee08  api-server  0        stop     complete  07/26/17 18:04:30 UTC
   164  af115865  f7b1ee08  api-server  0        stop     complete  07/26/17 18:04:30 UTC
   165  ```
   166  
   167  Here we can see that Nomad has created a deployment to conduct the rolling
   168  upgrade from job version 0 to 1 and has placed 4 instances of the new job and
   169  has stopped 4 of the old instances. If we look at the deployed allocations, we
   170  also can see that Nomad has placed 4 instances of job version 1 but only
   171  considers 2 of them healthy. This is because the 2 newest placed allocations
   172  haven't been healthy for the required 30 seconds yet.
   173  
   174  If we wait for the deployment to complete and re-issue the command, we get the
   175  following:
   176  
   177  ```text
   178  $ nomad status geo-api-server
   179  ID            = geo-api-server
   180  Name          = geo-api-server
   181  Submit Date   = 07/26/17 18:08:56 UTC
   182  Type          = service
   183  Priority      = 50
   184  Datacenters   = dc1
   185  Status        = running
   186  Periodic      = false
   187  Parameterized = false
   188  
   189  Summary
   190  Task Group  Queued  Starting  Running  Failed  Complete  Lost
   191  cache       0       0         6        0       6         0
   192  
   193  Latest Deployment
   194  ID          = c5b34665
   195  Status      = successful
   196  Description = Deployment completed successfully
   197  
   198  Deployed
   199  Task Group  Desired  Placed  Healthy  Unhealthy
   200  cache       6        6       6        0
   201  
   202  Allocations
   203  ID        Node ID   Task Group  Version  Desired  Status    Created At
   204  d42a1656  f7b1ee08  api-server  1        run      running   07/26/17 18:10:10 UTC
   205  401daaf9  f7b1ee08  api-server  1        run      running   07/26/17 18:10:00 UTC
   206  14d288e8  f7b1ee08  api-server  1        run      running   07/26/17 18:09:17 UTC
   207  a134f73c  f7b1ee08  api-server  1        run      running   07/26/17 18:09:17 UTC
   208  a2574bb6  f7b1ee08  api-server  1        run      running   07/26/17 18:08:56 UTC
   209  496e7aa2  f7b1ee08  api-server  1        run      running   07/26/17 18:08:56 UTC
   210  9fc96fcc  f7b1ee08  api-server  0        stop     complete  07/26/17 18:04:30 UTC
   211  2521c47a  f7b1ee08  api-server  0        stop     complete  07/26/17 18:04:30 UTC
   212  6b794fcb  f7b1ee08  api-server  0        stop     complete  07/26/17 18:04:30 UTC
   213  9bc11bd7  f7b1ee08  api-server  0        stop     complete  07/26/17 18:04:30 UTC
   214  691eea24  f7b1ee08  api-server  0        stop     complete  07/26/17 18:04:30 UTC
   215  af115865  f7b1ee08  api-server  0        stop     complete  07/26/17 18:04:30 UTC
   216  ```
   217  
   218  Nomad has successfully transitioned the group to running the updated canary and
   219  did so with no downtime to our service by ensuring only two allocations were
   220  changed at a time and the newly placed allocations ran successfully. Had any of
   221  the newly placed allocations failed their health check, Nomad would have aborted
   222  the deployment and stopped placing new allocations. If configured, Nomad can
   223  automatically revert back to the old job definition when the deployment fails.
   224  
   225  ## Auto Reverting on Failed Deployments
   226  
   227  In the case we do a deployment in which the new allocations are unhealthy, Nomad
   228  will fail the deployment and stop placing new instances of the job. It
   229  optionally supports automatically reverting back to the last stable job version
   230  on deployment failure. Nomad keeps a history of submitted jobs and whether the
   231  job version was stable.  A job is considered stable if all its allocations are
   232  healthy.
   233  
   234  To enable this we simply add the `auto_revert` parameter to the `update` stanza:
   235  
   236  ```
   237  update {
   238    max_parallel = 2
   239    min_healthy_time = "30s"
   240    healthy_deadline = "10m"
   241  
   242    # Enable automatically reverting to the last stable job on a failed
   243    # deployment.
   244    auto_revert = true
   245  }
   246  ```
   247  
   248  Now imagine we want to update our image to "geo-api-server:0.3" but we instead
   249  update it to the below and run the job:
   250  
   251  ```diff
   252  @@ -2,6 +2,8 @@ job "geo-api-server" {
   253     group "api-server" {
   254       task "server" {
   255         driver = "docker"
   256  
   257         config {
   258  -        image = "geo-api-server:0.2"
   259  +        image = "geo-api-server:0.33"
   260  ```
   261  
   262  If we run `nomad job deployments` we can see that the deployment fails and Nomad
   263  auto-reverts to the last stable job:
   264  
   265  ```text
   266  $ nomad job deployments geo-api-server
   267  ID        Job ID          Job Version  Status      Description
   268  0c6f87a5  geo-api-server  3            successful  Deployment completed successfully
   269  b1712b7f  geo-api-server  2            failed      Failed due to unhealthy allocations - rolling back to job version 1
   270  3eee83ce  geo-api-server  1            successful  Deployment completed successfully
   271  72813fcf  geo-api-server  0            successful  Deployment completed successfully
   272  ```
   273  
   274  Nomad job versions increment monotonically, so even though Nomad reverted to the
   275  job specification at version 1, it creates a new job version. We can see the
   276  differences between a jobs versions and how Nomad auto-reverted the job using
   277  the `job history` command:
   278  
   279  ```text
   280  $ nomad job history -p geo-api-server
   281  Version     = 3
   282  Stable      = true
   283  Submit Date = 07/26/17 18:44:18 UTC
   284  Diff        =
   285  +/- Job: "geo-api-server"
   286  +/- Task Group: "api-server"
   287    +/- Task: "server"
   288      +/- Config {
   289        +/- image: "geo-api-server:0.33" => "geo-api-server:0.2"
   290          }
   291  
   292  Version     = 2
   293  Stable      = false
   294  Submit Date = 07/26/17 18:45:21 UTC
   295  Diff        =
   296  +/- Job: "geo-api-server"
   297  +/- Task Group: "api-server"
   298    +/- Task: "server"
   299      +/- Config {
   300        +/- image: "geo-api-server:0.2" => "geo-api-server:0.33"
   301          }
   302  
   303  Version     = 1
   304  Stable      = true
   305  Submit Date = 07/26/17 18:44:18 UTC
   306  Diff        =
   307  +/- Job: "geo-api-server"
   308  +/- Task Group: "api-server"
   309    +/- Task: "server"
   310      +/- Config {
   311        +/- image: "geo-api-server:0.1" => "geo-api-server:0.2"
   312          }
   313  
   314  Version     = 0
   315  Stable      = true
   316  Submit Date = 07/26/17 18:43:43 UTC
   317  ```
   318  
   319  We can see that Nomad considered the job running "geo-api-server:0.1" and
   320  "geo-api-server:0.2" as stable but job Version 2 that submitted the incorrect
   321  image is marked as unstable. This is because the placed allocations failed to
   322  start. Nomad detected the deployment failed and as such, created job Version 3
   323  that reverted back to the last healthy job.
   324  
   325  [update]: /docs/job-specification/update.html "Nomad update Stanza"