github.com/blixtra/nomad@v0.7.2-0.20171221000451-da9a1d7bb050/website/source/docs/operating-a-job/update-strategies/rolling-upgrades.html.md

github.com/blixtra/nomad@v0.7.2-0.20171221000451-da9a1d7bb050/website/source/docs/operating-a-job/update-strategies/rolling-upgrades.html.md (about)

     1  ---
     2  layout: "docs"
     3  page_title: "Rolling Upgrades - Operating a Job"
     4  sidebar_current: "docs-operating-a-job-updating-rolling-upgrades"
     5  description: |-
     6    In order to update a service while reducing downtime, Nomad provides a
     7    built-in mechanism for rolling upgrades. Rolling upgrades incrementally
     8    transistions jobs between versions and using health check information to
     9    reduce downtime.
    10  ---
    11  
    12  # Rolling Upgrades
    13  
    14  Nomad supports rolling updates as a first class feature. To enable rolling
    15  updates a job or task group is annotated with a high-level description of the
    16  update strategy using the [`update` stanza][update]. Under the hood, Nomad
    17  handles limiting parallelism, interfacing with Consul to determine service
    18  health and even automatically reverting to an older, healthy job when a
    19  deployment fails.
    20  
    21  ## Enabling Rolling Updates
    22  
    23  Rolling updates are enabled by adding the [`update` stanza][update] to the job
    24  specification. The `update` stanza may be placed at the job level or in an
    25  individual task group. When placed at the job level, the update strategy is
    26  inherited by all task groups in the job. When placed at both the job and group
    27  level, the `update` stanzas are merged, with group stanzas taking precedence
    28  over job level stanzas. See the [`update` stanza
    29  documentation](/docs/job-specification/update.html#upgrade-stanza-inheritance)
    30  for an example.
    31  
    32  ```hcl
    33  job "geo-api-server" {
    34    # ...
    35  
    36    group "api-server" {
    37      count = 6
    38  
    39      # Add an update stanza to enable rolling updates of the service
    40      update {
    41        max_parallel = 2
    42        min_healthy_time = "30s"
    43        healthy_deadline = "10m"
    44      }
    45  
    46      task "server" {
    47        driver = "docker"
    48  
    49        config {
    50          image = "geo-api-server:0.1"
    51        }
    52  
    53        # ...
    54      }
    55    }
    56  }
    57  ```
    58  
    59  In this example, by adding the simple `update` stanza to the "api-server" task
    60  group, we inform Nomad that updates to the group should be handled with a
    61  rolling update strategy.
    62  
    63  Thus when a change is made to the job file that requires new allocations to be
    64  made, Nomad will deploy 2 allocations at a time and require that the allocations
    65  be running in a healthy state for 30 seconds before deploying more versions of the
    66  new group.
    67  
    68  By default Nomad determines allocation health by ensuring that all tasks in the
    69  group are running and that any [service
    70  check](/docs/job-specification/service.html#check-parameters) the tasks register
    71  are passing.
    72  
    73  ## Planning Changes
    74  
    75  Suppose we make a change to a file to upgrade the version of a Docker container
    76  that is configured with the same rolling update strategy from above.
    77  
    78  ```diff
    79  @@ -2,6 +2,8 @@ job "geo-api-server" {
    80     group "api-server" {
    81       task "server" {
    82         driver = "docker"
    83  
    84         config {
    85  -        image = "geo-api-server:0.1"
    86  +        image = "geo-api-server:0.2"
    87  ```
    88  
    89  The [`nomad plan` command](/docs/commands/plan.html) allows
    90  us to visualize the series of steps the scheduler would perform. We can analyze
    91  this output to confirm it is correct:
    92  
    93  ```text
    94  $ nomad plan geo-api-server.nomad
    95  ```
    96  
    97  Here is some sample output:
    98  
    99  ```text
   100  +/- Job: "geo-api-server"
   101  +/- Task Group: "api-server" (2 create/destroy update, 4 ignore)
   102    +/- Task: "server" (forces create/destroy update)
   103      +/- Config {
   104        +/- image: "geo-api-server:0.1" => "geo-api-server:0.2"
   105      }
   106  
   107  Scheduler dry-run:
   108  - All tasks successfully allocated.
   109  
   110  Job Modify Index: 7
   111  To submit the job with version verification run:
   112  
   113  nomad run -check-index 7 my-web.nomad
   114  
   115  When running the job with the check-index flag, the job will only be run if the
   116  server side version matches the job modify index returned. If the index has
   117  changed, another user has modified the job and the plan's results are
   118  potentially invalid.
   119  ```
   120  
   121  Here we can see that Nomad will begin a rolling update by creating and
   122  destroying 2 allocations first and for the time being ignoring 4 of the old
   123  allocations, matching our configured `max_parallel`.
   124  
   125  ## Inspecting a Deployment
   126  
   127  After running the plan we can submit the updated job by simply running `nomad
   128  run`. Once run, Nomad will begin the rolling upgrade of our service by placing
   129  2 allocations at a time of the new job and taking two of the old jobs down.
   130  
   131  We can inspect the current state of a rolling deployment using `nomad status`:
   132  
   133  ```text
   134  $ nomad status geo-api-server
   135  ID            = geo-api-server
   136  Name          = geo-api-server
   137  Submit Date   = 07/26/17 18:08:56 UTC
   138  Type          = service
   139  Priority      = 50
   140  Datacenters   = dc1
   141  Status        = running
   142  Periodic      = false
   143  Parameterized = false
   144  
   145  Summary
   146  Task Group  Queued  Starting  Running  Failed  Complete  Lost
   147  api-server  0       0         6        0       4         0
   148  
   149  Latest Deployment
   150  ID          = c5b34665
   151  Status      = running
   152  Description = Deployment is running
   153  
   154  Deployed
   155  Task Group  Desired  Placed  Healthy  Unhealthy
   156  api-server  6        4       2        0
   157  
   158  Allocations
   159  ID        Node ID   Task Group  Version  Desired  Status    Created At
   160  14d288e8  f7b1ee08  api-server  1        run      running   07/26/17 18:09:17 UTC
   161  a134f73c  f7b1ee08  api-server  1        run      running   07/26/17 18:09:17 UTC
   162  a2574bb6  f7b1ee08  api-server  1        run      running   07/26/17 18:08:56 UTC
   163  496e7aa2  f7b1ee08  api-server  1        run      running   07/26/17 18:08:56 UTC
   164  9fc96fcc  f7b1ee08  api-server  0        run      running   07/26/17 18:04:30 UTC
   165  2521c47a  f7b1ee08  api-server  0        run      running   07/26/17 18:04:30 UTC
   166  6b794fcb  f7b1ee08  api-server  0        stop     complete  07/26/17 18:04:30 UTC
   167  9bc11bd7  f7b1ee08  api-server  0        stop     complete  07/26/17 18:04:30 UTC
   168  691eea24  f7b1ee08  api-server  0        stop     complete  07/26/17 18:04:30 UTC
   169  af115865  f7b1ee08  api-server  0        stop     complete  07/26/17 18:04:30 UTC
   170  ```
   171  
   172  Here we can see that Nomad has created a deployment to conduct the rolling
   173  upgrade from job version 0 to 1 and has placed 4 instances of the new job and
   174  has stopped 4 of the old instances. If we look at the deployed allocations, we
   175  also can see that Nomad has placed 4 instances of job version 1 but only
   176  considers 2 of them healthy. This is because the 2 newest placed allocations
   177  haven't been healthy for the required 30 seconds yet.
   178  
   179  If we wait for the deployment to complete and re-issue the command, we get the
   180  following:
   181  
   182  ```text
   183  $ nomad status geo-api-server
   184  ID            = geo-api-server
   185  Name          = geo-api-server
   186  Submit Date   = 07/26/17 18:08:56 UTC
   187  Type          = service
   188  Priority      = 50
   189  Datacenters   = dc1
   190  Status        = running
   191  Periodic      = false
   192  Parameterized = false
   193  
   194  Summary
   195  Task Group  Queued  Starting  Running  Failed  Complete  Lost
   196  cache       0       0         6        0       6         0
   197  
   198  Latest Deployment
   199  ID          = c5b34665
   200  Status      = successful
   201  Description = Deployment completed successfully
   202  
   203  Deployed
   204  Task Group  Desired  Placed  Healthy  Unhealthy
   205  cache       6        6       6        0
   206  
   207  Allocations
   208  ID        Node ID   Task Group  Version  Desired  Status    Created At
   209  d42a1656  f7b1ee08  api-server  1        run      running   07/26/17 18:10:10 UTC
   210  401daaf9  f7b1ee08  api-server  1        run      running   07/26/17 18:10:00 UTC
   211  14d288e8  f7b1ee08  api-server  1        run      running   07/26/17 18:09:17 UTC
   212  a134f73c  f7b1ee08  api-server  1        run      running   07/26/17 18:09:17 UTC
   213  a2574bb6  f7b1ee08  api-server  1        run      running   07/26/17 18:08:56 UTC
   214  496e7aa2  f7b1ee08  api-server  1        run      running   07/26/17 18:08:56 UTC
   215  9fc96fcc  f7b1ee08  api-server  0        stop     complete  07/26/17 18:04:30 UTC
   216  2521c47a  f7b1ee08  api-server  0        stop     complete  07/26/17 18:04:30 UTC
   217  6b794fcb  f7b1ee08  api-server  0        stop     complete  07/26/17 18:04:30 UTC
   218  9bc11bd7  f7b1ee08  api-server  0        stop     complete  07/26/17 18:04:30 UTC
   219  691eea24  f7b1ee08  api-server  0        stop     complete  07/26/17 18:04:30 UTC
   220  af115865  f7b1ee08  api-server  0        stop     complete  07/26/17 18:04:30 UTC
   221  ```
   222  
   223  Nomad has successfully transitioned the group to running the updated canary and
   224  did so with no downtime to our service by ensuring only two allocations were
   225  changed at a time and the newly placed allocations ran successfully. Had any of
   226  the newly placed allocations failed their health check, Nomad would have aborted
   227  the deployment and stopped placing new allocations. If configured, Nomad can
   228  automatically revert back to the old job definition when the deployment fails.
   229  
   230  ## Auto Reverting on Failed Deployments
   231  
   232  In the case we do a deployment in which the new allocations are unhealthy, Nomad
   233  will fail the deployment and stop placing new instances of the job. It
   234  optionally supports automatically reverting back to the last stable job version
   235  on deployment failure. Nomad keeps a history of submitted jobs and whether the
   236  job version was stable.  A job is considered stable if all its allocations are
   237  healthy.
   238  
   239  To enable this we simply add the `auto_revert` parameter to the `update` stanza:
   240  
   241  ```
   242  update {
   243    max_parallel = 2
   244    min_healthy_time = "30s"
   245    healthy_deadline = "10m"
   246  
   247    # Enable automatically reverting to the last stable job on a failed
   248    # deployment.
   249    auto_revert = true
   250  }
   251  ```
   252  
   253  Now imagine we want to update our image to "geo-api-server:0.3" but we instead
   254  update it to the below and run the job:
   255  
   256  ```diff
   257  @@ -2,6 +2,8 @@ job "geo-api-server" {
   258     group "api-server" {
   259       task "server" {
   260         driver = "docker"
   261  
   262         config {
   263  -        image = "geo-api-server:0.2"
   264  +        image = "geo-api-server:0.33"
   265  ```
   266  
   267  If we run `nomad job deployments` we can see that the deployment fails and Nomad
   268  auto-reverts to the last stable job:
   269  
   270  ```text
   271  $ nomad job deployments geo-api-server
   272  ID        Job ID          Job Version  Status      Description
   273  0c6f87a5  geo-api-server  3            successful  Deployment completed successfully
   274  b1712b7f  geo-api-server  2            failed      Failed due to unhealthy allocations - rolling back to job version 1
   275  3eee83ce  geo-api-server  1            successful  Deployment completed successfully
   276  72813fcf  geo-api-server  0            successful  Deployment completed successfully
   277  ```
   278  
   279  Nomad job versions increment monotonically, so even though Nomad reverted to the
   280  job specification at version 1, it creates a new job version. We can see the
   281  differences between a jobs versions and how Nomad auto-reverted the job using
   282  the `job history` command:
   283  
   284  ```text
   285  $ nomad job history -p geo-api-server
   286  Version     = 3
   287  Stable      = true
   288  Submit Date = 07/26/17 18:44:18 UTC
   289  Diff        =
   290  +/- Job: "geo-api-server"
   291  +/- Task Group: "api-server"
   292    +/- Task: "server"
   293      +/- Config {
   294        +/- image: "geo-api-server:0.33" => "geo-api-server:0.2"
   295          }
   296  
   297  Version     = 2
   298  Stable      = false
   299  Submit Date = 07/26/17 18:45:21 UTC
   300  Diff        =
   301  +/- Job: "geo-api-server"
   302  +/- Task Group: "api-server"
   303    +/- Task: "server"
   304      +/- Config {
   305        +/- image: "geo-api-server:0.2" => "geo-api-server:0.33"
   306          }
   307  
   308  Version     = 1
   309  Stable      = true
   310  Submit Date = 07/26/17 18:44:18 UTC
   311  Diff        =
   312  +/- Job: "geo-api-server"
   313  +/- Task Group: "api-server"
   314    +/- Task: "server"
   315      +/- Config {
   316        +/- image: "geo-api-server:0.1" => "geo-api-server:0.2"
   317          }
   318  
   319  Version     = 0
   320  Stable      = true
   321  Submit Date = 07/26/17 18:43:43 UTC
   322  ```
   323  
   324  We can see that Nomad considered the job running "geo-api-server:0.1" and
   325  "geo-api-server:0.2" as stable but job Version 2 that submitted the incorrect
   326  image is marked as unstable. This is because the placed allocations failed to
   327  start. Nomad detected the deployment failed and as such, created job Version 3
   328  that reverted back to the last healthy job.
   329  
   330  [update]: /docs/job-specification/update.html "Nomad update Stanza"