github.com/smintz/nomad@v0.8.3/website/source/guides/operating-a-job/update-strategies/blue-green-and-canary-deployments.html.md

github.com/smintz/nomad@v0.8.3/website/source/guides/operating-a-job/update-strategies/blue-green-and-canary-deployments.html.md (about)

     1  ---
     2  layout: "guides"
     3  page_title: "Blue/Green & Canary Deployments - Operating a Job"
     4  sidebar_current: "guides-operating-a-job-updating-blue-green-deployments"
     5  description: |-
     6    Nomad has built-in support for doing blue/green and canary deployments to more
     7    safely update existing applications and services.
     8  ---
     9  
    10  # Blue/Green &amp; Canary Deployments
    11  
    12  Sometimes [rolling
    13  upgrades](/guides/operating-a-job/update-strategies/rolling-upgrades.html) do not
    14  offer the required flexibility for updating an application in production. Often
    15  organizations prefer to put a "canary" build into production or utilize a
    16  technique known as a "blue/green" deployment to ensure a safe application
    17  rollout to production while minimizing downtime.
    18  
    19  ## Blue/Green Deployments
    20  
    21  Blue/Green deployments have several other names including Red/Black or A/B, but
    22  the concept is generally the same. In a blue/green deployment, there are two
    23  application versions. Only one application version is active at a time, except
    24  during the transition phase from one version to the next. The term "active"
    25  tends to mean "receiving traffic" or "in service".
    26  
    27  Imagine a hypothetical API server which has five instances deployed to
    28  production at version 1.3, and we want to safely upgrade to version 1.4. We want
    29  to create five new instances at version 1.4 and in the case that they are
    30  operating correctly we want to promote them and take down the five versions
    31  running 1.3. In the event of failure, we can quickly rollback to 1.3.
    32  
    33  To start, we examine our job which is running in production:
    34  
    35  ```hcl
    36  job "docs" {
    37    # ...
    38  
    39    group "api" {
    40      count = 5
    41  
    42      update {
    43        max_parallel     = 1
    44        canary           = 5
    45        min_healthy_time = "30s"
    46        healthy_deadline = "10m"
    47        auto_revert      = true
    48      }
    49  
    50      task "api-server" {
    51        driver = "docker"
    52  
    53        config {
    54          image = "api-server:1.3"
    55        }
    56      }
    57    }
    58  }
    59  ```
    60  
    61  We see that it has an `update` stanza that has the `canary` equal to the desired
    62  count. This is what allows us to easily model blue/green deployments. When we
    63  change the job to run the "api-server:1.4" image, Nomad will create 5 new
    64  allocations without touching the original "api-server:1.3" allocations. Below we
    65  can see how this works by changing the image to run the new version:
    66  
    67  ```diff
    68  @@ -2,6 +2,8 @@ job "docs" {
    69    group "api" {
    70      task "api-server" {
    71        config {
    72  -       image = "api-server:1.3"
    73  +       image = "api-server:1.4"
    74  ```
    75  
    76  Next we plan and run these changes:
    77  
    78  ```text
    79  $ nomad job plan docs.nomad
    80  +/- Job: "docs"
    81  +/- Task Group: "api" (5 canary, 5 ignore)
    82    +/- Task: "api-server" (forces create/destroy update)
    83      +/- Config {
    84        +/- image: "api-server:1.3" => "api-server:1.4"
    85          }
    86  
    87  Scheduler dry-run:
    88  - All tasks successfully allocated.
    89  
    90  Job Modify Index: 7
    91  To submit the job with version verification run:
    92  
    93  nomad job run -check-index 7 example.nomad
    94  
    95  When running the job with the check-index flag, the job will only be run if the
    96  server side version matches the job modify index returned. If the index has
    97  changed, another user has modified the job and the plan's results are
    98  potentially invalid.
    99  
   100  $ nomad job run docs.nomad
   101  # ...
   102  ```
   103  
   104  We can see from the plan output that Nomad is going to create 5 canaries that
   105  are running the "api-server:1.4" image and ignore all the allocations running
   106  the older image. Now if we examine the status of the job we can see that both
   107  the blue ("api-server:1.3") and green ("api-server:1.4") set are running.
   108  
   109  ```text
   110  $ nomad status docs
   111  ID            = docs
   112  Name          = docs
   113  Submit Date   = 07/26/17 19:57:47 UTC
   114  Type          = service
   115  Priority      = 50
   116  Datacenters   = dc1
   117  Status        = running
   118  Periodic      = false
   119  Parameterized = false
   120  
   121  Summary
   122  Task Group  Queued  Starting  Running  Failed  Complete  Lost
   123  api         0       0         10       0       0         0
   124  
   125  Latest Deployment
   126  ID          = 32a080c1
   127  Status      = running
   128  Description = Deployment is running but requires promotion
   129  
   130  Deployed
   131  Task Group  Auto Revert  Promoted  Desired  Canaries  Placed  Healthy  Unhealthy
   132  api         true         false     5        5         5       5        0
   133  
   134  Allocations
   135  ID        Node ID   Task Group  Version  Desired  Status   Created At
   136  6d8eec42  087852e2  api         1        run      running  07/26/17 19:57:47 UTC
   137  7051480e  087852e2  api         1        run      running  07/26/17 19:57:47 UTC
   138  36c6610f  087852e2  api         1        run      running  07/26/17 19:57:47 UTC
   139  410ba474  087852e2  api         1        run      running  07/26/17 19:57:47 UTC
   140  85662a7a  087852e2  api         1        run      running  07/26/17 19:57:47 UTC
   141  3ac3fe05  087852e2  api         0        run      running  07/26/17 19:53:56 UTC
   142  4bd51979  087852e2  api         0        run      running  07/26/17 19:53:56 UTC
   143  2998387b  087852e2  api         0        run      running  07/26/17 19:53:56 UTC
   144  35b813ee  087852e2  api         0        run      running  07/26/17 19:53:56 UTC
   145  b53b4289  087852e2  api         0        run      running  07/26/17 19:53:56 UTC
   146  ```
   147  
   148  Now that we have the new set in production, we can route traffic to it and
   149  validate the new job version is working properly. Based on whether the new
   150  version is functioning properly or improperly we will either want to promote or
   151  fail the deployment.
   152  
   153  ### Promoting the Deployment
   154  
   155  After deploying the new image along side the old version we have determined it
   156  is functioning properly and we want to transition fully to the new version.
   157  Doing so is as simple as promoting the deployment:
   158  
   159  ```text
   160  $ nomad deployment promote 32a080c1
   161  ==> Monitoring evaluation "61ac2be5"
   162      Evaluation triggered by job "docs"
   163      Evaluation within deployment: "32a080c1"
   164      Evaluation status changed: "pending" -> "complete"
   165  ==> Evaluation "61ac2be5" finished with status "complete"
   166  ```
   167  
   168  If we look at the job's status we see that after promotion, Nomad stopped the
   169  older allocations and is only running the new one. This now completes our
   170  blue/green deployment.
   171  
   172  ```text
   173  $ nomad status docs
   174  ID            = docs
   175  Name          = docs
   176  Submit Date   = 07/26/17 19:57:47 UTC
   177  Type          = service
   178  Priority      = 50
   179  Datacenters   = dc1
   180  Status        = running
   181  Periodic      = false
   182  Parameterized = false
   183  
   184  Summary
   185  Task Group  Queued  Starting  Running  Failed  Complete  Lost
   186  api         0       0         5        0       5         0
   187  
   188  Latest Deployment
   189  ID          = 32a080c1
   190  Status      = successful
   191  Description = Deployment completed successfully
   192  
   193  Deployed
   194  Task Group  Auto Revert  Promoted  Desired  Canaries  Placed  Healthy  Unhealthy
   195  api         true         true      5        5         5       5        0
   196  
   197  Allocations
   198  ID        Node ID   Task Group  Version  Desired  Status    Created At
   199  6d8eec42  087852e2  api         1        run      running   07/26/17 19:57:47 UTC
   200  7051480e  087852e2  api         1        run      running   07/26/17 19:57:47 UTC
   201  36c6610f  087852e2  api         1        run      running   07/26/17 19:57:47 UTC
   202  410ba474  087852e2  api         1        run      running   07/26/17 19:57:47 UTC
   203  85662a7a  087852e2  api         1        run      running   07/26/17 19:57:47 UTC
   204  3ac3fe05  087852e2  api         0        stop     complete  07/26/17 19:53:56 UTC
   205  4bd51979  087852e2  api         0        stop     complete  07/26/17 19:53:56 UTC
   206  2998387b  087852e2  api         0        stop     complete  07/26/17 19:53:56 UTC
   207  35b813ee  087852e2  api         0        stop     complete  07/26/17 19:53:56 UTC
   208  b53b4289  087852e2  api         0        stop     complete  07/26/17 19:53:56 UTC
   209  ```
   210  
   211  ### Failing the Deployment
   212  
   213  After deploying the new image alongside the old version we have determined it
   214  is not functioning properly and we want to roll back to the old version.  Doing
   215  so is as simple as failing the deployment:
   216  
   217  ```text
   218  $ nomad deployment fail 32a080c1
   219  Deployment "32a080c1-de5a-a4e7-0218-521d8344c328" failed. Auto-reverted to job version 0.
   220  
   221  ==> Monitoring evaluation "6840f512"
   222      Evaluation triggered by job "example"
   223      Evaluation within deployment: "32a080c1"
   224      Allocation "0ccb732f" modified: node "36e7a123", group "cache"
   225      Allocation "64d4f282" modified: node "36e7a123", group "cache"
   226      Allocation "664e33c7" modified: node "36e7a123", group "cache"
   227      Allocation "a4cb6a4b" modified: node "36e7a123", group "cache"
   228      Allocation "fdd73bdd" modified: node "36e7a123", group "cache"
   229      Evaluation status changed: "pending" -> "complete"
   230  ==> Evaluation "6840f512" finished with status "complete"
   231  ```
   232  
   233  If we now look at the job's status we can see that after failing the deployment,
   234  Nomad stopped the new allocations and is only running the old ones and reverted
   235  the working copy of the job back to the original specification running
   236  "api-server:1.3".
   237  
   238  ```text
   239  $ nomad status docs
   240  ID            = docs
   241  Name          = docs
   242  Submit Date   = 07/26/17 19:57:47 UTC
   243  Type          = service
   244  Priority      = 50
   245  Datacenters   = dc1
   246  Status        = running
   247  Periodic      = false
   248  Parameterized = false
   249  
   250  Summary
   251  Task Group  Queued  Starting  Running  Failed  Complete  Lost
   252  api         0       0         5        0       5         0
   253  
   254  Latest Deployment
   255  ID          = 6f3f84b3
   256  Status      = successful
   257  Description = Deployment completed successfully
   258  
   259  Deployed
   260  Task Group  Auto Revert  Desired  Placed  Healthy  Unhealthy
   261  cache       true         5        5       5        0
   262  
   263  Allocations
   264  ID        Node ID   Task Group  Version  Desired  Status    Created At
   265  27dc2a42  36e7a123  api         1        stop     complete  07/26/17 20:07:31 UTC
   266  5b7d34bb  36e7a123  api         1        stop     complete  07/26/17 20:07:31 UTC
   267  983b487d  36e7a123  api         1        stop     complete  07/26/17 20:07:31 UTC
   268  d1cbf45a  36e7a123  api         1        stop     complete  07/26/17 20:07:31 UTC
   269  d6b46def  36e7a123  api         1        stop     complete  07/26/17 20:07:31 UTC
   270  0ccb732f  36e7a123  api         2        run      running   07/26/17 20:06:29 UTC
   271  64d4f282  36e7a123  api         2        run      running   07/26/17 20:06:29 UTC
   272  664e33c7  36e7a123  api         2        run      running   07/26/17 20:06:29 UTC
   273  a4cb6a4b  36e7a123  api         2        run      running   07/26/17 20:06:29 UTC
   274  fdd73bdd  36e7a123  api         2        run      running   07/26/17 20:06:29 UTC
   275  
   276  $ nomad job deployments docs
   277  ID        Job ID   Job Version  Status      Description
   278  6f3f84b3  example  2            successful  Deployment completed successfully
   279  32a080c1  example  1            failed      Deployment marked as failed - rolling back to job version 0
   280  c4c16494  example  0            successful  Deployment completed successfully
   281  ```
   282  
   283  ## Canary Deployments
   284  
   285  Canary updates are a useful way to test a new version of a job before beginning
   286  a rolling upgrade. The `update` stanza supports setting the number of canaries
   287  the job operator would like Nomad to create when the job changes via the
   288  `canary` parameter. When the job specification is updated, Nomad creates the
   289  canaries without stopping any allocations from the previous job.
   290  
   291  This pattern allows operators to achieve higher confidence in the new job
   292  version because they can route traffic, examine logs, etc, to determine the new
   293  application is performing properly.
   294  
   295  ```hcl
   296  job "docs" {
   297    # ...
   298  
   299    group "api" {
   300      count = 5
   301  
   302      update {
   303        max_parallel     = 1
   304        canary           = 1
   305        min_healthy_time = "30s"
   306        healthy_deadline = "10m"
   307        auto_revert      = true
   308      }
   309  
   310      task "api-server" {
   311        driver = "docker"
   312  
   313        config {
   314          image = "api-server:1.3"
   315        }
   316      }
   317    }
   318  }
   319  ```
   320  
   321  In the example above, the `update` stanza tells Nomad to create a single canary
   322  when the job specification is changed. Below we can see how this works by
   323  changing the image to run the new version:
   324  
   325  ```diff
   326  @@ -2,6 +2,8 @@ job "docs" {
   327    group "api" {
   328      task "api-server" {
   329        config {
   330  -       image = "api-server:1.3"
   331  +       image = "api-server:1.4"
   332  ```
   333  
   334  Next we plan and run these changes:
   335  
   336  ```text
   337  $ nomad job plan docs.nomad
   338  +/- Job: "docs"
   339  +/- Task Group: "api" (1 canary, 5 ignore)
   340    +/- Task: "api-server" (forces create/destroy update)
   341      +/- Config {
   342        +/- image: "api-server:1.3" => "api-server:1.4"
   343          }
   344  
   345  Scheduler dry-run:
   346  - All tasks successfully allocated.
   347  
   348  Job Modify Index: 7
   349  To submit the job with version verification run:
   350  
   351  nomad job run -check-index 7 example.nomad
   352  
   353  When running the job with the check-index flag, the job will only be run if the
   354  server side version matches the job modify index returned. If the index has
   355  changed, another user has modified the job and the plan's results are
   356  potentially invalid.
   357  
   358  $ nomad job run docs.nomad
   359  # ...
   360  ```
   361  
   362  We can see from the plan output that Nomad is going to create 1 canary that
   363  will run the "api-server:1.4" image and ignore all the allocations running
   364  the older image. If we inspect the status we see that the canary is running
   365  along side the older version of the job:
   366  
   367  ```text
   368  $ nomad status docs
   369  ID            = docs
   370  Name          = docs
   371  Submit Date   = 07/26/17 19:57:47 UTC
   372  Type          = service
   373  Priority      = 50
   374  Datacenters   = dc1
   375  Status        = running
   376  Periodic      = false
   377  Parameterized = false
   378  
   379  Summary
   380  Task Group  Queued  Starting  Running  Failed  Complete  Lost
   381  api         0       0         6        0       0         0
   382  
   383  Latest Deployment
   384  ID          = 32a080c1
   385  Status      = running
   386  Description = Deployment is running but requires promotion
   387  
   388  Deployed
   389  Task Group  Auto Revert  Promoted  Desired  Canaries  Placed  Healthy  Unhealthy
   390  api         true         false     5        1         1       1        0
   391  
   392  Allocations
   393  ID        Node ID   Task Group  Version  Desired  Status   Created At
   394  85662a7a  087852e2  api         1        run      running  07/26/17 19:57:47 UTC
   395  3ac3fe05  087852e2  api         0        run      running  07/26/17 19:53:56 UTC
   396  4bd51979  087852e2  api         0        run      running  07/26/17 19:53:56 UTC
   397  2998387b  087852e2  api         0        run      running  07/26/17 19:53:56 UTC
   398  35b813ee  087852e2  api         0        run      running  07/26/17 19:53:56 UTC
   399  b53b4289  087852e2  api         0        run      running  07/26/17 19:53:56 UTC
   400  ```
   401  
   402  Now if we promote the canary, this will trigger a rolling update to replace the
   403  remaining allocations running the older image. The rolling update will happen at
   404  a rate of `max_parallel`, so in this case one allocation at a time:
   405  
   406  ```text
   407  $ nomad deployment promote 37033151
   408  ==> Monitoring evaluation "37033151"
   409      Evaluation triggered by job "docs"
   410      Evaluation within deployment: "ed28f6c2"
   411      Allocation "f5057465" created: node "f6646949", group "cache"
   412      Allocation "f5057465" status changed: "pending" -> "running"
   413      Evaluation status changed: "pending" -> "complete"
   414  ==> Evaluation "37033151" finished with status "complete"
   415  
   416  $ nomad status docs
   417  ID            = docs
   418  Name          = docs
   419  Submit Date   = 07/26/17 20:28:59 UTC
   420  Type          = service
   421  Priority      = 50
   422  Datacenters   = dc1
   423  Status        = running
   424  Periodic      = false
   425  Parameterized = false
   426  
   427  Summary
   428  Task Group  Queued  Starting  Running  Failed  Complete  Lost
   429  api         0       0         5        0       2         0
   430  
   431  Latest Deployment
   432  ID          = ed28f6c2
   433  Status      = running
   434  Description = Deployment is running
   435  
   436  Deployed
   437  Task Group  Auto Revert  Promoted  Desired  Canaries  Placed  Healthy  Unhealthy
   438  api         true         true      5        1         2       1        0
   439  
   440  Allocations
   441  ID        Node ID   Task Group  Version  Desired  Status    Created At
   442  f5057465  f6646949  api         1        run      running   07/26/17 20:29:23 UTC
   443  b1c88d20  f6646949  api         1        run      running   07/26/17 20:28:59 UTC
   444  1140bacf  f6646949  api         0        run      running   07/26/17 20:28:37 UTC
   445  1958a34a  f6646949  api         0        run      running   07/26/17 20:28:37 UTC
   446  4bda385a  f6646949  api         0        run      running   07/26/17 20:28:37 UTC
   447  62d96f06  f6646949  api         0        stop     complete  07/26/17 20:28:37 UTC
   448  f58abbb2  f6646949  api         0        stop     complete  07/26/17 20:28:37 UTC
   449  ```
   450  
   451  Alternatively, if the canary was not performing properly, we could abandon the
   452  change using the `nomad deployment fail` command, similar to the blue/green
   453  example.