github.com/ferranbt/nomad@v0.9.3-0.20190607002617-85c449b7667c/website/source/guides/operating-a-job/update-strategies/blue-green-and-canary-deployments.html.md (about)

     1  ---
     2  layout: "guides"
     3  page_title: "Blue/Green & Canary Deployments - Operating a Job"
     4  sidebar_current: "guides-operating-a-job-updating-blue-green-deployments"
     5  description: |-
     6    Nomad has built-in support for doing blue/green and canary deployments to more
     7    safely update existing applications and services.
     8  ---
     9  
    10  # Blue/Green & Canary Deployments
    11  
    12  Sometimes [rolling
    13  upgrades](/guides/operating-a-job/update-strategies/rolling-upgrades.html) do not
    14  offer the required flexibility for updating an application in production. Often
    15  organizations prefer to put a "canary" build into production or utilize a
    16  technique known as a "blue/green" deployment to ensure a safe application
    17  rollout to production while minimizing downtime.
    18  
    19  ## Blue/Green Deployments
    20  
    21  Blue/Green deployments have several other names including Red/Black or A/B, but
    22  the concept is generally the same. In a blue/green deployment, there are two
    23  application versions. Only one application version is active at a time, except
    24  during the transition phase from one version to the next. The term "active"
    25  tends to mean "receiving traffic" or "in service".
    26  
    27  Imagine a hypothetical API server which has five instances deployed to
    28  production at version 1.3, and we want to safely upgrade to version 1.4. We want
    29  to create five new instances at version 1.4 and in the case that they are
    30  operating correctly we want to promote them and take down the five versions
    31  running 1.3. In the event of failure, we can quickly rollback to 1.3.
    32  
    33  To start, we examine our job which is running in production:
    34  
    35  ```hcl
    36  job "docs" {
    37    # ...
    38  
    39    group "api" {
    40      count = 5
    41  
    42      update {
    43        max_parallel     = 1
    44        canary           = 5
    45        min_healthy_time = "30s"
    46        healthy_deadline = "10m"
    47        auto_revert      = true
    48        auto_promote     = false
    49      }
    50  
    51      task "api-server" {
    52        driver = "docker"
    53  
    54        config {
    55          image = "api-server:1.3"
    56        }
    57      }
    58    }
    59  }
    60  ```
    61  
    62  We see that it has an `update` stanza that has the `canary` equal to the desired
    63  count. This is what allows us to easily model blue/green deployments. When we
    64  change the job to run the "api-server:1.4" image, Nomad will create 5 new
    65  allocations without touching the original "api-server:1.3" allocations. Below we
    66  can see how this works by changing the image to run the new version:
    67  
    68  ```diff
    69  @@ -2,6 +2,8 @@ job "docs" {
    70    group "api" {
    71      task "api-server" {
    72        config {
    73  -       image = "api-server:1.3"
    74  +       image = "api-server:1.4"
    75  ```
    76  
    77  Next we plan and run these changes:
    78  
    79  ```text
    80  $ nomad job plan docs.nomad
    81  +/- Job: "docs"
    82  +/- Task Group: "api" (5 canary, 5 ignore)
    83    +/- Task: "api-server" (forces create/destroy update)
    84      +/- Config {
    85        +/- image: "api-server:1.3" => "api-server:1.4"
    86          }
    87  
    88  Scheduler dry-run:
    89  - All tasks successfully allocated.
    90  
    91  Job Modify Index: 7
    92  To submit the job with version verification run:
    93  
    94  nomad job run -check-index 7 example.nomad
    95  
    96  When running the job with the check-index flag, the job will only be run if the
    97  server side version matches the job modify index returned. If the index has
    98  changed, another user has modified the job and the plan's results are
    99  potentially invalid.
   100  
   101  $ nomad job run docs.nomad
   102  # ...
   103  ```
   104  
   105  We can see from the plan output that Nomad is going to create 5 canaries that
   106  are running the "api-server:1.4" image and ignore all the allocations running
   107  the older image. Now if we examine the status of the job we can see that both
   108  the blue ("api-server:1.3") and green ("api-server:1.4") set are running.
   109  
   110  ```text
   111  $ nomad status docs
   112  ID            = docs
   113  Name          = docs
   114  Submit Date   = 07/26/17 19:57:47 UTC
   115  Type          = service
   116  Priority      = 50
   117  Datacenters   = dc1
   118  Status        = running
   119  Periodic      = false
   120  Parameterized = false
   121  
   122  Summary
   123  Task Group  Queued  Starting  Running  Failed  Complete  Lost
   124  api         0       0         10       0       0         0
   125  
   126  Latest Deployment
   127  ID          = 32a080c1
   128  Status      = running
   129  Description = Deployment is running but requires manual promotion
   130  
   131  Deployed
   132  Task Group  Auto Revert  Promoted  Desired  Canaries  Placed  Healthy  Unhealthy
   133  api         true         false     5        5         5       5        0
   134  
   135  Allocations
   136  ID        Node ID   Task Group  Version  Desired  Status   Created At
   137  6d8eec42  087852e2  api         1        run      running  07/26/17 19:57:47 UTC
   138  7051480e  087852e2  api         1        run      running  07/26/17 19:57:47 UTC
   139  36c6610f  087852e2  api         1        run      running  07/26/17 19:57:47 UTC
   140  410ba474  087852e2  api         1        run      running  07/26/17 19:57:47 UTC
   141  85662a7a  087852e2  api         1        run      running  07/26/17 19:57:47 UTC
   142  3ac3fe05  087852e2  api         0        run      running  07/26/17 19:53:56 UTC
   143  4bd51979  087852e2  api         0        run      running  07/26/17 19:53:56 UTC
   144  2998387b  087852e2  api         0        run      running  07/26/17 19:53:56 UTC
   145  35b813ee  087852e2  api         0        run      running  07/26/17 19:53:56 UTC
   146  b53b4289  087852e2  api         0        run      running  07/26/17 19:53:56 UTC
   147  ```
   148  
   149  Now that we have the new set in production, we can route traffic to it and
   150  validate the new job version is working properly. Based on whether the new
   151  version is functioning properly or improperly we will either want to promote or
   152  fail the deployment.
   153  
   154  ### Promoting the Deployment
   155  
   156  After deploying the new image along side the old version we have determined it
   157  is functioning properly and we want to transition fully to the new version.
   158  Doing so is as simple as promoting the deployment:
   159  
   160  ```text
   161  $ nomad deployment promote 32a080c1
   162  ==> Monitoring evaluation "61ac2be5"
   163      Evaluation triggered by job "docs"
   164      Evaluation within deployment: "32a080c1"
   165      Evaluation status changed: "pending" -> "complete"
   166  ==> Evaluation "61ac2be5" finished with status "complete"
   167  ```
   168  
   169  If we look at the job's status we see that after promotion, Nomad stopped the
   170  older allocations and is only running the new one. This now completes our
   171  blue/green deployment.
   172  
   173  ```text
   174  $ nomad status docs
   175  ID            = docs
   176  Name          = docs
   177  Submit Date   = 07/26/17 19:57:47 UTC
   178  Type          = service
   179  Priority      = 50
   180  Datacenters   = dc1
   181  Status        = running
   182  Periodic      = false
   183  Parameterized = false
   184  
   185  Summary
   186  Task Group  Queued  Starting  Running  Failed  Complete  Lost
   187  api         0       0         5        0       5         0
   188  
   189  Latest Deployment
   190  ID          = 32a080c1
   191  Status      = successful
   192  Description = Deployment completed successfully
   193  
   194  Deployed
   195  Task Group  Auto Revert  Promoted  Desired  Canaries  Placed  Healthy  Unhealthy
   196  api         true         true      5        5         5       5        0
   197  
   198  Allocations
   199  ID        Node ID   Task Group  Version  Desired  Status    Created At
   200  6d8eec42  087852e2  api         1        run      running   07/26/17 19:57:47 UTC
   201  7051480e  087852e2  api         1        run      running   07/26/17 19:57:47 UTC
   202  36c6610f  087852e2  api         1        run      running   07/26/17 19:57:47 UTC
   203  410ba474  087852e2  api         1        run      running   07/26/17 19:57:47 UTC
   204  85662a7a  087852e2  api         1        run      running   07/26/17 19:57:47 UTC
   205  3ac3fe05  087852e2  api         0        stop     complete  07/26/17 19:53:56 UTC
   206  4bd51979  087852e2  api         0        stop     complete  07/26/17 19:53:56 UTC
   207  2998387b  087852e2  api         0        stop     complete  07/26/17 19:53:56 UTC
   208  35b813ee  087852e2  api         0        stop     complete  07/26/17 19:53:56 UTC
   209  b53b4289  087852e2  api         0        stop     complete  07/26/17 19:53:56 UTC
   210  ```
   211  
   212  ### Failing the Deployment
   213  
   214  After deploying the new image alongside the old version we have determined it
   215  is not functioning properly and we want to roll back to the old version.  Doing
   216  so is as simple as failing the deployment:
   217  
   218  ```text
   219  $ nomad deployment fail 32a080c1
   220  Deployment "32a080c1-de5a-a4e7-0218-521d8344c328" failed. Auto-reverted to job version 0.
   221  
   222  ==> Monitoring evaluation "6840f512"
   223      Evaluation triggered by job "example"
   224      Evaluation within deployment: "32a080c1"
   225      Allocation "0ccb732f" modified: node "36e7a123", group "cache"
   226      Allocation "64d4f282" modified: node "36e7a123", group "cache"
   227      Allocation "664e33c7" modified: node "36e7a123", group "cache"
   228      Allocation "a4cb6a4b" modified: node "36e7a123", group "cache"
   229      Allocation "fdd73bdd" modified: node "36e7a123", group "cache"
   230      Evaluation status changed: "pending" -> "complete"
   231  ==> Evaluation "6840f512" finished with status "complete"
   232  ```
   233  
   234  If we now look at the job's status we can see that after failing the deployment,
   235  Nomad stopped the new allocations and is only running the old ones and reverted
   236  the working copy of the job back to the original specification running
   237  "api-server:1.3".
   238  
   239  ```text
   240  $ nomad status docs
   241  ID            = docs
   242  Name          = docs
   243  Submit Date   = 07/26/17 19:57:47 UTC
   244  Type          = service
   245  Priority      = 50
   246  Datacenters   = dc1
   247  Status        = running
   248  Periodic      = false
   249  Parameterized = false
   250  
   251  Summary
   252  Task Group  Queued  Starting  Running  Failed  Complete  Lost
   253  api         0       0         5        0       5         0
   254  
   255  Latest Deployment
   256  ID          = 6f3f84b3
   257  Status      = successful
   258  Description = Deployment completed successfully
   259  
   260  Deployed
   261  Task Group  Auto Revert  Desired  Placed  Healthy  Unhealthy
   262  cache       true         5        5       5        0
   263  
   264  Allocations
   265  ID        Node ID   Task Group  Version  Desired  Status    Created At
   266  27dc2a42  36e7a123  api         1        stop     complete  07/26/17 20:07:31 UTC
   267  5b7d34bb  36e7a123  api         1        stop     complete  07/26/17 20:07:31 UTC
   268  983b487d  36e7a123  api         1        stop     complete  07/26/17 20:07:31 UTC
   269  d1cbf45a  36e7a123  api         1        stop     complete  07/26/17 20:07:31 UTC
   270  d6b46def  36e7a123  api         1        stop     complete  07/26/17 20:07:31 UTC
   271  0ccb732f  36e7a123  api         2        run      running   07/26/17 20:06:29 UTC
   272  64d4f282  36e7a123  api         2        run      running   07/26/17 20:06:29 UTC
   273  664e33c7  36e7a123  api         2        run      running   07/26/17 20:06:29 UTC
   274  a4cb6a4b  36e7a123  api         2        run      running   07/26/17 20:06:29 UTC
   275  fdd73bdd  36e7a123  api         2        run      running   07/26/17 20:06:29 UTC
   276  
   277  $ nomad job deployments docs
   278  ID        Job ID   Job Version  Status      Description
   279  6f3f84b3  example  2            successful  Deployment completed successfully
   280  32a080c1  example  1            failed      Deployment marked as failed - rolling back to job version 0
   281  c4c16494  example  0            successful  Deployment completed successfully
   282  ```
   283  
   284  ## Canary Deployments
   285  
   286  Canary updates are a useful way to test a new version of a job before beginning
   287  a rolling upgrade. The `update` stanza supports setting the number of canaries
   288  the job operator would like Nomad to create when the job changes via the
   289  `canary` parameter. When the job specification is updated, Nomad creates the
   290  canaries without stopping any allocations from the previous job.
   291  
   292  This pattern allows operators to achieve higher confidence in the new job
   293  version because they can route traffic, examine logs, etc, to determine the new
   294  application is performing properly.
   295  
   296  ```hcl
   297  job "docs" {
   298    # ...
   299  
   300    group "api" {
   301      count = 5
   302  
   303      update {
   304        max_parallel     = 1
   305        canary           = 1
   306        min_healthy_time = "30s"
   307        healthy_deadline = "10m"
   308        auto_revert      = true
   309        auto_promote     = false
   310      }
   311  
   312      task "api-server" {
   313        driver = "docker"
   314  
   315        config {
   316          image = "api-server:1.3"
   317        }
   318      }
   319    }
   320  }
   321  ```
   322  
   323  In the example above, the `update` stanza tells Nomad to create a single canary
   324  when the job specification is changed. Below we can see how this works by
   325  changing the image to run the new version:
   326  
   327  ```diff
   328  @@ -2,6 +2,8 @@ job "docs" {
   329    group "api" {
   330      task "api-server" {
   331        config {
   332  -       image = "api-server:1.3"
   333  +       image = "api-server:1.4"
   334  ```
   335  
   336  Next we plan and run these changes:
   337  
   338  ```text
   339  $ nomad job plan docs.nomad
   340  +/- Job: "docs"
   341  +/- Task Group: "api" (1 canary, 5 ignore)
   342    +/- Task: "api-server" (forces create/destroy update)
   343      +/- Config {
   344        +/- image: "api-server:1.3" => "api-server:1.4"
   345          }
   346  
   347  Scheduler dry-run:
   348  - All tasks successfully allocated.
   349  
   350  Job Modify Index: 7
   351  To submit the job with version verification run:
   352  
   353  nomad job run -check-index 7 example.nomad
   354  
   355  When running the job with the check-index flag, the job will only be run if the
   356  server side version matches the job modify index returned. If the index has
   357  changed, another user has modified the job and the plan's results are
   358  potentially invalid.
   359  
   360  $ nomad job run docs.nomad
   361  # ...
   362  ```
   363  
   364  We can see from the plan output that Nomad is going to create 1 canary that
   365  will run the "api-server:1.4" image and ignore all the allocations running
   366  the older image. If we inspect the status we see that the canary is running
   367  along side the older version of the job:
   368  
   369  ```text
   370  $ nomad status docs
   371  ID            = docs
   372  Name          = docs
   373  Submit Date   = 07/26/17 19:57:47 UTC
   374  Type          = service
   375  Priority      = 50
   376  Datacenters   = dc1
   377  Status        = running
   378  Periodic      = false
   379  Parameterized = false
   380  
   381  Summary
   382  Task Group  Queued  Starting  Running  Failed  Complete  Lost
   383  api         0       0         6        0       0         0
   384  
   385  Latest Deployment
   386  ID          = 32a080c1
   387  Status      = running
   388  Description = Deployment is running but requires manual promotion
   389  
   390  Deployed
   391  Task Group  Auto Revert  Promoted  Desired  Canaries  Placed  Healthy  Unhealthy
   392  api         true         false     5        1         1       1        0
   393  
   394  Allocations
   395  ID        Node ID   Task Group  Version  Desired  Status   Created At
   396  85662a7a  087852e2  api         1        run      running  07/26/17 19:57:47 UTC
   397  3ac3fe05  087852e2  api         0        run      running  07/26/17 19:53:56 UTC
   398  4bd51979  087852e2  api         0        run      running  07/26/17 19:53:56 UTC
   399  2998387b  087852e2  api         0        run      running  07/26/17 19:53:56 UTC
   400  35b813ee  087852e2  api         0        run      running  07/26/17 19:53:56 UTC
   401  b53b4289  087852e2  api         0        run      running  07/26/17 19:53:56 UTC
   402  ```
   403  
   404  Now if we promote the canary, this will trigger a rolling update to replace the
   405  remaining allocations running the older image. The rolling update will happen at
   406  a rate of `max_parallel`, so in this case one allocation at a time:
   407  
   408  ```text
   409  $ nomad deployment promote 37033151
   410  ==> Monitoring evaluation "37033151"
   411      Evaluation triggered by job "docs"
   412      Evaluation within deployment: "ed28f6c2"
   413      Allocation "f5057465" created: node "f6646949", group "cache"
   414      Allocation "f5057465" status changed: "pending" -> "running"
   415      Evaluation status changed: "pending" -> "complete"
   416  ==> Evaluation "37033151" finished with status "complete"
   417  
   418  $ nomad status docs
   419  ID            = docs
   420  Name          = docs
   421  Submit Date   = 07/26/17 20:28:59 UTC
   422  Type          = service
   423  Priority      = 50
   424  Datacenters   = dc1
   425  Status        = running
   426  Periodic      = false
   427  Parameterized = false
   428  
   429  Summary
   430  Task Group  Queued  Starting  Running  Failed  Complete  Lost
   431  api         0       0         5        0       2         0
   432  
   433  Latest Deployment
   434  ID          = ed28f6c2
   435  Status      = running
   436  Description = Deployment is running
   437  
   438  Deployed
   439  Task Group  Auto Revert  Promoted  Desired  Canaries  Placed  Healthy  Unhealthy
   440  api         true         true      5        1         2       1        0
   441  
   442  Allocations
   443  ID        Node ID   Task Group  Version  Desired  Status    Created At
   444  f5057465  f6646949  api         1        run      running   07/26/17 20:29:23 UTC
   445  b1c88d20  f6646949  api         1        run      running   07/26/17 20:28:59 UTC
   446  1140bacf  f6646949  api         0        run      running   07/26/17 20:28:37 UTC
   447  1958a34a  f6646949  api         0        run      running   07/26/17 20:28:37 UTC
   448  4bda385a  f6646949  api         0        run      running   07/26/17 20:28:37 UTC
   449  62d96f06  f6646949  api         0        stop     complete  07/26/17 20:28:37 UTC
   450  f58abbb2  f6646949  api         0        stop     complete  07/26/17 20:28:37 UTC
   451  ```
   452  
   453  Alternatively, if the canary was not performing properly, we could abandon the
   454  change using the `nomad deployment fail` command, similar to the blue/green
   455  example.