github.com/apprenda/kismatic@v1.12.0/docs/design-decisions/upgrades.md (about)

     1  # Cluster Upgrade
     2  
     3  Status: Implemented.
     4  
     5  This document describes the initial implementation of upgrades in Kismatic. An upgrade
     6  is defined as the replacement of _binaries_ or _configuration_ of a cluster created by Kismatic. 
     7  An upgrade does not include the operating system, or any packages that are not managed by Kismatic.
     8  
     9  This upgrade implementation is concerned with upgrading the following:
    10  * Etcd clusters (Kubernetes and Networking)
    11  * Kubernetes components
    12  * Docker (if we decide to support a newer version)
    13  * Calico
    14  * On-Cluster services (e.g. DNS, Dashboard, etc)
    15  
    16  ## Versions
    17  Every component on a cluster has a *current* version, and may be transitioning to 
    18  exactly one *target version.*
    19  
    20  Every cluster has components with some number of current versions, and may be transitioning 
    21  to exactly one target version. The cluster is said to be “at version X” only 
    22  if all components are at that version. There may be operations that aren’t 
    23  performed on a cluster in transition because of the complexity of applying cluster-level 
    24  decisions to a system in an unknown state.
    25  
    26  Each Kismatic release has a single target version. We will attempt to support { some number } 
    27  of Kismatic versions back in time.
    28  
    29  ## Can I add a feature during an upgrade?
    30  No. You need a cluster at a consistent, known version to add a feature.
    31  
    32  ## Safety
    33  Safety is the first concern of upgrading Kubernetes. An unsafe upgrade is one that results in 
    34  loss of data or critical functionality.
    35  
    36  Kubernetes does not have a concept of a stable workload installation, but by default it won’t 
    37  move pods unless there is a problem with them. This relative stability means it’s possible to 
    38  use Kubernetes to stand up workloads in configurations that aren’t at all safe, 
    39  such as a database accepting writes that’s running in a single pod with emptyDir.
    40  
    41  It is not Kismatic’s responsibility to fix these workloads. Also, it is not okay for us to 
    42  identify that a workload is unsafe and perform an action that would cause it to lose data, 
    43  which could be as simple as disconnecting Kubelet long enough that Kubernetes re-schedules the pod.
    44  
    45  ## Availability
    46  Availability is the second concern of upgrading Kubernetes. An upgrade interrupts 
    47  cluster availability if it results in the loss of a global cluster function 
    48  (such as removing the last master, ingress or breaking etcd cluster quorum) 
    49  and it interrupts workload availability if it results in the reduction of a service to 0 active pods.
    50  
    51  ## Upgrade Safety and Availability
    52  The cluster operator will be able to choose between two upgrade modalities. 
    53  
    54  ### Offline upgrade
    55  The offline upgrade is the most basic upgrade supported by Kismatic. In this mode, the cluster
    56  will be upgraded regardless of potential safety or availability issues. More specifically,
    57  Kismatic will not perform any safety or availability checks before performing the upgrade, nor will it
    58  cordon or drain nodes during the upgrade. This is suitable for clusters that are not housing production
    59  workloads.
    60  
    61  ### Online upgrade
    62  The online upgrade is gated by safety and availability checks. In this mode, Kismatic will report 
    63  any upgrade condition that is potentially unsafe or likely to cause a loss of availability. 
    64  When faced with this situation, the upgrade will not proceed.
    65  
    66  Operators may address the safety and availability concerns by:
    67  * Manually scaling out nodes or pods, where applicable
    68  * Manually scaling down or removing the unsafe workload
    69  * Forcing the upgrade by using the offline modality
    70  
    71  The following table outlines the conditions Kismatic will use to determine upgrade
    72  safety and availability.
    73  
    74  | Detected condition                         | Reasoning                                                                 |
    75  |--------------------------------------------|---------------------------------------------------------------------------|
    76  | Pod not managed by RC, RS,  Job, DS, or SS | Potentially unsafe: unmanaged pod will not be rescheduled                 |
    77  | Pods without peers (i.e. replicas = 1)     | Potentially unavailable: singleton pod will be unavailable during upgrade |
    78  | DaemonSet scheduled on a single node       | Potentially unavailable: singleton pod will be unavailable during upgrade |
    79  | Pod using EmptyDir volume                  | Potentially unsafe: pod will loose the data in this volume                |
    80  | Pod using HostPath volume                  | Potentially unsafe: pod will loose the data in this volume                |
    81  | Pod using HostPath persistent volume       | Potentially unsafe: pod will loose the data in this volume                |
    82  | Master node in a cluster with <2 masters   | Unavailable: upgrading the master will bring the control plane down       |
    83  | Worker node in a cluster with <2 workers   | Unavailable: upgrading the worker will bring all workloads down           |
    84  | Ingress node                               | Unavailable: we can't ensure that ingress nodes are load balanced         |
    85  | Gluster node                               | Potentially unavailable: brick on node will become unavailable            |
    86  
    87  ## Readiness
    88  Validation (aka. Preflight) during an upgrade is about node readiness. In other words,
    89  validation is about answering the question: Can the node be expected to safely install
    90  the new software and configuration?
    91  
    92  The following checks are performed on each node to determine readiness:
    93  1. Disk space: Ensure that there is enough disk space on the node for upgrading.
    94  2. Packages: When package installation is disabled, ensure that the new packages are installed.
    95  
    96  ## Order of upgrade
    97  All etcd nodes
    98  Then
    99  All master nodes
   100  Then
   101  All worker nodes (regardless of specialization)
   102  Then
   103  “On-cluster” Docker Registry
   104  Then
   105  Other on-cluster systems (DNS, dashboard, etc)
   106  
   107  Nodes are upgraded one node at a time.
   108  
   109  # Partial upgrade
   110  Both the offline and online upgrade modalities will allow for partially upgrading a cluster.
   111  A partial upgrade involves upgrading the nodes that did not report a problem. This enables 
   112  the ability for an operator to upgrade part of the cluster online and to upgrade the rest of 
   113  the cluster in a smaller downtime window.
   114  
   115  ## User Experience
   116  ```
   117  kismatic info [-f planfile]
   118  ```
   119  Prints the version of the cluster and all nodes.
   120  
   121  ```
   122  kismatic upgrade online [-f plan]
   123  ```
   124  Computes the upgrade plan for a cluster, detecting nodes NOT at the target version. 
   125  Checks node readiness for upgrade. If no unsafe/unavailable workloads are detected, 
   126  the plan is immediately executed. If any nodes are unready, unsafe or unavailable, 
   127  the command will print and quit.
   128  
   129  ```
   130  kismatic upgrade online [-f plan] --partial-ok
   131  ```
   132  Computes the upgrade plan for a cluster, detecting nodes NOT at the target version. 
   133  Checks node readiness for upgrade. All unready/unsafe/unavailable workloads detected are pruned 
   134  from the plan. It is then immediately executed. All unready, unsafe or unavailable nodes are printed, 
   135  then the command will quit.
   136  
   137  ```
   138  kismatic upgrade offline [-f plan]
   139  ```
   140  Computes the upgrade plan for a cluster, detecting nodes NOT at the target version. 
   141  Checks node readiness for upgrade. If no unready workloads are detected, the plan is 
   142  immediately executed. If any nodes are unready, the command will print and quit.
   143  
   144  ```
   145  kismatic upgrade offline [-f plan] --partial-ok
   146  ```
   147  Computes the upgrade plan for a cluster, detecting nodes NOT at the target version. 
   148  Checks node readiness for upgrade. All unready nodes are pruned from the plan. 
   149  It is then immediately executed. All unready nodes are printed, then the command will quit.