github.com/jlmeeker/kismatic@v1.10.1-0.20180612190640-57f9005a1f1a/docs/design-decisions/upgrades.md (about) 1 # Cluster Upgrade 2 3 Status: Implemented. 4 5 This document describes the initial implementation of upgrades in Kismatic. An upgrade 6 is defined as the replacement of _binaries_ or _configuration_ of a cluster created by Kismatic. 7 An upgrade does not include the operating system, or any packages that are not managed by Kismatic. 8 9 This upgrade implementation is concerned with upgrading the following: 10 * Etcd clusters (Kubernetes and Networking) 11 * Kubernetes components 12 * Docker (if we decide to support a newer version) 13 * Calico 14 * On-Cluster services (e.g. DNS, Dashboard, etc) 15 16 ## Versions 17 Every component on a cluster has a *current* version, and may be transitioning to 18 exactly one *target version.* 19 20 Every cluster has components with some number of current versions, and may be transitioning 21 to exactly one target version. The cluster is said to be “at version X” only 22 if all components are at that version. There may be operations that aren’t 23 performed on a cluster in transition because of the complexity of applying cluster-level 24 decisions to a system in an unknown state. 25 26 Each Kismatic release has a single target version. We will attempt to support { some number } 27 of Kismatic versions back in time. 28 29 ## Can I add a feature during an upgrade? 30 No. You need a cluster at a consistent, known version to add a feature. 31 32 ## Safety 33 Safety is the first concern of upgrading Kubernetes. An unsafe upgrade is one that results in 34 loss of data or critical functionality. 35 36 Kubernetes does not have a concept of a stable workload installation, but by default it won’t 37 move pods unless there is a problem with them. This relative stability means it’s possible to 38 use Kubernetes to stand up workloads in configurations that aren’t at all safe, 39 such as a database accepting writes that’s running in a single pod with emptyDir. 40 41 It is not Kismatic’s responsibility to fix these workloads. Also, it is not okay for us to 42 identify that a workload is unsafe and perform an action that would cause it to lose data, 43 which could be as simple as disconnecting Kubelet long enough that Kubernetes re-schedules the pod. 44 45 ## Availability 46 Availability is the second concern of upgrading Kubernetes. An upgrade interrupts 47 cluster availability if it results in the loss of a global cluster function 48 (such as removing the last master, ingress or breaking etcd cluster quorum) 49 and it interrupts workload availability if it results in the reduction of a service to 0 active pods. 50 51 ## Upgrade Safety and Availability 52 The cluster operator will be able to choose between two upgrade modalities. 53 54 ### Offline upgrade 55 The offline upgrade is the most basic upgrade supported by Kismatic. In this mode, the cluster 56 will be upgraded regardless of potential safety or availability issues. More specifically, 57 Kismatic will not perform any safety or availability checks before performing the upgrade, nor will it 58 cordon or drain nodes during the upgrade. This is suitable for clusters that are not housing production 59 workloads. 60 61 ### Online upgrade 62 The online upgrade is gated by safety and availability checks. In this mode, Kismatic will report 63 any upgrade condition that is potentially unsafe or likely to cause a loss of availability. 64 When faced with this situation, the upgrade will not proceed. 65 66 Operators may address the safety and availability concerns by: 67 * Manually scaling out nodes or pods, where applicable 68 * Manually scaling down or removing the unsafe workload 69 * Forcing the upgrade by using the offline modality 70 71 The following table outlines the conditions Kismatic will use to determine upgrade 72 safety and availability. 73 74 | Detected condition | Reasoning | 75 |--------------------------------------------|---------------------------------------------------------------------------| 76 | Pod not managed by RC, RS, Job, DS, or SS | Potentially unsafe: unmanaged pod will not be rescheduled | 77 | Pods without peers (i.e. replicas = 1) | Potentially unavailable: singleton pod will be unavailable during upgrade | 78 | DaemonSet scheduled on a single node | Potentially unavailable: singleton pod will be unavailable during upgrade | 79 | Pod using EmptyDir volume | Potentially unsafe: pod will loose the data in this volume | 80 | Pod using HostPath volume | Potentially unsafe: pod will loose the data in this volume | 81 | Pod using HostPath persistent volume | Potentially unsafe: pod will loose the data in this volume | 82 | Master node in a cluster with <2 masters | Unavailable: upgrading the master will bring the control plane down | 83 | Worker node in a cluster with <2 workers | Unavailable: upgrading the worker will bring all workloads down | 84 | Ingress node | Unavailable: we can't ensure that ingress nodes are load balanced | 85 | Gluster node | Potentially unavailable: brick on node will become unavailable | 86 87 ## Readiness 88 Validation (aka. Preflight) during an upgrade is about node readiness. In other words, 89 validation is about answering the question: Can the node be expected to safely install 90 the new software and configuration? 91 92 The following checks are performed on each node to determine readiness: 93 1. Disk space: Ensure that there is enough disk space on the node for upgrading. 94 2. Packages: When package installation is disabled, ensure that the new packages are installed. 95 96 ## Order of upgrade 97 All etcd nodes 98 Then 99 All master nodes 100 Then 101 All worker nodes (regardless of specialization) 102 Then 103 “On-cluster” Docker Registry 104 Then 105 Other on-cluster systems (DNS, dashboard, etc) 106 107 Nodes are upgraded one node at a time. 108 109 # Partial upgrade 110 Both the offline and online upgrade modalities will allow for partially upgrading a cluster. 111 A partial upgrade involves upgrading the nodes that did not report a problem. This enables 112 the ability for an operator to upgrade part of the cluster online and to upgrade the rest of 113 the cluster in a smaller downtime window. 114 115 ## User Experience 116 ``` 117 kismatic info [-f planfile] 118 ``` 119 Prints the version of the cluster and all nodes. 120 121 ``` 122 kismatic upgrade online [-f plan] 123 ``` 124 Computes the upgrade plan for a cluster, detecting nodes NOT at the target version. 125 Checks node readiness for upgrade. If no unsafe/unavailable workloads are detected, 126 the plan is immediately executed. If any nodes are unready, unsafe or unavailable, 127 the command will print and quit. 128 129 ``` 130 kismatic upgrade online [-f plan] --partial-ok 131 ``` 132 Computes the upgrade plan for a cluster, detecting nodes NOT at the target version. 133 Checks node readiness for upgrade. All unready/unsafe/unavailable workloads detected are pruned 134 from the plan. It is then immediately executed. All unready, unsafe or unavailable nodes are printed, 135 then the command will quit. 136 137 ``` 138 kismatic upgrade offline [-f plan] 139 ``` 140 Computes the upgrade plan for a cluster, detecting nodes NOT at the target version. 141 Checks node readiness for upgrade. If no unready workloads are detected, the plan is 142 immediately executed. If any nodes are unready, the command will print and quit. 143 144 ``` 145 kismatic upgrade offline [-f plan] --partial-ok 146 ``` 147 Computes the upgrade plan for a cluster, detecting nodes NOT at the target version. 148 Checks node readiness for upgrade. All unready nodes are pruned from the plan. 149 It is then immediately executed. All unready nodes are printed, then the command will quit.