github.com/jlmeeker/kismatic@v1.10.1-0.20180612190640-57f9005a1f1a/docs/design-decisions/upgrades.md

github.com/jlmeeker/kismatic@v1.10.1-0.20180612190640-57f9005a1f1a/docs/design-decisions/upgrades.md (about)

1 # Cluster Upgrade
2
3 Status: Implemented.
4
5 This document describes the initial implementation of upgrades in Kismatic. An upgrade
6 is defined as the replacement of _binaries_ or _configuration_ of a cluster created by Kismatic.
7 An upgrade does not include the operating system, or any packages that are not managed by Kismatic.
8
9 This upgrade implementation is concerned with upgrading the following:
10 * Etcd clusters (Kubernetes and Networking)
11 * Kubernetes components
12 * Docker (if we decide to support a newer version)
13 * Calico
14 * On-Cluster services (e.g. DNS, Dashboard, etc)
15
16 ## Versions
17 Every component on a cluster has a *current* version, and may be transitioning to
18 exactly one *target version.*
19
20 Every cluster has components with some number of current versions, and may be transitioning
21 to exactly one target version. The cluster is said to be “at version X” only
22 if all components are at that version. There may be operations that aren’t
23 performed on a cluster in transition because of the complexity of applying cluster-level
24 decisions to a system in an unknown state.
25
26 Each Kismatic release has a single target version. We will attempt to support { some number }
27 of Kismatic versions back in time.
28
29 ## Can I add a feature during an upgrade?
30 No. You need a cluster at a consistent, known version to add a feature.
31
32 ## Safety
33 Safety is the first concern of upgrading Kubernetes. An unsafe upgrade is one that results in
34 loss of data or critical functionality.
35
36 Kubernetes does not have a concept of a stable workload installation, but by default it won’t
37 move pods unless there is a problem with them. This relative stability means it’s possible to
38 use Kubernetes to stand up workloads in configurations that aren’t at all safe,
39 such as a database accepting writes that’s running in a single pod with emptyDir.
40
41 It is not Kismatic’s responsibility to fix these workloads. Also, it is not okay for us to
42 identify that a workload is unsafe and perform an action that would cause it to lose data,
43 which could be as simple as disconnecting Kubelet long enough that Kubernetes re-schedules the pod.
44
45 ## Availability
46 Availability is the second concern of upgrading Kubernetes. An upgrade interrupts
47 cluster availability if it results in the loss of a global cluster function
48 (such as removing the last master, ingress or breaking etcd cluster quorum)
49 and it interrupts workload availability if it results in the reduction of a service to 0 active pods.
50
51 ## Upgrade Safety and Availability
52 The cluster operator will be able to choose between two upgrade modalities.
53
54 ### Offline upgrade
55 The offline upgrade is the most basic upgrade supported by Kismatic. In this mode, the cluster
56 will be upgraded regardless of potential safety or availability issues. More specifically,
57 Kismatic will not perform any safety or availability checks before performing the upgrade, nor will it
58 cordon or drain nodes during the upgrade. This is suitable for clusters that are not housing production
59 workloads.
60
61 ### Online upgrade
62 The online upgrade is gated by safety and availability checks. In this mode, Kismatic will report
63 any upgrade condition that is potentially unsafe or likely to cause a loss of availability.
64 When faced with this situation, the upgrade will not proceed.
65
66 Operators may address the safety and availability concerns by:
67 * Manually scaling out nodes or pods, where applicable
68 * Manually scaling down or removing the unsafe workload
69 * Forcing the upgrade by using the offline modality
70
71 The following table outlines the conditions Kismatic will use to determine upgrade
72 safety and availability.
73
74 | Detected condition | Reasoning |
75 |--------------------------------------------|---------------------------------------------------------------------------|
76 | Pod not managed by RC, RS, Job, DS, or SS | Potentially unsafe: unmanaged pod will not be rescheduled |
77 | Pods without peers (i.e. replicas = 1) | Potentially unavailable: singleton pod will be unavailable during upgrade |
78 | DaemonSet scheduled on a single node | Potentially unavailable: singleton pod will be unavailable during upgrade |
79 | Pod using EmptyDir volume | Potentially unsafe: pod will loose the data in this volume |
80 | Pod using HostPath volume | Potentially unsafe: pod will loose the data in this volume |
81 | Pod using HostPath persistent volume | Potentially unsafe: pod will loose the data in this volume |
82 | Master node in a cluster with <2 masters | Unavailable: upgrading the master will bring the control plane down |
83 | Worker node in a cluster with <2 workers | Unavailable: upgrading the worker will bring all workloads down |
84 | Ingress node | Unavailable: we can't ensure that ingress nodes are load balanced |
85 | Gluster node | Potentially unavailable: brick on node will become unavailable |
86
87 ## Readiness
88 Validation (aka. Preflight) during an upgrade is about node readiness. In other words,
89 validation is about answering the question: Can the node be expected to safely install
90 the new software and configuration?
91
92 The following checks are performed on each node to determine readiness:
93 1. Disk space: Ensure that there is enough disk space on the node for upgrading.
94 2. Packages: When package installation is disabled, ensure that the new packages are installed.
95
96 ## Order of upgrade
97 All etcd nodes
98 Then
99 All master nodes
100 Then
101 All worker nodes (regardless of specialization)
102 Then
103 “On-cluster” Docker Registry
104 Then
105 Other on-cluster systems (DNS, dashboard, etc)
106
107 Nodes are upgraded one node at a time.
108
109 # Partial upgrade
110 Both the offline and online upgrade modalities will allow for partially upgrading a cluster.
111 A partial upgrade involves upgrading the nodes that did not report a problem. This enables
112 the ability for an operator to upgrade part of the cluster online and to upgrade the rest of
113 the cluster in a smaller downtime window.
114
115 ## User Experience
116 ```
117 kismatic info [-f planfile]
118 ```
119 Prints the version of the cluster and all nodes.
120
121 ```
122 kismatic upgrade online [-f plan]
123 ```
124 Computes the upgrade plan for a cluster, detecting nodes NOT at the target version.
125 Checks node readiness for upgrade. If no unsafe/unavailable workloads are detected,
126 the plan is immediately executed. If any nodes are unready, unsafe or unavailable,
127 the command will print and quit.
128
129 ```
130 kismatic upgrade online [-f plan] --partial-ok
131 ```
132 Computes the upgrade plan for a cluster, detecting nodes NOT at the target version.
133 Checks node readiness for upgrade. All unready/unsafe/unavailable workloads detected are pruned
134 from the plan. It is then immediately executed. All unready, unsafe or unavailable nodes are printed,
135 then the command will quit.
136
137 ```
138 kismatic upgrade offline [-f plan]
139 ```
140 Computes the upgrade plan for a cluster, detecting nodes NOT at the target version.
141 Checks node readiness for upgrade. If no unready workloads are detected, the plan is
142 immediately executed. If any nodes are unready, the command will print and quit.
143
144 ```
145 kismatic upgrade offline [-f plan] --partial-ok
146 ```
147 Computes the upgrade plan for a cluster, detecting nodes NOT at the target version.
148 Checks node readiness for upgrade. All unready nodes are pruned from the plan.
149 It is then immediately executed. All unready nodes are printed, then the command will quit.