sigs.k8s.io/cluster-api@v1.7.1/docs/proposals/20200423-etcd-data-disk.md (about) 1 --- 2 title: Enable mounting etcd on a data disk 3 authors: 4 - "@CecileRobertMichon" 5 reviewers: 6 - "@bagnaram" 7 - "@vincepri" 8 - "@detiber" 9 - "@fabrizio.pandini" 10 creation-date: 2020-04-23 11 last-updated: 2020-05-11 12 status: implementable 13 --- 14 15 # Enable mounting etcd on a data disk 16 17 ## Table of Contents 18 19 <!-- START doctoc generated TOC please keep comment here to allow auto update --> 20 <!-- DON'T EDIT THIS SECTION, INSTEAD RE-RUN doctoc TO UPDATE --> 21 22 - [Glossary](#glossary) 23 - [Summary](#summary) 24 - [Motivation](#motivation) 25 - [Goals](#goals) 26 - [Non-Goals/Future Work](#non-goalsfuture-work) 27 - [Proposal](#proposal) 28 - [User Stories](#user-stories) 29 - [Story 1](#story-1) 30 - [Story 2](#story-2) 31 - [Implementation Details/Notes/Constraints](#implementation-detailsnotesconstraints) 32 - [Changes required in the bootstrap provider (i.e. CABPK)](#changes-required-in-the-bootstrap-provider-ie-cabpk) 33 - [Changes required in the infrastructure provider (here Azure is used as an example to illustrate the required changes).](#changes-required-in-the-infrastructure-provider-here-azure-is-used-as-an-example-to-illustrate-the-required-changes) 34 - [Risks and Mitigations](#risks-and-mitigations) 35 - [Alternatives](#alternatives) 36 - [Use script to do the etcd mount and append that script to preKubeadmCommands](#use-script-to-do-the-etcd-mount-and-append-that-script-to-prekubeadmcommands) 37 - [Use Cloud init to mount the data dir but modify bootstrap data in the infrastructure provider before passing to the instance user data](#use-cloud-init-to-mount-the-data-dir-but-modify-bootstrap-data-in-the-infrastructure-provider-before-passing-to-the-instance-user-data) 38 - [Instrument the OS image with image-builder to perform this customization automatically through custom UDEV rules, scripts, etc.](#instrument-the-os-image-with-image-builder-to-perform-this-customization-automatically-through-custom-udev-rules-scripts-etc) 39 - [Upgrade Strategy](#upgrade-strategy) 40 41 <!-- END doctoc generated TOC please keep comment here to allow auto update --> 42 43 ## Glossary 44 45 Refer to the [Cluster API Book Glossary](https://cluster-api.sigs.k8s.io/reference/glossary.html). 46 47 ## Summary 48 49 CAPZ issue that motivated this proposal: [Re-evaluate putting etcd in a data disk](https://github.com/kubernetes-sigs/cluster-api-provider-azure/issues/448). 50 51 Currently we deploy etcd on the OS disk for the VM instance. To avoid potential issues related to cached bandwidth/IOPS available on the VM, we should consider moving etcd to its own storage device. This might seem like an infrastructure-only change at first glance. However, in order to implement this proposal, we would need to mount the etcd disk. 52 53 This is one of the cases where infrastructure and bootstrapping overlap so we need to ensure we set the right precedent for future cases. In addition, most infrastructure will face a similar challenge so a common solution would be beneficial. One possible workaround to achieve this without any modifications to CABPK would be to leverage `preKubeadmCommands` to insert a script that does the mounting or to append a cloud init section to the bootstrap data to mount the disk right before setting the instance user data in the infrastructure provider. This is not a desired outcome however, because this would mean either 1) using `preKubeadmCommands` for something infrastructure specific that is required for all clusters, whereas `preKubeadmCommands` is meant for user customization, or 2) making an assumption in the infrastructure provider about the content and format of the bootstrap data. 54 55 ## Motivation 56 57 The main motivation of this proposal is to allow running etcd on a data disk to improve performance and reduce throttling. 58 59 A "side effect" motivation is to open up the option of putting other directories such as /var/lib/containerd and /var/log onto separate partitions as well. This would also make upgrades to the OS and/or repairing a broken OS install much easier if all the important data is located on a data disk. 60 61 References: 62 https://docs.microsoft.com/en-us/archive/blogs/xiangwu/azure-vm-storage-performance-and-throttling-demystify 63 https://github.com/mesosphere/dcos-docs-site/blob/f15c4d9cd8c36c8a406a22b76b825b49d9fe577d/pages/mesosphere/dcos/1.13/installing/production/system-requirements/azure-recommendations/index.md#disk-configurations 64 ### Goals 65 66 - Allow running etcd on its own disk 67 - Provide a solution that is reusable across infra providers 68 - Avoid tying infrastructure providers to cloud init or a specific OS 69 - To maintain backwards compatibility and cause no impact for users who don't intend to make use of this capability 70 - Provide a generic solution that can be used to put other data directories on data disks as well 71 72 ### Non-Goals/Future Work 73 74 - Putting docker or any other component data on its own disk 75 - External etcd 76 77 ## Proposal 78 79 The proposal is to modify CABPK to enable creating partitions and mounts as part of cloud-init. This would follow a similar pattern to the already available user configurable NTP and Files. The main benefit of this is that the disk setup and mount points options would be generic and reusable for other purposes besides mounting the etcd data directory. 80 81 ### User Stories 82 83 #### Story 1 84 85 As an operator of a Management Cluster, I want to avoid potential issues related to cached bandwidth/IOPS of the etcd disk on my workload clusters. 86 87 #### Story 2 88 89 As a user of a Workload Cluster, I want provision and mount additional data storage devices for my application data. 90 91 ### Implementation Details/Notes/Constraints 92 93 ### Changes required in the bootstrap provider (i.e. CABPK) 94 95 1. Add two new fields to KubeadmConfig for disk setup and mount points 96 97 ```go 98 // DiskSetup specifies options for the creation of partition tables and file systems on devices. 99 // +optional 100 DiskSetup *DiskSetup `json:"diskSetup,omitempty"` 101 102 // Mounts specifies a list of mount points to be setup. 103 // +optional 104 Mounts []MountPoints `json:"mounts,omitempty"` 105 106 // DiskSetup defines input for generated disk_setup and fs_setup in cloud-init. 107 type DiskSetup struct { 108 // Partitions specifies the list of the partitions to setup. 109 Partitions []Partition `json:"partitions,omitempty"` 110 Filesystems []Filesystem `json:"filesystems,omitempty"` 111 } 112 113 // Partition defines how to create and layout a partition. 114 type Partition struct { 115 Device string `json:"device"` 116 Layout bool `json:"layout"` 117 // +optional 118 Overwrite *bool `json:"overwrite,omitempty"` 119 // +optional 120 TableType *string `json:"tableType,omitempty"` 121 } 122 123 // Filesystem defines the file systems to be created. 124 type Filesystem struct { 125 Device string `json:"device"` 126 Filesystem string `json:"filesystem"` 127 // +optional 128 Label *string `json:"label,omitempty"` 129 // +optional 130 Partition *string `json:"partition,omitempty"` 131 // +optional 132 Overwrite *bool `json:"overwrite,omitempty"` 133 // +optional 134 ReplaceFS *string `json:"replaceFS,omitempty"` 135 // +optional 136 ExtraOpts []string `json:"extraOpts,omitempty"` 137 } 138 139 // MountPoints defines input for generated mounts in cloud-init. 140 type MountPoints []string 141 ``` 142 143 2. Add templates for disk_setupm fs_setup, and mounts and add those to controlPlaneCloudInit and controlPlaneJoinCloudInit 144 145 references: 146 https://cloudinit.readthedocs.io/en/latest/topics/examples.html#disk-setup 147 https://cloudinit.readthedocs.io/en/latest/topics/examples.html#adjust-mount-points-mounted 148 149 ### Changes required in the infrastructure provider (here Azure is used as an example to illustrate the required changes). 150 151 These changes are required to run etcd on a data device but it should be noted that someone could use the KubeadmConfig changes described above without making any changes to infrastructure. 152 153 1. Add EtcdDisk optional field in AzureMachineSpec. For now, this will only have DiskSizeGB to specify the disk size but we can envision expanding this struct to allow further customization in the future. 154 155 ```go 156 // DataDisk specifies the parameters that are used to add one or more data disks to the machine. 157 type DataDisk struct { 158 // NameSuffix is the suffix to be appended to the machine name to generate the disk name. 159 // Each disk name will be in format <machineName>_<nameSuffix>. 160 NameSuffix string `json:"nameSuffix"` 161 // DiskSizeGB is the size in GB to assign to the data disk. 162 DiskSizeGB int32 `json:"diskSizeGB"` 163 // Lun Specifies the logical unit number of the data disk. This value is used to identify data disks within the VM and therefore must be unique for each data disk attached to a VM. 164 Lun int32 `json:"lun"` 165 } 166 ``` 167 168 2. Provision a data disk for each control plane 169 170 ```go 171 dataDisks := []compute.DataDisk{} 172 for _, disk := range vmSpec.DataDisks { 173 dataDisks = append(dataDisks, compute.DataDisk{ 174 CreateOption: compute.DiskCreateOptionTypesEmpty, 175 DiskSizeGB: to.Int32Ptr(disk.DiskSizeGB), 176 Lun: to.Int32Ptr(disk.Lun), 177 Name: to.StringPtr(azure.GenerateDataDiskName(vmSpec.Name, disk.NameSuffix)), 178 179 }) 180 } 181 storageProfile.DataDisks = &dataDisks 182 ``` 183 184 3. Modify the base cluster-template to specify the new KubeadmConfigSpec options above 185 186 ```yaml 187 diskSetup: 188 partitions: 189 - device: /dev/disk/azure/scsi1/lun0 190 tableType: gpt 191 layout: true 192 overwrite: false 193 filesystems: 194 - label: etcd_disk 195 filesystem: ext4 196 device: /dev/disk/azure/scsi1/lun0 197 extraOpts: 198 - "-F" 199 - "-E" 200 - "lazy_itable_init=1,lazy_journal_init=1" 201 mounts: 202 - - etcd_disk 203 - /var/lib/etcd 204 ``` 205 206 4. Prepend `rm -rf /var/lib/etcd/*` to preKubeadmCommand to remove `lost+found` from `/var/lib/etcd` otherwise kubeadm will complain that etcd data dir is not empty and fail (see https://github.com/kubernetes/kubeadm/issues/2127). 207 208 ### Risks and Mitigations 209 210 - Is there anything Azure specific in the cloud init implementation and script below that won't work for other providers? 211 A: No, see: https://cloudinit.readthedocs.io/en/latest/topics/examples.html#disk-setup 212 https://cloudinit.readthedocs.io/en/latest/topics/examples.html#adjust-mount-points-mounted 213 214 - If the data dir already exists, will kubeadm complain? While testing a prototype, I noticed kubeadm init failed if the data dir was not empty. Still need more investigation. 215 A: No, but it will fail if that data dir is not empty. For now, a workaround is to prepend `rm -rf /var/lib/etcd/*` to preKubeadmCommands in CAPZ (see changes required above). 216 217 - How much data will this add to UserData? In many infra providers, user data size is very restricted. See https://github.com/kubernetes-sigs/cluster-api/pull/2763#discussion_r397306055 for previous discussion on the topic. 218 A: Alternatives 2.1 would theoretically include snippets for disk layout, format, & mount. 219 220 - CAPZ: what should the default etcd disk size be? Right now I'm thinking 256GB but ideally it should depend on the number of nodes. https://etcd.io/docs/v3.3.12/op-guide/hardware/ 221 222 - The default behaviour for etcd will need to be on the OS disk rather than its own etcd data disk in order to maintain backwards compatibility (until fully automated). For example, if the current UX requires the user to modify the cluster template, this will break current and previous workflows for CAPI. How can etcd data disk fields in the cluster template be made optional? 223 A: All the fields added are 100% optional so a template without etcd data disks specified should still work with etcd on the OS disk. It's up to each provider to decide what they want to put in their "default" template. The provider can also leverage clusterctl flavors to have an OS disk and a data disk flavor. 224 225 - Adding the disk configuration to the cluster template as part of Kubeadm Config spec has the downside that the user could remove it. What would happen then? 226 A: the data disk resource would still be created but the etcd data dir wouldn't get mounted which means etcd data would live on the OS disk. The cluster would be functional but the performance would be reduced (equivalent to what it is now). Do we want users to be able to change the configuration and risk shooting themselves in the foot? One thing to consider is that we already do that. There are some things in the cluster template that are "required" in the sense that the cluster creation will fail if the user removes them, for example "JoinConfiguration" in KubeadmConfigSpec. 227 228 - In Azure (might be the same for other providers?), the device name may not be persistent if there is more than one disk. https://docs.microsoft.com/en-us/azure/virtual-machines/troubleshooting/troubleshoot-device-names-problems#identify-disk-luns. 229 In order to workaround that problem, the infra provider will have to find a way to refer to the device that is persistent. In Azure, we can use `/dev/disk/azure/scsi1/lun0` as the device in diskSetup and the file system label (`etcd_disk`) in mounts. The caveat is that the LUN needs to match the data disk lun in infrastructure. This means that there is a dependency between how the data disks are specified and what the device should be in the bootstrap config. This goes back to the overall challenge of separating bootstrapping from infrastructure. In this case, there is a circular dependency: bootstrapping depends on infrastructure to determine which device should be used as etcd in cloud-init, but infrastructure depends on the bootstrap provider to generate userData for VMs. The way that we solve this for now is by putting the burden on the user (via templates) to reconcile the two. However, we should think about this problem (probably out of scope for this particular proposal) overall as a Cluster API design challenge: what happens when the bootstrap provider and infrastructure provider need to talk? Other examples of this are how to communicate bootstrap success/failure, how to deal with custom VM images during KCP upgrade, etc. 230 231 ## Alternatives 232 233 These alternatives do not require any changes to CABPK. 234 235 ### Use script to do the etcd mount and append that script to preKubeadmCommands 236 237 Script eg. : 238 239 ```bash 240 set -x 241 DISK=/dev/sdc 242 PARTITION=${DISK}1 243 MOUNTPOINT=/var/lib/etcd 244 udevadm settle 245 mkdir -p $MOUNTPOINT 246 if mount | grep $MOUNTPOINT; then 247 echo "disk is already mounted" 248 exit 0 249 fi 250 if ! grep "/dev/sdc1" /etc/fstab; then 251 echo "$PARTITION $MOUNTPOINT auto defaults,nofail 0 2" >>/etc/fstab 252 fi 253 if ! ls $PARTITION; then 254 /sbin/sgdisk --new 1 $DISK 255 /sbin/mkfs.ext4 $PARTITION -L etcd_disk -F -E lazy_itable_init=1,lazy_journal_init=1 256 fi 257 mount $MOUNTPOINT 258 /bin/chown -R etcd:etcd $MOUNTPOINT 259 ``` 260 261 ### Use Cloud init to mount the data dir but modify bootstrap data in the infrastructure provider before passing to the instance user data 262 263 Add this to the control plane init and join cloud init templates: 264 265 ```yaml 266 disk_setup: 267 /dev/sdc: 268 table_type: gpt 269 layout: true 270 overwrite: false 271 272 fs_setup: 273 - label: etcd_disk 274 filesystem: ext4 275 device: /dev/sdc1 276 extra_opts: 277 - "-F" 278 - "-E" 279 - "lazy_itable_init=1,lazy_journal_init=1" 280 281 mounts: 282 - - /dev/sdc1 283 - /var/lib/etcd 284 ``` 285 286 ### Instrument the OS image with image-builder to perform this customization automatically through custom UDEV rules, scripts, etc. 287 288 ## Upgrade Strategy 289 290 N/A as this proposal only adds new types.