sigs.k8s.io/cluster-api@v1.7.1/docs/proposals/20200423-etcd-data-disk.md

sigs.k8s.io/cluster-api@v1.7.1/docs/proposals/20200423-etcd-data-disk.md (about)

     1  ---
     2  title: Enable mounting etcd on a data disk
     3  authors:
     4    - "@CecileRobertMichon"
     5  reviewers:
     6    - "@bagnaram"
     7    - "@vincepri"
     8    - "@detiber"
     9    - "@fabrizio.pandini"
    10  creation-date: 2020-04-23
    11  last-updated: 2020-05-11
    12  status: implementable
    13  ---
    14  
    15  # Enable mounting etcd on a data disk
    16  
    17  ## Table of Contents
    18  
    19  <!-- START doctoc generated TOC please keep comment here to allow auto update -->
    20  <!-- DON'T EDIT THIS SECTION, INSTEAD RE-RUN doctoc TO UPDATE -->
    21  
    22  - [Glossary](#glossary)
    23  - [Summary](#summary)
    24  - [Motivation](#motivation)
    25    - [Goals](#goals)
    26    - [Non-Goals/Future Work](#non-goalsfuture-work)
    27  - [Proposal](#proposal)
    28    - [User Stories](#user-stories)
    29      - [Story 1](#story-1)
    30      - [Story 2](#story-2)
    31    - [Implementation Details/Notes/Constraints](#implementation-detailsnotesconstraints)
    32    - [Changes required in the bootstrap provider (i.e. CABPK)](#changes-required-in-the-bootstrap-provider-ie-cabpk)
    33    - [Changes required in the infrastructure provider (here Azure is used as an example to illustrate the required changes).](#changes-required-in-the-infrastructure-provider-here-azure-is-used-as-an-example-to-illustrate-the-required-changes)
    34    - [Risks and Mitigations](#risks-and-mitigations)
    35  - [Alternatives](#alternatives)
    36    - [Use script to do the etcd mount and append that script to preKubeadmCommands](#use-script-to-do-the-etcd-mount-and-append-that-script-to-prekubeadmcommands)
    37    - [Use Cloud init to mount the data dir but modify bootstrap data in the infrastructure provider before passing to the instance user data](#use-cloud-init-to-mount-the-data-dir-but-modify-bootstrap-data-in-the-infrastructure-provider-before-passing-to-the-instance-user-data)
    38    - [Instrument the OS image with image-builder to perform this customization automatically through custom UDEV rules, scripts, etc.](#instrument-the-os-image-with-image-builder-to-perform-this-customization-automatically-through-custom-udev-rules-scripts-etc)
    39  - [Upgrade Strategy](#upgrade-strategy)
    40  
    41  <!-- END doctoc generated TOC please keep comment here to allow auto update -->
    42  
    43  ## Glossary
    44  
    45  Refer to the [Cluster API Book Glossary](https://cluster-api.sigs.k8s.io/reference/glossary.html).
    46  
    47  ## Summary
    48  
    49  CAPZ issue that motivated this proposal: [Re-evaluate putting etcd in a data disk](https://github.com/kubernetes-sigs/cluster-api-provider-azure/issues/448).
    50  
    51  Currently we deploy etcd on the OS disk for the VM instance. To avoid potential issues related to cached bandwidth/IOPS available on the VM, we should consider moving etcd to its own storage device. This might seem like an infrastructure-only change at first glance. However, in order to implement this proposal, we would need to mount the etcd disk.
    52  
    53  This is one of the cases where infrastructure and bootstrapping overlap so we need to ensure we set the right precedent for future cases. In addition, most infrastructure will face a similar challenge so a common solution would be beneficial. One possible workaround to achieve this without any modifications to CABPK would be to leverage `preKubeadmCommands` to insert a script that does the mounting or to append a cloud init section to the bootstrap data to mount the disk right before setting the instance user data in the infrastructure provider. This is not a desired outcome however, because this would mean either 1) using  `preKubeadmCommands` for something infrastructure specific that is required for all clusters, whereas `preKubeadmCommands` is meant for user customization, or 2) making an assumption in the infrastructure provider about the content and format of the bootstrap data.
    54  
    55  ## Motivation
    56  
    57  The main motivation of this proposal is to allow running etcd on a data disk to improve performance and reduce throttling.
    58  
    59  A "side effect" motivation is to open up the option of putting other directories such as  /var/lib/containerd and /var/log onto separate partitions as well. This would also make upgrades to the OS and/or repairing a broken OS install much easier if all the important data is located on a data disk.
    60  
    61  References:
    62  https://docs.microsoft.com/en-us/archive/blogs/xiangwu/azure-vm-storage-performance-and-throttling-demystify
    63  https://github.com/mesosphere/dcos-docs-site/blob/f15c4d9cd8c36c8a406a22b76b825b49d9fe577d/pages/mesosphere/dcos/1.13/installing/production/system-requirements/azure-recommendations/index.md#disk-configurations
    64  ### Goals
    65  
    66  - Allow running etcd on its own disk
    67  - Provide a solution that is reusable across infra providers
    68  - Avoid tying infrastructure providers to cloud init or a specific OS
    69  - To maintain backwards compatibility and cause no impact for users who don't intend to make use of this capability
    70  - Provide a generic solution that can be used to put other data directories on data disks as well
    71  
    72  ### Non-Goals/Future Work
    73  
    74  - Putting docker or any other component data on its own disk
    75  - External etcd
    76  
    77  ## Proposal
    78  
    79  The proposal is to modify CABPK to enable creating partitions and mounts as part of cloud-init. This would follow a similar pattern to the already available user configurable NTP and Files. The main benefit of this is that the disk setup and mount points options would be generic and reusable for other purposes besides mounting the etcd data directory.
    80  
    81  ### User Stories
    82  
    83  #### Story 1
    84  
    85  As an operator of a Management Cluster, I want to avoid potential issues related to cached bandwidth/IOPS of the etcd disk on my workload clusters.
    86  
    87  #### Story 2
    88  
    89  As a user of a Workload Cluster, I want provision and mount additional data storage devices for my application data.
    90  
    91  ### Implementation Details/Notes/Constraints
    92  
    93  ### Changes required in the bootstrap provider (i.e. CABPK)
    94  
    95  1. Add two new fields to KubeadmConfig for disk setup and mount points
    96  
    97  ```go
    98     // DiskSetup specifies options for the creation of partition tables and file systems on devices.
    99     // +optional
   100     DiskSetup *DiskSetup `json:"diskSetup,omitempty"`
   101  
   102     // Mounts specifies a list of mount points to be setup.
   103     // +optional
   104     Mounts []MountPoints `json:"mounts,omitempty"`
   105  
   106     // DiskSetup defines input for generated disk_setup and fs_setup in cloud-init.
   107     type DiskSetup struct {
   108         // Partitions specifies the list of the partitions to setup.
   109         Partitions  []Partition  `json:"partitions,omitempty"`
   110         Filesystems []Filesystem `json:"filesystems,omitempty"`
   111     }
   112  
   113     // Partition defines how to create and layout a partition.
   114  type Partition struct {
   115     Device string `json:"device"`
   116     Layout bool   `json:"layout"`
   117     // +optional
   118     Overwrite *bool `json:"overwrite,omitempty"`
   119     // +optional
   120     TableType *string `json:"tableType,omitempty"`
   121  }
   122  
   123  // Filesystem defines the file systems to be created.
   124  type Filesystem struct {
   125     Device     string `json:"device"`
   126     Filesystem string `json:"filesystem"`
   127     // +optional
   128     Label *string `json:"label,omitempty"`
   129     // +optional
   130     Partition *string `json:"partition,omitempty"`
   131     // +optional
   132     Overwrite *bool `json:"overwrite,omitempty"`
   133     // +optional
   134     ReplaceFS *string `json:"replaceFS,omitempty"`
   135     // +optional
   136     ExtraOpts []string `json:"extraOpts,omitempty"`
   137  }
   138  
   139     // MountPoints defines input for generated mounts in cloud-init.
   140     type MountPoints []string
   141  ```
   142  
   143  2. Add templates for disk_setupm fs_setup, and mounts and add those to controlPlaneCloudInit and controlPlaneJoinCloudInit
   144  
   145  references:
   146  https://cloudinit.readthedocs.io/en/latest/topics/examples.html#disk-setup
   147  https://cloudinit.readthedocs.io/en/latest/topics/examples.html#adjust-mount-points-mounted
   148  
   149  ### Changes required in the infrastructure provider (here Azure is used as an example to illustrate the required changes). 
   150  
   151  These changes are required to run etcd on a data device but it should be noted that someone could use the KubeadmConfig changes described above without making any changes to infrastructure.
   152  
   153  1. Add EtcdDisk optional field in AzureMachineSpec. For now, this will only have DiskSizeGB to specify the disk size but we can envision expanding this struct to allow further customization in the future.
   154  
   155  ```go
   156  // DataDisk specifies the parameters that are used to add one or more data disks to the machine.
   157  type DataDisk struct {
   158    // NameSuffix is the suffix to be appended to the machine name to generate the disk name.
   159    // Each disk name will be in format <machineName>_<nameSuffix>.
   160    NameSuffix string `json:"nameSuffix"`
   161    // DiskSizeGB is the size in GB to assign to the data disk.
   162    DiskSizeGB int32 `json:"diskSizeGB"`
   163    // Lun Specifies the logical unit number of the data disk. This value is used to identify data disks within the VM and therefore must be unique for each data disk attached to a VM.
   164    Lun int32 `json:"lun"`
   165  }
   166  ```
   167  
   168  2. Provision a data disk for each control plane
   169  
   170  ```go
   171     dataDisks := []compute.DataDisk{}
   172     for _, disk := range vmSpec.DataDisks {
   173         dataDisks = append(dataDisks, compute.DataDisk{
   174             CreateOption: compute.DiskCreateOptionTypesEmpty,
   175             DiskSizeGB:   to.Int32Ptr(disk.DiskSizeGB),
   176             Lun:          to.Int32Ptr(disk.Lun),
   177             Name:         to.StringPtr(azure.GenerateDataDiskName(vmSpec.Name, disk.NameSuffix)),
   178  
   179         })
   180     }
   181     storageProfile.DataDisks = &dataDisks
   182  ```
   183  
   184  3. Modify the base cluster-template to specify the new KubeadmConfigSpec options above
   185  
   186  ```yaml
   187     diskSetup:
   188       partitions:
   189         - device: /dev/disk/azure/scsi1/lun0
   190           tableType: gpt
   191           layout: true
   192           overwrite: false
   193       filesystems:
   194         - label: etcd_disk
   195           filesystem: ext4
   196           device: /dev/disk/azure/scsi1/lun0
   197           extraOpts:
   198             - "-F"
   199             - "-E"
   200             - "lazy_itable_init=1,lazy_journal_init=1"
   201     mounts:
   202       - - etcd_disk
   203         - /var/lib/etcd
   204  ```
   205  
   206  4. Prepend `rm -rf /var/lib/etcd/*` to preKubeadmCommand to remove `lost+found` from `/var/lib/etcd` otherwise kubeadm will complain that etcd data dir is not empty and fail (see https://github.com/kubernetes/kubeadm/issues/2127).
   207  
   208  ### Risks and Mitigations
   209  
   210  - Is there anything Azure specific in the cloud init implementation and script below that won't work for other providers? 
   211  A: No, see: https://cloudinit.readthedocs.io/en/latest/topics/examples.html#disk-setup
   212  https://cloudinit.readthedocs.io/en/latest/topics/examples.html#adjust-mount-points-mounted
   213  
   214  - If the data dir already exists, will kubeadm complain? While testing a prototype, I noticed kubeadm init failed if the data dir was not empty. Still need more investigation.
   215  A: No, but it will fail if that data dir is not empty. For now, a workaround is to prepend `rm -rf /var/lib/etcd/*` to preKubeadmCommands in CAPZ (see changes required above).
   216  
   217  - How much data will this add to UserData? In many infra providers, user data size is very restricted. See https://github.com/kubernetes-sigs/cluster-api/pull/2763#discussion_r397306055 for previous discussion on the topic.
   218  A: Alternatives 2.1 would theoretically include snippets for disk layout, format, & mount.
   219  
   220  - CAPZ: what should the default etcd disk size be? Right now I'm thinking 256GB but ideally it should depend on the number of nodes. https://etcd.io/docs/v3.3.12/op-guide/hardware/
   221  
   222  - The default behaviour for etcd will need to be on the OS disk rather than its own etcd data disk in order to maintain backwards compatibility (until fully automated). For example, if the current UX requires the user to modify the cluster template, this will break current and previous workflows for CAPI. How can etcd data disk fields in the cluster template be made optional?
   223  	A: All the fields added are 100% optional so a template without etcd data disks specified should still work with etcd on the OS disk. It's up to each provider to decide what they want to put in their "default" template. The provider can also leverage clusterctl flavors to have an OS disk and a data disk flavor.
   224  
   225  - Adding the disk configuration to the cluster template as part of Kubeadm Config spec has the downside that the user could remove it. What would happen then?
   226  	A: the data disk resource would still be created but the etcd data dir wouldn't get mounted which means etcd data would live on the OS disk. The cluster would be functional but the performance would be reduced (equivalent to what it is now). Do we want users to be able to change the configuration and risk shooting themselves in the foot? One thing to consider is that we already do that. There are some things in the cluster template that are "required" in the sense that the cluster creation will fail if the user removes them, for example "JoinConfiguration" in KubeadmConfigSpec.
   227  
   228  - In Azure (might be the same for other providers?), the device name may not be persistent if there is more than one disk. https://docs.microsoft.com/en-us/azure/virtual-machines/troubleshooting/troubleshoot-device-names-problems#identify-disk-luns. 
   229  In order to workaround that problem, the infra provider will have to find a way to refer to the device that is persistent. In Azure, we can use `/dev/disk/azure/scsi1/lun0` as the device in diskSetup and the file system label (`etcd_disk`)  in mounts. The caveat is that the LUN needs to match the data disk lun in infrastructure. This means that there is a dependency between how the data disks are specified and what the device should be in the bootstrap config. This goes back to the overall challenge of separating bootstrapping from infrastructure. In this case, there is a circular dependency: bootstrapping depends on infrastructure to determine which device should be used as etcd in cloud-init, but infrastructure depends on the bootstrap provider to generate userData for VMs. The way that we solve this for now is by putting the burden on the user (via templates) to reconcile the two. However, we should think about this problem (probably out of scope for this particular proposal) overall as a Cluster API design challenge: what happens when the bootstrap provider and infrastructure provider need to talk? Other examples of this are how to communicate bootstrap success/failure, how to deal with custom VM images during KCP upgrade, etc. 
   230  
   231  ## Alternatives
   232  
   233  These alternatives do not require any changes to CABPK.
   234  
   235  ### Use script to do the etcd mount and append that script to preKubeadmCommands
   236  
   237  Script eg. :
   238  
   239  ```bash
   240  set -x
   241  DISK=/dev/sdc
   242  PARTITION=${DISK}1
   243  MOUNTPOINT=/var/lib/etcd
   244  udevadm settle
   245  mkdir -p $MOUNTPOINT
   246  if mount | grep $MOUNTPOINT; then
   247   echo "disk is already mounted"
   248   exit 0
   249  fi
   250  if ! grep "/dev/sdc1" /etc/fstab; then
   251   echo "$PARTITION       $MOUNTPOINT       auto    defaults,nofail       0       2" >>/etc/fstab
   252  fi
   253  if ! ls $PARTITION; then
   254   /sbin/sgdisk --new 1 $DISK
   255   /sbin/mkfs.ext4 $PARTITION -L etcd_disk -F -E lazy_itable_init=1,lazy_journal_init=1
   256  fi
   257  mount $MOUNTPOINT
   258  /bin/chown -R etcd:etcd $MOUNTPOINT
   259  ```
   260  
   261  ### Use Cloud init to mount the data dir but modify bootstrap data in the infrastructure provider before passing to the instance user data
   262  
   263  Add this to the control plane init and join cloud init templates:
   264  
   265  ```yaml
   266  disk_setup:
   267   /dev/sdc:
   268     table_type: gpt
   269     layout: true
   270     overwrite: false
   271  
   272  fs_setup:
   273  - label: etcd_disk
   274   filesystem: ext4
   275   device: /dev/sdc1
   276   extra_opts:
   277     - "-F"
   278     - "-E"
   279     - "lazy_itable_init=1,lazy_journal_init=1"
   280  
   281  mounts:
   282  - - /dev/sdc1
   283   - /var/lib/etcd
   284  ```
   285  
   286  ### Instrument the OS image with image-builder to perform this customization automatically through custom UDEV rules, scripts, etc.
   287  
   288  ## Upgrade Strategy
   289  
   290  N/A as this proposal only adds new types.