github.com/pachyderm/pachyderm@v1.13.4/doc/docs/1.9.x/deploy-manage/manage/backup_restore.md (about) 1 # Backup and Restore 2 3 Pachyderm provides the `pachctl extract` and `pachctl restore` commands to backup and restore the state of a Pachyderm cluster. 4 5 The `pachctl extract` command requires that all pipeline and data loading activity into Pachyderm stop before the extract occurs. This enables Pachyderm to create a consistent, point-in-time backup. In this document, we'll talk about how to create such a backup and restore it to another Pachyderm instance. 6 7 Extract and restore commands are currently used to migrate between minor and major releases of Pachyderm, so it's important to understand how to perform them properly. In addition, there are a few design points and operational techniques that data engineers should take into consideration when creating complex pachyderm deployments to minimize disruptions to production pipelines. 8 9 In this document, we'll take you through the steps to backup and restore a cluster, migrate an existing cluster to a newer minor or major release, and elaborate on some of those design and operations considerations. 10 11 ## Backup and restore concepts 12 Backing up Pachyderm involves the persistent volume (PV) that `etcd` uses for administrative data 13 and the object store bucket that holds Pachyderm's actual data. 14 Restoring involves populating that PV and object store with data to recreate a Pachyderm cluster. 15 16 ## General backup procedure 17 18 ### 1. Pause all pipeline and data loading/unloading operations 19 20 Before you begin, you need to pause all pipelines and data operations. 21 22 #### Pausing pipelines 23 From the directed acyclic graphs (DAG) that define your pachyderm cluster, stop each pipeline. You can either run a multiline shell command, shown below, or you must, for each pipeline, manually run the `pachctl stop pipeline` command. 24 25 `pachctl stop pipeline <pipeline-name>` 26 27 You can confirm each pipeline is paused using the `pachctl list pipeline` command 28 29 `pachctl list pipeline` 30 31 Alternatively, a useful shell script for running `stop pipeline` on all pipelines is included below. It may be necessary to install the utilities used in the script, like `jq` and `xargs`, on your system. 32 33 ``` 34 pachctl list pipeline --raw \ 35 | jq -r '.pipeline.name' \ 36 | xargs -P3 -n1 -I{} pachctl stop pipeline {} 37 ``` 38 39 It's also a useful practice, for simple to moderately complex deployments, to keep a terminal window up showing the state of all running kubernetes pods. 40 41 `watch -n 5 kubectl get pods` 42 43 You may need to install the `watch` and `kubectl` commands on your system, and configure `kubectl` to point at the cluster that Pachyderm is running in. 44 45 #### Pausing data loading operations 46 47 **Input repositories** or **input repos** in pachyderm are repositories created with the `pachctl create repo` command. They're designed to be the repos at the top of a directed acyclic graph of pipelines. Pipelines have their own output repos associated with them, and are not considered input repos. If there are any processes external to pachyderm that put data into input repos using any method (the Pachyderm APIs, `pachctl put file`, etc.), they need to be paused. See [Loading data from other sources into pachyderm](#loading-data-from-other-sources-into-pachyderm) below for design considerations for those processes that will minimize downtime during a restore or migration. 48 49 Alternatively, you can use the following commands to stop all data loading into Pachyderm from outside processes. 50 51 ``` 52 # Once you have stopped all running pachyderm pipelines, such as with this command, 53 # $ pachctl list pipeline --raw \ 54 # | jq -r '.pipeline.name' \ 55 # | xargs -P3 -n1 -I{} pachctl stop pipeline {} 56 57 # all pipelines in your cluster should be suspended. To stop all 58 # data loading processes, we're going to modify the pachd Kubernetes service so that 59 # it only accepts traffic on port 30649 (instead of the usual 30650). This way, 60 # any background users and services that send requests to your Pachyderm cluster 61 # while 'extract' is running will not interfere with the process 62 # 63 # Backup the Pachyderm service spec, in case you need to restore it quickly 64 $ kubectl get svc/pachd -o json >pach_service_backup_30650.json 65 66 # Modify the service to accept traffic on port 30649 67 # Note that you'll likely also need to modify your cloud provider's firewall 68 # rules to allow traffic on this port 69 $ kubectl get svc/pachd -o json | sed 's/30650/30649/g' | kubectl apply -f - 70 71 # Modify your environment so that *you* can talk to pachd on this new port 72 $ pachctl config update context `pachctl config get active-context` --pachd-address=<cluster ip>:30649 73 74 # Make sure you can talk to pachd (if not, firewall rules are a common culprit) 75 $ pachctl version 76 COMPONENT VERSION 77 pachctl 1.9.5 78 pachd 1.9.5 79 ``` 80 81 ### 2. Extract a pachyderm backup 82 83 You can use `pachctl extract` alone or in combination with cloning/snapshotting services. 84 85 #### Using `pachctl extract` 86 87 Using the `pachctl extract` command, create the backup you need. 88 89 `pachctl extract > path/to/your/backup/file` 90 91 You can also use the `-u` or `--url` flag to put the backup directly into an object store. 92 93 `pachctl extract --url s3://...` 94 95 If you are planning on backing up the object store using its own built-in clone operation, be sure to add the `--no-objects` flag to the `pachctl extract` command. 96 97 #### Using your cloud provider's clone and snapshot services 98 99 You should follow your cloud provider's recommendation 100 for backing up these resources. Here are some pointers to the relevant documentation. 101 102 ##### Snapshotting persistent volumes 103 For example, here are official guides on creating snapshots of persistent volumes on Google Cloud Platform, Amazon Web Services (AWS) and Microsoft Azure, respectively: 104 105 - [Creating snapshots of GCE persistent volumes](https://cloud.google.com/compute/docs/disks/create-snapshots) 106 - [Creating snapshots of Elastic Block Store (EBS) volumes](http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-creating-snapshot.html) 107 - [Creating snapshots of Azure Virtual Hard Disk volumes](https://docs.microsoft.com/en-us/azure/virtual-machines/windows/snapshot-copy-managed-disk) 108 109 For on-premises Kubernetes deployments, check the vendor documentation for your PV implementation on backing up and restoring. 110 111 ##### Cloning object stores 112 Below, we give an example using the Amazon Web Services CLI to clone one bucket to another, [taken from the documentation for that command](https://docs.aws.amazon.com/cli/latest/reference/s3/sync.html). Similar commands are available for [Google Cloud](https://cloud.google.com/storage/docs/gsutil/commands/cp) and [Azure blob storage](https://docs.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-linux?toc=%2fazure%2fstorage%2ffiles%2ftoc.json). 113 114 `aws s3 sync s3://mybucket s3://mybucket2` 115 116 For on-premises Kubernetes deployments, check the vendor documentation for your on-premises object store for details on backing up and restoring a bucket. 117 118 #### Combining cloning, snapshots and extract/restore 119 120 You can use `pachctl extract` command with the `--no-objects` flag to exclude the object store, and use an object store snapshot or clone command to back up the object store. You can run the two commands at the same time. For example, on Amazon Web Services, the following commands can be run simultaneously. 121 122 `aws s3 sync s3://mybucket s3://mybucket2` 123 124 `pachctl extract --no-objects --url s3://anotherbucket` 125 126 #### Use case: minimizing downtime during a migration 127 The above cloning/snapshotting technique is recommended when doing a migration where minimizing downtime is desirable, as it allows the duplicated object store to be the basis of the upgraded, new cluster instead of requiring Pachyderm to extract the data from object store. 128 129 ### 3. Restart all pipeline and data loading operations 130 131 Once the backup is complete, restart all paused pipelines and data loading operations. 132 133 From the directed acyclic graphs (DAG) that define your pachyderm cluster, start each pipeline. You can either run a multiline shell command, shown below, or you must, for each pipeline, manually run the `pachctl start pipeline` command. 134 135 `pachctl start pipeline <pipeline-name>` 136 137 You can confirm each pipeline is started using the `list pipeline` command 138 139 `pachctl list pipeline` 140 141 A useful shell script for running `start pipeline` on all pipelines is included below. It may be necessary to install the utilities used in the script, like `jq` and `xargs`, on your system. 142 143 ``` 144 pachctl list pipeline --raw \ 145 | jq -r '.pipeline.name' \ 146 | xargs -P3 -n1 -I{} pachctl start pipeline {} 147 ``` 148 149 If you used the port-changing technique, [above](#pausing-data-loading-operations), to stop all data loading into Pachyderm from outside processes, you should change the ports back. 150 151 ``` 152 # Once you have restarted all running pachyderm pipelines, such as with this command, 153 # $ pachctl list pipeline --raw \ 154 # | jq -r '.pipeline.name' \ 155 # | xargs -P3 -n1 -I{} pachctl start pipeline {} 156 157 # all pipelines in your cluster should be restarted. To restart all data loading 158 # processes, we're going to change the pachd Kubernetes service so that 159 # it only accepts traffic on port 30650 again (from 30649). 160 # 161 # Backup the Pachyderm service spec, in case you need to restore it quickly 162 $ kubectl get svc/pachd -o json >pach_service_backup_30649.json 163 164 # Modify the service to accept traffic on port 30650, again 165 $ kubectl get svc/pachd -o json | sed 's/30649/30650/g' | kubectl apply -f - 166 167 # Modify your environment so that *you* can talk to pachd on the old port 168 $ pachctl config update context `pachctl config get active-context` --pachd-address=<cluster ip>:30650 169 170 # Make sure you can talk to pachd (if not, firewall rules are a common culprit) 171 $ pachctl version 172 COMPONENT VERSION 173 pachctl 1.9.5 174 pachd 1.9.5 175 ``` 176 ## General restore procedure 177 ### Restore your backup to a pachyderm cluster, same version 178 179 Spin up a Pachyderm cluster and run `pachctl restore` with the backup you created earlier. 180 181 `pachctl restore < path/to/your/backup/file` 182 183 You can also use the `-u` or `--url` flag to get the backup directly from the object store you placed it in 184 185 `pachctl restore --url s3://...` 186 187 188 ### Loading data from other sources into Pachyderm 189 190 When writing systems that place data into Pachyderm input repos (see [above](#pausing-data-loading-operations) for a definition of 'input repo'), 191 it is important to provide ways of 'pausing' output while queueing any data output requests to be output when the systems are 'resumed'. 192 This allows all Pachyderm processing to be stopped while the extract takes place. 193 194 In addition, it is desirable for systems that load data into Pachyderm have a mechanism for replaying a queue from any checkpoint in time. 195 This is useful when doing migrations from one release to another, where you would like to minimize downtime of a production Pachyderm system. 196 After an extract, 197 the old system is kept running with the checkpoint established while a duplicate, upgraded pachyderm cluster is migrated with duplicated data. 198 Transactions that occur while the migrated, 199 upgraded cluster is being brought up are not lost,