github.com/pachyderm/pachyderm@v1.13.4/doc/docs/master/deploy-manage/manage/backup_restore.md (about) 1 # Backup Your Cluster 2 3 Pachyderm provides the `pachctl extract` and `pachctl restore` commands to 4 back up and restore the state of a Pachyderm cluster. 5 6 The `pachctl extract` command requires that all pipeline and data loading 7 activity into Pachyderm stop before the extract occurs. This enables 8 Pachyderm to create a consistent, point-in-time backup. 9 10 Extract and restore commands are used to migrate between minor 11 and major releases of Pachyderm. In addition, there are a few design 12 points and operational techniques that data engineers should take 13 into consideration when creating complex pachyderm deployments to 14 minimize disruptions to production pipelines. 15 16 Backing up Pachyderm involves the persistent volume (PV) that 17 `etcd` uses for administrative data and the object store bucket that 18 holds Pachyderm's actual data. 19 Restoring involves populating that PV and object store with data to 20 recreate a Pachyderm cluster. 21 22 ## Before You Begin 23 24 Before you begin, you need to pause all the pipelines and data operations 25 that run in your cluster. You can do so either by running a multi-line 26 shell script or by running the `pachctl stop pipeline` command for each 27 pipeline individually. 28 29 If you decide to use a shell script below, you need to have `jq` and 30 `xargs` installed on your system. Also, you might need to install 31 the `watch` and `kubectl` commands on your system, and configure 32 `kubectl` to point at the cluster that Pachyderm is running in. 33 34 To stop a running pipeline, complete the following steps: 35 36 1. Pause each pipeline individually by repeatedly running the single 37 `pachctl` command or by running a script: 38 39 === "Command" 40 ```shell 41 pachctl stop pipeline <pipeline-name> 42 ``` 43 44 === "Script" 45 ```shell 46 pachctl list pipeline --raw \ 47 | jq -r '.pipeline.name' \ 48 | xargs -P3 -n1 -I{} pachctl stop pipeline {} 49 ``` 50 51 1. Optionally, run the `watch` command to monitor the pods 52 terminating: 53 54 ```shell 55 watch -n 5 kubectl get pods 56 ``` 57 58 1. Confirm that pipelines are paused: 59 60 ```shell 61 pachctl list pipeline 62 ``` 63 64 ### Pause External Data Loading Operations 65 66 **Input repositories** or **input repos** in pachyderm are 67 repositories created with the `pachctl create repo` command. 68 They are designed to be the repos at the top of a directed 69 acyclic graph of pipelines. Pipelines have their own output 70 repos associated with them. These repos are different from 71 input repos. 72 73 If you have any processes external to Pachyderm 74 that put data into input repos using any supported method, 75 such as the Pachyderm APIs, `pachctl put file`, or other, 76 you need to pause those processes. 77 78 When an external system writes data into Pachyderm 79 input repos, you need to provide ways of *pausing* 80 output while queueing any data output 81 requests to be output when the systems are *resumed*. 82 This allows all Pachyderm processing to be stopped while 83 the extract takes place. 84 85 In addition, it is desirable for systems that load data 86 into Pachyderm have a mechanism for replaying a queue 87 from any checkpoint in time. 88 This is useful when doing migrations from one release 89 to another, where you want to minimize downtime 90 of a production Pachyderm system. After an extract, 91 the old system is kept running with the checkpoint 92 established while a duplicate, upgraded Pachyderm 93 cluster is being migrated with duplicated data. 94 Transactions that occur while the migrated, 95 upgraded cluster is being brought up are not lost. 96 97 If you are not using any external way of pausing input 98 from internal systems, you can use the following commands to stop 99 all data loading into Pachyderm from outside processes. 100 To stop all data loading processes, you need to modify 101 the `pachd` Kubernetes service so that it only accepts 102 traffic on port 30649 instead of the usual 30650. This way, 103 any background users and services that send requests to 104 your Pachyderm cluster while `pachctl extract` is 105 running will not interfere with the process. Use this port switching 106 technique to minimize downtime during the migration. 107 108 To pause external data loading operations, complete the 109 following steps: 110 111 1. Verify that all Pachyderm pipelines are paused: 112 113 ```shell 114 pachctl list pipeline 115 ``` 116 117 1. For safery, save the Pachyderm service spec in a `json`: 118 119 ```shell 120 kubectl get svc/pachd -o json >pach_service_backup_30650.json 121 ``` 122 123 1. Modify the `pachd` service to accept traffic on port 30649: 124 125 ```shell 126 kubectl get svc/pachd -o json | sed 's/30650/30649/g' | kubectl apply -f - 127 ``` 128 129 Most likely, you will need to modify your cloud provider's firewall 130 rules to allow traffic on this port. 131 132 Depending on your deployment, you might need to switch 133 additional ports: 134 135 1. Back up the `etcd` and dashboard manifests: 136 137 ```shell 138 kubectl get svc/etcd -o json >etcd_svc_backup_32379.json 139 kubectl get svc/dash -o json >dash_svc_backup_30080.json 140 ``` 141 142 1. Switch the `etcd` and dashboard manifests: 143 144 ```shell 145 kubectl get svc/pachd -o json | sed 's/30651/30648/g' | kubectl apply -f - 146 kubectl get svc/pachd -o json | sed 's/30652/30647/g' | kubectl apply -f - 147 kubectl get svc/pachd -o json | sed 's/30654/30646/g' | kubectl apply -f - 148 kubectl get svc/pachd -o json | sed 's/30655/30644/g' | kubectl apply -f - 149 kubectl get svc/etcd -o json | sed 's/32379/32378/g' | kubectl apply -f - 150 kubectl get svc/dash -o json | sed 's/30080/30079/g' | kubectl apply -f - 151 kubectl get svc/dash -o json | sed 's/30081/30078/g' | kubectl apply -f - 152 kubectl get svc/pachd -o json | sed 's/30600/30611/g' | kubectl apply -f - 153 ``` 154 155 1. Modify your environment so that you can access `pachd` on this new 156 port 157 158 ```shell 159 pachctl config update context `pachctl config get active-context` --pachd-address=<cluster ip>:30649 160 ``` 161 162 1. Verify that you can talk to `pachd`: (if not, firewall rules are a common culprit) 163 164 ```shell 165 pachctl version 166 ``` 167 168 **System Response:** 169 170 ``` 171 COMPONENT VERSION 172 pachctl {{ config.pach_latest_version }} 173 pachd {{ config.pach_latest_version }} 174 ``` 175 176 ??? note "pause-pipelines.sh" 177 Alternatively, you can run **Steps 1 - 3** by using the following script: 178 179 ```shell 180 #!/bin/bash 181 # Stop all pipelines: 182 pachctl list pipeline --raw \ 183 | jq -r '.pipeline.name' \ 184 | xargs -P3 -n1 -I{} pachctl stop pipeline {} 185 186 # Backup the Pachyderm services specs, in case you need to restore them: 187 kubectl get svc/pachd -o json >pach_service_backup_30650.json 188 kubectl get svc/etcd -o json >etcd_svc_backup_32379.json 189 kubectl get svc/dash -o json >dash_svc_backup_30080.json 190 191 # Modify all ports of all the Pachyderm service to avoid collissions 192 # with the migration cluster: 193 # Modify the pachd API endpoint to run on 30649: 194 kubectl get svc/pachd -o json | sed 's/30650/30649/g' | kubectl apply -f - 195 # Modify the pachd trace port to run on 30648: 196 kubectl get svc/pachd -o json | sed 's/30651/30648/g' | kubectl apply -f - 197 # Modify the pachd api-over-http port to run on 30647: 198 kubectl get svc/pachd -o json | sed 's/30652/30647/g' | kubectl apply -f - 199 # Modify the pachd saml authentication port to run on 30646: 200 kubectl get svc/pachd -o json | sed 's/30654/30646/g' | kubectl apply -f - 201 # Modify the pachd git api callback port to run on 30644: 202 kubectl get svc/pachd -o json | sed 's/30655/30644/g' | kubectl apply -f - 203 # Modify the etcd client port to run on 32378: 204 kubectl get svc/etcd -o json | sed 's/32379/32378/g' | kubectl apply -f - 205 # Modify the dashboard ports to run on 30079 and 30078: 206 kubectl get svc/dash -o json | sed 's/30080/30079/g' | kubectl apply -f - 207 kubectl get svc/dash -o json | sed 's/30081/30078/g' | kubectl apply -f - 208 # Modify the pachd s3 port to run on 30611: 209 kubectl get svc/pachd -o json | sed 's/30600/30611/g' | kubectl apply -f - 210 ``` 211 212 ## Back up Your Pachyderm Cluster 213 214 After you pause all pipelines and external data operations, 215 you can use the `pachctl extract` command to back up your data. 216 You can use `pachctl extract` alone or in combination with 217 cloning or snapshotting services offered by your cloud provider. 218 219 The backup includes the following: 220 221 * Your data that is typically stored in an object store 222 * Information about Pachyderm primitives, such as pipelines, repositories, 223 commits, provenance and so on. This information is stored in etcd. 224 225 You can back up everything to one local file or you can back up 226 Pachyderm primitives to a local file and use object store's 227 capabilities to clone the data stored in object store buckets. 228 The latter is preferred for large volumes of data and minimizing 229 the downtime during the upgrade. Use the 230 `--no-objects` flag to separate backups. 231 232 In addition, you can extract your partial or full backup into a 233 separate S3 bucket. The bucket must have the same permissions policy as 234 the one you have configured when you originally deployed Pachyderm. 235 236 To back up your Pachyderm cluster, run one of the following commands: 237 238 * To create a partial back up of metadata-only, run: 239 240 ```shell 241 pachctl extract --no-objects > path/to/your/backup/file 242 ``` 243 244 * If you want to save this partial backup in an object store by using the 245 `--url` flag, run: 246 247 ```shell 248 pachctl extract --no-objects --url s3://... 249 ``` 250 251 * To back up everything in one local file: 252 253 ```shell 254 pachctl extract > path/to/your/backup/file 255 ``` 256 257 Similarly, this backup can be saved in an object store with the `--url` 258 flag. 259 260 ## Using your Cloud Provider's Clone and Snapshot Services 261 262 Follow your cloud provider's recommendation 263 for backing up persistent volumes and object stores. Here are some pointers to the relevant documentation: 264 265 * Creating a snapshot of persistent volumes: 266 267 - [Creating snapshots of GCE persistent volumes](https://cloud.google.com/compute/docs/disks/create-snapshots) 268 - [Creating snapshots of Elastic Block Store (EBS) volumes](http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-creating-snapshot.html) 269 - [Creating snapshots of Azure Virtual Hard Disk volumes](https://docs.microsoft.com/en-us/azure/virtual-machines/windows/snapshot-copy-managed-disk) 270 271 For on-premises Kubernetes deployments, check the vendor documentation for 272 your PV implementation on backing up and restoring. 273 274 * Cloning object stores: 275 276 - [Using AWS CLI](https://docs.aws.amazon.com/cli/latest/reference/s3/sync.html) 277 - [Using gsutil](https://cloud.google.com/storage/docs/gsutil/commands/cp) 278 - [Using azcopy](https://docs.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-linux?toc=%2fazure%2fstorage%2ffiles%2ftoc.json). 279 280 For on-premises Kubernetes deployments, check the vendor documentation 281 for your on-premises object store for details on backing up and 282 restoring a bucket. 283 284 # Restore your Cluster from a Backup: 285 286 After you backup your cluster, you can restore it by using the 287 `pachctl restore` command. Typically, you would deploy a new Pachyderm cluster 288 either in another Kubernetes namespace or in a completely separate Kubernetes cluster. 289 290 To restore your Cluster from a Backup, run the following command: 291 292 * If you have backed up your cluster to a local file:, run: 293 294 ```shell 295 pachctl restore < path/to/your/backup/file 296 ``` 297 298 * If you have backed up your cluster to an object store, run: 299 300 ```shell 301 pachctl restore --url s3://<path-to-backup>> 302 ``` 303 304 !!! note "See Also:" 305 - [Migrate Your Cluster](../migrations/)