github.com/pachyderm/pachyderm@v1.13.4/doc/docs/1.11.x/deploy-manage/manage/backup_restore.md

github.com/pachyderm/pachyderm@v1.13.4/doc/docs/1.11.x/deploy-manage/manage/backup_restore.md (about)

     1  # Backup Your Cluster
     2  
     3  Pachyderm provides the `pachctl extract` and `pachctl restore` commands to
     4  back up and restore the state of a Pachyderm cluster.
     5  
     6  The `pachctl extract` command requires that all pipeline and data loading
     7  activity into Pachyderm stop before the extract occurs. This enables
     8  Pachyderm to create a consistent, point-in-time backup.
     9  
    10  Extract and restore commands are used to migrate between minor
    11  and major releases of Pachyderm. In addition, there are a few design
    12  points and operational techniques that data engineers should take
    13  into consideration when creating complex pachyderm deployments to
    14  minimize disruptions to production pipelines.
    15  
    16  Backing up Pachyderm involves the persistent volume (PV) that
    17  `etcd` uses for administrative data and the object store bucket that
    18  holds Pachyderm's actual data.
    19  Restoring involves populating that PV and object store with data to
    20  recreate a Pachyderm cluster.
    21  
    22  ## Before You Begin
    23  
    24  Before you begin, you need to pause all the pipelines and data operations
    25  that run in your cluster. You can do so either by running a multi-line
    26  shell script or by running the `pachctl stop pipeline` command for each
    27  pipeline individually.
    28  
    29  If you decide to use a shell script below, you need to have `jq` and
    30  `xargs` installed on your system. Also, you might need to install
    31  the `watch` and `kubectl` commands on your system, and configure
    32  `kubectl` to point at the cluster that Pachyderm is running in.
    33  
    34  To stop a running pipeline, complete the following steps:
    35  
    36  1. Pause each pipeline individually by repeatedly running the single
    37  `pachctl` command or by running a script:
    38  
    39     ```pachctl tab="Command"
    40     pachctl stop pipeline <pipeline-name>
    41     ```
    42  
    43     ```shell tab="Script"
    44     pachctl list pipeline --raw \
    45     | jq -r '.pipeline.name' \
    46     | xargs -P3 -n1 -I{} pachctl stop pipeline {}
    47     ```
    48  
    49  1. Optionally, run the `watch` command to monitor the pods
    50     terminating:
    51  
    52     ```shell
    53     watch -n 5 kubectl get pods
    54     ```
    55  
    56  1. Confirm that pipelines are paused:
    57  
    58     ```shell
    59     pachctl list pipeline
    60     ```
    61  
    62  ### Pause External Data Loading Operations
    63  
    64  **Input repositories** or **input repos** in pachyderm are
    65  repositories created with the `pachctl create repo` command.
    66  They are designed to be the repos at the top of a directed
    67  acyclic graph of pipelines. Pipelines have their own output
    68  repos associated with them. These repos are different from
    69  input repos.
    70  
    71  If you have any processes external to Pachyderm
    72  that put data into input repos using any supported method,
    73  such as the Pachyderm APIs, `pachctl put file`, or other,
    74  you need to pause those processes.
    75  
    76  When an external system writes data into Pachyderm
    77  input repos, you need to provide ways of *pausing*
    78  output while queueing any data output
    79  requests to be output when the systems are *resumed*.
    80  This allows all Pachyderm processing to be stopped while
    81  the extract takes place.
    82  
    83  In addition, it is desirable for systems that load data
    84  into Pachyderm have a mechanism for replaying a queue
    85  from any checkpoint in time.
    86  This is useful when doing migrations from one release
    87  to another, where you want to minimize downtime
    88  of a production Pachyderm system. After an extract,
    89  the old system is kept running with the checkpoint
    90  established while a duplicate, upgraded Pachyderm
    91  cluster is being migrated with duplicated data.
    92  Transactions that occur while the migrated,
    93  upgraded cluster is being brought up are not lost.
    94  
    95  If you are not using any external way of pausing input
    96  from internal systems, you can use the following commands to stop
    97  all data loading into Pachyderm from outside processes.
    98  To stop all data loading processes, you need to modify
    99  the `pachd` Kubernetes service so that it only accepts
   100  traffic on port 30649 instead of the usual 30650. This way,
   101  any background users and services that send requests to
   102  your Pachyderm cluster while `pachctl extract` is
   103  running will not interfere with the process. Use this port switching
   104  technique to minimize downtime during the migration.
   105  
   106  To pause external data loading operations, complete the
   107  following steps:
   108  
   109  1. Verify that all Pachyderm pipelines are paused:
   110  
   111     ```shell
   112     pachctl list pipeline
   113     ```
   114  
   115  1. For safery, save the Pachyderm service spec in a `json`:
   116  
   117     ```shell
   118     kubectl get svc/pachd -o json >pach_service_backup_30650.json
   119     ```
   120  
   121  1. Modify the `pachd` service to accept traffic on port 30649:
   122  
   123     ```shell
   124     kubectl get svc/pachd -o json | sed 's/30650/30649/g' | kubectl apply -f -
   125     ```
   126  
   127     Most likely, you will need to modify your cloud provider's firewall
   128     rules to allow traffic on this port.
   129  
   130     Depending on your deployment, you might need to switch
   131     additional ports:
   132  
   133     1. Back up the `etcd` and dashboard manifests:
   134  
   135     ```shell
   136     kubectl get svc/etcd -o json >etcd_svc_backup_32379.json
   137     kubectl get svc/dash -o json >dash_svc_backup_30080.json
   138     ```
   139  
   140     1. Switch the `etcd` and dashboard manifests:
   141  
   142        ```shell
   143        kubectl get svc/pachd -o json | sed 's/30651/30648/g' | kubectl apply -f -
   144        kubectl get svc/pachd -o json | sed 's/30652/30647/g' | kubectl apply -f -
   145        kubectl get svc/pachd -o json | sed 's/30654/30646/g' | kubectl apply -f -
   146        kubectl get svc/pachd -o json | sed 's/30655/30644/g' | kubectl apply -f -
   147        kubectl get svc/etcd -o json | sed 's/32379/32378/g' | kubectl apply -f -
   148        kubectl get svc/dash -o json | sed 's/30080/30079/g' | kubectl apply -f -
   149        kubectl get svc/dash -o json | sed 's/30081/30078/g' | kubectl apply -f -
   150        kubectl get svc/pachd -o json | sed 's/30600/30611/g' | kubectl apply -f -
   151        ```
   152  
   153  1. Modify your environment so that you can access `pachd` on this new
   154  port
   155  
   156     ```shell
   157     pachctl config update context `pachctl config get active-context` --pachd-address=<cluster ip>:30649
   158     ```
   159  
   160  1. Verify that you can talk to `pachd`: (if not, firewall rules are a common culprit)
   161  
   162     ```shell
   163     pachctl version
   164     ```
   165  
   166     **System Response:**
   167  
   168     ```
   169     COMPONENT           VERSION
   170     pachctl             {{ config.pach_latest_version }}
   171     pachd               {{ config.pach_latest_version }}
   172     ```
   173  
   174  ??? note "pause-pipelines.sh"
   175      Alternatively, you can run **Steps 1 - 3** by using the following script:
   176  
   177      ```shell
   178      #!/bin/bash
   179      # Stop all pipelines:
   180      pachctl list pipeline --raw \
   181      | jq -r '.pipeline.name' \
   182      | xargs -P3 -n1 -I{} pachctl stop pipeline {}
   183  
   184      # Backup the Pachyderm services specs, in case you need to restore them:
   185      kubectl get svc/pachd -o json >pach_service_backup_30650.json
   186      kubectl get svc/etcd -o json >etcd_svc_backup_32379.json
   187      kubectl get svc/dash -o json >dash_svc_backup_30080.json
   188  
   189      # Modify all ports of all the Pachyderm service to avoid collissions
   190      # with the migration cluster:
   191      # Modify the pachd API endpoint to run on 30649:
   192      kubectl get svc/pachd -o json | sed 's/30650/30649/g' | kubectl apply -f -
   193      # Modify the pachd trace port to run on 30648:
   194      kubectl get svc/pachd -o json | sed 's/30651/30648/g' | kubectl apply -f -
   195      # Modify the pachd api-over-http port to run on 30647:
   196      kubectl get svc/pachd -o json | sed 's/30652/30647/g' | kubectl apply -f -
   197      # Modify the pachd saml authentication port to run on 30646:
   198      kubectl get svc/pachd -o json | sed 's/30654/30646/g' | kubectl apply -f -
   199      # Modify the pachd git api callback port to run on 30644:
   200      kubectl get svc/pachd -o json | sed 's/30655/30644/g' | kubectl apply -f -
   201      # Modify the etcd client port to run on 32378:
   202      kubectl get svc/etcd -o json | sed 's/32379/32378/g' | kubectl apply -f -
   203      # Modify the dashboard ports to run on 30079 and 30078:
   204      kubectl get svc/dash -o json | sed 's/30080/30079/g' | kubectl apply -f -
   205      kubectl get svc/dash -o json | sed 's/30081/30078/g' | kubectl apply -f -
   206      # Modify the pachd s3 port to run on 30611:
   207      kubectl get svc/pachd -o json | sed 's/30600/30611/g' | kubectl apply -f -
   208      ```
   209  
   210  ## Back up Your Pachyderm Cluster
   211  
   212  After you pause all pipelines and external data operations,
   213  you can use the `pachctl extract` command to back up your data.
   214  You can use `pachctl extract` alone or in combination with
   215  cloning or snapshotting services offered by your cloud provider.
   216  
   217  The backup includes the following:
   218  
   219  * Your data that is typically stored in an object store
   220  * Information about Pachyderm primitives, such as pipelines, repositories,
   221  commits, provenance and so on. This information is stored in etcd.
   222  
   223  You can back up everything to one local file or you can back up
   224  Pachyderm primitives to a local file and use object store's
   225  capabilities to clone the data stored in object store buckets.
   226  The latter is preferred for large volumes of data and minimizing
   227  the downtime during the upgrade. Use the
   228  `--no-objects` flag to separate backups.
   229  
   230  In addition, you can extract your partial or full backup into a
   231  separate S3 bucket. The bucket must have the same permissions policy as
   232  the one you have configured when you originally deployed Pachyderm.
   233  
   234  To back up your Pachyderm cluster, run one of the following commands:
   235  
   236  * To create a partial back up of metadata-only, run:
   237  
   238    ```shell
   239    pachctl extract --no-objects > path/to/your/backup/file
   240    ```
   241  
   242    * If you want to save this partial backup in an object store by using the
   243    `--url` flag, run:
   244  
   245      ```shell
   246      pachctl extract --no-objects --url s3://...
   247      ```
   248  
   249  * To back up everything in one local file:
   250  
   251    ```shell
   252    pachctl extract > path/to/your/backup/file
   253    ```
   254  
   255    Similarly, this backup can be saved in an object store with the `--url`
   256    flag.
   257  
   258  ## Using your Cloud Provider's Clone and Snapshot Services
   259  
   260  Follow your cloud provider's recommendation
   261  for backing up persistent volumes and object stores. Here are some pointers to the relevant documentation:
   262  
   263  * Creating a snapshot of persistent volumes:
   264  
   265    - [Creating snapshots of GCE persistent volumes](https://cloud.google.com/compute/docs/disks/create-snapshots)
   266    - [Creating snapshots of Elastic Block Store (EBS) volumes](http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-creating-snapshot.html)
   267    - [Creating snapshots of Azure Virtual Hard Disk volumes](https://docs.microsoft.com/en-us/azure/virtual-machines/windows/snapshot-copy-managed-disk)
   268  
   269      For on-premises Kubernetes deployments, check the vendor documentation for
   270      your PV implementation on backing up and restoring.
   271  
   272  * Cloning object stores:
   273  
   274    - [Using AWS CLI](https://docs.aws.amazon.com/cli/latest/reference/s3/sync.html)
   275    - [Using gsutil](https://cloud.google.com/storage/docs/gsutil/commands/cp)
   276    - [Using azcopy](https://docs.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-linux?toc=%2fazure%2fstorage%2ffiles%2ftoc.json).
   277  
   278      For on-premises Kubernetes deployments, check the vendor documentation
   279      for your on-premises object store for details on  backing up and
   280      restoring a bucket.
   281  
   282  # Restore your Cluster from a Backup:
   283  
   284  After you backup your cluster, you can restore it by using the
   285  `pachctl restore` command. Typically, you would deploy a new Pachyderm cluster
   286  either in another Kubernetes namespace or in a completely separate Kubernetes cluster.
   287  
   288  To restore your Cluster from a Backup, run the following command:
   289  
   290  * If you have backed up your cluster to a local file:, run:
   291  
   292    ```shell
   293    pachctl restore < path/to/your/backup/file
   294    ```
   295  
   296  * If you have backed up your cluster to an object store, run:
   297  
   298    ```shell
   299    pachctl restore --url s3://<path-to-backup>>
   300    ```
   301  
   302  !!! note "See Also:"
   303      - [Migrate Your Cluster](../migrations/)