github.com/pachyderm/pachyderm@v1.13.4/doc/docs/master/deploy-manage/manage/backup_restore.md

github.com/pachyderm/pachyderm@v1.13.4/doc/docs/master/deploy-manage/manage/backup_restore.md (about)

     1  # Backup Your Cluster
     2  
     3  Pachyderm provides the `pachctl extract` and `pachctl restore` commands to
     4  back up and restore the state of a Pachyderm cluster.
     5  
     6  The `pachctl extract` command requires that all pipeline and data loading
     7  activity into Pachyderm stop before the extract occurs. This enables
     8  Pachyderm to create a consistent, point-in-time backup.
     9  
    10  Extract and restore commands are used to migrate between minor
    11  and major releases of Pachyderm. In addition, there are a few design
    12  points and operational techniques that data engineers should take
    13  into consideration when creating complex pachyderm deployments to
    14  minimize disruptions to production pipelines.
    15  
    16  Backing up Pachyderm involves the persistent volume (PV) that
    17  `etcd` uses for administrative data and the object store bucket that
    18  holds Pachyderm's actual data.
    19  Restoring involves populating that PV and object store with data to
    20  recreate a Pachyderm cluster.
    21  
    22  ## Before You Begin
    23  
    24  Before you begin, you need to pause all the pipelines and data operations
    25  that run in your cluster. You can do so either by running a multi-line
    26  shell script or by running the `pachctl stop pipeline` command for each
    27  pipeline individually.
    28  
    29  If you decide to use a shell script below, you need to have `jq` and
    30  `xargs` installed on your system. Also, you might need to install
    31  the `watch` and `kubectl` commands on your system, and configure
    32  `kubectl` to point at the cluster that Pachyderm is running in.
    33  
    34  To stop a running pipeline, complete the following steps:
    35  
    36  1. Pause each pipeline individually by repeatedly running the single
    37  `pachctl` command or by running a script:
    38  
    39  === "Command"
    40     ```shell
    41     pachctl stop pipeline <pipeline-name>
    42     ```
    43  
    44  === "Script"
    45     ```shell
    46     pachctl list pipeline --raw \
    47     | jq -r '.pipeline.name' \
    48     | xargs -P3 -n1 -I{} pachctl stop pipeline {}
    49     ```
    50  
    51  1. Optionally, run the `watch` command to monitor the pods
    52     terminating:
    53  
    54     ```shell
    55     watch -n 5 kubectl get pods
    56     ```
    57  
    58  1. Confirm that pipelines are paused:
    59  
    60     ```shell
    61     pachctl list pipeline
    62     ```
    63  
    64  ### Pause External Data Loading Operations
    65  
    66  **Input repositories** or **input repos** in pachyderm are
    67  repositories created with the `pachctl create repo` command.
    68  They are designed to be the repos at the top of a directed
    69  acyclic graph of pipelines. Pipelines have their own output
    70  repos associated with them. These repos are different from
    71  input repos.
    72  
    73  If you have any processes external to Pachyderm
    74  that put data into input repos using any supported method,
    75  such as the Pachyderm APIs, `pachctl put file`, or other,
    76  you need to pause those processes.
    77  
    78  When an external system writes data into Pachyderm
    79  input repos, you need to provide ways of *pausing*
    80  output while queueing any data output
    81  requests to be output when the systems are *resumed*.
    82  This allows all Pachyderm processing to be stopped while
    83  the extract takes place.
    84  
    85  In addition, it is desirable for systems that load data
    86  into Pachyderm have a mechanism for replaying a queue
    87  from any checkpoint in time.
    88  This is useful when doing migrations from one release
    89  to another, where you want to minimize downtime
    90  of a production Pachyderm system. After an extract,
    91  the old system is kept running with the checkpoint
    92  established while a duplicate, upgraded Pachyderm
    93  cluster is being migrated with duplicated data.
    94  Transactions that occur while the migrated,
    95  upgraded cluster is being brought up are not lost.
    96  
    97  If you are not using any external way of pausing input
    98  from internal systems, you can use the following commands to stop
    99  all data loading into Pachyderm from outside processes.
   100  To stop all data loading processes, you need to modify
   101  the `pachd` Kubernetes service so that it only accepts
   102  traffic on port 30649 instead of the usual 30650. This way,
   103  any background users and services that send requests to
   104  your Pachyderm cluster while `pachctl extract` is
   105  running will not interfere with the process. Use this port switching
   106  technique to minimize downtime during the migration.
   107  
   108  To pause external data loading operations, complete the
   109  following steps:
   110  
   111  1. Verify that all Pachyderm pipelines are paused:
   112  
   113     ```shell
   114     pachctl list pipeline
   115     ```
   116  
   117  1. For safery, save the Pachyderm service spec in a `json`:
   118  
   119     ```shell
   120     kubectl get svc/pachd -o json >pach_service_backup_30650.json
   121     ```
   122  
   123  1. Modify the `pachd` service to accept traffic on port 30649:
   124  
   125     ```shell
   126     kubectl get svc/pachd -o json | sed 's/30650/30649/g' | kubectl apply -f -
   127     ```
   128  
   129     Most likely, you will need to modify your cloud provider's firewall
   130     rules to allow traffic on this port.
   131  
   132     Depending on your deployment, you might need to switch
   133     additional ports:
   134  
   135     1. Back up the `etcd` and dashboard manifests:
   136  
   137     ```shell
   138     kubectl get svc/etcd -o json >etcd_svc_backup_32379.json
   139     kubectl get svc/dash -o json >dash_svc_backup_30080.json
   140     ```
   141  
   142     1. Switch the `etcd` and dashboard manifests:
   143  
   144        ```shell
   145        kubectl get svc/pachd -o json | sed 's/30651/30648/g' | kubectl apply -f -
   146        kubectl get svc/pachd -o json | sed 's/30652/30647/g' | kubectl apply -f -
   147        kubectl get svc/pachd -o json | sed 's/30654/30646/g' | kubectl apply -f -
   148        kubectl get svc/pachd -o json | sed 's/30655/30644/g' | kubectl apply -f -
   149        kubectl get svc/etcd -o json | sed 's/32379/32378/g' | kubectl apply -f -
   150        kubectl get svc/dash -o json | sed 's/30080/30079/g' | kubectl apply -f -
   151        kubectl get svc/dash -o json | sed 's/30081/30078/g' | kubectl apply -f -
   152        kubectl get svc/pachd -o json | sed 's/30600/30611/g' | kubectl apply -f -
   153        ```
   154  
   155  1. Modify your environment so that you can access `pachd` on this new
   156  port
   157  
   158     ```shell
   159     pachctl config update context `pachctl config get active-context` --pachd-address=<cluster ip>:30649
   160     ```
   161  
   162  1. Verify that you can talk to `pachd`: (if not, firewall rules are a common culprit)
   163  
   164     ```shell
   165     pachctl version
   166     ```
   167  
   168     **System Response:**
   169  
   170     ```
   171     COMPONENT           VERSION
   172     pachctl             {{ config.pach_latest_version }}
   173     pachd               {{ config.pach_latest_version }}
   174     ```
   175  
   176  ??? note "pause-pipelines.sh"
   177      Alternatively, you can run **Steps 1 - 3** by using the following script:
   178  
   179      ```shell
   180      #!/bin/bash
   181      # Stop all pipelines:
   182      pachctl list pipeline --raw \
   183      | jq -r '.pipeline.name' \
   184      | xargs -P3 -n1 -I{} pachctl stop pipeline {}
   185  
   186      # Backup the Pachyderm services specs, in case you need to restore them:
   187      kubectl get svc/pachd -o json >pach_service_backup_30650.json
   188      kubectl get svc/etcd -o json >etcd_svc_backup_32379.json
   189      kubectl get svc/dash -o json >dash_svc_backup_30080.json
   190  
   191      # Modify all ports of all the Pachyderm service to avoid collissions
   192      # with the migration cluster:
   193      # Modify the pachd API endpoint to run on 30649:
   194      kubectl get svc/pachd -o json | sed 's/30650/30649/g' | kubectl apply -f -
   195      # Modify the pachd trace port to run on 30648:
   196      kubectl get svc/pachd -o json | sed 's/30651/30648/g' | kubectl apply -f -
   197      # Modify the pachd api-over-http port to run on 30647:
   198      kubectl get svc/pachd -o json | sed 's/30652/30647/g' | kubectl apply -f -
   199      # Modify the pachd saml authentication port to run on 30646:
   200      kubectl get svc/pachd -o json | sed 's/30654/30646/g' | kubectl apply -f -
   201      # Modify the pachd git api callback port to run on 30644:
   202      kubectl get svc/pachd -o json | sed 's/30655/30644/g' | kubectl apply -f -
   203      # Modify the etcd client port to run on 32378:
   204      kubectl get svc/etcd -o json | sed 's/32379/32378/g' | kubectl apply -f -
   205      # Modify the dashboard ports to run on 30079 and 30078:
   206      kubectl get svc/dash -o json | sed 's/30080/30079/g' | kubectl apply -f -
   207      kubectl get svc/dash -o json | sed 's/30081/30078/g' | kubectl apply -f -
   208      # Modify the pachd s3 port to run on 30611:
   209      kubectl get svc/pachd -o json | sed 's/30600/30611/g' | kubectl apply -f -
   210      ```
   211  
   212  ## Back up Your Pachyderm Cluster
   213  
   214  After you pause all pipelines and external data operations,
   215  you can use the `pachctl extract` command to back up your data.
   216  You can use `pachctl extract` alone or in combination with
   217  cloning or snapshotting services offered by your cloud provider.
   218  
   219  The backup includes the following:
   220  
   221  * Your data that is typically stored in an object store
   222  * Information about Pachyderm primitives, such as pipelines, repositories,
   223  commits, provenance and so on. This information is stored in etcd.
   224  
   225  You can back up everything to one local file or you can back up
   226  Pachyderm primitives to a local file and use object store's
   227  capabilities to clone the data stored in object store buckets.
   228  The latter is preferred for large volumes of data and minimizing
   229  the downtime during the upgrade. Use the
   230  `--no-objects` flag to separate backups.
   231  
   232  In addition, you can extract your partial or full backup into a
   233  separate S3 bucket. The bucket must have the same permissions policy as
   234  the one you have configured when you originally deployed Pachyderm.
   235  
   236  To back up your Pachyderm cluster, run one of the following commands:
   237  
   238  * To create a partial back up of metadata-only, run:
   239  
   240    ```shell
   241    pachctl extract --no-objects > path/to/your/backup/file
   242    ```
   243  
   244    * If you want to save this partial backup in an object store by using the
   245    `--url` flag, run:
   246  
   247      ```shell
   248      pachctl extract --no-objects --url s3://...
   249      ```
   250  
   251  * To back up everything in one local file:
   252  
   253    ```shell
   254    pachctl extract > path/to/your/backup/file
   255    ```
   256  
   257    Similarly, this backup can be saved in an object store with the `--url`
   258    flag.
   259  
   260  ## Using your Cloud Provider's Clone and Snapshot Services
   261  
   262  Follow your cloud provider's recommendation
   263  for backing up persistent volumes and object stores. Here are some pointers to the relevant documentation:
   264  
   265  * Creating a snapshot of persistent volumes:
   266  
   267    - [Creating snapshots of GCE persistent volumes](https://cloud.google.com/compute/docs/disks/create-snapshots)
   268    - [Creating snapshots of Elastic Block Store (EBS) volumes](http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-creating-snapshot.html)
   269    - [Creating snapshots of Azure Virtual Hard Disk volumes](https://docs.microsoft.com/en-us/azure/virtual-machines/windows/snapshot-copy-managed-disk)
   270  
   271      For on-premises Kubernetes deployments, check the vendor documentation for
   272      your PV implementation on backing up and restoring.
   273  
   274  * Cloning object stores:
   275  
   276    - [Using AWS CLI](https://docs.aws.amazon.com/cli/latest/reference/s3/sync.html)
   277    - [Using gsutil](https://cloud.google.com/storage/docs/gsutil/commands/cp)
   278    - [Using azcopy](https://docs.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-linux?toc=%2fazure%2fstorage%2ffiles%2ftoc.json).
   279  
   280      For on-premises Kubernetes deployments, check the vendor documentation
   281      for your on-premises object store for details on  backing up and
   282      restoring a bucket.
   283  
   284  # Restore your Cluster from a Backup:
   285  
   286  After you backup your cluster, you can restore it by using the
   287  `pachctl restore` command. Typically, you would deploy a new Pachyderm cluster
   288  either in another Kubernetes namespace or in a completely separate Kubernetes cluster.
   289  
   290  To restore your Cluster from a Backup, run the following command:
   291  
   292  * If you have backed up your cluster to a local file:, run:
   293  
   294    ```shell
   295    pachctl restore < path/to/your/backup/file
   296    ```
   297  
   298  * If you have backed up your cluster to an object store, run:
   299  
   300    ```shell
   301    pachctl restore --url s3://<path-to-backup>>
   302    ```
   303  
   304  !!! note "See Also:"
   305      - [Migrate Your Cluster](../migrations/)