github.com/pachyderm/pachyderm@v1.13.4/doc/docs/1.9.x/deploy-manage/manage/backup_restore.md

github.com/pachyderm/pachyderm@v1.13.4/doc/docs/1.9.x/deploy-manage/manage/backup_restore.md (about)

     1  # Backup and Restore
     2  
     3  Pachyderm provides the `pachctl extract` and `pachctl restore` commands to backup and restore the state of a Pachyderm cluster.
     4  
     5  The `pachctl extract` command requires that all pipeline and data loading activity into Pachyderm stop before the extract occurs.  This enables Pachyderm to create a consistent, point-in-time backup.  In this document, we'll talk about how to create such a backup and restore it to another Pachyderm instance.
     6  
     7  Extract and restore commands are currently used to migrate between minor and major releases of Pachyderm, so it's important to understand how to perform them properly.   In addition, there are a few design points and operational techniques that data engineers should take into consideration when creating complex pachyderm deployments to minimize disruptions to production pipelines.
     8  
     9  In this document, we'll take you through the steps to backup and restore a cluster, migrate an existing cluster to a newer minor or major release, and elaborate on some of those design and operations considerations.
    10  
    11  ## Backup and restore concepts
    12  Backing up Pachyderm involves the persistent volume (PV) that `etcd` uses for administrative data
    13  and the object store bucket that holds Pachyderm's actual data. 
    14  Restoring involves populating that PV and object store with data to recreate a Pachyderm cluster.
    15  
    16  ## General backup procedure
    17  
    18  ### 1. Pause all pipeline and data loading/unloading operations
    19  
    20  Before you begin, you need to pause all pipelines and data operations.
    21  
    22  #### Pausing pipelines
    23  From the directed acyclic graphs (DAG) that define your pachyderm cluster, stop each pipeline.  You can either run a multiline shell command, shown below, or you must, for each pipeline, manually run the `pachctl stop pipeline` command.
    24  
    25  `pachctl stop pipeline <pipeline-name>`
    26  
    27  You can confirm each pipeline is paused using the `pachctl list pipeline` command
    28  
    29  `pachctl list pipeline`
    30  
    31  Alternatively, a useful shell script for running `stop pipeline` on all pipelines is included below.   It may be necessary to install the utilities used in the script, like `jq` and `xargs`, on your system.
    32  
    33  ```
    34  pachctl list pipeline --raw \
    35    | jq -r '.pipeline.name' \
    36    | xargs -P3 -n1 -I{} pachctl stop pipeline {}
    37  ```
    38  
    39  It's also a useful practice, for simple to moderately complex deployments, to keep a terminal window up showing the state of all running kubernetes pods.
    40  
    41  `watch -n 5 kubectl get pods`
    42  
    43  You may need to install the `watch` and `kubectl` commands on your system, and configure `kubectl` to point at the cluster that Pachyderm is running in.
    44  
    45  #### Pausing data loading operations
    46  
    47  **Input repositories** or **input repos** in pachyderm are repositories created with the `pachctl create repo` command.  They're designed to be the repos at the top of a directed acyclic graph of pipelines. Pipelines have their own output repos associated with them, and are not considered input repos. If there are any processes external to pachyderm that put data into input repos using any method (the Pachyderm APIs, `pachctl put file`, etc.), they need to be paused.  See [Loading data from other sources into pachyderm](#loading-data-from-other-sources-into-pachyderm) below for design considerations for those processes that will minimize downtime during a restore or migration.
    48  
    49  Alternatively, you can use the following commands to stop all data loading into Pachyderm from outside processes.
    50  
    51  ```
    52  # Once you have stopped all running pachyderm pipelines, such as with this command,
    53  # $ pachctl list pipeline --raw \
    54  #   | jq -r '.pipeline.name' \
    55  #   | xargs -P3 -n1 -I{} pachctl stop pipeline {}
    56  
    57  # all pipelines in your cluster should be suspended. To stop all
    58  # data loading processes, we're going to modify the pachd Kubernetes service so that
    59  # it only accepts traffic on port 30649 (instead of the usual 30650). This way,
    60  # any background users and services that send requests to your Pachyderm cluster
    61  # while 'extract' is running will not interfere with the process
    62  #
    63  # Backup the Pachyderm service spec, in case you need to restore it quickly
    64  $ kubectl get svc/pachd -o json >pach_service_backup_30650.json
    65  
    66  # Modify the service to accept traffic on port 30649
    67  # Note that you'll likely also need to modify your cloud provider's firewall
    68  # rules to allow traffic on this port
    69  $ kubectl get svc/pachd -o json | sed 's/30650/30649/g' | kubectl apply -f -
    70  
    71  # Modify your environment so that *you* can talk to pachd on this new port
    72  $ pachctl config update context `pachctl config get active-context` --pachd-address=<cluster ip>:30649
    73  
    74  # Make sure you can talk to pachd (if not, firewall rules are a common culprit)
    75  $ pachctl version
    76  COMPONENT           VERSION
    77  pachctl             1.9.5
    78  pachd               1.9.5
    79  ```
    80  
    81  ### 2. Extract a pachyderm backup
    82  
    83  You can use `pachctl extract` alone or in combination with cloning/snapshotting services.
    84  
    85  #### Using `pachctl extract`
    86  
    87  Using the `pachctl extract` command, create the backup you need.
    88  
    89  `pachctl extract > path/to/your/backup/file`
    90  
    91  You can also use the `-u` or `--url` flag to put the backup directly into an object store.
    92  
    93  `pachctl extract --url s3://...`
    94  
    95  If you are planning on backing up the object store using its own built-in clone operation, be sure to add the `--no-objects` flag to the `pachctl extract` command.
    96  
    97  #### Using your cloud provider's clone and snapshot services
    98  
    99  You should follow your cloud provider's recommendation
   100  for backing up these resources. Here are some pointers to the relevant documentation.
   101  
   102  ##### Snapshotting persistent volumes
   103  For example, here are official guides on creating snapshots of persistent volumes on Google Cloud Platform, Amazon Web Services (AWS) and Microsoft Azure, respectively:
   104  
   105  - [Creating snapshots of GCE persistent volumes](https://cloud.google.com/compute/docs/disks/create-snapshots)
   106  - [Creating snapshots of Elastic Block Store (EBS) volumes](http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-creating-snapshot.html)
   107  - [Creating snapshots of Azure Virtual Hard Disk volumes](https://docs.microsoft.com/en-us/azure/virtual-machines/windows/snapshot-copy-managed-disk)
   108  
   109  For on-premises Kubernetes deployments,  check the vendor documentation for your PV implementation on backing up and restoring.
   110  
   111  ##### Cloning object stores
   112  Below, we give an example using the Amazon Web Services CLI to clone one bucket to another, [taken from the documentation for that command](https://docs.aws.amazon.com/cli/latest/reference/s3/sync.html).  Similar commands are available for [Google Cloud](https://cloud.google.com/storage/docs/gsutil/commands/cp) and [Azure blob storage](https://docs.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-linux?toc=%2fazure%2fstorage%2ffiles%2ftoc.json).
   113  
   114  `aws s3 sync s3://mybucket s3://mybucket2`
   115  
   116  For on-premises Kubernetes deployments,  check the vendor documentation for your on-premises object store for details on  backing up and restoring a bucket.
   117  
   118  #### Combining cloning, snapshots and extract/restore
   119  
   120  You can use `pachctl extract` command with the `--no-objects` flag to exclude the object store, and use an object store snapshot or clone command to back up the object store. You can run the two commands at the same time.  For example, on Amazon Web Services, the following commands can be run simultaneously.
   121  
   122  `aws s3 sync s3://mybucket s3://mybucket2`
   123  
   124  `pachctl extract --no-objects --url s3://anotherbucket`
   125  
   126  #### Use case: minimizing downtime during a migration
   127  The above cloning/snapshotting technique is recommended when doing a migration where minimizing downtime is desirable, as it allows the duplicated object store to be the basis of the upgraded, new cluster instead of requiring Pachyderm to extract the data from object store.
   128  
   129  ### 3. Restart all pipeline and data loading operations
   130  
   131  Once the backup is complete, restart all paused pipelines and data loading operations.
   132  
   133  From the directed acyclic graphs (DAG) that define your pachyderm cluster, start each pipeline.    You can either run a multiline shell command, shown below, or you must, for each pipeline, manually run the `pachctl start pipeline` command.
   134  
   135  `pachctl start pipeline <pipeline-name>`
   136  
   137  You can confirm each pipeline is started using the `list pipeline` command
   138  
   139  `pachctl list pipeline`
   140  
   141  A useful shell script for running `start pipeline` on all pipelines is included below.  It may be necessary to install the utilities used in the script, like `jq` and `xargs`, on your system.
   142  
   143  ```
   144  pachctl list pipeline --raw \
   145    | jq -r '.pipeline.name' \
   146    | xargs -P3 -n1 -I{} pachctl start pipeline {}
   147  ```
   148  
   149  If you used the port-changing technique, [above](#pausing-data-loading-operations), to stop all data loading into Pachyderm from outside processes, you should change the ports back.
   150  
   151  ```
   152  # Once you have restarted all running pachyderm pipelines, such as with this command,
   153  # $ pachctl list pipeline --raw \
   154  #   | jq -r '.pipeline.name' \
   155  #   | xargs -P3 -n1 -I{} pachctl start pipeline {}
   156  
   157  # all pipelines in your cluster should be restarted. To restart all data loading 
   158  # processes, we're going to change the pachd Kubernetes service so that
   159  # it only accepts traffic on port 30650 again (from 30649). 
   160  #
   161  # Backup the Pachyderm service spec, in case you need to restore it quickly
   162  $ kubectl get svc/pachd -o json >pach_service_backup_30649.json
   163  
   164  # Modify the service to accept traffic on port 30650, again
   165  $ kubectl get svc/pachd -o json | sed 's/30649/30650/g' | kubectl apply -f -
   166  
   167  # Modify your environment so that *you* can talk to pachd on the old port
   168  $ pachctl config update context `pachctl config get active-context` --pachd-address=<cluster ip>:30650
   169  
   170  # Make sure you can talk to pachd (if not, firewall rules are a common culprit)
   171  $ pachctl version
   172  COMPONENT           VERSION
   173  pachctl             1.9.5
   174  pachd               1.9.5
   175  ```
   176  ## General restore procedure
   177  ### Restore your backup to a pachyderm cluster, same version
   178  
   179  Spin up a Pachyderm cluster and run `pachctl restore` with the backup you created earlier.
   180  
   181  `pachctl restore < path/to/your/backup/file`
   182  
   183  You can also use the `-u` or `--url` flag to get the backup directly from the object store you placed it in
   184  
   185  `pachctl restore --url s3://...`
   186  
   187  
   188  ### Loading data from other sources into Pachyderm
   189  
   190  When writing systems that place data into Pachyderm input repos (see [above](#pausing-data-loading-operations) for a definition of 'input repo'), 
   191  it is important to provide ways of 'pausing' output while queueing any data output requests to be output when the systems are 'resumed'.
   192  This allows all Pachyderm processing to be stopped while the extract takes place.
   193  
   194  In addition, it is desirable for systems that load data into Pachyderm have a mechanism for replaying a queue from any checkpoint in time.
   195  This is useful when doing migrations from one release to another, where you would like to minimize downtime of a production Pachyderm system. 
   196  After an extract, 
   197  the old system is kept running with the checkpoint established while a duplicate, upgraded pachyderm cluster is migrated with duplicated data. 
   198  Transactions that occur while the migrated, 
   199  upgraded cluster is being brought up are not lost,