github.com/pachyderm/pachyderm@v1.13.4/doc/docs/1.9.x/how-tos/export-data-out-pachyderm.md

github.com/pachyderm/pachyderm@v1.13.4/doc/docs/1.9.x/how-tos/export-data-out-pachyderm.md (about)

     1  # Export Your Data From Pachyderm
     2  
     3  After you build a pipeline, you probably want to see the
     4  results that the pipeline has produced. Every commit into an
     5  input repository results in a corresponding commit into an
     6  output repository.
     7  
     8  To access the results of
     9  a pipeline, you can use one of the following methods:
    10  
    11  * By running the `pachctl get file` command. This
    12  command returns the contents of the specified file.<br>
    13  To get the list of files in a repo, you should first
    14  run the `pachctl list file` command.
    15  See [Export Your Data with `pachctl`](#export-your-data-with-pachctl).<br>
    16  
    17  * By configuring the pipeline. A pipeline can push or expose
    18  output data to external sources. You can configure the following
    19  data exporting methods in a Pachyderm pipeline:
    20  
    21    * An `egress` property enables you to export your data to
    22    an external datastore, such as Amazon S3,
    23    Google Cloud Storage, and others.<br>
    24    See [Export data by using `egress`](#export-your-data-with-egress).<br>
    25  
    26    * A service. A Pachyderm service exposes the results of the
    27    pipeline processing on a specific port in the form of a dashboard
    28    or similar endpoint.<br>
    29    See [Service](../concepts/pipeline-concepts/pipeline/service.md).<br>
    30  
    31    * Configure your code to connect to an external data source.
    32    Because a pipeline is a Docker container that runs your code,
    33    you can egress your data to any data source, even to those that the
    34    `egress` field does not support, by connecting to that source from
    35    within your code.
    36  
    37  * By using the S3 gateway. Pachyderm Enterprise users can reuse
    38    their existing tools and libraries that work with object store
    39    to export their data with the S3 gateway.<br>
    40    See [Using the S3 Gateway](./s3gateway.md).
    41  
    42  ## Export Your Data with `pachctl`
    43  
    44  The `pachctl get file` command enables you to get the contents
    45  of a file in a Pachyderm repository. You need to know the file
    46  path to specify it in the command.
    47  
    48  To export your data with pachctl:
    49  
    50  1. Get the list of files in the repository:
    51  
    52     ```shell
    53     $ pachctl list file <repo>@<branch>
    54     ```
    55  
    56     **Example:**
    57  
    58     ```shell
    59     $ pachctl list commit data@master
    60     REPO   BRANCH COMMIT                           PARENT                           STARTED           DURATION           SIZE
    61     data master 230103d3c6bd45b483ab6d0b7ae858d5 f82b76f463ca4799817717a49ab74fac 2 seconds ago  Less than a second 750B
    62     data master f82b76f463ca4799817717a49ab74fac <none>                           40 seconds ago Less than a second 375B
    63     ```
    64  
    65  1. Get the contents of a specific file:
    66  
    67     ```shell
    68     pachctl get file <repo>@<branch>:<path/to/file>
    69     ```
    70  
    71     **Example:**
    72  
    73     ```shell
    74     $ pachctl get file data@master:user_data.csv
    75     1,cyukhtin0@stumbleupon.com,144.155.176.12
    76     2,csisneros1@over-blog.com,26.119.26.5
    77     3,jeye2@instagram.com,13.165.230.106
    78     4,rnollet3@hexun.com,58.52.147.83
    79     5,bposkitt4@irs.gov,51.247.120.167
    80     6,vvenmore5@hubpages.com,161.189.245.212
    81     7,lcoyte6@ask.com,56.13.147.134
    82     8,atuke7@psu.edu,78.178.247.163
    83     9,nmorrell8@howstuffworks.com,28.172.10.170
    84     10,afynn9@google.com.au,166.14.112.65
    85     ```
    86  
    87     Also, you can view the parent, grandparent, and any previous
    88     revision by using the caret (`^`) symbol with a number that
    89     corresponds to an ancestor in sequence:
    90  
    91     * To view a parent of a commit:
    92  
    93       1. List files in the parent commit:
    94  
    95          ```shell
    96          $ pachctl list commit <repo>@<branch-or-commit>^:<path/to/file>
    97          ```
    98  
    99       1. Get the contents of a file:
   100  
   101          ```shell
   102          $ pachctl get file <repo>@<branch-or-commit>^:<path/to/file>
   103          ```
   104  
   105     * To view an `<n>` parent of a commit:
   106  
   107       1. List files in the parent commit:
   108  
   109          ```shell
   110          $ pachctl list commit <repo>@<branch-or-commit>^<n>:<path/to/file>
   111          ```
   112  
   113          **Example:**
   114  
   115          ```shell
   116          NAME           TYPE SIZE
   117          /user_data.csv file 375B
   118          ```
   119  
   120       1. Get the contents of a file:
   121  
   122          ```shell
   123          $ pachctl get file <repo>@<branch-or-commit>^<n>:<path/to/file>
   124          ```
   125  
   126          **Example:**
   127  
   128          ```shell
   129          $ pachctl get file datas@master^4:user_data.csv
   130          ```
   131  
   132       You can specify any number in the `^<n>` notation. If the file
   133       exists in that commit, Pachyderm returns it. If the file
   134       does not exist in that revision, Pachyderm displays the following
   135       message:
   136  
   137       ```shell
   138       $ pachctl get file <repo>@<branch-or-commit>^<n>:<path/to/file>
   139       file "<path/to/file>" not found
   140       ```
   141  
   142  ## Export Your Data with `egress`
   143  
   144  The `egress` field in the Pachyderm [pipeline specification](../reference/pipeline_spec.md)
   145  enables you to push the results of a pipeline to an
   146  external datastore such as Amazon S3, Google Cloud Storage, or
   147  Azure Blob Storage. After the user code has finished running, but
   148  before the job is marked as successful, Pachyderm pushes the data
   149  to the specified destination.
   150  
   151  You can specify the following `egress` protocols for the
   152  corresponding storage:
   153  
   154  | Cloud Platform | Protocol | Description |
   155  | -------------- | -------- | ----------- |
   156  | Google Cloud <br>Storage | `gs://` | GCP uses the utility called `gsutil` to access GCP storage resources <br> from a CLI. This utility uses the `gs://` prefix to access these resources. <br>**Example:**<br> `gs://gs-bucket/gs-dir` |
   157  | Amazon S3 | `s3://` | The Amazon S3 storage protocol requires you to specify an `s3://`<br>prefix before the address of an Amazon resource. A valid address must <br>include an endpoint and a bucket, and, optionally, a directory in your <br>Amazon storage. <br>**Example:**<br> `s3://s3-endpoint/s3-bucket/s3-dir` |
   158  | Azure Blob <br>Storage | `wasb://` | Microsoft Windows Azure Storage Blob (WASB) is the default Azure <br>filesystem that outputs your data through `HDInsight`. To output your <br>data to Azure Blob Storage, use the ``wasb://`` prefix, the container name, <br>and your storage account in the path to your directory. <br>**Example:**<br>`wasb://default-container@storage-account/az-dir` |
   159  
   160  !!! example
   161      ```json
   162      "egress": {
   163         "URL": "s3://bucket/dir"
   164      },
   165      ```