github.com/pachyderm/pachyderm@v1.13.4/doc/docs/1.10.x/how-tos/export-data-out-pachyderm.md

github.com/pachyderm/pachyderm@v1.13.4/doc/docs/1.10.x/how-tos/export-data-out-pachyderm.md (about)

     1  # Export Your Data From Pachyderm
     2  
     3  After you build a pipeline, you probably want to see the
     4  results that the pipeline has produced. Every commit into an
     5  input repository results in a corresponding commit into an
     6  output repository.
     7  
     8  To access the results of
     9  a pipeline, you can use one of the following methods:
    10  
    11  * By running the `pachctl get file` command. This
    12  command returns the contents of the specified file.<br>
    13  To get the list of files in a repo, you should first
    14  run the `pachctl list file` command.
    15  See [Export Your Data with `pachctl`](#export-your-data-with-pachctl).<br>
    16  
    17  * By configuring the pipeline. A pipeline can push or expose
    18  output data to external sources. You can configure the following
    19  data exporting methods in a Pachyderm pipeline:
    20  
    21    * An `egress` property enables you to export your data to
    22    an external datastore, such as Amazon S3,
    23    Google Cloud Storage, and others.<br>
    24    See [Export data by using `egress`](#export-your-data-with-egress).<br>
    25  
    26    * A service. A Pachyderm service exposes the results of the
    27    pipeline processing on a specific port in the form of a dashboard
    28    or similar endpoint.<br>
    29    See [Service](../concepts/pipeline-concepts/pipeline/service.md).<br>
    30  
    31    * Configure your code to connect to an external data source.
    32    Because a pipeline is a Docker container that runs your code,
    33    you can egress your data to any data source, even to those that the
    34    `egress` field does not support, by connecting to that source from
    35    within your code.
    36  
    37  * By using the S3 gateway. Pachyderm Enterprise users can reuse
    38    their existing tools and libraries that work with object store
    39    to export their data with the S3 gateway.<br>
    40    See [Using the S3 Gateway](../../deploy-manage/manage/s3gateway/).
    41  
    42  ## Export Your Data with `pachctl`
    43  
    44  The `pachctl get file` command enables you to get the contents
    45  of a file in a Pachyderm repository. You need to know the file
    46  path to specify it in the command.
    47  
    48  To export your data with pachctl:
    49  
    50  1. Get the list of files in the repository:
    51  
    52     ```shell
    53     pachctl list file <repo>@<branch>
    54     ```
    55  
    56     **Example:**
    57  
    58     ```shell
    59     pachctl list commit data@master
    60     ```
    61  
    62     **System Response:**
    63  
    64     ```shell
    65     REPO   BRANCH COMMIT                           PARENT                           STARTED           DURATION           SIZE
    66     data master 230103d3c6bd45b483ab6d0b7ae858d5 f82b76f463ca4799817717a49ab74fac 2 seconds ago  Less than a second 750B
    67     data master f82b76f463ca4799817717a49ab74fac <none>                           40 seconds ago Less than a second 375B
    68     ```
    69  
    70  1. Get the contents of a specific file:
    71  
    72     ```shell
    73     pachctl get file <repo>@<branch>:<path/to/file>
    74     ```
    75  
    76     **Example:**
    77  
    78     ```shell
    79     pachctl get file data@master:user_data.csv
    80     ```
    81  
    82     **System Response:**
    83  
    84     ```shell
    85     1,cyukhtin0@stumbleupon.com,144.155.176.12
    86     2,csisneros1@over-blog.com,26.119.26.5
    87     3,jeye2@instagram.com,13.165.230.106
    88     4,rnollet3@hexun.com,58.52.147.83
    89     5,bposkitt4@irs.gov,51.247.120.167
    90     6,vvenmore5@hubpages.com,161.189.245.212
    91     7,lcoyte6@ask.com,56.13.147.134
    92     8,atuke7@psu.edu,78.178.247.163
    93     9,nmorrell8@howstuffworks.com,28.172.10.170
    94     10,afynn9@google.com.au,166.14.112.65
    95     ```
    96  
    97     Also, you can view the parent, grandparent, and any previous
    98     revision by using the caret (`^`) symbol with a number that
    99     corresponds to an ancestor in sequence:
   100  
   101     * To view a parent of a commit:
   102  
   103       1. List files in the parent commit:
   104  
   105          ```shell
   106          pachctl list commit <repo>@<branch-or-commit>^:<path/to/file>
   107          ```
   108  
   109       1. Get the contents of a file:
   110  
   111          ```shell
   112          pachctl get file <repo>@<branch-or-commit>^:<path/to/file>
   113          ```
   114  
   115     * To view an `<n>` parent of a commit:
   116  
   117       1. List files in the parent commit:
   118  
   119          ```shell
   120          pachctl list commit <repo>@<branch-or-commit>^<n>:<path/to/file>
   121          ```
   122  
   123          **Example:**
   124  
   125          ```shell
   126          NAME           TYPE SIZE
   127          /user_data.csv file 375B
   128          ```
   129  
   130       1. Get the contents of a file:
   131  
   132          ```shell
   133          pachctl get file <repo>@<branch-or-commit>^<n>:<path/to/file>
   134          ```
   135  
   136          **Example:**
   137  
   138          ```shell
   139          pachctl get file datas@master^4:user_data.csv
   140          ```
   141  
   142       You can specify any number in the `^<n>` notation. If the file
   143       exists in that commit, Pachyderm returns it. If the file
   144       does not exist in that revision, Pachyderm displays the following
   145       message:
   146  
   147       ```shell
   148       pachctl get file <repo>@<branch-or-commit>^<n>:<path/to/file>
   149       ```
   150  
   151       **System Response:**
   152  
   153       ```shell
   154       file "<path/to/file>" not found
   155       ```
   156  
   157  ## Export Your Data with `egress`
   158  
   159  The `egress` field in the Pachyderm [pipeline specification](../reference/pipeline_spec.md)
   160  enables you to push the results of a pipeline to an
   161  external datastore such as Amazon S3, Google Cloud Storage, or
   162  Azure Blob Storage. After the user code has finished running, but
   163  before the job is marked as successful, Pachyderm pushes the data
   164  to the specified destination.
   165  
   166  You can specify the following `egress` protocols for the
   167  corresponding storage:
   168  
   169  | Cloud Platform | Protocol | Description |
   170  | -------------- | -------- | ----------- |
   171  | Google Cloud <br>Storage | `gs://` | GCP uses the utility called `gsutil` to access GCP storage resources <br> from a CLI. This utility uses the `gs://` prefix to access these resources. <br>**Example:**<br> `gs://gs-bucket/gs-dir` |
   172  | Amazon S3 | `s3://` | The Amazon S3 storage protocol requires you to specify an `s3://`<br>prefix before the address of an Amazon resource. A valid address must <br>include an endpoint and a bucket, and, optionally, a directory in your <br>Amazon storage. <br>**Example:**<br> `s3://s3-endpoint/s3-bucket/s3-dir` |
   173  | Azure Blob <br>Storage | `wasb://` | Microsoft Windows Azure Storage Blob (WASB) is the default Azure <br>filesystem that outputs your data through `HDInsight`. To output your <br>data to Azure Blob Storage, use the ``wasb://`` prefix, the container name, <br>and your storage account in the path to your directory. <br>**Example:**<br>`wasb://default-container@storage-account/az-dir` |
   174  
   175  !!! example
   176      ```json
   177      "egress": {
   178         "URL": "s3://bucket/dir"
   179      },
   180      ```