github.com/pachyderm/pachyderm@v1.13.4/doc/docs/1.11.x/how-tos/load-data-into-pachyderm.md (about)

     1  # Load Your Data Into Pachyderm
     2  
     3  !!! info
     4      Before you read this section, make sure that you are familiar with
     5      the [Data Concepts](../concepts/data-concepts/index.md) and
     6      [Pipeline Concepts](../concepts/pipeline-concepts/index.md).
     7  
     8  The data that you commit to Pachyderm is stored in an object store of your
     9  choice, such as Amazon S3, MinIO, Google Cloud Storage, or other. Pachyderm
    10  records the cryptographic hash (`SHA`) of each portion of your data and stores
    11  it as a commit with a unique identifier (ID). Although the data is
    12  stored as an unstructured blob, Pachyderm enables you to interact
    13  with versioned data as you typically do in a standard file system.
    14  
    15  Pachyderm stores versioned data in repositories which can contain one or
    16  multiple files, as well as files arranged in directories. Regardless of the
    17  repository structure, Pachyderm versions the state of each data repository
    18  as the data changes over time.
    19  
    20  To put data into Pachyderm, a commit must be *started*, or *opened*.
    21  Data can then be put into Pachyderm as part of that open commit and is
    22  available once the commit is *finished* or *closed*.
    23  
    24  Pachyderm provides the following options to load data:
    25  
    26  * By using the `pachctl put file` command. This option is great for testing,
    27  development, integration with CI/CD, and for users who prefer scripting.
    28  See [Load Your Data by Using `pachctl`](#load-your-data-by-using-pachctl).
    29  
    30  * By creating a special type of pipeline that pulls data from an
    31  outside source.
    32  Because Pachyderm pipelines can be any arbitrary code that runs
    33  in a Docker container, you can call out to external APIs or data
    34  sources and pull in data from there. Your pipeline code can be
    35  triggered on-demand or
    36  continuously with the following special types of pipelines:
    37  
    38    * **Spout:** A spout enables you to continuously load
    39    streaming data from a streaming data source, such as a messaging system
    40    or message queue into Pachyderm. 
    41    See [Spout](../concepts/pipeline-concepts/pipeline/spout.md).
    42  
    43    * **Cron:** A cron triggers your pipeline periodically based on the
    44    interval that you configure in your pipeline spec.
    45    See [Cron](../concepts/pipeline-concepts/pipeline/cron.md).
    46  
    47    **Note:** Pipelines enable you to do much more than just ingressing
    48    data into Pachyderm. Pipelines can run all kinds of data transformations
    49    on your input data sources, such as a Pachyderm repository, and be
    50    configured to run your code automatically as new data is committed.
    51    For more information, see
    52    [Pipeline](../concepts/pipeline-concepts/pipeline/index.md).
    53  
    54  * By using a Pachyderm language client. This option is ideal
    55  for Go or Python users who want to push data into Pachyderm from
    56  services or applications written in those languages. If you did not find your
    57  favorite language in the list of supported language clients,
    58  Pachyderm uses a protobuf API which supports many other languages.
    59  See [Pachyderm Language Clients](../reference/clients.md).
    60  
    61  If you are using the Pachyderm Enterprise version, you can use these
    62  additional options:
    63  
    64  * By using the S3 gateway. This option is great to use with the existing tools
    65  and libraries that interact with S3-compatible object stores.
    66  See [Using the S3 Gateway](../../deploy-manage/manage/s3gateway/).
    67  
    68  * By using the Pachyderm dashboard. The Pachyderm Enterprise dashboard
    69  provides a convenient way to upload data right from the UI.
    70  <!--TBA link to the PachHub tutorial-->
    71  
    72  !!! note
    73      In the Pachyderm UI, you can only specify an S3 data source.
    74      Uploading data from your local device is not supported.
    75  
    76  ## Load Your Data by Using `pachctl`
    77  
    78  The `pachctl put file` command enables you to do everything from
    79  loading local files into Pachyderm to pulling data from an existing object
    80  store bucket and extracting data from a website. With
    81  `pachctl put file`, you can append new data to the existing data or
    82  overwrite the existing data. All these options can be configured by using
    83  the flags available with this command. Run `pachctl put file --help` to
    84  view the complete list of flags that you can specify.
    85  
    86  To load your data into Pachyderm by using `pachctl`, you first need to create
    87  one or more data repositories. Then, you can use the `pachctl put file`
    88  command to put your data into the created repository.
    89  
    90  In Pachyderm, you can *start* and *finish* commits. If you just
    91  run `pachctl put file` and no open commit exists, Pachyderm starts a new
    92  commit, adds the data to which you specified the path in your command, and
    93  finishes the commit. This is called an atomic commit.
    94  
    95  Alternatively, you can run `pachctl start commit` to start a new commit.
    96  Then, add your data in multiple `put file` calls, and finally, when ready,
    97  close the commit by running `pachctl finish commit`.
    98  
    99  To load your data into a repository, complete the following steps:
   100  
   101  1. Create a Pachyderm repository:
   102  
   103     ```shell
   104     pachctl create repo <repo name>
   105     ```
   106  
   107  1. Select from the following options:
   108  
   109     * To start and finish an atomic commit, run:
   110  
   111       ```shell
   112       pachctl put file <repo>@<branch>:</path/to/file1> -f <file1>
   113       ```
   114  
   115     * To start a commit and add data in iterations:
   116  
   117       1. Start a commit:
   118  
   119          ```shell
   120          pachctl start commit <repo>@<branch>
   121          ```
   122       1. Put your data:
   123  
   124          ```shell
   125          pachctl put file <repo>@<branch>:</path/to/file1> -f <file1>
   126          ```
   127  
   128       1. Work on your changes, and when ready, put more data:
   129  
   130          ```shell
   131          pachctl put file <repo>@<branch>:</path/to/file2> -f <file2>
   132          ```
   133  
   134       1. Close the commit:
   135  
   136          ```shell
   137          pachctl finish commit <repo>@<branch>
   138          ```
   139  
   140  ## Filepath Format
   141  
   142  In Pachyderm, you specify the path to file by using the `-f` option. A path
   143  to file can be a local path or a URL to an external resource. You can add
   144  multiple files or directories by using the `-i` option. To add contents
   145  of a directory, use the `-r` flag.
   146  
   147  The following table provides examples of `pachctl put file` commands with
   148  various filepaths and data sources:
   149  
   150  * Put data from a URL:
   151  
   152    ```
   153    pachctl put file <repo>@<branch>:</path/to/file> -f http://url_path
   154    ```
   155  
   156  * Put data from an object store. You can use `s3://`, `gcs://`, or `as://`
   157  in your filepath:
   158  
   159    ```
   160    pachctl put file <repo>@<branch>:</path/to/file> -f s3://object_store_url
   161    ```
   162  
   163  !!! note
   164      If you are configuring a local cluster to access an S3 bucket,
   165      you need to first deploy a Kubernetes `Secret` for the selected object
   166      store.
   167  
   168  * Add multiple files at once by using the `-i` option or multiple `-f` flags.
   169  In the case of `-i`, the target file must be a list of files, paths, or URLs
   170  that you want to input all at once:
   171  
   172    ```shell
   173    pachctl put file <repo>@<branch> -i <file containing list of files, paths, or URLs>
   174    ```
   175  
   176  * Input data from stdin into a data repository by using a pipe:
   177  
   178    ```shell
   179    echo "data" | pachctl put file <repo>@<branch> -f </path/to/file>
   180    ```
   181  
   182  * Add an entire directory or all of the contents at a particular URL, either
   183  HTTP(S) or object store URL, `s3://`, `gcs://`, and `as://`, by using the
   184  recursive flag, `-r`:
   185  
   186    ```shell
   187    pachctl put file <repo>@<branch> -r -f <dir>
   188    ```
   189  
   190  ## Loading Your Data Partially
   191  
   192  Depending on your use case, you might decide not to import all of your
   193  data into Pachyderm but only store and apply version control to some
   194  of it. For example, if you have a 10 PB dataset, loading the
   195  whole dataset into Pachyderm is a costly operation that takes
   196  a lot of time and resources. To optimize performance and the
   197  use of resources, you might decide to load some of this data into
   198  Pachyderm, leaving the rest of it in its original source.
   199  
   200  One possible way of doing this is by adding a metadata file with a
   201  URL to the specific file or directory in your dataset to a Pachyderm
   202  repository and refer to that file in your pipeline.
   203  Your pipeline code would read the URL or path in the external data
   204  source and retrieve that data as needed for processing instead of
   205  needing to preload it all into a Pachyderm repo. This method works
   206  particularly well for mostly immutable data because in this case,
   207  Pachyderm will not keep versions of the source file, but it will keep
   208  track and provenance of the resulting output commits in its
   209  version-control system.