github.com/pachyderm/pachyderm@v1.13.4/doc/docs/1.9.x/how-tos/load-data-into-pachyderm.md (about)

     1  # Load Your Data Into Pachyderm
     2  
     3  !!! info
     4      Before you read this section, make sure that you are familiar with
     5      the [Data Concepts](../concepts/data-concepts/index.md) and
     6      [Pipeline Concepts](../concepts/pipeline-concepts/index.md).
     7  
     8  The data that you commit to Pachyderm is stored in an object store of your
     9  choice, such as Amazon S3, MinIO, Google Cloud Storage, or other. Pachyderm
    10  records the cryptographic hash (`SHA`) of each portion of your data and stores
    11  it as a commit with a unique identifier (ID). Although the data is
    12  stored as an unstructured blob, Pachyderm enables you to interact
    13  with versioned data as you typically do in a standard file system.
    14  
    15  Pachyderm stores versioned data in repositories which can contain one or
    16  multiple files, as well as files arranged in directories. Regardless of the
    17  repository structure, Pachyderm versions the state of each data repository
    18  as the data changes over time.
    19  
    20  To put data into Pachyderm, a commit must be *started*, or *opened*.
    21  Data can then be put into Pachyderm as part of that open commit and is
    22  available once the commit is *finished* or *closed*.
    23  
    24  Pachyderm provides the following options to load data:
    25  
    26  * By using the `pachctl put file` command. This option is great for testing,
    27  development, integration with CI/CD, and for users who prefer scripting.
    28  See [Load Your Data by Using `pachctl`](#load-your-data-by-using-pachctl).
    29  
    30  * By creating a pipeline to pull data from an outside source.
    31  Because Pachyderm pipelines can be any arbitrary code that runs
    32  in a Docker container, you can call out to external APIs or data
    33  sources and pull in data from there. Your pipeline code can be
    34  triggered on-demand or
    35  continuously with the following special types of pipelines:
    36  
    37    * **Spout:** A spout enables you to continuously load
    38    streaming data from a streaming data source, such as a messaging system
    39    or message queue into Pachyderm.
    40    See [Spout](../concepts/pipeline-concepts/pipeline/spout.md).
    41  
    42    * **Cron:** A cron triggers your pipeline periodically based on the
    43    interval that you configure in your pipeline spec.
    44    See [Cron](../concepts/pipeline-concepts/pipeline/cron.md).
    45  
    46    **Note:** Pipelines enable you to do much more than just ingressing
    47    data into Pachyderm. Pipelines can run all kinds of data transformations
    48    on your input data sources, such as a Pachyderm repository, and be
    49    configured to run your code automatically as new data is committed.
    50    For more information, see
    51    [Pipeline](../concepts/pipeline-concepts/pipeline/index.md).
    52  
    53  * By using a Pachyderm language client. This option is ideal
    54  for Go or Python users who want to push data into Pachyderm from
    55  services or applications written in those languages. If you did not find your
    56  favorite language in the list of supported language clients,
    57  Pachyderm uses a protobuf API which supports many other languages.
    58  See [Pachyderm Language Clients](../reference/clients.md).
    59  
    60  If you are using the Pachyderm Enterprise version, you can use these
    61  additional options:
    62  
    63  * By using the S3 gateway. This option is great to use with the existing tools
    64  and libraries that interact with S3-compatible object stores.
    65  See [Using the S3 Gateway](./s3gateway.md).
    66  
    67  * By using the Pachyderm dashboard. The Pachyderm Enterprise dashboard
    68  provides a convenient way to upload data right from the UI.
    69  <!--TBA link to the PachHub tutorial-->
    70  
    71  !!! note
    72      In the Pachyderm UI, you can only specify an S3 data source.
    73      Uploading data from your local device is not supported.
    74  
    75  ## Load Your Data by Using `pachctl`
    76  
    77  The `pachctl put file` command enables you to do everything from
    78  loading local files into Pachyderm to pulling data from an existing object
    79  store bucket and extracting data from a website. With
    80  `pachctl put file`, you can append new data to the existing data or
    81  overwrite the existing data. All these options can be configured by using
    82  the flags available with this command. Run `pachctl put file --help` to
    83  view the complete list of flags that you can specify.
    84  
    85  To load your data into Pachyderm by using `pachctl`, you first need to create
    86  one or more data repositories. Then, you can use the `pachctl put file`
    87  command to put your data into the created repository.
    88  
    89  In Pachyderm, you can *start* and *finish* commits. If you just
    90  run `pachctl put file` and no open commit exists, Pachyderm starts a new
    91  commit, adds the data to which you specified the path in your command, and
    92  finishes the commit. This is called an atomic commit.
    93  
    94  Alternatively, you can run `pachctl start commit` to start a new commit.
    95  Then, add your data in multiple `put file` calls, and finally, when ready,
    96  close the commit by running `pachctl finish commit`.
    97  
    98  To load your data into a repository, complete the following steps:
    99  
   100  1. Create a Pachyderm repository:
   101  
   102     ```shell
   103     $ pachctl create repo <repo name>
   104     ```
   105  
   106  1. Select from the following options:
   107  
   108     * To start and finish an atomic commit, run:
   109  
   110       ```shell
   111       $ pachctl put file <repo>@<branch>:</path/to/file1> -f <file1>
   112       ```
   113  
   114     * To start a commit and add data in iterations:
   115  
   116       1. Start a commit:
   117  
   118          ```shell
   119          $ pachctl start commit <repo>@<branch>
   120          ```
   121       1. Put your data:
   122  
   123          ```shell
   124          $ pachctl put file <repo>@<branch>:</path/to/file1> -f <file1>
   125          ```
   126  
   127       1. Work on your changes, and when ready, put more data:
   128  
   129          ```shell
   130          $ pachctl put file <repo>@<branch>:</path/to/file2> -f <file2>
   131          ```
   132  
   133       1. Close the commit:
   134  
   135          ```shell
   136          $ pachctl finish commit <repo>@<branch>
   137          ```
   138  
   139  ## Filepath Format
   140  
   141  In Pachyderm, you specify the path to file by using the `-f` option. A path
   142  to file can be a local path or a URL to an external resource. You can add
   143  multiple files or directories by using the `-i` option. To add contents
   144  of a directory, use the `-r` flag.
   145  
   146  The following table provides examples of `pachctl put file` commands with
   147  various filepaths and data sources:
   148  
   149  * Put data from a URL:
   150  
   151    ```
   152    $ pachctl put file <repo>@<branch>:</path/to/file> -f http://url_path
   153    ```
   154  
   155  * Put data from an object store. You can use `s3://`, `gcs://`, or `as://`
   156  in your filepath:
   157  
   158    ```
   159    $ pachctl put file <repo>@<branch>:</path/to/file> -f s3://object_store_url
   160    ```
   161  
   162  !!! note
   163      If you are configuring a local cluster to access an S3 bucket,
   164      you need to first deploy a Kubernetes `Secret` for the selected object
   165      store.
   166  
   167  * Add multiple files at once by using the `-i` option or multiple `-f` flags.
   168  In the case of `-i`, the target file must be a list of files, paths, or URLs
   169  that you want to input all at once:
   170  
   171    ```shell
   172    $ pachctl put file <repo>@<branch> -i <file containing list of files, paths, or URLs>
   173    ```
   174  
   175  * Input data from stdin into a data repository by using a pipe:
   176  
   177    ```shell
   178    $ echo "data" | pachctl put file <repo>@<branch> -f </path/to/file>
   179    ```
   180  
   181  * Add an entire directory or all of the contents at a particular URL, either
   182  HTTP(S) or object store URL, `s3://`, `gcs://`, and `as://`, by using the
   183  recursive flag, `-r`:
   184  
   185    ```shell
   186    $ pachctl put file <repo>@<branch> -r -f <dir>
   187    ```
   188  
   189  ## Loading Your Data Partially
   190  
   191  Depending on your use case, you might decide not to import all of your
   192  data into Pachyderm but only store and apply version control to some
   193  of it. For example, if you have a 10 PB dataset, loading the
   194  whole dataset into Pachyderm is a costly operation that takes
   195  a lot of time and resources. To optimize performance and the
   196  use of resources, you might decide to load some of this data into
   197  Pachyderm, leaving the rest of it in its original source.
   198  
   199  One possible way of doing this is by adding a metadata file with a
   200  URL to the specific file or directory in your dataset to a Pachyderm
   201  repository and refer to that file in your pipeline.
   202  Your pipeline code would read the URL or path in the external data
   203  source and retrieve that data as needed for processing instead of
   204  needing to preload it all into a Pachyderm repo. This method works
   205  particularly well for mostly immutable data because in this case,
   206  Pachyderm will not keep versions of the source file, but it will keep
   207  track and provenance of the resulting output commits in its
   208  version-control system.