github.com/pachyderm/pachyderm@v1.13.4/doc/docs/1.9.x/how-tos/load-data-into-pachyderm.md (about) 1 # Load Your Data Into Pachyderm 2 3 !!! info 4 Before you read this section, make sure that you are familiar with 5 the [Data Concepts](../concepts/data-concepts/index.md) and 6 [Pipeline Concepts](../concepts/pipeline-concepts/index.md). 7 8 The data that you commit to Pachyderm is stored in an object store of your 9 choice, such as Amazon S3, MinIO, Google Cloud Storage, or other. Pachyderm 10 records the cryptographic hash (`SHA`) of each portion of your data and stores 11 it as a commit with a unique identifier (ID). Although the data is 12 stored as an unstructured blob, Pachyderm enables you to interact 13 with versioned data as you typically do in a standard file system. 14 15 Pachyderm stores versioned data in repositories which can contain one or 16 multiple files, as well as files arranged in directories. Regardless of the 17 repository structure, Pachyderm versions the state of each data repository 18 as the data changes over time. 19 20 To put data into Pachyderm, a commit must be *started*, or *opened*. 21 Data can then be put into Pachyderm as part of that open commit and is 22 available once the commit is *finished* or *closed*. 23 24 Pachyderm provides the following options to load data: 25 26 * By using the `pachctl put file` command. This option is great for testing, 27 development, integration with CI/CD, and for users who prefer scripting. 28 See [Load Your Data by Using `pachctl`](#load-your-data-by-using-pachctl). 29 30 * By creating a pipeline to pull data from an outside source. 31 Because Pachyderm pipelines can be any arbitrary code that runs 32 in a Docker container, you can call out to external APIs or data 33 sources and pull in data from there. Your pipeline code can be 34 triggered on-demand or 35 continuously with the following special types of pipelines: 36 37 * **Spout:** A spout enables you to continuously load 38 streaming data from a streaming data source, such as a messaging system 39 or message queue into Pachyderm. 40 See [Spout](../concepts/pipeline-concepts/pipeline/spout.md). 41 42 * **Cron:** A cron triggers your pipeline periodically based on the 43 interval that you configure in your pipeline spec. 44 See [Cron](../concepts/pipeline-concepts/pipeline/cron.md). 45 46 **Note:** Pipelines enable you to do much more than just ingressing 47 data into Pachyderm. Pipelines can run all kinds of data transformations 48 on your input data sources, such as a Pachyderm repository, and be 49 configured to run your code automatically as new data is committed. 50 For more information, see 51 [Pipeline](../concepts/pipeline-concepts/pipeline/index.md). 52 53 * By using a Pachyderm language client. This option is ideal 54 for Go or Python users who want to push data into Pachyderm from 55 services or applications written in those languages. If you did not find your 56 favorite language in the list of supported language clients, 57 Pachyderm uses a protobuf API which supports many other languages. 58 See [Pachyderm Language Clients](../reference/clients.md). 59 60 If you are using the Pachyderm Enterprise version, you can use these 61 additional options: 62 63 * By using the S3 gateway. This option is great to use with the existing tools 64 and libraries that interact with S3-compatible object stores. 65 See [Using the S3 Gateway](./s3gateway.md). 66 67 * By using the Pachyderm dashboard. The Pachyderm Enterprise dashboard 68 provides a convenient way to upload data right from the UI. 69 <!--TBA link to the PachHub tutorial--> 70 71 !!! note 72 In the Pachyderm UI, you can only specify an S3 data source. 73 Uploading data from your local device is not supported. 74 75 ## Load Your Data by Using `pachctl` 76 77 The `pachctl put file` command enables you to do everything from 78 loading local files into Pachyderm to pulling data from an existing object 79 store bucket and extracting data from a website. With 80 `pachctl put file`, you can append new data to the existing data or 81 overwrite the existing data. All these options can be configured by using 82 the flags available with this command. Run `pachctl put file --help` to 83 view the complete list of flags that you can specify. 84 85 To load your data into Pachyderm by using `pachctl`, you first need to create 86 one or more data repositories. Then, you can use the `pachctl put file` 87 command to put your data into the created repository. 88 89 In Pachyderm, you can *start* and *finish* commits. If you just 90 run `pachctl put file` and no open commit exists, Pachyderm starts a new 91 commit, adds the data to which you specified the path in your command, and 92 finishes the commit. This is called an atomic commit. 93 94 Alternatively, you can run `pachctl start commit` to start a new commit. 95 Then, add your data in multiple `put file` calls, and finally, when ready, 96 close the commit by running `pachctl finish commit`. 97 98 To load your data into a repository, complete the following steps: 99 100 1. Create a Pachyderm repository: 101 102 ```shell 103 $ pachctl create repo <repo name> 104 ``` 105 106 1. Select from the following options: 107 108 * To start and finish an atomic commit, run: 109 110 ```shell 111 $ pachctl put file <repo>@<branch>:</path/to/file1> -f <file1> 112 ``` 113 114 * To start a commit and add data in iterations: 115 116 1. Start a commit: 117 118 ```shell 119 $ pachctl start commit <repo>@<branch> 120 ``` 121 1. Put your data: 122 123 ```shell 124 $ pachctl put file <repo>@<branch>:</path/to/file1> -f <file1> 125 ``` 126 127 1. Work on your changes, and when ready, put more data: 128 129 ```shell 130 $ pachctl put file <repo>@<branch>:</path/to/file2> -f <file2> 131 ``` 132 133 1. Close the commit: 134 135 ```shell 136 $ pachctl finish commit <repo>@<branch> 137 ``` 138 139 ## Filepath Format 140 141 In Pachyderm, you specify the path to file by using the `-f` option. A path 142 to file can be a local path or a URL to an external resource. You can add 143 multiple files or directories by using the `-i` option. To add contents 144 of a directory, use the `-r` flag. 145 146 The following table provides examples of `pachctl put file` commands with 147 various filepaths and data sources: 148 149 * Put data from a URL: 150 151 ``` 152 $ pachctl put file <repo>@<branch>:</path/to/file> -f http://url_path 153 ``` 154 155 * Put data from an object store. You can use `s3://`, `gcs://`, or `as://` 156 in your filepath: 157 158 ``` 159 $ pachctl put file <repo>@<branch>:</path/to/file> -f s3://object_store_url 160 ``` 161 162 !!! note 163 If you are configuring a local cluster to access an S3 bucket, 164 you need to first deploy a Kubernetes `Secret` for the selected object 165 store. 166 167 * Add multiple files at once by using the `-i` option or multiple `-f` flags. 168 In the case of `-i`, the target file must be a list of files, paths, or URLs 169 that you want to input all at once: 170 171 ```shell 172 $ pachctl put file <repo>@<branch> -i <file containing list of files, paths, or URLs> 173 ``` 174 175 * Input data from stdin into a data repository by using a pipe: 176 177 ```shell 178 $ echo "data" | pachctl put file <repo>@<branch> -f </path/to/file> 179 ``` 180 181 * Add an entire directory or all of the contents at a particular URL, either 182 HTTP(S) or object store URL, `s3://`, `gcs://`, and `as://`, by using the 183 recursive flag, `-r`: 184 185 ```shell 186 $ pachctl put file <repo>@<branch> -r -f <dir> 187 ``` 188 189 ## Loading Your Data Partially 190 191 Depending on your use case, you might decide not to import all of your 192 data into Pachyderm but only store and apply version control to some 193 of it. For example, if you have a 10 PB dataset, loading the 194 whole dataset into Pachyderm is a costly operation that takes 195 a lot of time and resources. To optimize performance and the 196 use of resources, you might decide to load some of this data into 197 Pachyderm, leaving the rest of it in its original source. 198 199 One possible way of doing this is by adding a metadata file with a 200 URL to the specific file or directory in your dataset to a Pachyderm 201 repository and refer to that file in your pipeline. 202 Your pipeline code would read the URL or path in the external data 203 source and retrieve that data as needed for processing instead of 204 needing to preload it all into a Pachyderm repo. This method works 205 particularly well for mostly immutable data because in this case, 206 Pachyderm will not keep versions of the source file, but it will keep 207 track and provenance of the resulting output commits in its 208 version-control system.