github.com/pachyderm/pachyderm@v1.13.4/doc/docs/1.10.x/how-tos/load-data-into-pachyderm.md (about) 1 # Load Your Data Into Pachyderm 2 3 !!! info 4 Before you read this section, make sure that you are familiar with 5 the [Data Concepts](../concepts/data-concepts/index.md) and 6 [Pipeline Concepts](../concepts/pipeline-concepts/index.md). 7 8 The data that you commit to Pachyderm is stored in an object store of your 9 choice, such as Amazon S3, MinIO, Google Cloud Storage, or other. Pachyderm 10 records the cryptographic hash (`SHA`) of each portion of your data and stores 11 it as a commit with a unique identifier (ID). Although the data is 12 stored as an unstructured blob, Pachyderm enables you to interact 13 with versioned data as you typically do in a standard file system. 14 15 Pachyderm stores versioned data in repositories which can contain one or 16 multiple files, as well as files arranged in directories. Regardless of the 17 repository structure, Pachyderm versions the state of each data repository 18 as the data changes over time. 19 20 To put data into Pachyderm, a commit must be *started*, or *opened*. 21 Data can then be put into Pachyderm as part of that open commit and is 22 available once the commit is *finished* or *closed*. 23 24 Pachyderm provides the following options to load data: 25 26 * By using the `pachctl put file` command. This option is great for testing, 27 development, integration with CI/CD, and for users who prefer scripting. 28 See [Load Your Data by Using `pachctl`](#load-your-data-by-using-pachctl). 29 30 * By creating a special type of pipeline that pulls data from an 31 outside source. 32 Because Pachyderm pipelines can be any arbitrary code that runs 33 in a Docker container, you can call out to external APIs or data 34 sources and pull in data from there. Your pipeline code can be 35 triggered on-demand or 36 continuously with the following special types of pipelines: 37 38 * **Spout:** A spout enables you to continuously load 39 streaming data from a streaming data source, such as a messaging system 40 or message queue into Pachyderm. 41 See [Spout](../concepts/pipeline-concepts/pipeline/spout.md). 42 43 * **Cron:** A cron triggers your pipeline periodically based on the 44 interval that you configure in your pipeline spec. 45 See [Cron](../concepts/pipeline-concepts/pipeline/cron.md). 46 47 **Note:** Pipelines enable you to do much more than just ingressing 48 data into Pachyderm. Pipelines can run all kinds of data transformations 49 on your input data sources, such as a Pachyderm repository, and be 50 configured to run your code automatically as new data is committed. 51 For more information, see 52 [Pipeline](../concepts/pipeline-concepts/pipeline/index.md). 53 54 * By using a Pachyderm language client. This option is ideal 55 for Go or Python users who want to push data into Pachyderm from 56 services or applications written in those languages. If you did not find your 57 favorite language in the list of supported language clients, 58 Pachyderm uses a protobuf API which supports many other languages. 59 See [Pachyderm Language Clients](../reference/clients.md). 60 61 If you are using the Pachyderm Enterprise version, you can use these 62 additional options: 63 64 * By using the S3 gateway. This option is great to use with the existing tools 65 and libraries that interact with S3-compatible object stores. 66 See [Using the S3 Gateway](../../deploy-manage/manage/s3gateway/). 67 68 * By using the Pachyderm dashboard. The Pachyderm Enterprise dashboard 69 provides a convenient way to upload data right from the UI. 70 <!--TBA link to the PachHub tutorial--> 71 72 !!! note 73 In the Pachyderm UI, you can only specify an S3 data source. 74 Uploading data from your local device is not supported. 75 76 ## Load Your Data by Using `pachctl` 77 78 The `pachctl put file` command enables you to do everything from 79 loading local files into Pachyderm to pulling data from an existing object 80 store bucket and extracting data from a website. With 81 `pachctl put file`, you can append new data to the existing data or 82 overwrite the existing data. All these options can be configured by using 83 the flags available with this command. Run `pachctl put file --help` to 84 view the complete list of flags that you can specify. 85 86 To load your data into Pachyderm by using `pachctl`, you first need to create 87 one or more data repositories. Then, you can use the `pachctl put file` 88 command to put your data into the created repository. 89 90 In Pachyderm, you can *start* and *finish* commits. If you just 91 run `pachctl put file` and no open commit exists, Pachyderm starts a new 92 commit, adds the data to which you specified the path in your command, and 93 finishes the commit. This is called an atomic commit. 94 95 Alternatively, you can run `pachctl start commit` to start a new commit. 96 Then, add your data in multiple `put file` calls, and finally, when ready, 97 close the commit by running `pachctl finish commit`. 98 99 To load your data into a repository, complete the following steps: 100 101 1. Create a Pachyderm repository: 102 103 ```shell 104 pachctl create repo <repo name> 105 ``` 106 107 1. Select from the following options: 108 109 * To start and finish an atomic commit, run: 110 111 ```shell 112 pachctl put file <repo>@<branch>:</path/to/file1> -f <file1> 113 ``` 114 115 * To start a commit and add data in iterations: 116 117 1. Start a commit: 118 119 ```shell 120 pachctl start commit <repo>@<branch> 121 ``` 122 1. Put your data: 123 124 ```shell 125 pachctl put file <repo>@<branch>:</path/to/file1> -f <file1> 126 ``` 127 128 1. Work on your changes, and when ready, put more data: 129 130 ```shell 131 pachctl put file <repo>@<branch>:</path/to/file2> -f <file2> 132 ``` 133 134 1. Close the commit: 135 136 ```shell 137 pachctl finish commit <repo>@<branch> 138 ``` 139 140 ## Filepath Format 141 142 In Pachyderm, you specify the path to file by using the `-f` option. A path 143 to file can be a local path or a URL to an external resource. You can add 144 multiple files or directories by using the `-i` option. To add contents 145 of a directory, use the `-r` flag. 146 147 The following table provides examples of `pachctl put file` commands with 148 various filepaths and data sources: 149 150 * Put data from a URL: 151 152 ``` 153 pachctl put file <repo>@<branch>:</path/to/file> -f http://url_path 154 ``` 155 156 * Put data from an object store. You can use `s3://`, `gcs://`, or `as://` 157 in your filepath: 158 159 ``` 160 pachctl put file <repo>@<branch>:</path/to/file> -f s3://object_store_url 161 ``` 162 163 !!! note 164 If you are configuring a local cluster to access an S3 bucket, 165 you need to first deploy a Kubernetes `Secret` for the selected object 166 store. 167 168 * Add multiple files at once by using the `-i` option or multiple `-f` flags. 169 In the case of `-i`, the target file must be a list of files, paths, or URLs 170 that you want to input all at once: 171 172 ```shell 173 pachctl put file <repo>@<branch> -i <file containing list of files, paths, or URLs> 174 ``` 175 176 * Input data from stdin into a data repository by using a pipe: 177 178 ```shell 179 echo "data" | pachctl put file <repo>@<branch> -f </path/to/file> 180 ``` 181 182 * Add an entire directory or all of the contents at a particular URL, either 183 HTTP(S) or object store URL, `s3://`, `gcs://`, and `as://`, by using the 184 recursive flag, `-r`: 185 186 ```shell 187 pachctl put file <repo>@<branch> -r -f <dir> 188 ``` 189 190 ## Loading Your Data Partially 191 192 Depending on your use case, you might decide not to import all of your 193 data into Pachyderm but only store and apply version control to some 194 of it. For example, if you have a 10 PB dataset, loading the 195 whole dataset into Pachyderm is a costly operation that takes 196 a lot of time and resources. To optimize performance and the 197 use of resources, you might decide to load some of this data into 198 Pachyderm, leaving the rest of it in its original source. 199 200 One possible way of doing this is by adding a metadata file with a 201 URL to the specific file or directory in your dataset to a Pachyderm 202 repository and refer to that file in your pipeline. 203 Your pipeline code would read the URL or path in the external data 204 source and retrieve that data as needed for processing instead of 205 needing to preload it all into a Pachyderm repo. This method works 206 particularly well for mostly immutable data because in this case, 207 Pachyderm will not keep versions of the source file, but it will keep 208 track and provenance of the resulting output commits in its 209 version-control system.