github.com/pachyderm/pachyderm@v1.13.4/doc/docs/1.10.x/how-tos/load-data-into-pachyderm.md

github.com/pachyderm/pachyderm@v1.13.4/doc/docs/1.10.x/how-tos/load-data-into-pachyderm.md (about)

1 # Load Your Data Into Pachyderm
2
3 !!! info
4 Before you read this section, make sure that you are familiar with
5 the [Data Concepts](../concepts/data-concepts/index.md) and
6 [Pipeline Concepts](../concepts/pipeline-concepts/index.md).
7
8 The data that you commit to Pachyderm is stored in an object store of your
9 choice, such as Amazon S3, MinIO, Google Cloud Storage, or other. Pachyderm
10 records the cryptographic hash (`SHA`) of each portion of your data and stores
11 it as a commit with a unique identifier (ID). Although the data is
12 stored as an unstructured blob, Pachyderm enables you to interact
13 with versioned data as you typically do in a standard file system.
14
15 Pachyderm stores versioned data in repositories which can contain one or
16 multiple files, as well as files arranged in directories. Regardless of the
17 repository structure, Pachyderm versions the state of each data repository
18 as the data changes over time.
19
20 To put data into Pachyderm, a commit must be *started*, or *opened*.
21 Data can then be put into Pachyderm as part of that open commit and is
22 available once the commit is *finished* or *closed*.
23
24 Pachyderm provides the following options to load data:
25
26 * By using the `pachctl put file` command. This option is great for testing,
27 development, integration with CI/CD, and for users who prefer scripting.
28 See [Load Your Data by Using `pachctl`](#load-your-data-by-using-pachctl).
29
30 * By creating a special type of pipeline that pulls data from an
31 outside source.
32 Because Pachyderm pipelines can be any arbitrary code that runs
33 in a Docker container, you can call out to external APIs or data
34 sources and pull in data from there. Your pipeline code can be
35 triggered on-demand or
36 continuously with the following special types of pipelines:
37
38 * **Spout:** A spout enables you to continuously load
39 streaming data from a streaming data source, such as a messaging system
40 or message queue into Pachyderm.
41 See [Spout](../concepts/pipeline-concepts/pipeline/spout.md).
42
43 * **Cron:** A cron triggers your pipeline periodically based on the
44 interval that you configure in your pipeline spec.
45 See [Cron](../concepts/pipeline-concepts/pipeline/cron.md).
46
47 **Note:** Pipelines enable you to do much more than just ingressing
48 data into Pachyderm. Pipelines can run all kinds of data transformations
49 on your input data sources, such as a Pachyderm repository, and be
50 configured to run your code automatically as new data is committed.
51 For more information, see
52 [Pipeline](../concepts/pipeline-concepts/pipeline/index.md).
53
54 * By using a Pachyderm language client. This option is ideal
55 for Go or Python users who want to push data into Pachyderm from
56 services or applications written in those languages. If you did not find your
57 favorite language in the list of supported language clients,
58 Pachyderm uses a protobuf API which supports many other languages.
59 See [Pachyderm Language Clients](../reference/clients.md).
60
61 If you are using the Pachyderm Enterprise version, you can use these
62 additional options:
63
64 * By using the S3 gateway. This option is great to use with the existing tools
65 and libraries that interact with S3-compatible object stores.
66 See [Using the S3 Gateway](../../deploy-manage/manage/s3gateway/).
67
68 * By using the Pachyderm dashboard. The Pachyderm Enterprise dashboard
69 provides a convenient way to upload data right from the UI.
70 
71
72 !!! note
73 In the Pachyderm UI, you can only specify an S3 data source.
74 Uploading data from your local device is not supported.
75
76 ## Load Your Data by Using `pachctl`
77
78 The `pachctl put file` command enables you to do everything from
79 loading local files into Pachyderm to pulling data from an existing object
80 store bucket and extracting data from a website. With
81 `pachctl put file`, you can append new data to the existing data or
82 overwrite the existing data. All these options can be configured by using
83 the flags available with this command. Run `pachctl put file --help` to
84 view the complete list of flags that you can specify.
85
86 To load your data into Pachyderm by using `pachctl`, you first need to create
87 one or more data repositories. Then, you can use the `pachctl put file`
88 command to put your data into the created repository.
89
90 In Pachyderm, you can *start* and *finish* commits. If you just
91 run `pachctl put file` and no open commit exists, Pachyderm starts a new
92 commit, adds the data to which you specified the path in your command, and
93 finishes the commit. This is called an atomic commit.
94
95 Alternatively, you can run `pachctl start commit` to start a new commit.
96 Then, add your data in multiple `put file` calls, and finally, when ready,
97 close the commit by running `pachctl finish commit`.
98
99 To load your data into a repository, complete the following steps:
100
101 1. Create a Pachyderm repository:
102
103 ```shell
104 pachctl create repo <repo name>
105 ```
106
107 1. Select from the following options:
108
109 * To start and finish an atomic commit, run:
110
111 ```shell
112 pachctl put file <repo>@<branch>:</path/to/file1> -f <file1>
113 ```
114
115 * To start a commit and add data in iterations:
116
117 1. Start a commit:
118
119 ```shell
120 pachctl start commit <repo>@<branch>
121 ```
122 1. Put your data:
123
124 ```shell
125 pachctl put file <repo>@<branch>:</path/to/file1> -f <file1>
126 ```
127
128 1. Work on your changes, and when ready, put more data:
129
130 ```shell
131 pachctl put file <repo>@<branch>:</path/to/file2> -f <file2>
132 ```
133
134 1. Close the commit:
135
136 ```shell
137 pachctl finish commit <repo>@<branch>
138 ```
139
140 ## Filepath Format
141
142 In Pachyderm, you specify the path to file by using the `-f` option. A path
143 to file can be a local path or a URL to an external resource. You can add
144 multiple files or directories by using the `-i` option. To add contents
145 of a directory, use the `-r` flag.
146
147 The following table provides examples of `pachctl put file` commands with
148 various filepaths and data sources:
149
150 * Put data from a URL:
151
152 ```
153 pachctl put file <repo>@<branch>:</path/to/file> -f http://url_path
154 ```
155
156 * Put data from an object store. You can use `s3://`, `gcs://`, or `as://`
157 in your filepath:
158
159 ```
160 pachctl put file <repo>@<branch>:</path/to/file> -f s3://object_store_url
161 ```
162
163 !!! note
164 If you are configuring a local cluster to access an S3 bucket,
165 you need to first deploy a Kubernetes `Secret` for the selected object
166 store.
167
168 * Add multiple files at once by using the `-i` option or multiple `-f` flags.
169 In the case of `-i`, the target file must be a list of files, paths, or URLs
170 that you want to input all at once:
171
172 ```shell
173 pachctl put file <repo>@<branch> -i <file containing list of files, paths, or URLs>
174 ```
175
176 * Input data from stdin into a data repository by using a pipe:
177
178 ```shell
179 echo "data" | pachctl put file <repo>@<branch> -f </path/to/file>
180 ```
181
182 * Add an entire directory or all of the contents at a particular URL, either
183 HTTP(S) or object store URL, `s3://`, `gcs://`, and `as://`, by using the
184 recursive flag, `-r`:
185
186 ```shell
187 pachctl put file <repo>@<branch> -r -f <dir>
188 ```
189
190 ## Loading Your Data Partially
191
192 Depending on your use case, you might decide not to import all of your
193 data into Pachyderm but only store and apply version control to some
194 of it. For example, if you have a 10 PB dataset, loading the
195 whole dataset into Pachyderm is a costly operation that takes
196 a lot of time and resources. To optimize performance and the
197 use of resources, you might decide to load some of this data into
198 Pachyderm, leaving the rest of it in its original source.
199
200 One possible way of doing this is by adding a metadata file with a
201 URL to the specific file or directory in your dataset to a Pachyderm
202 repository and refer to that file in your pipeline.
203 Your pipeline code would read the URL or path in the external data
204 source and retrieve that data as needed for processing instead of
205 needing to preload it all into a Pachyderm repo. This method works
206 particularly well for mostly immutable data because in this case,
207 Pachyderm will not keep versions of the source file, but it will keep
208 track and provenance of the resulting output commits in its
209 version-control system.