github.com/pachyderm/pachyderm@v1.13.4/doc/docs/1.9.x/how-tos/load-data-into-pachyderm.md

github.com/pachyderm/pachyderm@v1.13.4/doc/docs/1.9.x/how-tos/load-data-into-pachyderm.md (about)

1 # Load Your Data Into Pachyderm
2
3 !!! info
4 Before you read this section, make sure that you are familiar with
5 the [Data Concepts](../concepts/data-concepts/index.md) and
6 [Pipeline Concepts](../concepts/pipeline-concepts/index.md).
7
8 The data that you commit to Pachyderm is stored in an object store of your
9 choice, such as Amazon S3, MinIO, Google Cloud Storage, or other. Pachyderm
10 records the cryptographic hash (`SHA`) of each portion of your data and stores
11 it as a commit with a unique identifier (ID). Although the data is
12 stored as an unstructured blob, Pachyderm enables you to interact
13 with versioned data as you typically do in a standard file system.
14
15 Pachyderm stores versioned data in repositories which can contain one or
16 multiple files, as well as files arranged in directories. Regardless of the
17 repository structure, Pachyderm versions the state of each data repository
18 as the data changes over time.
19
20 To put data into Pachyderm, a commit must be *started*, or *opened*.
21 Data can then be put into Pachyderm as part of that open commit and is
22 available once the commit is *finished* or *closed*.
23
24 Pachyderm provides the following options to load data:
25
26 * By using the `pachctl put file` command. This option is great for testing,
27 development, integration with CI/CD, and for users who prefer scripting.
28 See [Load Your Data by Using `pachctl`](#load-your-data-by-using-pachctl).
29
30 * By creating a pipeline to pull data from an outside source.
31 Because Pachyderm pipelines can be any arbitrary code that runs
32 in a Docker container, you can call out to external APIs or data
33 sources and pull in data from there. Your pipeline code can be
34 triggered on-demand or
35 continuously with the following special types of pipelines:
36
37 * **Spout:** A spout enables you to continuously load
38 streaming data from a streaming data source, such as a messaging system
39 or message queue into Pachyderm.
40 See [Spout](../concepts/pipeline-concepts/pipeline/spout.md).
41
42 * **Cron:** A cron triggers your pipeline periodically based on the
43 interval that you configure in your pipeline spec.
44 See [Cron](../concepts/pipeline-concepts/pipeline/cron.md).
45
46 **Note:** Pipelines enable you to do much more than just ingressing
47 data into Pachyderm. Pipelines can run all kinds of data transformations
48 on your input data sources, such as a Pachyderm repository, and be
49 configured to run your code automatically as new data is committed.
50 For more information, see
51 [Pipeline](../concepts/pipeline-concepts/pipeline/index.md).
52
53 * By using a Pachyderm language client. This option is ideal
54 for Go or Python users who want to push data into Pachyderm from
55 services or applications written in those languages. If you did not find your
56 favorite language in the list of supported language clients,
57 Pachyderm uses a protobuf API which supports many other languages.
58 See [Pachyderm Language Clients](../reference/clients.md).
59
60 If you are using the Pachyderm Enterprise version, you can use these
61 additional options:
62
63 * By using the S3 gateway. This option is great to use with the existing tools
64 and libraries that interact with S3-compatible object stores.
65 See [Using the S3 Gateway](./s3gateway.md).
66
67 * By using the Pachyderm dashboard. The Pachyderm Enterprise dashboard
68 provides a convenient way to upload data right from the UI.
69 
70
71 !!! note
72 In the Pachyderm UI, you can only specify an S3 data source.
73 Uploading data from your local device is not supported.
74
75 ## Load Your Data by Using `pachctl`
76
77 The `pachctl put file` command enables you to do everything from
78 loading local files into Pachyderm to pulling data from an existing object
79 store bucket and extracting data from a website. With
80 `pachctl put file`, you can append new data to the existing data or
81 overwrite the existing data. All these options can be configured by using
82 the flags available with this command. Run `pachctl put file --help` to
83 view the complete list of flags that you can specify.
84
85 To load your data into Pachyderm by using `pachctl`, you first need to create
86 one or more data repositories. Then, you can use the `pachctl put file`
87 command to put your data into the created repository.
88
89 In Pachyderm, you can *start* and *finish* commits. If you just
90 run `pachctl put file` and no open commit exists, Pachyderm starts a new
91 commit, adds the data to which you specified the path in your command, and
92 finishes the commit. This is called an atomic commit.
93
94 Alternatively, you can run `pachctl start commit` to start a new commit.
95 Then, add your data in multiple `put file` calls, and finally, when ready,
96 close the commit by running `pachctl finish commit`.
97
98 To load your data into a repository, complete the following steps:
99
100 1. Create a Pachyderm repository:
101
102 ```shell
103 $ pachctl create repo <repo name>
104 ```
105
106 1. Select from the following options:
107
108 * To start and finish an atomic commit, run:
109
110 ```shell
111 $ pachctl put file <repo>@<branch>:</path/to/file1> -f <file1>
112 ```
113
114 * To start a commit and add data in iterations:
115
116 1. Start a commit:
117
118 ```shell
119 $ pachctl start commit <repo>@<branch>
120 ```
121 1. Put your data:
122
123 ```shell
124 $ pachctl put file <repo>@<branch>:</path/to/file1> -f <file1>
125 ```
126
127 1. Work on your changes, and when ready, put more data:
128
129 ```shell
130 $ pachctl put file <repo>@<branch>:</path/to/file2> -f <file2>
131 ```
132
133 1. Close the commit:
134
135 ```shell
136 $ pachctl finish commit <repo>@<branch>
137 ```
138
139 ## Filepath Format
140
141 In Pachyderm, you specify the path to file by using the `-f` option. A path
142 to file can be a local path or a URL to an external resource. You can add
143 multiple files or directories by using the `-i` option. To add contents
144 of a directory, use the `-r` flag.
145
146 The following table provides examples of `pachctl put file` commands with
147 various filepaths and data sources:
148
149 * Put data from a URL:
150
151 ```
152 $ pachctl put file <repo>@<branch>:</path/to/file> -f http://url_path
153 ```
154
155 * Put data from an object store. You can use `s3://`, `gcs://`, or `as://`
156 in your filepath:
157
158 ```
159 $ pachctl put file <repo>@<branch>:</path/to/file> -f s3://object_store_url
160 ```
161
162 !!! note
163 If you are configuring a local cluster to access an S3 bucket,
164 you need to first deploy a Kubernetes `Secret` for the selected object
165 store.
166
167 * Add multiple files at once by using the `-i` option or multiple `-f` flags.
168 In the case of `-i`, the target file must be a list of files, paths, or URLs
169 that you want to input all at once:
170
171 ```shell
172 $ pachctl put file <repo>@<branch> -i <file containing list of files, paths, or URLs>
173 ```
174
175 * Input data from stdin into a data repository by using a pipe:
176
177 ```shell
178 $ echo "data" | pachctl put file <repo>@<branch> -f </path/to/file>
179 ```
180
181 * Add an entire directory or all of the contents at a particular URL, either
182 HTTP(S) or object store URL, `s3://`, `gcs://`, and `as://`, by using the
183 recursive flag, `-r`:
184
185 ```shell
186 $ pachctl put file <repo>@<branch> -r -f <dir>
187 ```
188
189 ## Loading Your Data Partially
190
191 Depending on your use case, you might decide not to import all of your
192 data into Pachyderm but only store and apply version control to some
193 of it. For example, if you have a 10 PB dataset, loading the
194 whole dataset into Pachyderm is a costly operation that takes
195 a lot of time and resources. To optimize performance and the
196 use of resources, you might decide to load some of this data into
197 Pachyderm, leaving the rest of it in its original source.
198
199 One possible way of doing this is by adding a metadata file with a
200 URL to the specific file or directory in your dataset to a Pachyderm
201 repository and refer to that file in your pipeline.
202 Your pipeline code would read the URL or path in the external data
203 source and retrieve that data as needed for processing instead of
204 needing to preload it all into a Pachyderm repo. This method works
205 particularly well for mostly immutable data because in this case,
206 Pachyderm will not keep versions of the source file, but it will keep
207 track and provenance of the resulting output commits in its
208 version-control system.