github.com/pachyderm/pachyderm@v1.13.4/doc/docs/1.11.x/how-tos/splitting-data/splitting.md (about) 1 # Splitting Data for Distributed Processing 2 3 Before you read this section, make sure that you understand 4 the concepts described in 5 [Distributed Computing](../../concepts/advanced-concepts/distributed_computing.md). 6 7 Pachyderm enables you to parallelize computations over data as long as 8 that data can be split up into multiple *datums*. However, in many 9 cases, you might have a dataset that you want or need to commit 10 into Pachyderm as a single file rather than a bunch of smaller 11 files that are easily mapped to datums, such as one file per record. 12 For such cases, Pachyderm provides an easy way to prepare your dataset 13 for subsequent distributed computing by splitting it upon uploading 14 to a Pachyderm repository. 15 16 In this example, you have a dataset that consists of information about your 17 users and a repository called `user`. 18 This data is in `CSV` format in a single file called `user_data.csv` 19 with one record per line: 20 21 ``` 22 head user_data.csv 23 1,cyukhtin0@stumbleupon.com,144.155.176.12 24 2,csisneros1@over-blog.com,26.119.26.5 25 3,jeye2@instagram.com,13.165.230.106 26 4,rnollet3@hexun.com,58.52.147.83 27 5,bposkitt4@irs.gov,51.247.120.167 28 6,vvenmore5@hubpages.com,161.189.245.212 29 7,lcoyte6@ask.com,56.13.147.134 30 8,atuke7@psu.edu,78.178.247.163 31 9,nmorrell8@howstuffworks.com,28.172.10.170 32 10,afynn9@google.com.au,166.14.112.65 33 ``` 34 35 If you put this data into Pachyderm as a single 36 file, Pachyderm processes them a single datum. 37 It cannot process each of 38 these user records in parallel as separate `datums`. 39 Potentially, you can manually separate 40 these user records into standalone files before you 41 commit them into the `users` repository or through 42 a pipeline stage dedicated to this splitting task. 43 However, Pachyderm provides an optimized way of completing 44 this task. 45 46 The `put file` API includes an option for splitting 47 the file into separate datums automatically. You can use 48 the `--split` flag with the `put file` command. 49 50 To complete this example, follow the steps below: 51 52 1. Create a `users` repository by running: 53 54 ```shell 55 pachctl create repo users 56 ``` 57 58 1. Create a file called `user_data.csv` with the 59 contents listed above. 60 61 1. Put your `user_data.csv` file into Pachyderm and 62 automatically split it into separate datums for each line: 63 64 ```shell 65 pachctl put file users@master -f user_data.csv --split line --target-file-datums 1 66 ``` 67 68 The `--split line` argument specifies that Pachyderm 69 splits this file into lines, and the `--target-file-datums 1` 70 argument specifies that each resulting file must include 71 at most one datum or one line. 72 73 1. View the list of files in the master branch of the `users` 74 repository: 75 76 ```shell 77 pachctl list file users@master 78 ``` 79 80 **System Response:** 81 82 ```shell 83 NAME TYPE SIZE 84 user_data.csv dir 5.346 KiB 85 ``` 86 87 If you run `pachctl list file` command for the master branch 88 in the `users` repository, Pachyderm 89 still shows the `user_data.csv` entity to you as one 90 entity in the repo 91 However, this entity is now a directory that contains all 92 of the split records. 93 94 1. To view the detailed information about 95 the `user_data.csv` file, run the command with the file name 96 specified after a colon: 97 98 ```shell 99 pachctl list file users@master:user_data.csv 100 ``` 101 102 **System Response:** 103 104 ```shell 105 NAME TYPE SIZE 106 user_data.csv/0000000000000000 file 43 B 107 user_data.csv/0000000000000001 file 39 B 108 user_data.csv/0000000000000002 file 37 B 109 user_data.csv/0000000000000003 file 34 B 110 user_data.csv/0000000000000004 file 35 B 111 user_data.csv/0000000000000005 file 41 B 112 user_data.csv/0000000000000006 file 32 B 113 etc... 114 ``` 115 116 Then, a pipeline that takes the repo `users` as input 117 with a glob pattern of `/user_data.csv/*` processes each 118 user record, such as each line in the CSV file in parallel. 119 120 ### JSON and Text File Splitting Examples 121 122 Pachyderm supports this type of splitting for lines or 123 JSON blobs as well. See the examples below. 124 125 * Split a `json` file on `json` blobs by putting each `json` 126 blob into a separate file. 127 128 ```shell 129 pachctl put file users@master -f user_data.json --split json --target-file-datums 1 130 ``` 131 132 * Split a `json` file on `json` blobs by putting three `json` 133 blobs into each split file. 134 135 ```shell 136 pachctl put file users@master -f user_data.json --split json --target-file-datums 3 137 ``` 138 139 * Split a file on lines by putting each 100-bytes chunk into 140 the split files. 141 142 ```shell 143 pachctl put file users@master -f user_data.txt --split line --target-file-bytes 100 144 ``` 145 146 ## Specifying a Header 147 148 If your data has a common header, you can specify it 149 manually by using `pachctl put file` with the `--header-records` flag. 150 You can use this functionality with JSON and CSV data. 151 152 To specify a header, complete the following steps: 153 154 1. Create a new or use an existing data file. For example, the `user_data.csv` 155 from the section above with the following header: 156 157 ```shell 158 NUMBER,EMAIL,IP_ADDRESS 159 ``` 160 161 1. Create a new repository or use an existing one: 162 163 ```shell 164 pachctl create repo users 165 ``` 166 167 1. Put your file into the repository by separating the header from 168 other lines: 169 170 ```shell 171 pachctl put file users@master -f user_data.csv --split=csv --header-records=1 --target-file-datums=1 172 ``` 173 174 1. Verify that the file was added and split: 175 176 ```shell 177 pachctl list file users@master:/user_data.csv 178 ``` 179 180 **Example:** 181 182 ```shell 183 NAME TYPE SIZE 184 /user_data.csv/0000000000000000 file 70B 185 /user_data.csv/0000000000000001 file 66B 186 /user_data.csv/0000000000000002 file 64B 187 /user_data.csv/0000000000000003 file 61B 188 /user_data.csv/0000000000000004 file 62B 189 /user_data.csv/0000000000000005 file 68B 190 /user_data.csv/0000000000000006 file 59B 191 /user_data.csv/0000000000000007 file 59B 192 /user_data.csv/0000000000000008 file 71B 193 /user_data.csv/0000000000000009 file 65B 194 ``` 195 196 1. Get the first file from the repository: 197 198 ```shell 199 pachctl get file users@master:/user_data.csv/0000000000000000 200 ``` 201 202 **System Response:** 203 204 ```shell 205 NUMBER,EMAIL,IP_ADDRESS 206 1,cyukhtin0@stumbleupon.com,144.155.176.12 207 ``` 208 1. Get all files: 209 210 ```shell 211 pachctl get file users@master:/user_data.csv/* 212 ``` 213 214 **System Response:** 215 216 ```csv 217 NUMBER,EMAIL,IP_ADDRESS 218 1,cyukhtin0@stumbleupon.com,144.155.176.12 219 2,csisneros1@over-blog.com,26.119.26.5 220 3,jeye2@instagram.com,13.165.230.106 221 4,rnollet3@hexun.com,58.52.147.83 222 5,bposkitt4@irs.gov,51.247.120.167 223 6,vvenmore5@hubpages.com,161.189.245.212 224 7,lcoyte6@ask.com,56.13.147.134 225 8,atuke7@psu.edu,78.178.247.163 226 9,nmorrell8@howstuffworks.com,28.172.10.170 227 10,afynn9@google.com.au,166.14.112.65 228 ``` 229 230 For more information, type `pachctl put file --help`. 231 232 ## Ingesting PostgresSQL data 233 234 Pachyderm supports direct data ingestion from PostgreSQL. 235 You need first extract your database into a script file 236 by using `pg_dump` and then add the data from the file 237 into Pachyderm by running the `pachctl put file` with the 238 `--split` flag. 239 240 When you use `pachctl put file --split sql ...`, Pachyderm 241 splits your `pgdump` file into three parts - the header, rows, 242 and the footer. The header contains all the SQL statements 243 in the `pgdump` file that set up the schema and tables. 244 The rows are split into individual files, or if you specify 245 the `--target-file-datums` or `--target-file-bytes`, multiple 246 rows per file. The footer contains the remaining 247 SQL statements for setting up the tables. 248 249 The header and footer are stored in the directory that contains 250 the rows. If you request a `get file` on that directory, you 251 get just the header and footer. If you request an individual 252 file, you see the header, the row or rows, and the footer. 253 If you request all the files with a glob pattern, for example, 254 `/directoryname/*`, you receive the header, all the rows, and 255 the footer recreating the full `pgdump`. Therefore, you can 256 construct full or partial `pgdump` files so that you can 257 load full or partial datasets. 258 259 To put your PostgreSQL data into Pachyderm, complete the following 260 steps: 261 262 1. Generate a `pgdump` file: 263 264 **Example:** 265 266 ```shell 267 pg_dump -t users -f users.pgdump 268 ``` 269 270 1. View the `pgdump` file: 271 272 ```shell 273 cat users.pgdump 274 ``` 275 276 **System Response:** 277 278 ```shell 279 -- 280 -- PostgreSQL database dump 281 -- 282 283 -- Dumped from database version 9.5.12 284 -- Dumped by pg_dump version 9.5.12 285 286 SET statement_timeout = 0; 287 SET lock_timeout = 0; 288 SET client_encoding = 'UTF8'; 289 SET standard_conforming_strings = on; 290 SELECT pg_catalog.set_config('search_path', '', false); 291 SET check_function_bodies = false; 292 SET client_min_messages = warning; 293 SET row_security = off; 294 295 SET default_tablespace = ''; 296 297 SET default_with_oids = false; 298 299 -- 300 -- Name: users; Type: TABLE; Schema: public; Owner: postgres 301 -- 302 303 CREATE TABLE public.users ( 304 id integer NOT NULL, 305 name text NOT NULL, 306 saying text NOT NULL 307 ); 308 309 310 ALTER TABLE public.users OWNER TO postgres; 311 312 -- 313 -- Data for Name: users; Type: TABLE DATA; Schema: public; Owner: postgres 314 -- 315 316 COPY public.users (id, name, saying) FROM stdin; 317 0 wile E Coyote ... 318 1 road runner \\. 319 \. 320 321 322 -- 323 -- PostgreSQL database dump complete 324 -- 325 ``` 326 327 3. Ingest the SQL data by using the `pachctl put file` command 328 with the `--split` file: 329 330 ```shell 331 pachctl put file data@master -f users.pgdump --split sql 332 ``` 333 334 ```shell 335 pachctl put file data@master:users --split sql -f users.pgdump 336 ``` 337 338 4. View the information about your repository: 339 340 ```shell 341 pachctl list file data@master 342 ``` 343 344 **System Response:** 345 346 ```shell 347 NAME TYPE SIZE 348 users dir 914B 349 ``` 350 351 The `users.pgdump` file is added to the master branch in the `data` 352 repository. 353 354 5. View the information about the `users.pgdump` file: 355 356 ```shell 357 358 pachctl list file data@master:users 359 ``` 360 361 **System Response:** 362 363 ```shell 364 NAME TYPE SIZE 365 /users/0000000000000000 file 20B 366 /users/0000000000000001 file 18B 367 ``` 368 369 6. In your pipeline, where you have started and forked PostgreSQL, 370 you can load the data by running the following or a similar script: 371 372 ``` 373 cat /pfs/data/users/* | sudo -u postgres psql 374 ``` 375 376 By using the glob pattern `/*`, this code loads each raw PostgreSQL chunk 377 into your PostgreSQL instance for processing by your pipeline. 378 379 380 !!! tip 381 For this use case, you might want to use `--target-file-datums` or 382 `--target-file-bytes` because these commands enable your queries to run 383 against many rows at a time.