github.com/pachyderm/pachyderm@v1.13.4/doc/docs/1.9.x/how-tos/splitting-data/splitting.md (about) 1 # Splitting Data for Distributed Processing 2 3 Before you read this section, make sure that you understand 4 the concepts described in 5 [Distributed Computing](../distributed_computing.md). 6 7 Pachyderm enables you to parallelize computations over data as long as 8 that data can be split up into multiple *datums*. However, in many 9 cases, you might have a dataset that you want or need to commit 10 into Pachyderm as a single file rather than a bunch of smaller 11 files that are easily mapped to datums, such as one file per record. 12 For such cases, Pachyderm provides an easy way to prepare your dataset 13 for subsequent distributed computing by splitting it upon uploading 14 to a Pachyderm repository. 15 16 In this example, you have a dataset that consists of information about your 17 users and a repository called `user`. 18 This data is in `CSV` format in a single file called `user_data.csv` 19 with one record per line: 20 21 ``` 22 $ head user_data.csv 23 1,cyukhtin0@stumbleupon.com,144.155.176.12 24 2,csisneros1@over-blog.com,26.119.26.5 25 3,jeye2@instagram.com,13.165.230.106 26 4,rnollet3@hexun.com,58.52.147.83 27 5,bposkitt4@irs.gov,51.247.120.167 28 6,vvenmore5@hubpages.com,161.189.245.212 29 7,lcoyte6@ask.com,56.13.147.134 30 8,atuke7@psu.edu,78.178.247.163 31 9,nmorrell8@howstuffworks.com,28.172.10.170 32 10,afynn9@google.com.au,166.14.112.65 33 ``` 34 35 If you put this data into Pachyderm as a single 36 file, Pachyderm processes them a single datum. 37 It cannot process each of 38 these user records in parallel as separate `datums`. 39 Potentially, you can manually separate 40 these user records into standalone files before you 41 commit them into the `users` repository or through 42 a pipeline stage dedicated to this splitting task. 43 However, Pachyderm provides an optimized way of completing 44 this task. 45 46 The `put file` API includes an option for splitting 47 the file into separate datums automatically. You can use 48 the `--split` flag with the `put file` command. 49 50 To complete this example, follow the steps below: 51 52 1. Create a `users` repository by running: 53 54 ```shell 55 $ pachctl create repo users 56 ``` 57 58 1. Create a file called `user_data.csv` with the 59 contents listed above. 60 61 1. Put your `user_data.csv` file into Pachyderm and 62 automatically split it into separate datums for each line: 63 64 ```shell 65 $ pachctl put file users@master -f user_data.csv --split line --target-file-datums 1 66 ``` 67 68 The `--split line` argument specifies that Pachyderm 69 splits this file into lines, and the `--target-file-datums 1` 70 argument specifies that each resulting file must include 71 at most one datum or one line. 72 73 1. View the list of files in the master branch of the `users` 74 repository: 75 76 ```shell 77 $ pachctl list file users@master 78 NAME TYPE SIZE 79 user_data.csv dir 5.346 KiB 80 ``` 81 82 If you run `pachctl list file` command for the master branch 83 in the `users` repository, Pachyderm 84 still shows the `user_data.csv` entity to you as one 85 entity in the repo 86 However, this entity is now a directory that contains all 87 of the split records. 88 89 1. To view the detailed information about 90 the `user_data.csv` file, run the command with the file name 91 specified after a colon: 92 93 ```shell 94 $ pachctl list file users@master:user_data.csv 95 NAME TYPE SIZE 96 user_data.csv/0000000000000000 file 43 B 97 user_data.csv/0000000000000001 file 39 B 98 user_data.csv/0000000000000002 file 37 B 99 user_data.csv/0000000000000003 file 34 B 100 user_data.csv/0000000000000004 file 35 B 101 user_data.csv/0000000000000005 file 41 B 102 user_data.csv/0000000000000006 file 32 B 103 etc... 104 ``` 105 106 Then, a pipeline that takes the repo `users` as input 107 with a glob pattern of `/user_data.csv/*` processes each 108 user record, such as each line in the CSV file in parallel. 109 110 ### JSON and Text File Splitting Examples 111 112 Pachyderm supports this type of splitting for lines or 113 JSON blobs as well. See the examples below. 114 115 * Split a `json` file on `json` blobs by putting each `json` 116 blob into a separate file. 117 118 ```shell 119 $ pachctl put file users@master -f user_data.json --split json --target-file-datums 1 120 ``` 121 122 * Split a `json` file on `json` blobs by putting three `json` 123 blobs into each split file. 124 125 ```shell 126 $ pachctl put file users@master -f user_data.json --split json --target-file-datums 3 127 ``` 128 129 * Split a file on lines by putting each 100-bytes chunk into 130 the split files. 131 132 ```shell 133 $ pachctl put file users@master -f user_data.txt --split line --target-file-bytes 100 134 ``` 135 136 ## Specifying a Header 137 138 If your data has a common header, you can specify it 139 manually by using `pachctl put file` with the `--header-records` flag. 140 You can use this functionality with JSON and CSV data. 141 142 To specify a header, complete the following steps: 143 144 1. Create a new or use an existing data file. For example, the `user_data.csv` 145 from the section above with the following header: 146 147 ```shell 148 NUMBER,EMAIL,IP_ADDRESS 149 ``` 150 151 1. Create a new repository or use an existing one: 152 153 ```shell 154 $ pachctl create repo users 155 ``` 156 157 1. Put your file into the repository by separating the header from 158 other lines: 159 160 ```shell 161 $ pachctl put file users@master -f user_data.csv --split=csv --header-records=1 --target-file-datums=1 162 ``` 163 164 1. Verify that the file was added and split: 165 166 ```shell 167 $ pachctl list file users@master:/user_data.csv 168 ``` 169 170 **Example:** 171 172 ```shell 173 NAME TYPE SIZE 174 /user_data.csv/0000000000000000 file 70B 175 /user_data.csv/0000000000000001 file 66B 176 /user_data.csv/0000000000000002 file 64B 177 /user_data.csv/0000000000000003 file 61B 178 /user_data.csv/0000000000000004 file 62B 179 /user_data.csv/0000000000000005 file 68B 180 /user_data.csv/0000000000000006 file 59B 181 /user_data.csv/0000000000000007 file 59B 182 /user_data.csv/0000000000000008 file 71B 183 /user_data.csv/0000000000000009 file 65B 184 ``` 185 186 1. Get the first file from the repository: 187 188 ```shell 189 $ pachctl get file users@master:/user_data.csv/0000000000000000 190 NUMBER,EMAIL,IP_ADDRESS 191 1,cyukhtin0@stumbleupon.com,144.155.176.12 192 ``` 193 1. Get all files: 194 195 ```csv 196 $ pachctl get file users@master:/user_data.csv/* 197 NUMBER,EMAIL,IP_ADDRESS 198 1,cyukhtin0@stumbleupon.com,144.155.176.12 199 2,csisneros1@over-blog.com,26.119.26.5 200 3,jeye2@instagram.com,13.165.230.106 201 4,rnollet3@hexun.com,58.52.147.83 202 5,bposkitt4@irs.gov,51.247.120.167 203 6,vvenmore5@hubpages.com,161.189.245.212 204 7,lcoyte6@ask.com,56.13.147.134 205 8,atuke7@psu.edu,78.178.247.163 206 9,nmorrell8@howstuffworks.com,28.172.10.170 207 10,afynn9@google.com.au,166.14.112.65 208 ``` 209 210 For more information, type `pachctl put file --help`. 211 212 ## Ingesting PostgresSQL data 213 214 Pachyderm supports direct data ingestion from PostgreSQL. 215 You need first extract your database into a script file 216 by using `pg_dump` and then add the data from the file 217 into Pachyderm by running the `pachctl put file` with the 218 `--split` flag. 219 220 When you use `pachctl put file --split sql ...`, Pachyderm 221 splits your `pgdump` file into three parts - the header, rows, 222 and the footer. The header contains all the SQL statements 223 in the `pgdump` file that set up the schema and tables. 224 The rows are split into individual files, or if you specify 225 the `--target-file-datums` or `--target-file-bytes`, multiple 226 rows per file. The footer contains the remaining 227 SQL statements for setting up the tables. 228 229 The header and footer are stored in the directory that contains 230 the rows. If you request a `get file` on that directory, you 231 get just the header and footer. If you request an individual 232 file, you see the header, the row or rows, and the footer. 233 If you request all the files with a glob pattern, for example, 234 `/directoryname/*`, you receive the header, all the rows, and 235 the footer recreating the full `pgdump`. Therefore, you can 236 construct full or partial `pgdump` files so that you can 237 load full or partial datasets. 238 239 To put your PostgreSQL data into Pachyderm, complete the following 240 steps: 241 242 1. Generate a `pgdump` file: 243 244 **Example:** 245 246 ```shell 247 $ pg_dump -t users -f users.pgdump 248 ``` 249 250 1. View the `pgdump` file: 251 252 ???+ note "Example" 253 254 ```shell 255 $ cat users.pgdump 256 -- 257 -- PostgreSQL database dump 258 -- 259 260 -- Dumped from database version 9.5.12 261 -- Dumped by pg_dump version 9.5.12 262 263 SET statement_timeout = 0; 264 SET lock_timeout = 0; 265 SET client_encoding = 'UTF8'; 266 SET standard_conforming_strings = on; 267 SELECT pg_catalog.set_config('search_path', '', false); 268 SET check_function_bodies = false; 269 SET client_min_messages = warning; 270 SET row_security = off; 271 272 SET default_tablespace = ''; 273 274 SET default_with_oids = false; 275 276 -- 277 -- Name: users; Type: TABLE; Schema: public; Owner: postgres 278 -- 279 280 CREATE TABLE public.users ( 281 id integer NOT NULL, 282 name text NOT NULL, 283 saying text NOT NULL 284 ); 285 286 287 ALTER TABLE public.users OWNER TO postgres; 288 289 -- 290 -- Data for Name: users; Type: TABLE DATA; Schema: public; Owner: postgres 291 -- 292 293 COPY public.users (id, name, saying) FROM stdin; 294 0 wile E Coyote ... 295 1 road runner \\. 296 \. 297 298 299 -- 300 -- PostgreSQL database dump complete 301 -- 302 ``` 303 304 3. Ingest the SQL data by using the `pachctl put file` command 305 with the `--split` file: 306 307 ```shell 308 $ pachctl put file data@master -f users.pgdump --split sql 309 $ pachctl put file data@master:users --split sql -f users.pgdump 310 ``` 311 312 4. View the information about your repository: 313 314 ```shell 315 316 $ pachctl list file data@master 317 NAME TYPE SIZE 318 users dir 914B 319 ``` 320 321 The `users.pgdump` file is added to the master branch in the `data` 322 repository. 323 324 5. View the information about the `users.pgdump` file: 325 326 ```shell 327 328 $ pachctl list file data@master:users 329 NAME TYPE SIZE 330 /users/0000000000000000 file 20B 331 /users/0000000000000001 file 18B 332 ``` 333 334 6. In your pipeline, where you have started and forked PostgreSQL, 335 you can load the data by running the following or a similar script: 336 337 ``` 338 $ cat /pfs/data/users/* | sudo -u postgres psql 339 ``` 340 341 By using the glob pattern `/*`, this code loads each raw PostgreSQL chunk 342 into your PostgreSQL instance for processing by your pipeline. 343 344 345 !!! tip 346 For this use case, you might want to use `--target-file-datums` or 347 `--target-file-bytes` because these commands enable your queries to run 348 against many rows at a time.