github.com/pachyderm/pachyderm@v1.13.4/doc/docs/1.11.x/how-tos/splitting-data/adjusting_data_processing_w_split.md (about) 1 # Adjusting Data Processing by Splitting Data 2 3 !!! info 4 Before you read this section, make sure that you understand 5 the concepts described in [File](../../concepts/data-concepts/file.md), 6 [Glob Pattern](../../concepts/pipeline-concepts/datum/glob-pattern.md), 7 [Pipeline Specification](../../reference/pipeline_spec.md), and 8 [Individual Developer Workflow](../individual-developer-workflow.md). 9 10 Unlike source code version-control systems, such as Git, that mostly 11 store and version text files, Pachyderm does not perform intra-file 12 diffing. This means that if you have a 10,000 lines CSV file and 13 change a single line in that file, a pipeline that is subscribed 14 to the repository where that file is located processes the whole file. 15 You can adjust this behavior by splitting your file upon loading 16 into chunks. 17 18 Pachyderm applies diffing at per file level. 19 Therefore, if one bit of a file changes, 20 Pachyderm identifies that change as a new file. 21 Similarly, Pachyderm can only distribute computation 22 at the level of a single file. If your data is stored in 23 one large file, it can only be processed by a single worker, which 24 might affect performance. 25 26 To optimize performance and processing time, you might want to 27 break up large files into smaller chunks while Pachyderm uploads 28 data to an input repository. For simple data types, you can 29 run the `pachctl put file` command with the `--split` flag. For 30 more complex splitting pattern, such as when you work with `.avro` 31 or other binary formats, you need to manually split your data 32 either at ingest or by configuring splitting in a Pachyderm 33 pipeline. 34 35 ## Using Split and Target-File Flags 36 37 For common file types that are often used in data science, such as CSV, 38 line-delimited text files, JavaScript Object Notation (JSON) files, 39 Pachyderm includes the `--split`, `--target-file-bytes`, and 40 `--target-file-datums` flags. 41 42 !!! note 43 In this section, a chunk of data is called a *split-file*. 44 45 | Flag | Description | 46 | ----------------- | ----------------------------------------------------- | 47 | `--split` | Divides a file into chunks based on a *record*, such as newlines <br> in a line-delimited files or by JSON object for JSON files. <br> The `--split` flag takes one of the following arguments— `line`, <br> `json`, or `sql`. For example, `--split line` ensures that Pachyderm <br> only breaks up a file on a newline boundaries and not in the <br> middle of a line. | 48 | `--target-file-bytes` | This flag must be used with the `--split` <br> flag. The `--target-file-bytes` flag fills each of the split-files with data up to <br> the specified number of bytes, splitting on the nearest <br>record boundary. For example, you have a line-delimited file <br>of 50 lines, with each line having about 20 bytes. If you run the <br> `--split lines --target-file-bytes 100` command, you see the input <br>file split into about 10 files and each file has about 5 lines. Each <br>split-file’s size hovers above the target value of 100 bytes, <br>not going below 100 bytes until the last split-file, which might <br>be less than 100 bytes. | 49 | `--target-file-datums` | This flag must be used with the `--split` <br> flag. The `--target-file-datums` attempts to fill each split-file <br>with the specified number of datums. If you run `--split lines --target-file-datums 2` on the line-delimited 100-line file mentioned above, you see the file split into 50 split-files and each file has 2 lines. | 50 51 52 If you specify both `--target-file-datums` and `--target-file-bytes` flags, 53 Pachyderm creates split-files until it hits one of the 54 constraints. 55 56 !!! note "See also:" 57 [Splitting Data for Distributed Processing](../splitting/#ingesting-postgressql-data) 58 59 ### Example: Splitting a File 60 61 In this example, you have a 50-line file called `my-data.txt`. 62 You create a repository named `line-data` and load 63 `my-data.txt` into that repository. Then, you can analyze 64 how the data is split in the repository. 65 66 To complete this example, perform the following steps: 67 68 1. Create a file with fifty lines named `my-data.txt`. You can 69 add random lines, such as numbers from one to fifty, or US states, 70 or anything else. 71 72 **Examples:** 73 74 ```shell 75 Zero 76 One 77 Two 78 Three 79 ... 80 Fifty 81 ``` 82 83 1. Create a repository called `line-data`: 84 85 ```shell 86 pachctl create repo line-data 87 pachctl put file line-data@master -f my-data.txt --split line 88 ``` 89 90 1. List the filesystem objects in the repository: 91 92 ```shell 93 pachctl list file line-data@master 94 ``` 95 96 **System Response:** 97 98 ```shell 99 NAME TYPE SIZE 100 /my-data.txt dir 1.071KiB 101 ``` 102 103 The `pachctl list file` command shows 104 that the line-oriented file `my-data.txt` 105 that was uploaded has been transformed into a 106 directory that includes the chunks of the original 107 `my-data.txt` file. Each chunk is put into a split-file 108 and given a 16-character filename, left-padded with 0. 109 Pachyderm numbers each filename sequentially in hexadecimal. We 110 modify the command to list the contents of “my-data.txt”, and the output 111 reveals the naming structure. 112 113 1. List the files in the `my-data.txt` directory: 114 115 ```shell 116 pachctl list file line-data@master my-data.txt 117 ``` 118 119 **System Response:** 120 121 ```shell 122 NAME TYPE SIZE 123 /my-data.txt/0000000000000000 file 21B 124 /my-data.txt/0000000000000001 file 22B 125 /my-data.txt/0000000000000002 file 24B 126 /my-data.txt/0000000000000003 file 21B 127 ... 128 NAME TYPE SIZE 129 /my-data.txt/0000000000000031 file 22B 130 ``` 131 132 ## Example: Appending to files with `-–split` 133 134 Combining `--split` with the default *append* behavior of 135 `pachctl put file` enables flexible and scalable processing of 136 record-oriented file data from external, legacy systems. 137 138 Pachyderm ensures that only the newly added data gets processed when 139 you append to an existing file by using `--split`. Pachyderm 140 optimizes storage utilization by automatically deduplicating each 141 split-file. If you split a large file 142 with many duplicate lines or objects with identical hashes 143 might use less space in PFS than it does as 144 a single file outside of PFS. 145 146 To complete this example, follow these steps: 147 148 1. Create a file `count.txt` with the following lines: 149 150 ```shell 151 Zero 152 One 153 Two 154 Three 155 Four 156 ``` 157 158 1. Put the `count.txt` file into a Pachyderm repository called `raw_data`: 159 160 ```shell 161 pachctl put file -f count.txt raw_data@master --split line 162 ``` 163 164 This command splits the `count.txt` file by line and creates 165 a separate file with one line in each file. Also, this operation 166 creates five datums that are processed by the 167 pipelines that use this repository as input. 168 169 1. View the repository contents: 170 171 ```shell 172 pachctl list file raw_data@master 173 ``` 174 175 **System Response:** 176 177 ```shell 178 NAME TYPE SIZE 179 /count.txt dir 24B 180 ``` 181 182 Pachyderm created a directory called `count.txt`. 183 184 1. View the contents of the `count.txt` directory: 185 186 ```shell 187 pachctl list file raw_data@master:count.txt 188 ``` 189 190 **System Response:** 191 192 ```shell 193 NAME TYPE SIZE 194 /count.txt/0000000000000000 file 4B 195 /count.txt/0000000000000001 file 4B 196 /count.txt/0000000000000002 file 6B 197 /count.txt/0000000000000003 file 5B 198 /count.txt/0000000000000004 file 5B 199 ``` 200 201 In the output above, you can see that Pachyderm created five split-files 202 from the original `count.txt` file. Each file has one line from the 203 original `count.txt`. You can check the contents of each file by 204 running the `pachctl get file` command. For example, to get 205 the contents of `count.txt/0000000000000000`, run the following 206 command: 207 208 ```shell 209 pachctl get file raw_data@master:count.txt/0000000000000000 210 ``` 211 212 **System Response:** 213 214 ```shell 215 Zero 216 ``` 217 218 This operation creates five datums that are processed by the 219 pipelines that use this repository as input. 220 221 1. Create a one-line file called `more-count.txt` with the 222 following content: 223 224 ```shell 225 Five 226 ``` 227 228 1. Load this file into Pachyderm by appending it to the `count.txt` file: 229 230 ```shell 231 pachctl put file raw_data@master:count.txt -f more-count.txt --split line 232 ``` 233 234 * If you do not specify `--split` flag while appending to 235 a file that was previously split, Pachyderm displays the following 236 error message: 237 238 ```shell 239 could not put file at "/count.txt"; a file of type directory is already there 240 ``` 241 242 1. Verify that another file was added: 243 244 ```shell 245 pachctl list file raw_data@master:count.txt 246 ``` 247 248 **System Response:** 249 250 ```shell 251 NAME TYPE SIZE 252 /count.txt/0000000000000000 file 4B 253 /count.txt/0000000000000001 file 4B 254 /count.txt/0000000000000002 file 6B 255 /count.txt/0000000000000003 file 5B 256 /count.txt/0000000000000004 file 5B 257 /count.txt/0000000000000005 file 4B 258 ``` 259 260 The `/count.txt/0000000000000005` file was added to the input 261 repository. Pachyderm considers 262 this new file as a separate datum. Therefore, pipelines process 263 only that datum instead of all the chunks of `count.txt`. 264 265 1. Get the contents of the `/count.txt/0000000000000005` file: 266 267 ``` 268 pachctl get file raw_data@master:count.txt/0000000000000005 269 ``` 270 271 **System Response:** 272 273 ```shell 274 Five 275 ``` 276 277 ## Example: Overwriting Files with `–-split` 278 279 The behavior of Pachyderm when a file loaded with ``--split`` is 280 overwritten is simple to explain but subtle in its implications. 281 Most importantly, it can have major implications when new rows 282 are inserted within the file as opposed to just being appended to the end. 283 The loaded file is split into those sequentially-named files, 284 as shown above. If any of those resulting 285 split-files hashes differently than the one it is replacing, it 286 causes the Pachyderm Pipeline System to process that data. 287 This can have significant consequences for downstream processing. 288 289 To complete this example, follow these steps: 290 291 1. Create a file `count.txt` with the following lines: 292 293 ```shell 294 One 295 Two 296 Three 297 Four 298 Five 299 ``` 300 301 1. Put the file into a Pachyderm repository called `raw_data`: 302 303 ```shell 304 pachctl put file -f count.txt raw_data@master --split line 305 ``` 306 307 This command splits the `count.txt` file by line and creates 308 a separate file with one line in each file. Also, this operation 309 creates five datums that are processed by the 310 pipelines that use this repository as input. 311 312 1. View the repository contents: 313 314 ```shell 315 pachctl list file raw_data@master 316 ``` 317 318 **System Response:** 319 320 ```shell 321 NAME TYPE SIZE 322 /count.txt dir 24B 323 ``` 324 325 Pachyderm created a directory called `count.txt`. 326 327 1. View the contents of the `count.txt` directory: 328 329 ```shell 330 pachctl list file raw_data@master:count.txt 331 ``` 332 333 **System Response:** 334 335 ```shell 336 NAME TYPE SIZE 337 /count.txt/0000000000000000 file 4B 338 /count.txt/0000000000000001 file 4B 339 /count.txt/0000000000000002 file 6B 340 /count.txt/0000000000000003 file 5B 341 /count.txt/0000000000000004 file 5B 342 ``` 343 344 In the output above, you can see that Pachyderm created five split-files 345 from the original `count.txt` file. Each file has one line from the 346 original `count.txt` file. You can check the contents of each file by 347 running the `pachctl get file` command. For example, to get 348 the contents of `count.txt/0000000000000000`, run the following 349 command: 350 351 ```shell 352 pachctl get file raw_data@master:count.txt/0000000000000000 353 ``` 354 355 **System Response:** 356 357 ```shell 358 One 359 ``` 360 361 1. In your local directory, modify the original `count.txt` file by 362 inserting the word *Zero* on the first line: 363 364 ```shell 365 Zero 366 One 367 Two 368 Three 369 Four 370 Five 371 ``` 372 373 1. Upload the updated `count.txt` file into the raw_data repository 374 by using the `--split` and `--overwrite` flags: 375 376 ```shell 377 pachctl put file -f count.txt raw_data@master:count.txt --split line --overwrite 378 ``` 379 380 Because Pachyderm takes the file name into account when hashing 381 data for a pipeline, it considers every single split-file as new, 382 and the pipelines that use this repository as input process all 383 six datums. 384 385 1. List the files in the directory: 386 387 ```shell 388 pachctl list file raw_data@master:count.txt 389 ``` 390 391 **System Response:** 392 393 ```shell 394 NAME TYPE SIZE 395 /count.txt/0000000000000000 file 5B 396 /count.txt/0000000000000001 file 4B 397 /count.txt/0000000000000002 file 4B 398 /count.txt/0000000000000003 file 6B 399 /count.txt/0000000000000004 file 5B 400 /count.txt/0000000000000005 file 5B 401 ``` 402 403 The `/count.txt/0000000000000000` file now has the newly added `Zero` line. 404 To verify the contents of the file, run: 405 406 ```shell 407 pachctl get file raw_data@master:count.txt/0000000000000000 408 ``` 409 410 **System Response:** 411 412 ```shell 413 Zero 414 ``` 415 416 !!! note "See Also:" 417 [Splitting Data](splitting.md)