github.com/pachyderm/pachyderm@v1.13.4/doc/docs/1.9.x/how-tos/splitting-data/adjusting_data_processing_w_split.md (about) 1 # Adjusting Data Processing by Splitting Data 2 3 !!! info 4 Before you read this section, make sure that you understand 5 the concepts described in [File](../../concepts/data-concepts/file.md), 6 [Glob Pattern](../../concepts/pipeline-concepts/datum/glob-pattern.md), 7 [Pipeline Specification](../../reference/pipeline_spec.md), and 8 [Individual Developer Workflow](../individual-developer-workflow.md). 9 10 Unlike source code version-control systems, such as Git, that mostly 11 store and version text files, Pachyderm does not perform intra-file 12 diffing. This means that if you have a 10,000 lines CSV file and 13 change a single line in that file, a pipeline that is subscribed 14 to the repository where that file is located processes the whole file. 15 You can adjust this behavior by splitting your file upon loading 16 into chunks. 17 18 Pachyderm applies diffing at per file level. 19 Therefore, if one bit of a file changes, 20 Pachyderm identifies that change as a new file. 21 Similarly, Pachyderm can only distribute computation 22 at the level of a single file. If your data is stored in 23 one large file, it can only be processed by a single worker, which 24 might affect performance. 25 26 To optimize performance and processing time, you might want to 27 break up large files into smaller chunks while Pachyderm uploads 28 data to an input repository. For simple data types, you can 29 run the `pachctl put file` command with the `--split` flag. For 30 more complex splitting pattern, such as when you work with `.avro` 31 or other binary formats, you need to manually split your data 32 either at ingest or by configuring splitting in a Pachyderm 33 pipeline. 34 35 ## Using Split and Target-File Flags 36 37 For common file types that are often used in data science, such as CSV, 38 line-delimited text files, JavaScript Object Notation (JSON) files, 39 Pachyderm includes the `--split`, `--target-file-bytes`, and 40 `--target-file-datums` flags. 41 42 !!! note 43 In this section, a chunk of data is called a *split-file*. 44 45 | Flag | Description | 46 | ----------------- | ----------------------------------------------------- | 47 | `--split` | Divides a file into chunks based on a *record*, such as newlines <br> in a line-delimited files or by JSON object for JSON files. <br> The `--split` flag takes one of the following arguments— `line`, <br> `json`, or `sql`. For example, `--split line` ensures that Pachyderm <br> only breaks up a file on a newline boundaries and not in the <br> middle of a line. | 48 | `--target-file-bytes` | This flag must be used with the `--split` <br> flag. The `--target-file-bytes` flag fills each of the split-files with data up to <br> the specified number of bytes, splitting on the nearest <br>record boundary. For example, you have a line-delimited file <br>of 50 lines, with each line having about 20 bytes. If you run the <br> `--split lines --target-file-bytes 100` command, you see the input <br>file split into about 10 files and each file has about 5 lines. Each <br>split-file’s size hovers above the target value of 100 bytes, <br>not going below 100 bytes until the last split-file, which might <br>be less than 100 bytes. | 49 | `--target-file-datums` | This flag must be used with the `--split` <br> flag. The `--target-file-datums` attempts to fill each split-file <br>with the specified number of datums. If you run `--split lines --target-file-datums 2` on the line-delimited 100-line file mentioned above, you see the file split into 50 split-files and each file has 2 lines. | 50 51 52 If you specify both `--target-file-datums` and `--target-file-bytes` flags, 53 Pachyderm creates split-files until it hits one of the 54 constraints. 55 56 !!! note "See also:" 57 [Splitting Data for Distributed Processing](../splitting/#ingesting-postgressql-data) 58 59 ### Example: Splitting a File 60 61 In this example, you have a 50-line file called `my-data.txt`. 62 You create a repository named `line-data` and load 63 `my-data.txt` into that repository. Then, you can analyze 64 how the data is split in the repository. 65 66 To complete this example, perform the following steps: 67 68 1. Create a file with fifty lines named `my-data.txt`. You can 69 add random lines, such as numbers from one to fifty, or US states, 70 or anything else. 71 72 **Examples:** 73 74 ```shell 75 Zero 76 One 77 Two 78 Three 79 ... 80 Fifty 81 ``` 82 83 1. Create a repository called `line-data`: 84 85 ```shell 86 $ pachctl create repo line-data 87 $ pachctl put file line-data@master -f my-data.txt --split line 88 ``` 89 90 1. List the filesystem objects in the repository: 91 92 ```shell 93 $ pachctl list file line-data@master 94 NAME TYPE SIZE 95 /my-data.txt dir 1.071KiB 96 ``` 97 98 The `pachctl list file` command shows 99 that the line-oriented file `my-data.txt` 100 that was uploaded has been transformed into a 101 directory that includes the chunks of the original 102 `my-data.txt` file. Each chunk is put into a split-file 103 and given a 16-character filename, left-padded with 0. 104 Pachyderm numbers each filename sequentially in hexadecimal. We 105 modify the command to list the contents of “my-data.txt”, and the output 106 reveals the naming structure. 107 108 1. List the files in the `my-data.txt` directory: 109 110 ```shell 111 $ pachctl list file line-data@master my-data.txt 112 NAME TYPE SIZE 113 /my-data.txt/0000000000000000 file 21B 114 /my-data.txt/0000000000000001 file 22B 115 /my-data.txt/0000000000000002 file 24B 116 /my-data.txt/0000000000000003 file 21B 117 ... 118 NAME TYPE SIZE 119 /my-data.txt/0000000000000031 file 22B 120 ``` 121 122 ## Example: Appending to files with `-–split` 123 124 Combining `--split` with the default *append* behavior of 125 `pachctl put file` enables flexible and scalable processing of 126 record-oriented file data from external, legacy systems. 127 128 Pachyderm ensures that only the newly added data gets processed when 129 you append to an existing file by using `--split`. Pachyderm 130 optimizes storage utilization by automatically deduplicating each 131 split-file. If you split a large file 132 with many duplicate lines or objects with identical hashes 133 might use less space in PFS than it does as 134 a single file outside of PFS. 135 136 To complete this example, follow these steps: 137 138 1. Create a file `count.txt` with the following lines: 139 140 ```shell 141 Zero 142 One 143 Two 144 Three 145 Four 146 ``` 147 148 1. Put the `count.txt` file into a Pachyderm repository called `raw_data`: 149 150 ```shell 151 $ pachctl put file -f count.txt raw_data@master --split line 152 ``` 153 154 This command splits the `count.txt` file by line and creates 155 a separate file with one line in each file. Also, this operation 156 creates five datums that are processed by the 157 pipelines that use this repository as input. 158 159 1. View the repository contents: 160 161 ```shell 162 $ pachctl list file raw_data@master 163 NAME TYPE SIZE 164 /count.txt dir 24B 165 ``` 166 167 Pachyderm created a directory called `count.txt`. 168 169 1. View the contents of the `count.txt` directory: 170 171 ```shell 172 $ pachctl list file raw_data@master:count.txt 173 NAME TYPE SIZE 174 /count.txt/0000000000000000 file 4B 175 /count.txt/0000000000000001 file 4B 176 /count.txt/0000000000000002 file 6B 177 /count.txt/0000000000000003 file 5B 178 /count.txt/0000000000000004 file 5B 179 ``` 180 181 In the output above, you can see that Pachyderm created five split-files 182 from the original `count.txt` file. Each file has one line from the 183 original `count.txt`. You can check the contents of each file by 184 running the `pachctl get file` command. For example, to get 185 the contents of `count.txt/0000000000000000`, run the following 186 command: 187 188 ```shell 189 $ pachctl get file raw_data@master:count.txt/0000000000000000 190 Zero 191 ``` 192 193 This operation creates five datums that are processed by the 194 pipelines that use this repository as input. 195 196 1. Create a one-line file called `more-count.txt` with the 197 following content: 198 199 ```shell 200 Five 201 ``` 202 203 1. Load this file into Pachyderm by appending it to the `count.txt` file: 204 205 ```shell 206 $ pachctl put file raw_data@master:count.txt -f more-count.txt --split line 207 ``` 208 209 * If you do not specify `--split` flag while appending to 210 a file that was previously split, Pachyderm displays the following 211 error message: 212 213 ```shell 214 could not put file at "/count.txt"; a file of type directory is already there 215 ``` 216 217 1. Verify that another file was added: 218 219 ```shell 220 $ pachctl list file raw_data@master:count.txt 221 NAME TYPE SIZE 222 /count.txt/0000000000000000 file 4B 223 /count.txt/0000000000000001 file 4B 224 /count.txt/0000000000000002 file 6B 225 /count.txt/0000000000000003 file 5B 226 /count.txt/0000000000000004 file 5B 227 /count.txt/0000000000000005 file 4B 228 ``` 229 230 The `/count.txt/0000000000000005` file was added to the input 231 repository. Pachyderm considers 232 this new file as a separate datum. Therefore, pipelines process 233 only that datum instead of all the chunks of `count.txt`. 234 235 1. Get the contents of the `/count.txt/0000000000000005` file: 236 237 ``` 238 $ pachctl get file raw_data@master:count.txt/0000000000000005 239 Five 240 ``` 241 242 ## Example: Overwriting Files with `–-split` 243 244 The behavior of Pachyderm when a file loaded with ``--split`` is 245 overwritten is simple to explain but subtle in its implications. 246 Most importantly, it can have major implications when new rows 247 are inserted within the file as opposed to just being appended to the end. 248 The loaded file is split into those sequentially-named files, 249 as shown above. If any of those resulting 250 split-files hashes differently than the one it is replacing, it 251 causes the Pachyderm Pipeline System to process that data. 252 This can have significant consequences for downstream processing. 253 254 To complete this example, follow these steps: 255 256 1. Create a file `count.txt` with the following lines: 257 258 ```shell 259 One 260 Two 261 Three 262 Four 263 Five 264 ``` 265 266 1. Put the file into a Pachyderm repository called `raw_data`: 267 268 ```shell 269 $ pachctl put file -f count.txt raw_data@master --split line 270 ``` 271 272 This command splits the `count.txt` file by line and creates 273 a separate file with one line in each file. Also, this operation 274 creates five datums that are processed by the 275 pipelines that use this repository as input. 276 277 1. View the repository contents: 278 279 ```shell 280 $ pachctl list file raw_data@master 281 NAME TYPE SIZE 282 /count.txt dir 24B 283 ``` 284 285 Pachyderm created a directory called `count.txt`. 286 287 1. View the contents of the `count.txt` directory: 288 289 ```shell 290 $ pachctl list file raw_data@master:count.txt 291 NAME TYPE SIZE 292 /count.txt/0000000000000000 file 4B 293 /count.txt/0000000000000001 file 4B 294 /count.txt/0000000000000002 file 6B 295 /count.txt/0000000000000003 file 5B 296 /count.txt/0000000000000004 file 5B 297 ``` 298 299 In the output above, you can see that Pachyderm created five split-files 300 from the original `count.txt` file. Each file has one line from the 301 original `count.txt` file. You can check the contents of each file by 302 running the `pachctl get file` command. For example, to get 303 the contents of `count.txt/0000000000000000`, run the following 304 command: 305 306 ```shell 307 $ pachctl get file raw_data@master:count.txt/0000000000000000 308 One 309 ``` 310 311 1. In your local directory, modify the original `count.txt` file by 312 inserting the word *Zero* on the first line: 313 314 ```shell 315 Zero 316 One 317 Two 318 Three 319 Four 320 Five 321 ``` 322 323 1. Upload the updated `count.txt` file into the raw_data repository 324 by using the `--split` and `--overwrite` flags: 325 326 ```shell 327 $ pachctl put file -f count.txt raw_data@master:count.txt --split line --overwrite 328 ``` 329 330 Because Pachyderm takes the file name into account when hashing 331 data for a pipeline, it considers every single split-file as new, 332 and the pipelines that use this repository as input process all 333 six datums. 334 335 1. List the files in the directory: 336 337 ```shell 338 $ pachctl list file raw_data@master:count.txt 339 NAME TYPE SIZE 340 /count.txt/0000000000000000 file 5B 341 /count.txt/0000000000000001 file 4B 342 /count.txt/0000000000000002 file 4B 343 /count.txt/0000000000000003 file 6B 344 /count.txt/0000000000000004 file 5B 345 /count.txt/0000000000000005 file 5B 346 ``` 347 348 The `/count.txt/0000000000000000` file now has the newly added `Zero` line. 349 To verify the contents of the file, run: 350 351 ```shell 352 $ pachctl get file raw_data@master:count.txt/0000000000000000 353 Zero 354 ``` 355 356 !!! note "See Also:" 357 [Splitting Data](splitting.md)