github.com/pachyderm/pachyderm@v1.13.4/examples/shuffle/README.md (about) 1 > INFO - Pachyderm 2.0 introduces profound architectural changes to the product. As a result, our examples pre and post 2.0 are kept in two separate branches: 2 > - Branch Master: Examples using Pachyderm 2.0 and later versions - https://github.com/pachyderm/pachyderm/tree/master/examples 3 > - Branch 1.13.x: Examples using Pachyderm 1.13 and older versions - https://github.com/pachyderm/pachyderm/tree/1.13.x/examples 4 # Creating a shuffle pipeline 5 6 This example demonstrates how shuffle pipelines i.e. a pipeline that shuffles, combines files without downloading/uploading can be created. 7 8 ## Create fruits input repo 9 ```shell 10 pachctl create repo fruits 11 pachctl put file fruits@master -f mango.jpeg 12 pachctl put file fruits@master -f apple.jpeg 13 ``` 14 15 ## Create pricing input repo 16 ```shell 17 pachctl create repo pricing 18 pachctl put file pricing@master -f mango.json 19 pachctl put file pricing@master -f apple.json 20 ``` 21 22 23 ## Create shuffle pipeline 24 ```shell 25 pachctl create pipeline -f shuffle.json 26 ``` 27 28 Let's take a closer look at that pipeline: 29 30 ```json 31 { 32 "input": { 33 "union": [ 34 { 35 "pfs": { 36 "glob": "/*.jpeg", 37 "repo": "fruits", 38 "empty_files": true 39 } 40 }, 41 { 42 "pfs": { 43 "glob": "/*.json", 44 "repo": "pricing", 45 "empty_files": true 46 } 47 } 48 ] 49 }, 50 "pipeline": { 51 "name": "lazy_shuffle" 52 }, 53 "transform": { 54 "image": "ubuntu", 55 "cmd": ["/bin/bash"], 56 "stdin": [ 57 "echo 'process fruits if any'", 58 "fn=$(find -L /pfs -not -path \"*/\\.*\" -type f \\( -path '*/fruits/*' \\))", 59 "for f in $fn; do fruit_name=$(basename $f .jpeg); mkdir -p /pfs/out/$fruit_name/; ln -s $f /pfs/out/$fruit_name/img.jpeg; done", 60 "echo 'process pricing if any'", 61 "fn=$(find -L /pfs -not -path \"*/\\.*\" -type f \\( -path '*/pricing/*' \\))", 62 "for f in $fn; do fruit_name=$(basename $f .json); mkdir -p /pfs/out/$fruit_name/; ln -s $f /pfs/out/$fruit_name/cost.json; done" 63 ] 64 } 65 } 66 ``` 67 68 Notice that both of our inputs have the `"empty_files"` field set to `true`, 69 this means that we'll get files with the correct name but no content. If your 70 shuffle can be done looking only at the names of the files, without considering 71 content, specifying `"empty_files"` will massively improve its performance. 72 73 ## Results 74 75 ### List job 76 `pachctl list job` indicates no data download or upload was performed 77 78 | ID | OUTPUT COMMIT | STARTED | DURATION | RESTART | PROGRESS | DL | UL | STATE | 79 |----------------------------------|------------------------------------------|----------------|-----------|---------|-----------|----|----|---------| 80 | 60617fd06155451d8358cc714bf9b670 | shuffle/f56e97fa9e234eb6ad902640d4fba2ac | 10 seconds ago | 4 seconds | 0 | 4 + 0 / 4 | 0B | 0B | success | 81 82 83 84 ### Output files: 85 `pachctl list file "shuffle@master:*"` will show shuffled file: 86 87 | NAME | TYPE | SIZE | 88 |------------------|------|----------| 89 | /mango/cost.json | file | 22B | 90 | /mango/img.jpeg | file | 7.029KiB | 91 | /apple/cost.json | file | 23B | 92 | /apple/img.jpeg | file | 4.978KiB |