github.com/pachyderm/pachyderm@v1.13.4/examples/shuffle/README.md (about)

     1  >![pach_logo](../img/pach_logo.svg) INFO - Pachyderm 2.0 introduces profound architectural changes to the product. As a result, our examples pre and post 2.0 are kept in two separate branches:
     2  > - Branch Master: Examples using Pachyderm 2.0 and later versions - https://github.com/pachyderm/pachyderm/tree/master/examples
     3  > - Branch 1.13.x: Examples using Pachyderm 1.13 and older versions - https://github.com/pachyderm/pachyderm/tree/1.13.x/examples
     4  # Creating a shuffle pipeline
     5  
     6  This example demonstrates how shuffle pipelines i.e. a pipeline that shuffles, combines files without downloading/uploading can be created.
     7  
     8  ## Create fruits input repo
     9  ```shell
    10  pachctl create repo fruits
    11  pachctl put file fruits@master -f mango.jpeg
    12  pachctl put file fruits@master -f apple.jpeg
    13  ```
    14  
    15  ## Create pricing input repo
    16  ```shell
    17  pachctl create repo pricing
    18  pachctl put file pricing@master -f mango.json
    19  pachctl put file pricing@master -f apple.json
    20  ```
    21  
    22  
    23  ## Create shuffle pipeline
    24  ```shell
    25  pachctl create pipeline -f shuffle.json
    26  ```
    27  
    28  Let's take a closer look at that pipeline:
    29  
    30  ```json
    31  {
    32    "input": {
    33      "union": [
    34        {
    35          "pfs": {
    36            "glob": "/*.jpeg",
    37            "repo": "fruits",
    38            "empty_files": true
    39          }
    40        },
    41        {
    42          "pfs": {
    43            "glob": "/*.json",
    44            "repo": "pricing",
    45            "empty_files": true
    46          }
    47        }
    48      ]
    49    },
    50    "pipeline": {
    51      "name": "lazy_shuffle"
    52    },
    53    "transform": {
    54      "image": "ubuntu",
    55      "cmd": ["/bin/bash"],
    56      "stdin": [
    57        "echo 'process fruits if any'",
    58        "fn=$(find  -L /pfs -not -path \"*/\\.*\"  -type f \\( -path '*/fruits/*' \\))",
    59        "for f in $fn; do fruit_name=$(basename $f .jpeg); mkdir -p /pfs/out/$fruit_name/; ln -s $f /pfs/out/$fruit_name/img.jpeg; done",
    60        "echo 'process pricing if any'",
    61        "fn=$(find  -L /pfs -not -path \"*/\\.*\"  -type f \\( -path '*/pricing/*' \\))",
    62        "for f in $fn; do fruit_name=$(basename $f .json); mkdir -p /pfs/out/$fruit_name/; ln -s $f /pfs/out/$fruit_name/cost.json; done"
    63      ]
    64    }
    65  }
    66  ```
    67  
    68  Notice that both of our inputs have the `"empty_files"` field set to `true`,
    69  this means that we'll get files with the correct name but no content. If your
    70  shuffle can be done looking only at the names of the files, without considering
    71  content, specifying `"empty_files"` will massively improve its performance.
    72  
    73  ## Results
    74  
    75  ### List job
    76  `pachctl list job` indicates no data download or upload was performed
    77  
    78  | ID                               | OUTPUT COMMIT                            | STARTED        | DURATION  | RESTART | PROGRESS  | DL | UL | STATE   |
    79  |----------------------------------|------------------------------------------|----------------|-----------|---------|-----------|----|----|---------|
    80  | 60617fd06155451d8358cc714bf9b670 | shuffle/f56e97fa9e234eb6ad902640d4fba2ac | 10 seconds ago | 4 seconds | 0       | 4 + 0 / 4 | 0B | 0B | success |
    81  
    82  
    83  
    84  ### Output files:
    85  `pachctl list file "shuffle@master:*"` will show shuffled file:
    86  
    87  | NAME             | TYPE | SIZE     |
    88  |------------------|------|----------|
    89  | /mango/cost.json | file | 22B      |
    90  | /mango/img.jpeg  | file | 7.029KiB |
    91  | /apple/cost.json | file | 23B      |
    92  | /apple/img.jpeg  | file | 4.978KiB |