github.com/pachyderm/pachyderm@v1.13.4/doc/docs/1.11.x/concepts/pipeline-concepts/datum/cross-union.md

github.com/pachyderm/pachyderm@v1.13.4/doc/docs/1.11.x/concepts/pipeline-concepts/datum/cross-union.md (about)

     1  # Cross and Union Inputs
     2  
     3  <!---This section needs to be made more clear. There is a lot of information
     4  that I would say describes the things you can do with a cross or union pipeline
     5  but does not really have a good and clear explanation of what they are -->
     6  
     7  Pachyderm enables you to combine multiple
     8  input repositories in a single pipeline by using the `union` and
     9  `cross` operators in the pipeline specification.
    10  
    11  If you are familiar with [Set theory](https://en.wikipedia.org/wiki/Set_theory),
    12  you can think of union as a *disjoint union binary operator* and cross as a
    13  *cartesian product binary operator*. However, if you are unfamiliar with these
    14  concepts, it is still easy to understand how cross and union work in Pachyderm.
    15  
    16  This section describes how to use `cross` and `union` in your pipelines and how you
    17  can optimize your code when you work with them.
    18  
    19  ## Union Input
    20  
    21  The union input combines each of the datums in the input repos as one
    22  set of datums.
    23  The number of datums that are processed is the sum of all the
    24  datums in each repo.
    25  
    26  For example, you have two input repos, `A` and `B`. Each of these
    27  repositories contain three files with the following names.
    28  
    29  Repository `A` has the following structure:
    30  
    31  ```shell
    32  A
    33  ├── 1.txt
    34  ├── 2.txt
    35  └── 3.txt
    36  ```
    37  
    38  Repository `B` has the following structure:
    39  
    40  ```shell
    41  B
    42  ├── 4.txt
    43  ├── 5.txt
    44  └── 6.txt
    45  ```
    46  
    47  If you want your pipeline to process each file independently as a
    48  separate datum, use a glob pattern of `/*`. Each
    49  glob is applied to each input independently. The input section
    50  in the pipeline spec might have the following structure:
    51  
    52  ```shell
    53  "input": {
    54      "union": [
    55          {
    56              "pfs": {
    57                  "glob": "/*",
    58                  "repo": "A"
    59              }
    60          },
    61          {
    62              "pfs": {
    63                  "glob": "/*",
    64                  "repo": "B"
    65              }
    66          }
    67      ]
    68  }
    69  ```
    70  
    71  In this example, each Pachyderm repository has those three files in the root
    72  directory, so three datums from each input. Therefore, the union of `A` and `B`
    73  has six datums in total.
    74  Your pipeline processes the following datums without any specific order:
    75  
    76  ```shell
    77  /pfs/A/1.txt
    78  /pfs/A/2.txt
    79  /pfs/A/3.txt
    80  /pfs/B/4.txt
    81  /pfs/B/5.txt
    82  /pfs/B/6.txt
    83  ```
    84  
    85  !!! note
    86      Each datum in a pipeline is processed independently by a single
    87      execution of your code. In this example, your code runs six times, and
    88      each datum is available to it one at a time. For example, your code
    89      processes `pfs/A/1.txt` in one of the runs and `pfs/B/5.txt` in a
    90      different run, and so on. In a union, two or more datums are never
    91      available to your code at the same time. You can simplify
    92      your union code by using the `name` property as described below.
    93  
    94  ### Simplifying the Union Pipelines Code
    95  
    96  In the example above, your code needs to read into the `pfs/A`
    97  _or_ `pfs/B` directory because only one of them is present in any given datum.
    98  To simplify your code, you can add the `name` field to the `pfs` object and
    99  give the same name to each of the input repos. For example, you can add, the
   100  `name` field with the value `C` to the input repositories `A` and `B`:
   101  
   102  ```
   103  "input": {
   104      "union": [
   105          {
   106              "pfs": {
   107                  "name": "C",
   108                  "glob": "/*",
   109                  "repo": "A"
   110              }
   111          },
   112          {
   113              "pfs": {
   114                  "name": "C",
   115                  "glob": "/*",
   116                  "repo": "B"
   117              }
   118          }
   119      ]
   120  }
   121  ```
   122  
   123  Then, in the pipeline, all datums appear in the same directory.
   124  
   125  ```shell
   126  /pfs/C/1.txt  # from A
   127  /pfs/C/2.txt  # from A
   128  /pfs/C/3.txt  # from A
   129  /pfs/C/4.txt  # from B
   130  /pfs/C/5.txt  # from B
   131  /pfs/C/6.txt  # from B
   132  ```
   133  
   134  ## Cross Input
   135  
   136  In a cross input, Pachyderm exposes every combination of datums,
   137  or a cross-product, from each of your input repositories to your code
   138  in a single run.
   139  In other words, a cross input pairs every datum in one repository with
   140  each datum in another, creating sets of datums. Your transformation
   141  code is provided one of these sets at the time to process.
   142  
   143  For example, you have repositories `A` and `B` with three datums, each
   144  with the following structure:
   145  
   146  !!! note
   147      For this example, the glob pattern is set to `/*`.
   148  
   149  Repository `A` has three files at the top level:
   150  
   151  ```shell
   152  A
   153  ├── 1.txt
   154  ├── 2.txt
   155  └── 3.txt
   156  ```
   157  
   158  Repository `B` has three files at the top level:
   159  
   160  ```shell
   161  B
   162  ├── 4.txt
   163  ├── 5.txt
   164  └── 6.txt
   165  ```
   166  
   167  Because you have three datums in each repo, Pachyderm exposes
   168  a total of nine combinations of datums to your code.
   169  
   170  !!! important
   171      In cross pipelines, both `pfs/A` and `pfs/B`
   172      directories are visible during each code run.
   173  
   174  ```shell
   175  Run 1: /pfs/A/1.txt
   176         /pfs/B/4.txt
   177  
   178  Run 2: /pfs/A/1.txt
   179         /pfs/B/5.txt
   180  ...
   181  
   182  Run 9: /pfs/A/3.txt
   183         /pfs/B/6.txt
   184  ```
   185  
   186  !!! note
   187      In cross inputs, if you use the `name` field, your two
   188      inputs cannot have the same name. This could cause file system collisions.
   189  
   190  !!! note "See Also:"
   191  
   192  - [Cross Input](../../../../reference/pipeline_spec/#cross-input)
   193  - [Union Input](../../../../reference/pipeline_spec/#union-input)
   194  - [Distributed hyperparameter tuning](https://github.com/pachyderm/pachyderm/tree/master/examples/ml/hyperparameter)
   195  
   196  <!-- Add a link to an interactive tutorial when it's ready-->