github.com/pachyderm/pachyderm@v1.13.4/doc/docs/archived/combining.md

github.com/pachyderm/pachyderm@v1.13.4/doc/docs/archived/combining.md (about)

     1  # Combine, Merge, and Join Data
     2  
     3  !!! info
     4      Before you read this section, make sure that you understand the concepts
     5      described in [Distributed Processing](distributed_computing.md).
     6  
     7  In some of your projects, you might need to match datums from
     8  multiple data repositories to process, join, or aggregate data. For
     9  example, you might need to process together multiple records that
    10  correspond to a certain user, experiment, or device.
    11  
    12  In these cases, you can create two pipelines that perform the following
    13  steps:
    14  
    15  ![Steps](../assets/images/d_steps_combine_pipelines.svg)
    16  
    17  More specifically, you need to create the following pipelines:
    18  
    19  1. Create a pipeline that groups all of the records for a specific
    20  key and index.
    21  
    22  2. Create another pipeline that takes that grouped output and performs
    23  the merging, joining, or other processing for the group.
    24  
    25  You can use these two data-combining pipelines for
    26  merging or grouped processing of data from various experiments,
    27  devices, and so on. You can also apply the same pattern to
    28  perform distributed joins of tabular data or data from database
    29  tables. For example, you can join user email records together
    30  with user IP records on the key and index of a user ID.
    31  
    32  You can parallelize each of the stages across workers to
    33  scale with the size of your data and the number of data
    34  sources that you want to merge.
    35  
    36  !!! tip
    37      If your data is not split into separate files for
    38      each record, you can split it automatically as described in
    39      [Splitting Data for Distributed Processing](splitting-data/splitting.md).
    40  
    41  ## Group Matching Records
    42  
    43  The first pipeline that you create groups the records that
    44  need to be processed together.
    45  
    46  In this example, you have two repositories `A` and `B`
    47  with JSON records.
    48  These repositories might correspond to two experiments, two geographic
    49  regions, two different devices that generate data, or other.
    50  
    51  The following diagram displays the first pipeline:
    52  
    53  ![alt tag](../assets/images/d_join1.svg)
    54  
    55  The repository `A` has the following structure:
    56  
    57  ```shell
    58  $ pachctl list file A@master
    59  NAME                TYPE                SIZE
    60  1.json              file                39 B
    61  2.json              file                39 B
    62  3.json              file                39 B
    63  ```
    64  
    65  The repository `B` has the following structure:
    66  
    67  ```shell
    68  $ pachctl list file B@master
    69  NAME                TYPE                SIZE
    70  1.json              file                39 B
    71  2.json              file                39 B
    72  3.json              file                39 B
    73  ```
    74  
    75  If you want to process `A/1.json` with `B/1.json` to merge
    76  their contents or otherwise process them together, you need to
    77  group each set of JSON records into respective datums that
    78  the pipelines that you create in
    79  
    80  [Process Grouped Records](#process-grouped-records)
    81  can process together.
    82  
    83  The grouping pipeline takes a union of `A` and `B` as inputs,
    84  each with glob pattern `/*`. While the pipeline processes a JSON file,
    85  the data is copied to a folder in the output that corresponds to the
    86  key and index for that record. In this example, it is just the
    87  number in the file name. Pachyderm also renames the files to
    88  unique names that correspond to the source:
    89  
    90  ```shell
    91  /1
    92    A.json
    93    B.json
    94  /2
    95    A.json
    96    B.json
    97  /3
    98    A.json
    99    B.json
   100  ```
   101  
   102  When you group your data, set the following parameters in the pipeline
   103  specification:
   104  
   105  - In the `pfs` section, set `"empty_files": true` to avoid
   106  unnecessary downloads of data.
   107  
   108  - Use symlinks to avoid unnecessary uploads of data and unnecessary data
   109  duplication.
   110  
   111  ## Process Grouped Records
   112  
   113  After you group the records together by using the grouping pipeline, you
   114  can use a merging pipeline on the `group` repository with a glob
   115  pattern of `/*`. By using the glob pattern of `/*`
   116  the pipeline can process each grouping of records in parallel.
   117  
   118  THe following diagram displays the second pipeline:
   119  
   120  ![alt tag](../assets/images/d_join2.svg)
   121  
   122  The second pipeline performs merging, aggregation, or other
   123  processing on the respective grouping of records. It can also
   124  output each respective result to the root of the output directory:
   125  
   126  ```shell
   127  $ pachctl list file merge@master
   128  NAME                TYPE          SIZE
   129  result_1.json       file          39 B
   130  result_2.json       file          39 B
   131  result_3.json       file          39 B
   132  ```