github.com/pachyderm/pachyderm@v1.13.4/doc/docs/archived/combining.md (about) 1 # Combine, Merge, and Join Data 2 3 !!! info 4 Before you read this section, make sure that you understand the concepts 5 described in [Distributed Processing](distributed_computing.md). 6 7 In some of your projects, you might need to match datums from 8 multiple data repositories to process, join, or aggregate data. For 9 example, you might need to process together multiple records that 10 correspond to a certain user, experiment, or device. 11 12 In these cases, you can create two pipelines that perform the following 13 steps: 14 15  16 17 More specifically, you need to create the following pipelines: 18 19 1. Create a pipeline that groups all of the records for a specific 20 key and index. 21 22 2. Create another pipeline that takes that grouped output and performs 23 the merging, joining, or other processing for the group. 24 25 You can use these two data-combining pipelines for 26 merging or grouped processing of data from various experiments, 27 devices, and so on. You can also apply the same pattern to 28 perform distributed joins of tabular data or data from database 29 tables. For example, you can join user email records together 30 with user IP records on the key and index of a user ID. 31 32 You can parallelize each of the stages across workers to 33 scale with the size of your data and the number of data 34 sources that you want to merge. 35 36 !!! tip 37 If your data is not split into separate files for 38 each record, you can split it automatically as described in 39 [Splitting Data for Distributed Processing](splitting-data/splitting.md). 40 41 ## Group Matching Records 42 43 The first pipeline that you create groups the records that 44 need to be processed together. 45 46 In this example, you have two repositories `A` and `B` 47 with JSON records. 48 These repositories might correspond to two experiments, two geographic 49 regions, two different devices that generate data, or other. 50 51 The following diagram displays the first pipeline: 52 53  54 55 The repository `A` has the following structure: 56 57 ```shell 58 $ pachctl list file A@master 59 NAME TYPE SIZE 60 1.json file 39 B 61 2.json file 39 B 62 3.json file 39 B 63 ``` 64 65 The repository `B` has the following structure: 66 67 ```shell 68 $ pachctl list file B@master 69 NAME TYPE SIZE 70 1.json file 39 B 71 2.json file 39 B 72 3.json file 39 B 73 ``` 74 75 If you want to process `A/1.json` with `B/1.json` to merge 76 their contents or otherwise process them together, you need to 77 group each set of JSON records into respective datums that 78 the pipelines that you create in 79 80 [Process Grouped Records](#process-grouped-records) 81 can process together. 82 83 The grouping pipeline takes a union of `A` and `B` as inputs, 84 each with glob pattern `/*`. While the pipeline processes a JSON file, 85 the data is copied to a folder in the output that corresponds to the 86 key and index for that record. In this example, it is just the 87 number in the file name. Pachyderm also renames the files to 88 unique names that correspond to the source: 89 90 ```shell 91 /1 92 A.json 93 B.json 94 /2 95 A.json 96 B.json 97 /3 98 A.json 99 B.json 100 ``` 101 102 When you group your data, set the following parameters in the pipeline 103 specification: 104 105 - In the `pfs` section, set `"empty_files": true` to avoid 106 unnecessary downloads of data. 107 108 - Use symlinks to avoid unnecessary uploads of data and unnecessary data 109 duplication. 110 111 ## Process Grouped Records 112 113 After you group the records together by using the grouping pipeline, you 114 can use a merging pipeline on the `group` repository with a glob 115 pattern of `/*`. By using the glob pattern of `/*` 116 the pipeline can process each grouping of records in parallel. 117 118 THe following diagram displays the second pipeline: 119 120  121 122 The second pipeline performs merging, aggregation, or other 123 processing on the respective grouping of records. It can also 124 output each respective result to the root of the output directory: 125 126 ```shell 127 $ pachctl list file merge@master 128 NAME TYPE SIZE 129 result_1.json file 39 B 130 result_2.json file 39 B 131 result_3.json file 39 B 132 ```