github.com/pachyderm/pachyderm@v1.13.4/doc/docs/1.10.x/how-tos/time_windows.md (about) 1 # Processing Time-Windowed Data 2 3 !!! info 4 Before you read this section, make sure that you understand the concepts 5 described in the following sections: 6 7 - [Datum](../concepts/pipeline-concepts/datum/index.md) 8 - [Distributed Computing](../concepts/advanced-concepts/distributed_computing.md) 9 - [Individual Developer Worflow](individual-developer-workflow.md) 10 11 If you are analyzing data that is changing over time, you might 12 need to analyze historical data. For example, you might need to 13 examine *the last two weeks of data*, *January's data*, or some 14 other moving or static time window of data. 15 16 Pachyderm provides the following approaches to this task: 17 18 1. [Fixed time windows](#fixed-time-windows) - for rigid, fixed 19 time windows, such as months (Jan, Feb, and so on) or days—01-01-17, 20 01-02-17, and so on). 21 22 2. [Moving time windows](#moving-time-windows) 23 - for rolling time windows of data, such as three-day windows or 24 two-week windows. 25 26 ## Fixed Time Windows 27 28 [Datum](../concepts/pipeline-concepts/datum/index.md) is the basic 29 unit of data partitioning in Pachyderm. The glob pattern property 30 in the pipeline specification defines a datum. When you analyze data 31 within fixed time windows, such as the data that corresponds to 32 fixed calendar dates, Pachyderm recommends that you organize your 33 data repositories so that each of the time windows that you plan 34 to analyze corresponds to a separate file or directory in your 35 repository, and therefore, Pachyderm processes it as a separate 36 datum. 37 38 Organizing your repository as described above, enables you to do the 39 following: 40 41 - Analyze each time window in parallel. 42 - Only re-process data within a time window when that data, or a 43 corresponding data pipeline, changes. 44 45 For example, if you have monthly time windows of sales data stored 46 in JSON format that needs to be analyzed, you can create a `sales` 47 data repository with the following data: 48 49 ``` 50 sales 51 ├── January 52 | ├── 01-01-17.json 53 | ├── 01-02-17.json 54 | └── ... 55 ├── February 56 | ├── 01-01-17.json 57 | ├── 01-02-17.json 58 | └── ... 59 └── March 60 ├── 01-01-17.json 61 ├── 01-02-17.json 62 └── ... 63 ``` 64 65 When you run a pipeline with `sales` as an input repository and a glob 66 pattern of `/*`, Pachyderm processes each month's worth of sales data 67 in parallel if workers are available. When you add new data into a 68 subset of the months or add data into a new month, for example, May, 69 Pachyderm processes only these updated datums. 70 71 More generally, this structure enables you to create the following 72 types of pipelines: 73 74 - Pipelines that aggregate or otherwise process daily data on a 75 monthly basis by using the `/*` glob pattern. 76 - Pipelines that only analyze a particular month's data by using a `/subdir/*` 77 or `/subdir/` glob pattern. For example, `/January/*` or `/January/`. 78 - Pipelines that process data on daily by using the `/*/*` glob 79 pattern. 80 - Any combination of the above. 81 82 ## Moving Time Windows 83 84 In some cases, you need to run analyses for moving or rolling time 85 windows that do not correspond to certain calendar months or days. 86 For example, you might need to analyze the last three days of data, 87 the three days of data before that, or similar. 88 In other words, you need to run an analysis for every rolling length 89 of time. 90 91 For rolling or moving time windows, there are a couple of recommended 92 patterns: 93 94 1. Bin your data in repository folders for each of the moving time windows. 95 96 2. Maintain a time-windowed set of data that corresponds to the latest of the 97 moving time windows. 98 99 ### Bin Data into Moving Time Windows 100 101 In this method of processing rolling time windows, you create the following 102 two-pipeline DAGs to analyze time windows efficiently: 103 104 | Pipeline | Description | 105 | -------- | ----------- | 106 | Pipeline 1 | Reads in data, determines to which bins the data <br>corresponds, and writes the data into those bins. | 107 | Pipeline 2 | Read in and analyze the binned data. | 108 109 By splitting this analysis into two pipelines, you can benefit from using 110 parallelism at the file level. In other words, *Pipeline 1* can be easily 111 parallelized for each file, and *Pipeline 2* can be parallelized per bin. 112 This structure enables easy pipeline scaling as the number of 113 files increases. 114 115 For example, you have three-day moving time windows, and you 116 want to analyze three-day moving windows of sales data. In the first repo, 117 called `sales`, you commit data for the first day of sales: 118 119 ```shell 120 sales 121 └── 01-01-17.json 122 ``` 123 124 In the first pipeline, you specify to bin this data into a directory that 125 corresponds to the first rolling time window from 01-01-17 to 01-03-17: 126 127 ```shell 128 binned_sales 129 └── 01-01-17_to_01-03-17 130 └── 01-01-17.json 131 ``` 132 133 When the next day's worth of sales is committed, that data lands 134 in the `sales` repository: 135 136 ```shell 137 sales 138 ├── 01-01-17.json 139 └── 01-02-17.json 140 ``` 141 142 Then, the first pipeline executes again to bin the `01-02-17` data into 143 relevant bins. In this case, the data is placed in the previously 144 created bin named `01-01-17 to 01-03-17`. However, the data also 145 goes to the bin that stores the data that is received starting 146 on `01-02-17`: 147 148 ```shell 149 binned_sales 150 ├── 01-01-17_to_01-03-17 151 | ├── 01-01-17.json 152 | └── 01-02-17.json 153 └── 01-02-17_to_01-04-17 154 └── 01-02-17.json 155 ``` 156 157 As more and more daily data is added, your repository structure 158 starting to looks as follows: 159 160 ```shell 161 binned_sales 162 ├── 01-01-17_to_01-03-17 163 | ├── 01-01-17.json 164 | ├── 01-02-17.json 165 | └── 01-03-17.json 166 ├── 01-02-17_to_01-04-17 167 | ├── 01-02-17.json 168 | ├── 01-03-17.json 169 | └── 01-04-17.json 170 ├── 01-03-17_to_01-05-17 171 | ├── 01-03-17.json 172 | ├── 01-04-17.json 173 | └── 01-05-17.json 174 └── ... 175 ``` 176 177 The following diagram describes how data accumulates in the repository 178 over time: 179 180  181 182 Your second pipeline can then process these bins in parallel according to the 183 glob pattern of `/*` or as described further. Both pipelines can be easily 184 parallelized. 185 186 In the above directory structure, it might seem that data is 187 duplicated. However, under the hood, Pachyderm deduplicates all of these 188 files and maintains a space-efficient representation of your data. 189 The binning of the data is merely a structural re-arrangement to enable 190 you process these types of moving time windows. 191 192 It might also seem as if Pachyderm performs unnecessary data transfers 193 over the network to bin files. However, Pachyderm ensures that these data 194 operations do not require transferring data over the network. 195 196 ### Maintaining a Single Time-Windowed Data Set 197 198 The advantage of the binning pattern above is that any of the moving 199 time windows are available for processing. They can be compared, 200 aggregated, and combined in any way, and any results or 201 aggregations are kept in sync with updates to the bins. However, you 202 do need to create a process to maintain the binning directory structure. 203 204 There is another pattern for moving time windows that avoids the 205 binning of the above approach and maintains an up-to-date version of a 206 moving time-windowed data set. This approach 207 involves the creation of the following pipelines: 208 209 | Pipeline | Description | 210 | ------------ | ----------- | 211 | Pipeline 1 | Reads in data, determines which files belong in your moving <br> time window, and writes the relevant files into an updated<br> version of the moving time-windowed data set. | 212 | Pipeline 2 | Reads in and analyzes the moving time-windowed data set. | 213 214 For example, you have three-day moving time windows, and you 215 want to analyze three-day moving windows of sales data. The input data 216 is stored in the `sales` repository: 217 218 ```shell 219 sales 220 ├── 01-01-17.json 221 ├── 01-02-17.json 222 ├── 01-03-17.json 223 └── 01-04-17.json 224 ``` 225 226 When the January 4th file, `01-04-17.json`, is committed, the first 227 pipeline pulls out the last three days of data and arranges it in the 228 following order: 229 230 ```shell 231 last_three_days 232 ├── 01-02-17.json 233 ├── 01-03-17.json 234 └── 01-04-17.json 235 ``` 236 237 When the January 5th file, `01-05-17.json`, is committed into the 238 `sales` repository: 239 240 ```shell 241 sales 242 ├── 01-01-17.json 243 ├── 01-02-17.json 244 ├── 01-03-17.json 245 ├── 01-04-17.json 246 └── 01-05-17.json 247 ``` 248 249 the first pipeline updates the moving window: 250 251 ```shell 252 last_three_days 253 ├── 01-03-17.json 254 ├── 01-04-17.json 255 └── 01-05-17.json 256 ``` 257 258 The analysis that you need to run on the moving windowed dataset 259 in `moving_sales_window` can use the `/` or `/*` glob pattern, depending 260 on whether you need to process all of the time-windowed files together 261 or if they can be processed in parallel. 262 263 !!! warning 264 When you create this type of moving time-windowed data set, 265 the concept of *now* or *today* is relative. You must define the time 266 based on your use case. For example, by configuring to use `UTC`. Do not use 267 functions such as `time.now()` to determine the current time. The actual 268 time when this pipeline runs might vary.