github.com/pachyderm/pachyderm@v1.13.4/doc/docs/1.11.x/how-tos/time_windows.md

github.com/pachyderm/pachyderm@v1.13.4/doc/docs/1.11.x/how-tos/time_windows.md (about)

     1  # Processing Time-Windowed Data
     2  
     3  !!! info
     4      Before you read this section, make sure that you understand the concepts
     5      described in the following sections:
     6  
     7      - [Datum](../concepts/pipeline-concepts/datum/index.md)
     8      - [Distributed Computing](../concepts/advanced-concepts/distributed_computing.md)
     9      - [Working with Pipelines](../how-tos/developer-workflow/working-with-pipelines.md)
    10  
    11  If you are analyzing data that is changing over time, you might
    12  need to analyze historical data. For example, you might need to
    13  examine *the last two weeks of data*, *January's data*, or some
    14  other moving or static time window of data.
    15  
    16  Pachyderm provides the following approaches to this task:
    17  
    18  1. [Fixed time windows](#fixed-time-windows) - for rigid, fixed
    19  time windows, such as months (Jan, Feb, and so on) or days—01-01-17,
    20  01-02-17, and so on).
    21  
    22  2. [Moving time windows](#moving-time-windows)
    23  - for rolling time windows of data, such as three-day windows or
    24  two-week windows.
    25  
    26  ## Fixed Time Windows
    27  
    28  [Datum](../concepts/pipeline-concepts/datum/index.md) is the basic
    29  unit of data partitioning in Pachyderm. The glob pattern property
    30  in the pipeline specification defines a datum. When you analyze data
    31  within fixed time windows, such as the data that corresponds to
    32  fixed calendar dates, Pachyderm recommends that you organize your
    33  data repositories so that each of the time windows that you plan
    34  to analyze corresponds to a separate file or directory in your
    35  repository, and therefore, Pachyderm processes it as a separate
    36  datum.
    37  
    38  Organizing your repository as described above, enables you to do the
    39  following:
    40  
    41  - Analyze each time window in parallel.
    42  - Only re-process data within a time window when that data, or a
    43    corresponding data pipeline, changes.
    44  
    45  For example, if you have monthly time windows of sales data stored
    46  in JSON format that needs to be analyzed, you can create a `sales`
    47  data repository with the following data:
    48  
    49  ```
    50  sales
    51  ├── January
    52  |   ├── 01-01-17.json
    53  |   ├── 01-02-17.json
    54  |   └── ...
    55  ├── February
    56  |   ├── 01-01-17.json
    57  |   ├── 01-02-17.json
    58  |   └── ...
    59  └── March
    60      ├── 01-01-17.json
    61      ├── 01-02-17.json
    62      └── ...
    63  ```
    64  
    65  When you run a pipeline with `sales` as an input repository and a glob
    66  pattern of `/*`, Pachyderm processes each month's worth of sales data
    67  in parallel if workers are available. When you add new data into a
    68  subset of the months or add data into a new month, for example, May,
    69  Pachyderm processes only these updated datums.
    70  
    71  More generally, this structure enables you to create the following
    72  types of pipelines:
    73  
    74  - Pipelines that aggregate or otherwise process daily data on a
    75    monthly basis by using the `/*` glob pattern.
    76  - Pipelines that only analyze a particular month's data by using a `/subdir/*`
    77    or `/subdir/` glob pattern. For example, `/January/*` or `/January/`.
    78  - Pipelines that process data on daily by using the `/*/*` glob
    79    pattern.
    80  - Any combination of the above.
    81  
    82  ## Moving Time Windows
    83  
    84  In some cases, you need to run analyses for moving or rolling time
    85  windows that do not correspond to certain calendar months or days.
    86  For example, you might need to analyze the last three days of data,
    87  the three days of data before that, or similar.
    88  In other words, you need to run an analysis for every rolling length
    89  of time.
    90  
    91  For rolling or moving time windows, there are a couple of recommended
    92  patterns:
    93  
    94  1. Bin your data in repository folders for each of the moving time windows.
    95  
    96  2. Maintain a time-windowed set of data that corresponds to the latest of the
    97     moving time windows.
    98  
    99  ### Bin Data into Moving Time Windows
   100  
   101  In this method of processing rolling time windows, you create the following
   102  two-pipeline DAGs to analyze time windows efficiently:
   103  
   104  | Pipeline | Description |
   105  | -------- | ----------- |
   106  | Pipeline 1 | Reads in data, determines to which bins the data <br>corresponds, and writes the data into those bins. |
   107  | Pipeline 2 | Read in and analyze the binned data. |
   108  
   109  By splitting this analysis into two pipelines, you can benefit from using
   110  parallelism at the file level. In other words, *Pipeline 1* can be easily
   111  parallelized for each file, and *Pipeline 2* can be parallelized per bin.
   112  This structure enables easy pipeline scaling as the number of
   113  files increases.
   114  
   115  For example, you have three-day moving time windows, and you
   116  want to analyze three-day moving windows of sales data. In the first repo,
   117  called `sales`, you commit data for the first day of sales:
   118  
   119  ```shell
   120  sales
   121  └── 01-01-17.json
   122  ```
   123  
   124  In the first pipeline, you specify to bin this data into a directory that
   125  corresponds to the first rolling time window from 01-01-17 to 01-03-17:
   126  
   127  ```shell
   128  binned_sales
   129  └── 01-01-17_to_01-03-17
   130      └── 01-01-17.json
   131  ```
   132  
   133  When the next day's worth of sales is committed, that data lands
   134  in the `sales` repository:
   135  
   136  ```shell
   137  sales
   138  ├── 01-01-17.json
   139  └── 01-02-17.json
   140  ```
   141  
   142  Then, the first pipeline executes again to bin the `01-02-17` data into
   143  relevant bins. In this case, the data is placed in the previously
   144  created bin named `01-01-17 to 01-03-17`. However, the data also
   145  goes to the bin that stores the data that is received starting
   146  on `01-02-17`:
   147  
   148  ```shell
   149  binned_sales
   150  ├── 01-01-17_to_01-03-17
   151  |   ├── 01-01-17.json
   152  |   └── 01-02-17.json
   153  └── 01-02-17_to_01-04-17
   154      └── 01-02-17.json
   155  ```
   156  
   157  As more and more daily data is added, your repository structure
   158  starting to looks as follows:
   159  
   160  ```shell
   161  binned_sales
   162  ├── 01-01-17_to_01-03-17
   163  |   ├── 01-01-17.json
   164  |   ├── 01-02-17.json
   165  |   └── 01-03-17.json
   166  ├── 01-02-17_to_01-04-17
   167  |   ├── 01-02-17.json
   168  |   ├── 01-03-17.json
   169  |   └── 01-04-17.json
   170  ├── 01-03-17_to_01-05-17
   171  |   ├── 01-03-17.json
   172  |   ├── 01-04-17.json
   173  |   └── 01-05-17.json
   174  └── ...
   175  ```
   176  
   177  The following diagram describes how data accumulates in the repository
   178  over time:
   179  
   180  ![Data Accumulation](../assets/images/d_time_window.svg)
   181  
   182  Your second pipeline can then process these bins in parallel according to the
   183  glob pattern of `/*` or as described further. Both pipelines can be easily
   184  parallelized.
   185  
   186  In the above directory structure, it might seem that data is
   187  duplicated. However, under the hood, Pachyderm deduplicates all of these
   188  files and maintains a space-efficient representation of your data.
   189  The binning of the data is merely a structural re-arrangement to enable
   190  you process these types of moving time windows.
   191  
   192  It might also seem as if Pachyderm performs unnecessary data transfers
   193  over the network to bin files. However, Pachyderm ensures that these data
   194  operations do not require transferring data over the network.
   195  
   196  ### Maintaining a Single Time-Windowed Data Set
   197  
   198  The advantage of the binning pattern above is that any of the moving
   199  time windows are available for processing. They can be compared,
   200  aggregated, and combined in any way, and any results or
   201  aggregations are kept in sync with updates to the bins. However, you
   202  do need to create a process to maintain the binning directory structure.
   203  
   204  There is another pattern for moving time windows that avoids the
   205  binning of the above approach and maintains an up-to-date version of a
   206  moving time-windowed data set. This approach
   207  involves the creation of the following pipelines:
   208  
   209  | Pipeline     | Description |
   210  | ------------ | ----------- |
   211  | Pipeline 1 | Reads in data, determines which files belong in your moving <br> time window, and writes the relevant files into an updated<br> version of the moving time-windowed data set. |
   212  | Pipeline 2 | Reads in and analyzes the moving time-windowed data set. |
   213  
   214  For example, you have three-day moving time windows, and you
   215  want to analyze three-day moving windows of sales data. The input data
   216  is stored in the `sales` repository:
   217  
   218  ```shell
   219  sales
   220  ├── 01-01-17.json
   221  ├── 01-02-17.json
   222  ├── 01-03-17.json
   223  └── 01-04-17.json
   224  ```
   225  
   226  When the January 4th file, `01-04-17.json`, is committed, the first
   227  pipeline pulls out the last three days of data and arranges it in the
   228  following order:
   229  
   230  ```shell
   231  last_three_days
   232  ├── 01-02-17.json
   233  ├── 01-03-17.json
   234  └── 01-04-17.json
   235  ```
   236  
   237  When the January 5th file, `01-05-17.json`, is committed into the
   238  `sales` repository:
   239  
   240  ```shell
   241  sales
   242  ├── 01-01-17.json
   243  ├── 01-02-17.json
   244  ├── 01-03-17.json
   245  ├── 01-04-17.json
   246  └── 01-05-17.json
   247  ```
   248  
   249  the first pipeline updates the moving window:
   250  
   251  ```shell
   252  last_three_days
   253  ├── 01-03-17.json
   254  ├── 01-04-17.json
   255  └── 01-05-17.json
   256  ```
   257  
   258  The analysis that you need to run on the moving windowed dataset
   259  in `moving_sales_window` can use the `/` or `/*` glob pattern, depending
   260  on whether you need to process all of the time-windowed files together
   261  or if they can be processed in parallel.
   262  
   263  !!! warning
   264      When you create this type of moving time-windowed data set,
   265      the concept of *now* or *today* is relative. You must define the time
   266      based on your use case. For example, by configuring to use `UTC`. Do not use
   267      functions such as `time.now()` to determine the current time. The actual
   268      time when this pipeline runs might vary.