github.com/pachyderm/pachyderm@v1.13.4/doc/docs/1.9.x/concepts/data-concepts/provenance.md (about)

     1  # Provenance
     2  
     3  Data versioning enables Pachyderm users to go back in time and see the state
     4  of a dataset or repository at a particular moment in time. Data provenance
     5  (from the French *provenir* which means *the place of origin*),
     6  also known as data lineage, tracks the dependencies and relationships
     7  between datasets. Provenance answers not only the question of
     8  where the data comes from, but also how the data was transformed along
     9  the way. Data scientists use provenance in root cause analysis to improve
    10  their code, workflows, and understanding of the data and its implications
    11  on final results. Data scientists need
    12  to have confidence in the information with which they operate. They need
    13  to be able to reproduce the results and sometimes go through the whole
    14  data transformation process from scratch multiple times, which makes data
    15  provenance one of the most critical aspects of data analysis. If your
    16  computations result in unexpected numbers, the first place to look
    17  is the historical data that gives insights into possible flaws in the
    18  transformation chain or the data itself.
    19  
    20  For example, when a bank makes a decision about a mortgage
    21  application, many factors are taken into consideration, including the
    22  credit history, annual income, and loan size. This data goes through multiple
    23  automated steps of analysis with numerous dependencies and decisions made
    24  along the way. If the final decision does not satisfy the applicant,
    25  the historical data is the first place to look for proof of authenticity,
    26  as well as for possible prejudice or model bias against the applicant.
    27  Data provenance creates a complete audit trail that enables data scientists
    28  to track the data from its origin to the final decision and make
    29  appropriate changes that address issues. With the adoption of General Data
    30  Protection Regulation (GDPR) compliance requirements, monitoring data lineage
    31  is becoming a necessity for many organizations that work with sensitive data.
    32  
    33  Pachyderm implements provenance for both commits and repositories.
    34  You can track revisions of the data and
    35  understand the connection between the data stored in one repository
    36  and the results in the other repository.
    37  
    38  Collaboration takes data provenance even further. Provenance enables teams
    39  of data scientists across the globe to build on each other work, share,
    40  transform, and update datasets while automatically maintaining a
    41  complete audit trail so that all results are reproducible.
    42  
    43  The following diagram demonstrates how provenance works:
    44  
    45  ![Provenance example](../../assets/images/provenance.svg)
    46  
    47  In the diagram above, you can see two input repositories called `parameters`
    48  and `training-data`. The `training-data` repository continuously collects
    49  data from an outside source. The training model pipeline combines the
    50  data from these two repositories, trains many models, and runs tests to
    51  select the best one.
    52  
    53  Provenance helps you to understand how and why the best model was
    54  selected and enables you to track the origin of the best model.
    55  In the diagram above, the best model is represented with a purple
    56  circle. By using provenance, you can find that the best model was
    57  created from the commit **2** in the `training-data` repository
    58  and the commit **1** in the `parameters` repository.
    59  
    60  ## Tracking Provenance in Pachyderm
    61  
    62  Pachyderm provides the `pachctl inspect` command that enables you to track
    63  provenance of your commits and learn where the data in the repository
    64  originates in.
    65  
    66  !!! example
    67      ```shell
    68      $ pachctl inspect commit split@master
    69      Commit: split@f71e42704b734598a89c02026c8f7d13
    70      Original Branch: master
    71      Started: 4 minutes ago
    72      Finished: 3 minutes ago
    73      Size: 0B
    74      Provenance:  __spec__@8c6440f52a2d4aa3980163e25557b4a1 (split)  raw_data@ccf82debb4b94ca3bfe165aca8d517c3 (master)
    75      ```
    76  
    77  In the example above, you can see that the latest commit in the master
    78  branch of the split repository tracks back to the master branch in the
    79  `raw_data` repository.
    80  
    81  ## Tracking Provenance Downstream
    82  
    83  Pachyderm provides the `flush commit` command that enables you
    84  to track provenance downstream. Tracking downstream means that instead of
    85  tracking the origin of a commit, you can learn in which output repository
    86  a certain input has resulted.
    87  
    88  For example, you have the `ccf82debb4b94ca3bfe165aca8d517c3` commit in
    89  the `raw_data` repository. If you run the `pachctl flush commit` command
    90  for this commit, you can see in which repositories and commits that data
    91  resulted.
    92  
    93  !!! example
    94      ```shell
    95      $ pachctl flush commit raw_data@ccf82debb4b94ca3bfe165aca8d517c3
    96      REPO        BRANCH COMMIT                           PARENT STARTED        DURATION       SIZE
    97      split       master f71e42704b734598a89c02026c8f7d13 <none> 52 minutes ago About a minute 0B
    98      split       stats  9b46d7abf9a74bf7bf66c77f2a0da4b1 <none> 52 minutes ago About a minute 15.39MiB
    99      pre_process master a99ab362dc944b108fb33544b2b24a8c <none> 48 minutes ago About a minute 0B
   100      ```