github.com/pachyderm/pachyderm@v1.13.4/doc/docs/master/concepts/data-concepts/provenance.md (about) 1 # Provenance 2 3 Data versioning enables Pachyderm users to go back in time and see the state 4 of a dataset or repository at a particular moment in time. Data provenance 5 (from the French *provenir* which means *the place of origin*), 6 also known as data lineage, tracks the dependencies and relationships 7 between datasets. Provenance answers not only the question of 8 where the data comes from, but also how the data was transformed along 9 the way. Data scientists use provenance in root cause analysis to improve 10 their code, workflows, and understanding of the data and its implications 11 on final results. Data scientists need 12 to have confidence in the information with which they operate. They need 13 to be able to reproduce the results and sometimes go through the whole 14 data transformation process from scratch multiple times, which makes data 15 provenance one of the most critical aspects of data analysis. If your 16 computations result in unexpected numbers, the first place to look 17 is the historical data that gives insights into possible flaws in the 18 transformation chain or the data itself. 19 20 For example, when a bank makes a decision about a mortgage 21 application, many factors are taken into consideration, including the 22 credit history, annual income, and loan size. This data goes through multiple 23 automated steps of analysis with numerous dependencies and decisions made 24 along the way. If the final decision does not satisfy the applicant, 25 the historical data is the first place to look for proof of authenticity, 26 as well as for possible prejudice or model bias against the applicant. 27 Data provenance creates a complete audit trail that enables data scientists 28 to track the data from its origin to the final decision and make 29 appropriate changes that address issues. With the adoption of General Data 30 Protection Regulation (GDPR) compliance requirements, monitoring data lineage 31 is becoming a necessity for many organizations that work with sensitive data. 32 33 Pachyderm implements provenance for both commits and repositories. 34 You can track revisions of the data and 35 understand the connection between the data stored in one repository 36 and the results in the other repository. 37 38 Collaboration takes data provenance even further. Provenance enables teams 39 of data scientists across the globe to build on each other work, share, 40 transform, and update datasets while automatically maintaining a 41 complete audit trail so that all results are reproducible. 42 43 The following diagram demonstrates how provenance works: 44 45  46 47 In the diagram above, you can see two input repositories called `parameters` 48 and `training-data`. The `training-data` repository continuously collects 49 data from an outside source. The training model pipeline combines the 50 data from these two repositories, trains many models, and runs tests to 51 select the best one. 52 53 Provenance helps you to understand how and why the best model was 54 selected and enables you to track the origin of the best model. 55 In the diagram above, the best model is represented with a purple 56 circle. By using provenance, you can find that the best model was 57 created from the commit **2** in the `training-data` repository 58 and the commit **1** in the `parameters` repository. 59 60 ## Tracking Provenance in Pachyderm 61 62 Pachyderm provides the `pachctl inspect` command that enables you to track 63 provenance of your commits and learn where the data in the repository 64 originates in. 65 66 !!! example 67 ```shell 68 pachctl inspect commit split@master 69 ``` 70 71 **System Response:** 72 73 ```shell 74 Commit: split@f71e42704b734598a89c02026c8f7d13 75 Original Branch: master 76 Started: 4 minutes ago 77 Finished: 3 minutes ago 78 Size: 0B 79 Provenance: __spec__@8c6440f52a2d4aa3980163e25557b4a1 (split) raw_data@ccf82debb4b94ca3bfe165aca8d517c3 (master) 80 ``` 81 82 In the example above, you can see that the latest commit in the master 83 branch of the split repository tracks back to the master branch in the 84 `raw_data` repository. 85 86 ## Tracking Provenance Downstream 87 88 Pachyderm provides the `flush commit` command that enables you 89 to track provenance downstream. Tracking downstream means that instead of 90 tracking the origin of a commit, you can learn in which output repository 91 a certain input has resulted. 92 93 For example, you have the `ccf82debb4b94ca3bfe165aca8d517c3` commit in 94 the `raw_data` repository. If you run the `pachctl flush commit` command 95 for this commit, you can see in which repositories and commits that data 96 resulted. 97 98 !!! example 99 ```shell 100 pachctl flush commit raw_data@ccf82debb4b94ca3bfe165aca8d517c3 101 ``` 102 103 **System Response:** 104 105 ```shell 106 REPO BRANCH COMMIT PARENT STARTED DURATION SIZE 107 split master f71e42704b734598a89c02026c8f7d13 <none> 52 minutes ago About a minute 0B 108 split stats 9b46d7abf9a74bf7bf66c77f2a0da4b1 <none> 52 minutes ago About a minute 15.39MiB 109 pre_process master a99ab362dc944b108fb33544b2b24a8c <none> 48 minutes ago About a minute 0B 110 ```