github.com/pachyderm/pachyderm@v1.13.4/doc/docs/1.11.x/how-tos/create-ml-workflow.md

github.com/pachyderm/pachyderm@v1.13.4/doc/docs/1.11.x/how-tos/create-ml-workflow.md (about)

1 # Create a Machine Learning Workflow
2
3 Because Pachyderm is a language and framework agnostic and
4 platform, and because it easily distributes analysis over
5 large data sets, data scientists can use any tooling for
6 creating machine learning workflows. Even if that tooling
7 is not familiar to the rest of an engineering organization,
8 data scientists can autonomously develop and deploy scalable
9 solutions by using containers. Moreover, Pachyderm’s
10 pipeline logic paired with data versioning make any results
11 reproducible for debugging purposes or during the development of
12 improvements to a model.
13
14 For maximum leverage of Pachyderm's built functionality, Pachyderm
15 recommends that you combine model training processes, persisted models,
16 and model utilization processes, such as making inferences or
17 generating results, into a single Pachyderm pipeline Directed Acyclic Graph
18 (DAG).
19
20 Such a pipeline enables you to achieve the following goals:
21
22 - Keep a rigorous historical record of which models were used
23 on what data to produce which results.
24 - Automatically update online ML models when training data or
25 parameterization changes.
26 - Easily revert to other versions of an ML model when a new model
27 does not produce an expected result or when *bad data* is
28 introduced into a training data set.
29
30 The following diagram demonstrates an ML pipeline:
31
32 ![Example of a machine learning workflow](../assets/images/d_ml_workflow.svg)
33
34 You can update the training dataset at any time
35 to automatically train a new persisted model. Also, you can use
36 any language or framework, including Apache Spark™, Tensorflow™,
37 scikit-learn™, or other, and output any format of persisted model,
38 such as pickle, XML, POJO, or other. Regardless of the framework,
39 Pachyderm versions the model so that you can track the data that
40 was used to train each model.
41
42 Pachyderm processes new data coming into the input repository with the
43 updated model. Also, you can recompute old predictions with the updated model,
44 or test new models on previously input and versioned data. This feature
45 enables you to avoid manual updates to historical results or swapping
46 ML models in production.
47
48 For examples of ML workflows in Pachyderm see
49 [Machine Learning Examples](../examples/examples.md#machine-learning).