github.com/pachyderm/pachyderm@v1.13.4/doc/docs/1.11.x/how-tos/create-ml-workflow.md (about)

     1  # Create a Machine Learning Workflow
     2  
     3  Because Pachyderm is a language and framework agnostic and
     4  platform, and because it easily distributes analysis over
     5  large data sets, data scientists can use any tooling for
     6  creating machine learning workflows. Even if that tooling
     7  is not familiar to the rest of an engineering organization,
     8  data scientists can autonomously develop and deploy scalable
     9  solutions by using containers. Moreover, Pachyderm’s
    10  pipeline logic paired with data versioning make any results
    11  reproducible for debugging purposes or during the development of
    12  improvements to a model.
    13  
    14  For maximum leverage of Pachyderm's built functionality, Pachyderm
    15  recommends that you combine model training processes, persisted models,
    16  and model utilization processes, such as making inferences or
    17  generating results, into a single Pachyderm pipeline Directed Acyclic Graph
    18  (DAG).
    19  
    20  Such a pipeline enables you to achieve the following goals:
    21  
    22  - Keep a rigorous historical record of which models were used
    23    on what data to produce which results.
    24  - Automatically update online ML models when training data or
    25    parameterization changes.
    26  - Easily revert to other versions of an ML model when a new model
    27    does not produce an expected result or when *bad data* is
    28    introduced into a training data set.
    29  
    30  The following diagram demonstrates an ML pipeline:
    31  
    32  ![Example of a machine learning workflow](../assets/images/d_ml_workflow.svg)
    33  
    34  You can update the training dataset at any time
    35  to automatically train a new persisted model. Also, you can use
    36  any language or framework, including Apache Spark™, Tensorflow™,
    37  scikit-learn™, or other, and output any format of persisted model,
    38  such as pickle, XML, POJO, or other. Regardless of the framework,
    39  Pachyderm versions the model so that you can track the data that
    40  was used to train each model.
    41  
    42  Pachyderm processes new data coming into the input repository with the
    43  updated model. Also, you can recompute old predictions with the updated model,
    44  or test new models on previously input and versioned data. This feature
    45  enables you to avoid manual updates to historical results or swapping
    46  ML models in production.
    47  
    48  For examples of ML workflows in Pachyderm see
    49  [Machine Learning Examples](../examples/examples.md#machine-learning).