github.com/pachyderm/pachyderm@v1.13.4/doc/docs/1.9.x/concepts/pipeline-concepts/index.rst (about)

     1  .. _pipeline-concepts:
     2  
     3  Pipeline Concepts
     4  =================
     5  
     6  Pachyderm Pipeline System (PPS) is the computational
     7  component of the Pachyderm platform that enables you to
     8  perform various transformations on your data. Pachyderm
     9  pipelines have the following main concepts:
    10  
    11  Pipeline
    12   A pipeline is a job-spawner that waits for certain
    13   conditions to be met. Most commonly, this means
    14   watching one or more Pachyderm repositories for new
    15   commits of data. When new data arrives, a pipeline executes
    16   a user-defined piece of code to perform an operation
    17   and process the data. Each of these executions is
    18   called a job.
    19  
    20   Pachyderm has the following special types of pipelines:
    21  
    22   Cron
    23    A cron input enables you to trigger the pipeline code
    24    at a specific interval. This type of pipeline is useful
    25    for such tasks as web scraping, querying a database, and
    26    other similar operations where you do not want to wait
    27    for new data, but instead trigger the pipeline periodically.
    28  
    29   Join
    30    A join pipeline enables you to join files that are stored in
    31    different Pachyderm repositories and match a particular
    32    file path pattern. Conceptually, joins are similar to
    33    the database's inner join operations, although they
    34    only match on file paths, not the actual file content.
    35  
    36   Service
    37    A service is a special type of pipeline that
    38    instead of executing jobs and then waiting, permanently runs
    39    a serving data through an endpoint. For example, you can be
    40    serving an ML model or a REST API that can be queried. A
    41    service reads data from Pachyderm but does not have an
    42    output repo.
    43  
    44   Spout
    45    A spout is a special type of pipeline for ingesting data
    46    from a data stream. A spout can subscribe to a message
    47    stream, such as Kafka or Amazon SQS, and ingest data when
    48    it receives a message. A spout does not have an input repo.
    49  
    50  Job
    51   A job is an individual execution of a pipeline. A job
    52   can succeed or fail. Within a job, data and processing
    53   can be broken up into individual units of work called datums.
    54  
    55  Datum
    56   A datum is the smallest indivisible unit of work within
    57   a job. Different datums can be processed in parallel
    58   within a job.
    59  
    60  Read the sections below to learn more about these concepts:
    61  
    62  .. toctree::
    63     :maxdepth: 2
    64  
    65     pipeline/index.rst
    66     job.md
    67     datum/index.rst
    68