github.com/pachyderm/pachyderm@v1.13.4/doc/docs/1.9.x/concepts/pipeline-concepts/index.rst

github.com/pachyderm/pachyderm@v1.13.4/doc/docs/1.9.x/concepts/pipeline-concepts/index.rst (about)

1 .. _pipeline-concepts:
2
3 Pipeline Concepts
4 =================
5
6 Pachyderm Pipeline System (PPS) is the computational
7 component of the Pachyderm platform that enables you to
8 perform various transformations on your data. Pachyderm
9 pipelines have the following main concepts:
10
11 Pipeline
12 A pipeline is a job-spawner that waits for certain
13 conditions to be met. Most commonly, this means
14 watching one or more Pachyderm repositories for new
15 commits of data. When new data arrives, a pipeline executes
16 a user-defined piece of code to perform an operation
17 and process the data. Each of these executions is
18 called a job.
19
20 Pachyderm has the following special types of pipelines:
21
22 Cron
23 A cron input enables you to trigger the pipeline code
24 at a specific interval. This type of pipeline is useful
25 for such tasks as web scraping, querying a database, and
26 other similar operations where you do not want to wait
27 for new data, but instead trigger the pipeline periodically.
28
29 Join
30 A join pipeline enables you to join files that are stored in
31 different Pachyderm repositories and match a particular
32 file path pattern. Conceptually, joins are similar to
33 the database's inner join operations, although they
34 only match on file paths, not the actual file content.
35
36 Service
37 A service is a special type of pipeline that
38 instead of executing jobs and then waiting, permanently runs
39 a serving data through an endpoint. For example, you can be
40 serving an ML model or a REST API that can be queried. A
41 service reads data from Pachyderm but does not have an
42 output repo.
43
44 Spout
45 A spout is a special type of pipeline for ingesting data
46 from a data stream. A spout can subscribe to a message
47 stream, such as Kafka or Amazon SQS, and ingest data when
48 it receives a message. A spout does not have an input repo.
49
50 Job
51 A job is an individual execution of a pipeline. A job
52 can succeed or fail. Within a job, data and processing
53 can be broken up into individual units of work called datums.
54
55 Datum
56 A datum is the smallest indivisible unit of work within
57 a job. Different datums can be processed in parallel
58 within a job.
59
60 Read the sections below to learn more about these concepts:
61
62 .. toctree::
63 :maxdepth: 2
64
65 pipeline/index.rst
66 job.md
67 datum/index.rst
68