github.com/pachyderm/pachyderm@v1.13.4/doc/docs/master/concepts/pipeline-concepts/pipeline/index.md

github.com/pachyderm/pachyderm@v1.13.4/doc/docs/master/concepts/pipeline-concepts/pipeline/index.md (about)

1 # Pipeline
2
3
4 A pipeline is a Pachyderm primitive that is responsible for reading data
5 from a specified source, such as a Pachyderm repo, transforming it
6 according to the pipeline configuration, and writing the result
7 to an output repo.
8 A pipeline subscribes to a branch in one or more input repositories.
9 Every time the branch has a new commit, the pipeline executes a job
10 that runs your code to completion and writes the results to a commit
11 in the output repository. Every pipeline automatically creates
12 an output repository by the same name as the pipeline. For example,
13 a pipeline named `model` writes all results to the
14 `model` output repo.
15
16 In Pachyderm, a Pipeline is an individual execution step. You can
17 chain multiple pipelines together to create a directed acyclic
18 graph (DAG).
19
20 Pachyderm has the following special types of pipelines:
21
22 **Cron**
23 : A cron input enables you to trigger the pipeline code at
24 a specific interval. This type of pipeline is useful for
25 such tasks as web scraping, querying a database, and other
26 similar operations where you do not want to wait for new
27 data, but instead trigger the pipeline periodically.
28
29 **Service**
30 : A service is a special type of pipeline that instead of
31 executing jobs and then waiting, permanently runs a serving
32 data through an endpoint. For example, you can be serving
33 an ML model or a REST API that can be queried. A service
34 reads data from Pachyderm but does not have an output repo.
35
36 **Spout**
37 : A spout is a special type of pipeline for ingesting data from
38 a data stream. A spout can subscribe to a message stream, such
39 as Kafka or Amazon SQS, and ingest data when it receives a
40 message. A spout does not have an input repo.
41
42 A minimum pipeline specification must include the following parameters:
43
44 - `name` — The name of your data pipeline. Set a meaningful name for
45 your pipeline, such as the name of the transformation that the
46 pipeline performs. For example, `split` or `edges`. Pachyderm
47 automatically creates an output repository with the same name.
48 A pipeline name must be an alphanumeric string that is less than
49 63 characters long and can include dashes and underscores.
50 No other special characters allowed.
51
52 - `input` — A location of the data that you want to process, such as a
53 Pachyderm repository. You can specify multiple input
54 repositories and set up the data to be combined in various ways.
55 For more information, see [Cross and Union](../datum/cross-union.md).
56
57 One very important property that is defined in the `input` field
58 is the `glob` pattern that specifies how Pachyderm breaks the data into
59 individual processing units, called Datums. For more information, see
60 [Datum](../datum/index.md).
61
62 - `transform` — Specifies the code that you want to run against your
63 data. The `transform` section must include an `image` field that
64 defines the Docker image that you want to
65 run, as well as a `cmd` field for the specific code within the
66 container that you want to execute, such as a Python script.
67
68 !!! example
69
70 ```shell
71
72 {
73 "pipeline": {
74 "name": "wordcount"
75 },
76 "transform": {
77 "image": "wordcount-image",
78 "cmd": ["python3", "/my_python_code.py"]
79 },
80 "input": {
81 "pfs": {
82 "repo": "data",
83 "glob": "/*"
84 }
85 }
86 }
87 ```
88
89 !!! note "See Also:"
90 [Pipeline Specification](../../../reference/pipeline_spec.md)