github.com/pachyderm/pachyderm@v1.13.4/doc/docs/1.9.x/concepts/pipeline-concepts/index.rst (about) 1 .. _pipeline-concepts: 2 3 Pipeline Concepts 4 ================= 5 6 Pachyderm Pipeline System (PPS) is the computational 7 component of the Pachyderm platform that enables you to 8 perform various transformations on your data. Pachyderm 9 pipelines have the following main concepts: 10 11 Pipeline 12 A pipeline is a job-spawner that waits for certain 13 conditions to be met. Most commonly, this means 14 watching one or more Pachyderm repositories for new 15 commits of data. When new data arrives, a pipeline executes 16 a user-defined piece of code to perform an operation 17 and process the data. Each of these executions is 18 called a job. 19 20 Pachyderm has the following special types of pipelines: 21 22 Cron 23 A cron input enables you to trigger the pipeline code 24 at a specific interval. This type of pipeline is useful 25 for such tasks as web scraping, querying a database, and 26 other similar operations where you do not want to wait 27 for new data, but instead trigger the pipeline periodically. 28 29 Join 30 A join pipeline enables you to join files that are stored in 31 different Pachyderm repositories and match a particular 32 file path pattern. Conceptually, joins are similar to 33 the database's inner join operations, although they 34 only match on file paths, not the actual file content. 35 36 Service 37 A service is a special type of pipeline that 38 instead of executing jobs and then waiting, permanently runs 39 a serving data through an endpoint. For example, you can be 40 serving an ML model or a REST API that can be queried. A 41 service reads data from Pachyderm but does not have an 42 output repo. 43 44 Spout 45 A spout is a special type of pipeline for ingesting data 46 from a data stream. A spout can subscribe to a message 47 stream, such as Kafka or Amazon SQS, and ingest data when 48 it receives a message. A spout does not have an input repo. 49 50 Job 51 A job is an individual execution of a pipeline. A job 52 can succeed or fail. Within a job, data and processing 53 can be broken up into individual units of work called datums. 54 55 Datum 56 A datum is the smallest indivisible unit of work within 57 a job. Different datums can be processed in parallel 58 within a job. 59 60 Read the sections below to learn more about these concepts: 61 62 .. toctree:: 63 :maxdepth: 2 64 65 pipeline/index.rst 66 job.md 67 datum/index.rst 68