github.com/pachyderm/pachyderm@v1.13.4/doc/docs/1.9.x/concepts/pipeline-concepts/pipeline/spout.md (about) 1 # Spout 2 3 A spout is a type of pipeline that ingests 4 streaming data. Generally, you use spouts for 5 situations when the interval between new data generation 6 is large or sporadic, but the latency requirement to start the 7 processing is short. Therefore, a regular pipeline 8 with a cron input that polls for new data 9 might not be an optimal solution. 10 11 Examples of streaming data include a message queue, 12 a database transactions log, event notifications, 13 and others. In spouts, your code runs continuously and writes the 14 results to the pipeline's output location, `pfs/out`. 15 Every time you create a complete `.tar` archive, 16 Pachyderm creates a new commit and triggers the pipeline to 17 process it. 18 19 One main difference from regular pipelines is that 20 spouts ingest their data from outside sources. Therefore, they 21 do not take an input. 22 23 Another important aspect is that in spouts, `pfs/out` is 24 a *named pipe*, or *First in, First Out* (FIFO), and is not 25 a directory like in standard pipelines. Unlike 26 the traditional pipe, that is familiar to most Linux users, 27 a *named pipe* enables two system processes to access 28 the pipe simultaneously and gives one of the processes read-only and the other 29 process write-only access. Therefore, the two processes can simultaneously 30 read and write to the same pipe. 31 32 To create a spout pipeline, you need the following items: 33 34 * A source of streaming data 35 * A Docker container with your spout code that reads from the data source 36 * A spout pipeline specification file that uses the container 37 38 Your spout code performs the following actions: 39 40 1. Connects to the specified streaming data source. 41 1. Opens `/pfs/out` as a named pipe. 42 1. Reads the data from the streaming data source. 43 1. Packages the data into a `tar` stream. 44 1. Writes the `tar` stream into the `pfs/out` pipe. In case of transient 45 errors produced by closing a previous write to the pipe, retries the write 46 operation. 47 1. Closes the `tar` stream and connection to `/pfs/out`, which produces the 48 commit. 49 50 A minimum spout specification must include the following 51 parameters: 52 53 | Parameter | Description | 54 | ----------- | ----------- | 55 | `name` | The name of your data pipeline and the output repository. You can set an <br> arbitrary name that is meaningful to the code you want to run. | 56 | `transform` | Specifies the code that you want to run against your data, such as a Python <br> or Go script. Also, it specifies a Docker image that you want to use to run that script. | 57 | `overwrite` | (Optional) Specifies whether to overwrite the existing content <br> of the file from previous commits or previous calls to the <br> `put file` command within this commit. The default value is `false`. | 58 59 The following text is an example of a minimum specification: 60 61 !!! note 62 The `env` property is an optional argument. You can define 63 your data stream source from within the container in which you run 64 your script. For simplicity, in this example, `env` specifies the 65 source of the Kafka host. 66 67 ``` 68 { 69 "pipeline": { 70 "name": "my-spout" 71 }, 72 "transform": { 73 "cmd": [ "go", "run", "./main.go" ], 74 "image": "myaccount/myimage:0.1" 75 "env": { 76 "HOST": "kafkahost", 77 "TOPIC": "mytopic", 78 "PORT": "9092" 79 }, 80 }, 81 "spout": { 82 "overwrite": false 83 } 84 } 85 ```