github.com/pachyderm/pachyderm@v1.13.4/doc/docs/1.10.x/concepts/pipeline-concepts/pipeline/spout.md (about)

     1  # Spout
     2  
     3  A spout is a type of pipeline that ingests
     4  streaming data. Generally, you use spouts for
     5  situations when the interval between new data generation
     6  is large or sporadic, but the latency requirement to start the
     7  processing is short. Therefore, a regular pipeline
     8  with a cron input that polls for new data
     9  might not be an optimal solution.
    10  
    11  Examples of streaming data include a message queue,
    12  a database transactions log, event notifications,
    13  and others. In spouts, your code runs continuously and writes the
    14  results to the pipeline's output location, `pfs/out`.
    15  Every time you create a complete `.tar` archive,
    16  Pachyderm creates a new commit and triggers the pipeline to
    17  process it.
    18  
    19  One main difference from regular pipelines is that
    20  spouts ingest their data from outside sources. Therefore, they
    21  do not take an input.
    22  
    23  Another important aspect is that in spouts, `pfs/out` is
    24  a *named pipe*, or *First in, First Out* (FIFO), and is not
    25  a directory like in standard pipelines. Unlike
    26  the traditional pipe, that is familiar to most Linux users,
    27  a *named pipe* enables two system processes to access
    28  the pipe simultaneously and gives one of the processes read-only and the other
    29  process write-only access. Therefore, the two processes can simultaneously
    30  read and write to the same pipe.
    31  
    32  To create a spout pipeline, you need the following items:
    33  
    34  * A source of streaming data
    35  * A Docker container with your spout code that reads from the data source
    36  * A spout pipeline specification file that uses the container
    37  
    38  Your spout code performs the following actions:
    39  
    40  1. Connects to the specified streaming data source.
    41  1. Opens `/pfs/out` as a named pipe.
    42  1. Reads the data from the streaming data source.
    43  1. Packages the data into a `tar` stream.
    44  1. Writes the `tar` stream into the `pfs/out` pipe. In case of transient
    45  errors produced by closing a previous write to the pipe, retries the write
    46  operation.
    47  1. Closes the `tar` stream and connection to `/pfs/out`, which produces the
    48  commit.
    49  
    50  A minimum spout specification must include the following
    51  parameters:
    52  
    53  | Parameter   | Description |
    54  | ----------- | ----------- |
    55  | `name`      | The name of your data pipeline and the output repository. You can set an <br> arbitrary name that is meaningful to the code you want to run. |
    56  | `transform` | Specifies the code that you want to run against your data, such as a Python <br> or Go script. Also, it specifies a Docker image that you want to use to run that script. |
    57  | `overwrite` | (Optional) Specifies whether to overwrite the existing content <br> of the file from previous commits or previous calls to the <br> `put file` command  within this commit. The default value is `false`. |
    58  
    59  The following text is an example of a minimum specification:
    60  
    61  !!! note
    62      The `env` property is an optional argument. You can define
    63      your data stream source from within the container in which you run
    64      your script. For simplicity, in this example, `env` specifies the
    65      source of the Kafka host.
    66  
    67  ```
    68  {
    69    "pipeline": {
    70      "name": "my-spout"
    71    },
    72    "transform": {
    73      "cmd": [ "go", "run", "./main.go" ],
    74      "image": "myaccount/myimage:0.1"
    75      "env": {
    76          "HOST": "kafkahost",
    77          "TOPIC": "mytopic",
    78          "PORT": "9092"
    79      },
    80    },
    81    "spout": {
    82      "overwrite": false
    83    }
    84  }
    85  ```
    86  
    87  ## Resuming Spout Progress
    88  
    89  When a spout container crashes, all incomplete operations
    90  that were processed before the crash are lost, and the spout needs
    91  to start the interrupted data operation from scratch.
    92  To keep the history of changes, so that the spout can
    93  continue where it left off after the restart, you can
    94  configure a record tracking `marker` file for your spout.
    95  
    96  When you specify the `marker` parameter as a subfield in the
    97  `spout` section of your pipeline, Pachyderm creates
    98  the `marker` file or directory. The file or directory is named
    99  according to the provided value. For example, if you specify
   100  `"marker": "offset"`, Pachyderm stores the current marker
   101  in `pfs/out/offset` and the previous marker in `pfs/offset`.
   102  If a spout container crashes and then starts
   103  again, it can read the `marker` file and resume where it left
   104  off instead of starting over.
   105  
   106  Markers are useful if you want to leverage a record tracking
   107  functionality of an external messaging system, such as
   108  ApacheĀ® Kafka offset management or similar.
   109  
   110  If you want to check how a marker works in Pahcyderm, see
   111  the [Resuming a Spout Pipeline example](https://github.com/pachyderm/pachyderm/tree/master/examples/spouts/spout-marker).