github.com/pachyderm/pachyderm@v1.13.4/examples/spouts/spout-marker/README.md (about) 1 > INFO - Pachyderm 2.0 introduces profound architectural changes to the product. As a result, our examples pre and post 2.0 are kept in two separate branches: 2 > - Branch Master: Examples using Pachyderm 2.0 and later versions - https://github.com/pachyderm/pachyderm/tree/master/examples 3 > - Branch 1.13.x: Examples using Pachyderm 1.13 and older versions - https://github.com/pachyderm/pachyderm/tree/1.13.x/examples 4 # Resuming a Spout Pipeline 5 6  7 This example is based on **spouts 1.0** implementation 8 prior to Pachyderm 1.12. 9 The implementation in spouts 2.0 is significantly different. 10 We recommend upgrading 11 to the latest version 12 of Pachyderm 13 and using the **spouts 2.0** implementation. 14 15 Pachyderm enables you to create a special pipeline 16 called *the spout* that enables you to ingest streaming 17 data from an external source into Pachyderm. An example 18 of such data could be a message queue, a transaction 19 log, or others. 20 21 Some of these streaming platforms can keep track of 22 the current record so that in case of a network failure, 23 the progress can be resumed from where it was left off. 24 For example, ApacheĀ® Kafka tracks messages that 25 will be sent to a Kafka consumer by recording the position of 26 a pointer to the last record. This pointer is called an offset. 27 If a Kafka consumer fails, it can then read the offset and resume 28 from the position of the last processed message. 29 30 When you use such a system in conjunction with Pachyderm, 31 this progress needs to be tracked within Pachyderm as well. 32 A spout has an option to specify a file or directory in 33 which Pachyderm can keep track of the Kafka offsets or 34 of similar record position trackers. 35 This file is called a *spout marker* or just *marker*. 36 37 A spout marker records the progress of a spout pipeline, 38 and in case of an error, modification, or interruption 39 can resume where it left off. 40 41 In this example, we will create a spout pipeline with a 42 marker file that will track the progress of a pipeline. 43 Then, we will modify the pipeline and observe 44 how the spout continues to update records without interruption. 45 46 ## Prerequisites 47 48 Before you begin, verify that you have the following components 49 installed on your machine: 50 51 * Pachyderm 1.9.12 or later 52 * Terminal 53 54 ## Pipeline Overview 55 56 In this example, we will use a simple spout pipeline that 57 will add dots into a spout marker file. Here is how the 58 marker file will look like: 59 60 ``` 61 . 62 .. 63 ... 64 .... 65 ..... 66 ...... 67 ....... 68 ``` 69 70 The script runs every thirty seconds and appends a dot (`.`) and a new line 71 to the marker file creating a half pyramid pattern. 72 73 After running the pipeline for some time, we will modify the Python 74 script so that it adds the star (`*`) symbol instead of a dot. The 75 resulting file should look like this: 76 77 ``` 78 79 . 80 .. 81 ... 82 .... 83 ..... 84 ...... 85 .......* 86 .......** 87 .......*** 88 .......**** 89 .......***** 90 ``` 91 92 ## Step 1: Build the Docker Image 93 94 Pachyderm uses Docker images that you specify in your 95 pipeline to create Kubernetes pods that run your code. 96 For this example, we will use a very simple [Dockerfile](./Dockerfile) 97 that pulls a basic Python image and adds your code to 98 the container that will run your code. 99 100 To build a Docker image, complete the following steps: 101 102 1. Clone this repository: 103 104 ```shell 105 git clone git@github.com:pachyderm/pachyderm.git 106 ``` 107 108 1. Change the directory to `examples/spouts/spout-marker/`. 109 110 1. Build and a tag a Docker image from the Dockerfile in this directory: 111 112 ```shell 113 docker build --tag spout-marker:v1 . 114 ``` 115 116 !!! note 117 **Note:** Do not forget the dot in the end! 118 119 1. Push the Docker image to an image registry. 120 121 * If you are using `minikube`, for testing you can just 122 transfer your local image to a `minikube` VM: 123 124 ```shell 125 docker save spout-marker:v1 | (\ 126 eval $(minikube docker-env) 127 docker load 128 ) 129 ``` 130 131 1. Proceed to [Step 2](#step-2-create-the-pipeline). 132 133 ## Step 2: Create the Pipeline 134 135 Because spouts do not have an input and consume data from an 136 outside source, you do not need to create a Pachyderm 137 repository for this example. For this example, we do not need 138 to set up any messaging system, because the Python script will 139 generate it for us. 140 141 However, you still need to create 142 the spout pipeline with a marker file. 143 The pipeline specification for this example is stored in 144 [spout-marker-pipeline.json](./spout-marker-pipeline.json). 145 146 The Python script that we will use for this example is stored in 147 [spout-marker-example.py](./spout-marker-example.py). 148 149 When you create a spout pipeline with a marker file, Pachyderm 150 creates a separate branch for the spout marker and stores the 151 marker file in that branch. 152 153 This example includes a test spout pipeline that demonstrate markers. 154 To use it, complete the following steps: 155 156 1. Create a spout pipeline using this json. 157 158 ```json 159 { 160 "pipeline": { 161 "name": "spoutmarker" 162 }, 163 "transform": { 164 "cmd": [ "python3", "/spout-marker-example.py" ], 165 "image": "spout-marker:v1", 166 "env" : { 167 "OUTPUT_CHARACTER": "." 168 } 169 }, 170 "spout": { 171 "marker": "mymark", 172 "overwrite": true 173 } 174 } 175 ``` 176 177 !!! note 178 **Note:** In the `spout` section, you have a key-value pair 179 `"marker": "mymark"`. `mymark` is the name of your marker file. 180 If you use multiple marker files, `mymark` will be a 181 prefix of all marker files that might be named as `mymark01`, 182 `mymark02`, and so on. 183 184 1. View the list of pipelines: 185 186 ```shell 187 pachctl list pipeline 188 ``` 189 190 **System response:** 191 192 ```shell 193 NAME VERSION INPUT CREATED STATE / LAST JOB DESCRIPTION 194 spoutmarker 1 none 2 minutes ago running / starting 195 ``` 196 197 The pipeline also creates an output repository by the same name. 198 199 1. View the list of branches created for this pipeline: 200 201 ```shell 202 pachctl list branch spoutmarker 203 ``` 204 205 **System response:** 206 207 ``` 208 BRANCH HEAD 209 marker fb7df194725f4d2c8786e466282a7cde 210 master 77935404f3ce48f09f4fd27147948e75 211 ``` 212 213 Pachyderm created a `marker` branch for the 214 `spoutmarker` pipeline. According to our Python code, a dot 215 should be added to the `marker` file every 10 seconds. Each of these 216 transactions creates a commit in both `master` and `marker` branches 217 in the `spoutmarker` output repository. 218 219 ```shell 220 pachctl list commit spoutmarker@master 221 ``` 222 223 **System response:** 224 225 ``` 226 REPO BRANCH COMMIT FINISHED SIZE PROGRESS DESCRIPTION 227 spoutmarker master f91d27382b8a40408504865783b717e9 3 minutes ago 0B - 228 spoutmarker master 333ab0ed77a24210a5ec3d613ea0c8e4 2 minutes ago 0B - 229 ``` 230 231 ```shell 232 pachctl list commit spoutmarker@marker 233 ``` 234 235 **System response:** 236 237 ``` 238 REPO BRANCH COMMIT FINISHED SIZE PROGRESS DESCRIPTION 239 spoutmarker marker dda511ef0e5c4238bc368869574125ac 3 minutes ago 4B 240 spoutmarker marker e4c5f71b40e74372bff7cf6fd9dcfb89 2 minutes ago 1B 241 ``` 242 243 !!! note 244 **Note:** Because the script appends to the marker file, each new commit 245 is larger than the previous one. 246 247 1. View the marker file: 248 249 ```shell 250 pachctl get file spoutmarker@marker:/mymark 251 ``` 252 253 **System response:** 254 255 ```shell 256 . 257 .. 258 ... 259 .... 260 ..... 261 ``` 262 263 Run this command a few times to see that a new dot is appended every 264 10 seconds. 265 266 1. (Optional) View the output. 267 268 ```shell 269 pachctl get file spoutmarker@master:/output 270 ``` 271 272 **System response:** 273 274 ``` 275 ...... 276 ``` 277 278 1. Proceed to [Step 3](#step-3-modify-the-pipeline-code). 279 280 ## Step 3: Modify the Pipeline Code 281 282 Now, as our pipeline is running correctly, let's try to modify it and 283 see if the marker file will continue to append to the new symbol. 284 285 To modify the pipeline code, complete the following steps: 286 287 1. Edit the pipeline in place, 288 changing the value of the `OUTPUT_CHARACTER` environment variable 289 from `.` to `*` 290 291 ```shell 292 pachctl edit pipeline spoutmarker 293 ``` 294 295 !!! note 296 **Note:** You can set the environment variable `EDITOR` to use your 297 your preferred text editor. 298 299 The new pipeline definition will look something like this in your text editor: 300 301 ```json 302 { 303 "pipeline": { 304 "name": "spoutmarker" 305 }, 306 "transform": { 307 "image": "spout-marker:v1", 308 "cmd": [ 309 "python3", 310 "/spout-marker-example.py" 311 ], 312 "env": { 313 "OUTPUT_CHARACTER": "*" 314 } 315 }, 316 "output_branch": "master", 317 "cache_size": "64M", 318 "max_queue_size": "1", 319 "spout": { 320 "overwrite": true, 321 "marker": "mymark" 322 }, 323 "salt": "ea04c48e993c45a781b5ba315b230674", 324 "datum_tries": "3" 325 } 326 ``` 327 328 !!! note 329 **Note:** You can also edit the pipeline spec in 330 the original `spout-marker-pipeline.json` file and use 331 `pachctl update pipeline -f spout-marker-pipeline.json` 332 to accomplish the same task. 333 334 335 1. Once you save this file and leave the editor, 336 you'll see the pipeline restart. 337 View the list of pipelines: 338 339 ```shell 340 pachctl list pipeline 341 ``` 342 343 **System response:** 344 345 ```shell 346 NAME VERSION INPUT CREATED STATE / LAST JOB DESCRIPTION 347 spoutmarker 2 none 10 minutes ago running / starting 348 ``` 349 350 Your pipeline was updated to version `2` and is 351 running your updated code. You might need to wait for some time, but 352 eventually, your `marker` file will look like this: 353 354 ```shell 355 pachctl get file spoutmarker@marker:/mymark 356 ``` 357 358 **System response:** 359 360 ``` 361 . 362 .. 363 ... 364 .... 365 ..... 366 ...... 367 ......* 368 ......** 369 ``` 370 371 1. (Optional) View the output. 372 373 ```shell 374 pachctl get file spoutmarker@master:/output 375 ``` 376 377 **System response:** 378 379 ``` 380 ......** 381 ``` 382 ## Summary 383 384 This example demonstrates that spout pipelines can be configured 385 to use a special `marker` file or directory that can keep track of 386 Kafka offsets or of similar record position trackers.