github.com/pachyderm/pachyderm@v1.13.4/examples/spouts/spout-marker/README.md

github.com/pachyderm/pachyderm@v1.13.4/examples/spouts/spout-marker/README.md (about)

     1  >![pach_logo](../../img/pach_logo.svg) INFO - Pachyderm 2.0 introduces profound architectural changes to the product. As a result, our examples pre and post 2.0 are kept in two separate branches:
     2  > - Branch Master: Examples using Pachyderm 2.0 and later versions - https://github.com/pachyderm/pachyderm/tree/master/examples
     3  > - Branch 1.13.x: Examples using Pachyderm 1.13 and older versions - https://github.com/pachyderm/pachyderm/tree/1.13.x/examples
     4  # Resuming a Spout Pipeline
     5  
     6  ![pach_logo](./img/pach_logo.svg)
     7     This example is based on **spouts 1.0** implementation
     8     prior to Pachyderm 1.12.
     9     The implementation in spouts 2.0 is significantly different.
    10     We recommend upgrading 
    11     to the latest version
    12     of Pachyderm
    13     and using the **spouts 2.0** implementation.
    14     
    15  Pachyderm enables you to create a special pipeline
    16  called *the spout* that enables you to ingest streaming
    17  data from an external source into Pachyderm. An example
    18  of such data could be a message queue, a transaction
    19  log, or others.
    20  
    21  Some of these streaming platforms can keep track of
    22  the current record so that in case of a network failure,
    23  the progress can be resumed from where it was left off.
    24  For example, Apache® Kafka tracks messages that
    25  will be sent to a Kafka consumer by recording the position of
    26  a pointer to the last record. This pointer is called an offset.
    27  If a Kafka consumer fails, it can then read the offset and resume
    28  from the position of the last processed message.
    29  
    30  When you use such a system in conjunction with Pachyderm,
    31  this progress needs to be tracked within Pachyderm as well.
    32  A spout has an option to specify a file or directory in
    33  which Pachyderm can keep track of the Kafka offsets or
    34  of similar record position trackers.
    35  This file is called a *spout marker* or just *marker*.
    36  
    37  A spout marker records the progress of a spout pipeline,
    38  and in case of an error, modification, or interruption
    39  can resume where it left off.
    40  
    41  In this example, we will create a spout pipeline with a
    42  marker file that will track the progress of a pipeline.
    43  Then, we will modify the pipeline and observe
    44  how the spout continues to update records without interruption.
    45  
    46  ## Prerequisites
    47  
    48  Before you begin, verify that you have the following components
    49  installed on your machine:
    50  
    51  * Pachyderm 1.9.12 or later
    52  * Terminal
    53  
    54  ## Pipeline Overview
    55  
    56  In this example, we will use a simple spout pipeline that
    57  will add dots into a spout marker file. Here is how the
    58  marker file will look like:
    59  
    60  ```
    61  .
    62  ..
    63  ...
    64  ....
    65  .....
    66  ......
    67  .......
    68  ```
    69  
    70  The script runs every thirty seconds and appends a dot (`.`) and a new line
    71  to the marker file creating a half pyramid pattern.
    72  
    73  After running the pipeline for some time, we will modify the Python
    74  script so that it adds the star (`*`) symbol instead of a dot. The
    75  resulting file should look like this:
    76  
    77  ```
    78  
    79  .
    80  ..
    81  ...
    82  ....
    83  .....
    84  ......
    85  .......*
    86  .......**
    87  .......***
    88  .......****
    89  .......*****
    90  ```
    91  
    92  ## Step 1: Build the Docker Image
    93  
    94  Pachyderm uses Docker images that you specify in your
    95  pipeline to create Kubernetes pods that run your code.
    96  For this example, we will use a very simple [Dockerfile](./Dockerfile)
    97  that pulls a basic Python image and adds your code to
    98  the container that will run your code.
    99  
   100  To build a Docker image, complete the following steps:
   101  
   102  1. Clone this repository:
   103  
   104     ```shell
   105     git clone git@github.com:pachyderm/pachyderm.git
   106     ```
   107  
   108  1. Change the directory to `examples/spouts/spout-marker/`.
   109  
   110  1. Build and a tag a Docker image from the Dockerfile in this directory:
   111  
   112     ```shell
   113     docker build --tag spout-marker:v1 .
   114     ```
   115  
   116     !!! note
   117     **Note:** Do not forget the dot in the end!
   118  
   119  1. Push the Docker image to an image registry.
   120  
   121     * If you are using `minikube`, for testing you can just
   122     transfer your local image to a `minikube` VM:
   123  
   124       ```shell
   125       docker save spout-marker:v1 | (\
   126       eval $(minikube docker-env)
   127       docker load
   128       )
   129       ```
   130  
   131  1. Proceed to [Step 2](#step-2-create-the-pipeline).
   132  
   133  ## Step 2: Create the Pipeline
   134  
   135  Because spouts do not have an input and consume data from an
   136  outside source, you do not need to create a Pachyderm
   137  repository for this example. For this example, we do not need
   138  to set up any messaging system, because the Python script will
   139  generate it for us.
   140  
   141  However, you still need to create
   142  the spout pipeline with a marker file.
   143  The pipeline specification for this example is stored in
   144  [spout-marker-pipeline.json](./spout-marker-pipeline.json).
   145  
   146  The Python script that we will use for this example is stored in
   147  [spout-marker-example.py](./spout-marker-example.py).
   148  
   149  When you create a spout pipeline with a marker file, Pachyderm
   150  creates a separate branch for the spout marker and stores the
   151  marker file in that branch.
   152  
   153  This example includes a test spout pipeline that demonstrate markers.
   154  To use it, complete the following steps:
   155  
   156  1. Create a spout pipeline using this json. 
   157  
   158     ```json
   159     {
   160         "pipeline": {
   161             "name": "spoutmarker"
   162         },
   163         "transform": {
   164             "cmd": [ "python3", "/spout-marker-example.py" ],
   165             "image": "spout-marker:v1",
   166             "env" : {
   167                 "OUTPUT_CHARACTER": "."
   168             }
   169         },
   170         "spout": {
   171             "marker": "mymark",
   172             "overwrite": true
   173         }
   174     }
   175     ```
   176  
   177     !!! note
   178     **Note:** In the `spout` section, you have a key-value pair
   179     `"marker": "mymark"`. `mymark` is the name of your marker file.
   180     If you use multiple marker files, `mymark` will be a
   181     prefix of all marker files that might be named as `mymark01`,
   182     `mymark02`, and so on.
   183  
   184  1. View the list of pipelines:
   185  
   186     ```shell
   187     pachctl list pipeline
   188     ```
   189  
   190     **System response:**
   191  
   192     ```shell
   193     NAME        VERSION INPUT CREATED        STATE / LAST JOB   DESCRIPTION
   194     spoutmarker 1       none  2 minutes ago  running / starting
   195     ```
   196  
   197     The pipeline also creates an output repository by the same name.
   198  
   199  1. View the list of branches created for this pipeline:
   200  
   201     ```shell
   202     pachctl list branch spoutmarker
   203     ```
   204  
   205     **System response:**
   206  
   207     ```
   208     BRANCH HEAD
   209     marker fb7df194725f4d2c8786e466282a7cde
   210     master 77935404f3ce48f09f4fd27147948e75
   211     ```
   212  
   213     Pachyderm created a `marker` branch for the
   214     `spoutmarker` pipeline. According to our Python code, a dot
   215     should be added to the `marker` file every 10 seconds. Each of these
   216     transactions creates a commit in both `master` and `marker` branches
   217     in the `spoutmarker` output repository.
   218  
   219     ```shell
   220     pachctl list commit spoutmarker@master
   221     ```
   222  
   223     **System response:**
   224  
   225     ```
   226     REPO        BRANCH COMMIT                           FINISHED           SIZE PROGRESS DESCRIPTION
   227     spoutmarker master f91d27382b8a40408504865783b717e9 3 minutes ago 0B   -
   228     spoutmarker master 333ab0ed77a24210a5ec3d613ea0c8e4 2 minutes ago 0B   -
   229     ```
   230  
   231     ```shell
   232     pachctl list commit spoutmarker@marker
   233     ```
   234  
   235     **System response:**
   236  
   237     ```
   238     REPO        BRANCH COMMIT                           FINISHED           SIZE PROGRESS DESCRIPTION
   239     spoutmarker marker dda511ef0e5c4238bc368869574125ac 3 minutes ago      4B
   240     spoutmarker marker e4c5f71b40e74372bff7cf6fd9dcfb89 2 minutes ago      1B
   241     ```
   242  
   243     !!! note
   244     **Note:** Because the script appends to the marker file, each new commit
   245     is larger than the previous one.
   246  
   247  1. View the marker file:
   248  
   249     ```shell
   250     pachctl get file spoutmarker@marker:/mymark
   251     ```
   252  
   253     **System response:**
   254  
   255     ```shell
   256     .
   257     ..
   258     ...
   259     ....
   260     .....
   261     ```
   262  
   263     Run this command a few times to see that a new dot is appended every
   264     10 seconds.
   265  
   266  1. (Optional) View the output.
   267  
   268    ```shell
   269     pachctl get file spoutmarker@master:/output
   270     ```
   271  
   272     **System response:**
   273  
   274     ```
   275     ......
   276     ```
   277  
   278  1. Proceed to [Step 3](#step-3-modify-the-pipeline-code).
   279  
   280  ## Step 3: Modify the Pipeline Code
   281  
   282  Now, as our pipeline is running correctly, let's try to modify it and
   283  see if the marker file will continue to append to the new symbol.
   284  
   285  To modify the pipeline code, complete the following steps:
   286  
   287  1. Edit the pipeline in place, 
   288     changing the value of the `OUTPUT_CHARACTER` environment variable
   289     from `.` to `*`
   290  
   291     ```shell
   292     pachctl edit pipeline spoutmarker
   293     ```
   294  
   295     !!! note
   296     **Note:** You can set the environment variable `EDITOR` to use your 
   297     your preferred text editor.
   298  
   299     The new pipeline definition will look something like this in your text editor:
   300     
   301     ```json
   302     {
   303         "pipeline": {
   304         "name": "spoutmarker"
   305     },
   306     "transform": {
   307         "image": "spout-marker:v1",
   308         "cmd": [
   309             "python3",
   310                 "/spout-marker-example.py"
   311          ],
   312          "env": {
   313              "OUTPUT_CHARACTER": "*"
   314          }
   315      },
   316      "output_branch": "master",
   317      "cache_size": "64M",
   318      "max_queue_size": "1",
   319      "spout": {
   320          "overwrite": true,
   321          "marker": "mymark"
   322      },
   323      "salt": "ea04c48e993c45a781b5ba315b230674",
   324      "datum_tries": "3"
   325      }
   326      ```
   327     
   328     !!! note
   329     **Note:** You can also edit the pipeline spec in
   330     the original `spout-marker-pipeline.json` file and use 
   331     `pachctl update pipeline -f spout-marker-pipeline.json`
   332     to accomplish the same task.
   333     
   334     
   335  1. Once you save this file and leave the editor, 
   336     you'll see the pipeline restart.
   337     View the list of pipelines:
   338  
   339     ```shell
   340     pachctl list pipeline
   341     ```
   342  
   343     **System response:**
   344  
   345     ```shell
   346     NAME        VERSION INPUT CREATED        STATE / LAST JOB   DESCRIPTION
   347     spoutmarker 2       none  10 minutes ago running / starting
   348     ```
   349  
   350     Your pipeline was updated to version `2` and is
   351     running your updated code. You might need to wait for some time, but
   352     eventually, your `marker` file will look like this:
   353  
   354     ```shell
   355     pachctl get file spoutmarker@marker:/mymark
   356     ```
   357  
   358     **System response:**
   359  
   360     ```
   361     .
   362     ..
   363     ...
   364     ....
   365     .....
   366     ......
   367     ......*
   368     ......**
   369     ```
   370  
   371  1. (Optional) View the output.
   372  
   373    ```shell
   374     pachctl get file spoutmarker@master:/output
   375     ```
   376  
   377     **System response:**
   378  
   379     ```
   380     ......**
   381     ```
   382  ## Summary
   383  
   384  This example demonstrates that spout pipelines can be configured
   385  to use a special `marker` file or directory that can keep track of
   386  Kafka offsets or of similar record position trackers.