github.com/pachyderm/pachyderm@v1.13.4/examples/spouts/go-kafka-spout/README.md (about)

     1  >![pach_logo](../../img/pach_logo.svg) INFO - Pachyderm 2.0 introduces profound architectural changes to the product. As a result, our examples pre and post 2.0 are kept in two separate branches:
     2  > - Branch Master: Examples using Pachyderm 2.0 and later versions - https://github.com/pachyderm/pachyderm/tree/master/examples
     3  > - Branch 1.13.x: Examples using Pachyderm 1.13 and older versions - https://github.com/pachyderm/pachyderm/tree/1.13.x/examples
     4  # Commit messages from a Kafka queue
     5  
     6  ![pach_logo](./img/pach_logo.svg)
     7     This example is based on **spouts 1.0** implementation
     8     prior to Pachyderm 1.12.
     9     The implementation in spouts 2.0 is significantly different.
    10     We recommend upgrading 
    11     to the latest version
    12     of Pachyderm
    13     and using the **spouts 2.0** implementation.
    14     
    15  This is a simple example of using spouts with [Kafka](https://kafka.apache.org) to process messages and write them to files.
    16  
    17  
    18  This example spout connects to a Kafka queue and reads a topic.
    19  The spout then writes each message in the topic to a file named by the topic and offset. 
    20  It uses Kafka group IDs to maintain a cursor into the offset in the topic, 
    21  making it resilient to restarts.
    22  
    23  ## Prerequisites
    24  
    25  If you would like to run the Kafka cluster included with this example,
    26  using the `make kafka` target,
    27  you must deploy an Amazon EKS cluster with at least three (3) m5.xlarge machines. 
    28  
    29  To deploy an EKS cluster, 
    30  follow the instructions in the [Amazon EKS documentation](https://docs.aws.amazon.com/eks/latest/userguide/create-cluster.html)
    31  
    32  That Kafka cluster could be deployed on other cloud providers. 
    33  See [setup](#setup) below for more information.
    34  
    35  The Pachyderm code in this example requires a Pachyderm cluster version 1.9.8 or later.
    36  
    37  ## Introduction
    38  
    39  ApacheĀ® Kafka is a distributed streaming platform
    40  that is used in a variety of applications to provide communications between microservices.
    41  Many Pachyderm users use Kafka to ingest data from legacy data sources using Pachyderm spouts.
    42  
    43  Pachyderm spouts are a way to ingest data into Pachyderm 
    44  by having your code get the data from inside a Pachyderm pipeline.
    45  
    46  This is the simplest possible implementation of a Pachyderm spout using Kafka to ingest data.
    47  The data ingested is simply the message posted to Kafka.
    48  The filename is derived from the Kafka topic name and the message's offset in the topic.
    49  You should be able to easily adapt it to your needs.
    50  
    51  ## Setup
    52  
    53  This example includes a pre-configured Kafka cluster that you can deploy on Amazon EKS,
    54  adapted from [Craig Johnston's blog post](https://imti.co/kafka-kubernetes/).
    55  If you'd like to adapt it for your own cluster on GCP, Azure, or on-premises Kubernetes deployment,
    56  the `001-storage-class.yaml` file is probably the only thing you'd need to change.
    57  You can replace the parameters and provisioner with the appropriate one for your environment.
    58  
    59  If you already have a Kafka cluster setup, you may skip step 1 of the Kafka setup.
    60  
    61  To correctly build the Docker container from source, 
    62  it will be necessary to have the Pachyderm source repo structure around this example.
    63  It depends on the directory `../../../vendor/github.com/segmentio/kafka-go`,
    64  relative to this one,
    65  containing the correct code.
    66  You can, of course, set up your Go development environment to achieve the same result.
    67  
    68  ### Kafka setup
    69  
    70  1. In the directory `additional_manifests`, 
    71  you'll find a numbered sequence of Kubernetes manifests for creating a fully-functioning Kafka deployment.
    72  You can use the makefile target `make kafka`,
    73  which will deploy a kafka cluster a `kafka` namespace, created in the first step.
    74  It's a kafka deployment that's intended for testing only.
    75  It's not recommended that you use it for production purposes.
    76  If you'd like to see the order in which the manifests will be loaded into Kubernetes,
    77  run the command
    78  
    79  ```shell
    80  make -n kafka
    81  ```
    82  
    83  !!! note
    84  If you are redeploying a Kafka deployment, run `make clean` before running `make kafka`.
    85  
    86  You can confirm that the Kafka cluster is running properly by verifying that the pods are running.
    87  
    88  !!! note
    89  Before deploying Kafka, 
    90  verify that you are using the correct Kubernetes context by running `kubectl config get-contexts`. 
    91  For example, when you are deploying on EKS, your active context should end with `eksctl.io`.
    92  
    93  
    94  ```shell
    95  $ kubectl get pods -n kafka
    96  NAME                READY   STATUS    RESTARTS   AGE
    97  kafka-0             1/1     Running   0          3d19h
    98  kafka-1             1/1     Running   0          3d19h
    99  kafka-2             1/1     Running   0          3d19h
   100  kafka-test-client   1/1     Running   0          3d19h
   101  kafka-zookeeper-0   1/1     Running   0          3d19h
   102  kafka-zookeeper-1   1/1     Running   0          3d19h
   103  kafka-zookeeper-2   1/1     Running   0          3d19h
   104  ```
   105  
   106  2. Once the Kafka cluster is running, create the topic you'd like to consume messages from.
   107  The example is configured to look for a topic called `test_topic`.
   108  You may modify the Makefile to use another topic name, of course.
   109  To use the example's Kafka environment,
   110  you may use the following command to create the topic:
   111  
   112  ```shell
   113  $ kubectl -n kafka exec kafka-test-client -- /usr/bin/kafka-topics --zookeeper \
   114        kafka-zookeeper.kafka:2181 --topic test_topic --create \
   115        --partitions 1 --replication-factor 1
   116  Created topic "test".
   117  ```
   118  
   119  !!! note
   120  The command is using Kubernetes DNS names to specify the Kafka zookeeper service,
   121  `kafka-zookeeper.kafka`.
   122  
   123  You can confirm that your topic was created with the following command:
   124  
   125  ```shell
   126  $ kubectl -n kafka exec kafka-test-client -- /usr/bin/kafka-topics --zookeeper \
   127         kafka-zookeeper.kafka:2181 --list
   128  ```
   129  
   130  It should return the topic you created and the topic `__confluent.support.metrics`.
   131  
   132  3. You can start populating the topic with data using the `kafka-console-producer` command.
   133  It provides you with a `>` prompt for entering data,
   134  delimited by lines for each offset into the topic.
   135  In the example below, the messages at offset 0 is `yo`, 
   136  at offset 1, `man`,
   137  and so on.
   138  Data entry is completed with an end-of-file character,
   139  `Control-d` in most shells.
   140  
   141  ```shell
   142  $ kubectl -n kafka exec -ti kafka-test-client --  /usr/bin/kafka-console-producer \
   143     --broker-list kafka.kafka:9092 --topic test_topic 
   144  >yo 
   145  >man
   146  >this 
   147  >is so
   148  >cool!!
   149  >
   150  ```
   151  
   152  4. You can see if the data has been added to the topic with the `kafka-console-consumer` command.
   153  
   154  ```shell
   155  $ kubectl -n kafka exec -ti kafka-test-client -- /usr/bin/kafka-console-consumer \
   156     --bootstrap-server kafka:9092 --topic test_topic --from-beginning
   157  yo
   158  man
   159  this
   160  is so
   161  cool!!
   162  ```
   163  
   164  Terminate the command to see the following message:
   165  
   166  ```
   167  ^CProcessed a total of 5 messages
   168  command terminated with exit code 130
   169  ```
   170  
   171  ### Pachyderm setup
   172  
   173  1. If you would simply like to use the prebuilt spout image,
   174  you can simply create the spout with the pachctl command
   175  using the pipeline definition available in the `pipelines` directory
   176  
   177  ```shell
   178  $ pachctl create pipeline -f pipelines/kafka_spout.pipeline
   179  ```
   180  
   181  !!! note
   182  The Makefile included with this example has a target for customizing that pipeline.
   183  
   184  2. To create your own version of the spout,
   185  you may modify the Makefile to use your own Dockerhub account, tag and version
   186  by changing these variables accordingly
   187  
   188  ```
   189  CONTAINER_VERSION := $(shell pachctl version --client-only)
   190  DOCKER_ACCOUNT := pachyderm
   191  CONTAINER_NAME := kafka_spout
   192  ```
   193  
   194  The Makefile has targets for `create-dag` and `update-dag`, 
   195  or you may simply make the image with `docker-image`.
   196  
   197  3. Once the spout is running, 
   198  if the `VERBOSE_LOGGING` variable is set to anything other than `false`,
   199  you will see verbose logging in the `kafka_spout` pipeline logs.
   200  
   201  ```shell
   202  $ pachctl logs -p kafka_spout -f
   203  creating new kafka reader for kafka.kafka:9092 with topic 'test_topic' and group 'test_group'
   204  reading kafka queue.
   205  opening named pipe /pfs/out.
   206  opening tarstream
   207  processing header for topic test_topic @ offset 0
   208  processing data for topic  test_topic @ offset 0
   209  closing tarstream.
   210  closing named pipe /pfs/out.
   211  cleaning up context.
   212  reading kafka queue.
   213  opening named pipe /pfs/out.
   214  opening tarstream
   215  processing header for topic test_topic @ offset 1
   216  processing data for topic  test_topic @ offset 1
   217  closing tarstream.
   218  closing named pipe /pfs/out.
   219  cleaning up context.
   220  reading kafka queue.
   221  opening named pipe /pfs/out.
   222  opening tarstream
   223  processing header for topic test_topic @ offset 2
   224  processing data for topic  test_topic @ offset 2
   225  closing tarstream.
   226  closing named pipe /pfs/out.
   227  cleaning up context.
   228  reading kafka queue.
   229  opening named pipe /pfs/out.
   230  opening tarstream
   231  processing header for topic test_topic @ offset 3
   232  processing data for topic  test_topic @ offset 3
   233  closing tarstream.
   234  closing named pipe /pfs/out.
   235  cleaning up context.
   236  reading kafka queue.
   237  opening named pipe /pfs/out.
   238  opening tarstream
   239  processing header for topic test_topic @ offset 4
   240  processing data for topic  test_topic @ offset 4
   241  closing tarstream.
   242  closing named pipe /pfs/out.
   243  ...
   244  ```
   245  
   246  And you will see the message files in the `kafka_spout` repo
   247  
   248  ```shell
   249  $ pachctl list file kafka_spout@master
   250  NAME          TYPE SIZE 
   251  /test_topic-0 file 2B   
   252  /test_topic-1 file 3B   
   253  /test_topic-2 file 4B   
   254  /test_topic-3 file 5B   
   255  /test_topic-4 file 6B   
   256  ```
   257  
   258  ## Pipelines
   259  
   260  This section describes the pipelines that you use in this example.
   261  
   262  ### kafka_spout
   263  
   264  The file `source/main.go` contains a simple Pachyderm spout that processes messages from Kafka,
   265  saving them to files in a Pachyderm repo named for the topic and message offset.
   266  
   267  It is configurable via environment variables and command-line flags. 
   268  Flags override environment variable settings.
   269  If your Go development environment is set up correctly,
   270  you can see the settings by running the command:
   271  
   272  ```shell
   273  $ go run source/main.go --help
   274  Usage of /var/folders/xl/xtvj4xtx0tv1llxcnbvlmwc40000gq/T/go-build997659573/b001/exe/main:
   275    -kafka_group_id string
   276      	the Kafka group for maintaining offset state (default "test")
   277    -kafka_host string
   278      	the hostname of the Kafka broker (default "kafka.kafka")
   279    -kafka_port string
   280      	the port of the Kafka broker (default "9092")
   281    -kafka_timeout int
   282      	the timeout in seconds for reading messages from the Kafka queue (default 5)
   283    -kafka_topic string
   284      	the Kafka topic for messages (default "test")
   285    -named_pipe string
   286      	the named pipe for the spout (default "/pfs/out")
   287    -v	verbose logging
   288  exit status 2
   289  ```
   290  
   291  The environment variables are as shown 
   292  in this excerpt from the `pipelines/kafka_spout.pipeline` file:
   293  
   294  ```shell
   295              "KAFKA_HOST": "kafka.kafka",
   296              "KAFKA_PORT": "9092",
   297              "KAFKA_TOPIC": "test_topic",
   298              "KAFKA_GROUP_ID": "test_group",
   299              "KAFKA_TIMEOUT": "5",
   300              "NAMED_PIPE": "/pfs/out",
   301              "VERBOSE_LOGGING": "false"
   302  ```
   303