github.com/pachyderm/pachyderm@v1.13.4/examples/spouts/go-kafka-spout/README.md (about) 1 > INFO - Pachyderm 2.0 introduces profound architectural changes to the product. As a result, our examples pre and post 2.0 are kept in two separate branches: 2 > - Branch Master: Examples using Pachyderm 2.0 and later versions - https://github.com/pachyderm/pachyderm/tree/master/examples 3 > - Branch 1.13.x: Examples using Pachyderm 1.13 and older versions - https://github.com/pachyderm/pachyderm/tree/1.13.x/examples 4 # Commit messages from a Kafka queue 5 6  7 This example is based on **spouts 1.0** implementation 8 prior to Pachyderm 1.12. 9 The implementation in spouts 2.0 is significantly different. 10 We recommend upgrading 11 to the latest version 12 of Pachyderm 13 and using the **spouts 2.0** implementation. 14 15 This is a simple example of using spouts with [Kafka](https://kafka.apache.org) to process messages and write them to files. 16 17 18 This example spout connects to a Kafka queue and reads a topic. 19 The spout then writes each message in the topic to a file named by the topic and offset. 20 It uses Kafka group IDs to maintain a cursor into the offset in the topic, 21 making it resilient to restarts. 22 23 ## Prerequisites 24 25 If you would like to run the Kafka cluster included with this example, 26 using the `make kafka` target, 27 you must deploy an Amazon EKS cluster with at least three (3) m5.xlarge machines. 28 29 To deploy an EKS cluster, 30 follow the instructions in the [Amazon EKS documentation](https://docs.aws.amazon.com/eks/latest/userguide/create-cluster.html) 31 32 That Kafka cluster could be deployed on other cloud providers. 33 See [setup](#setup) below for more information. 34 35 The Pachyderm code in this example requires a Pachyderm cluster version 1.9.8 or later. 36 37 ## Introduction 38 39 ApacheĀ® Kafka is a distributed streaming platform 40 that is used in a variety of applications to provide communications between microservices. 41 Many Pachyderm users use Kafka to ingest data from legacy data sources using Pachyderm spouts. 42 43 Pachyderm spouts are a way to ingest data into Pachyderm 44 by having your code get the data from inside a Pachyderm pipeline. 45 46 This is the simplest possible implementation of a Pachyderm spout using Kafka to ingest data. 47 The data ingested is simply the message posted to Kafka. 48 The filename is derived from the Kafka topic name and the message's offset in the topic. 49 You should be able to easily adapt it to your needs. 50 51 ## Setup 52 53 This example includes a pre-configured Kafka cluster that you can deploy on Amazon EKS, 54 adapted from [Craig Johnston's blog post](https://imti.co/kafka-kubernetes/). 55 If you'd like to adapt it for your own cluster on GCP, Azure, or on-premises Kubernetes deployment, 56 the `001-storage-class.yaml` file is probably the only thing you'd need to change. 57 You can replace the parameters and provisioner with the appropriate one for your environment. 58 59 If you already have a Kafka cluster setup, you may skip step 1 of the Kafka setup. 60 61 To correctly build the Docker container from source, 62 it will be necessary to have the Pachyderm source repo structure around this example. 63 It depends on the directory `../../../vendor/github.com/segmentio/kafka-go`, 64 relative to this one, 65 containing the correct code. 66 You can, of course, set up your Go development environment to achieve the same result. 67 68 ### Kafka setup 69 70 1. In the directory `additional_manifests`, 71 you'll find a numbered sequence of Kubernetes manifests for creating a fully-functioning Kafka deployment. 72 You can use the makefile target `make kafka`, 73 which will deploy a kafka cluster a `kafka` namespace, created in the first step. 74 It's a kafka deployment that's intended for testing only. 75 It's not recommended that you use it for production purposes. 76 If you'd like to see the order in which the manifests will be loaded into Kubernetes, 77 run the command 78 79 ```shell 80 make -n kafka 81 ``` 82 83 !!! note 84 If you are redeploying a Kafka deployment, run `make clean` before running `make kafka`. 85 86 You can confirm that the Kafka cluster is running properly by verifying that the pods are running. 87 88 !!! note 89 Before deploying Kafka, 90 verify that you are using the correct Kubernetes context by running `kubectl config get-contexts`. 91 For example, when you are deploying on EKS, your active context should end with `eksctl.io`. 92 93 94 ```shell 95 $ kubectl get pods -n kafka 96 NAME READY STATUS RESTARTS AGE 97 kafka-0 1/1 Running 0 3d19h 98 kafka-1 1/1 Running 0 3d19h 99 kafka-2 1/1 Running 0 3d19h 100 kafka-test-client 1/1 Running 0 3d19h 101 kafka-zookeeper-0 1/1 Running 0 3d19h 102 kafka-zookeeper-1 1/1 Running 0 3d19h 103 kafka-zookeeper-2 1/1 Running 0 3d19h 104 ``` 105 106 2. Once the Kafka cluster is running, create the topic you'd like to consume messages from. 107 The example is configured to look for a topic called `test_topic`. 108 You may modify the Makefile to use another topic name, of course. 109 To use the example's Kafka environment, 110 you may use the following command to create the topic: 111 112 ```shell 113 $ kubectl -n kafka exec kafka-test-client -- /usr/bin/kafka-topics --zookeeper \ 114 kafka-zookeeper.kafka:2181 --topic test_topic --create \ 115 --partitions 1 --replication-factor 1 116 Created topic "test". 117 ``` 118 119 !!! note 120 The command is using Kubernetes DNS names to specify the Kafka zookeeper service, 121 `kafka-zookeeper.kafka`. 122 123 You can confirm that your topic was created with the following command: 124 125 ```shell 126 $ kubectl -n kafka exec kafka-test-client -- /usr/bin/kafka-topics --zookeeper \ 127 kafka-zookeeper.kafka:2181 --list 128 ``` 129 130 It should return the topic you created and the topic `__confluent.support.metrics`. 131 132 3. You can start populating the topic with data using the `kafka-console-producer` command. 133 It provides you with a `>` prompt for entering data, 134 delimited by lines for each offset into the topic. 135 In the example below, the messages at offset 0 is `yo`, 136 at offset 1, `man`, 137 and so on. 138 Data entry is completed with an end-of-file character, 139 `Control-d` in most shells. 140 141 ```shell 142 $ kubectl -n kafka exec -ti kafka-test-client -- /usr/bin/kafka-console-producer \ 143 --broker-list kafka.kafka:9092 --topic test_topic 144 >yo 145 >man 146 >this 147 >is so 148 >cool!! 149 > 150 ``` 151 152 4. You can see if the data has been added to the topic with the `kafka-console-consumer` command. 153 154 ```shell 155 $ kubectl -n kafka exec -ti kafka-test-client -- /usr/bin/kafka-console-consumer \ 156 --bootstrap-server kafka:9092 --topic test_topic --from-beginning 157 yo 158 man 159 this 160 is so 161 cool!! 162 ``` 163 164 Terminate the command to see the following message: 165 166 ``` 167 ^CProcessed a total of 5 messages 168 command terminated with exit code 130 169 ``` 170 171 ### Pachyderm setup 172 173 1. If you would simply like to use the prebuilt spout image, 174 you can simply create the spout with the pachctl command 175 using the pipeline definition available in the `pipelines` directory 176 177 ```shell 178 $ pachctl create pipeline -f pipelines/kafka_spout.pipeline 179 ``` 180 181 !!! note 182 The Makefile included with this example has a target for customizing that pipeline. 183 184 2. To create your own version of the spout, 185 you may modify the Makefile to use your own Dockerhub account, tag and version 186 by changing these variables accordingly 187 188 ``` 189 CONTAINER_VERSION := $(shell pachctl version --client-only) 190 DOCKER_ACCOUNT := pachyderm 191 CONTAINER_NAME := kafka_spout 192 ``` 193 194 The Makefile has targets for `create-dag` and `update-dag`, 195 or you may simply make the image with `docker-image`. 196 197 3. Once the spout is running, 198 if the `VERBOSE_LOGGING` variable is set to anything other than `false`, 199 you will see verbose logging in the `kafka_spout` pipeline logs. 200 201 ```shell 202 $ pachctl logs -p kafka_spout -f 203 creating new kafka reader for kafka.kafka:9092 with topic 'test_topic' and group 'test_group' 204 reading kafka queue. 205 opening named pipe /pfs/out. 206 opening tarstream 207 processing header for topic test_topic @ offset 0 208 processing data for topic test_topic @ offset 0 209 closing tarstream. 210 closing named pipe /pfs/out. 211 cleaning up context. 212 reading kafka queue. 213 opening named pipe /pfs/out. 214 opening tarstream 215 processing header for topic test_topic @ offset 1 216 processing data for topic test_topic @ offset 1 217 closing tarstream. 218 closing named pipe /pfs/out. 219 cleaning up context. 220 reading kafka queue. 221 opening named pipe /pfs/out. 222 opening tarstream 223 processing header for topic test_topic @ offset 2 224 processing data for topic test_topic @ offset 2 225 closing tarstream. 226 closing named pipe /pfs/out. 227 cleaning up context. 228 reading kafka queue. 229 opening named pipe /pfs/out. 230 opening tarstream 231 processing header for topic test_topic @ offset 3 232 processing data for topic test_topic @ offset 3 233 closing tarstream. 234 closing named pipe /pfs/out. 235 cleaning up context. 236 reading kafka queue. 237 opening named pipe /pfs/out. 238 opening tarstream 239 processing header for topic test_topic @ offset 4 240 processing data for topic test_topic @ offset 4 241 closing tarstream. 242 closing named pipe /pfs/out. 243 ... 244 ``` 245 246 And you will see the message files in the `kafka_spout` repo 247 248 ```shell 249 $ pachctl list file kafka_spout@master 250 NAME TYPE SIZE 251 /test_topic-0 file 2B 252 /test_topic-1 file 3B 253 /test_topic-2 file 4B 254 /test_topic-3 file 5B 255 /test_topic-4 file 6B 256 ``` 257 258 ## Pipelines 259 260 This section describes the pipelines that you use in this example. 261 262 ### kafka_spout 263 264 The file `source/main.go` contains a simple Pachyderm spout that processes messages from Kafka, 265 saving them to files in a Pachyderm repo named for the topic and message offset. 266 267 It is configurable via environment variables and command-line flags. 268 Flags override environment variable settings. 269 If your Go development environment is set up correctly, 270 you can see the settings by running the command: 271 272 ```shell 273 $ go run source/main.go --help 274 Usage of /var/folders/xl/xtvj4xtx0tv1llxcnbvlmwc40000gq/T/go-build997659573/b001/exe/main: 275 -kafka_group_id string 276 the Kafka group for maintaining offset state (default "test") 277 -kafka_host string 278 the hostname of the Kafka broker (default "kafka.kafka") 279 -kafka_port string 280 the port of the Kafka broker (default "9092") 281 -kafka_timeout int 282 the timeout in seconds for reading messages from the Kafka queue (default 5) 283 -kafka_topic string 284 the Kafka topic for messages (default "test") 285 -named_pipe string 286 the named pipe for the spout (default "/pfs/out") 287 -v verbose logging 288 exit status 2 289 ``` 290 291 The environment variables are as shown 292 in this excerpt from the `pipelines/kafka_spout.pipeline` file: 293 294 ```shell 295 "KAFKA_HOST": "kafka.kafka", 296 "KAFKA_PORT": "9092", 297 "KAFKA_TOPIC": "test_topic", 298 "KAFKA_GROUP_ID": "test_group", 299 "KAFKA_TIMEOUT": "5", 300 "NAMED_PIPE": "/pfs/out", 301 "VERBOSE_LOGGING": "false" 302 ``` 303