github.com/pachyderm/pachyderm@v1.13.4/examples/spouts/go-kafka-spout/README.md

github.com/pachyderm/pachyderm@v1.13.4/examples/spouts/go-kafka-spout/README.md (about)

1 >![pach_logo](../../img/pach_logo.svg) INFO - Pachyderm 2.0 introduces profound architectural changes to the product. As a result, our examples pre and post 2.0 are kept in two separate branches:
2 > - Branch Master: Examples using Pachyderm 2.0 and later versions - https://github.com/pachyderm/pachyderm/tree/master/examples
3 > - Branch 1.13.x: Examples using Pachyderm 1.13 and older versions - https://github.com/pachyderm/pachyderm/tree/1.13.x/examples
4 # Commit messages from a Kafka queue
5
6 ![pach_logo](./img/pach_logo.svg)
7 This example is based on **spouts 1.0** implementation
8 prior to Pachyderm 1.12.
9 The implementation in spouts 2.0 is significantly different.
10 We recommend upgrading
11 to the latest version
12 of Pachyderm
13 and using the **spouts 2.0** implementation.
14
15 This is a simple example of using spouts with [Kafka](https://kafka.apache.org) to process messages and write them to files.
16
17
18 This example spout connects to a Kafka queue and reads a topic.
19 The spout then writes each message in the topic to a file named by the topic and offset.
20 It uses Kafka group IDs to maintain a cursor into the offset in the topic,
21 making it resilient to restarts.
22
23 ## Prerequisites
24
25 If you would like to run the Kafka cluster included with this example,
26 using the `make kafka` target,
27 you must deploy an Amazon EKS cluster with at least three (3) m5.xlarge machines.
28
29 To deploy an EKS cluster,
30 follow the instructions in the [Amazon EKS documentation](https://docs.aws.amazon.com/eks/latest/userguide/create-cluster.html)
31
32 That Kafka cluster could be deployed on other cloud providers.
33 See [setup](#setup) below for more information.
34
35 The Pachyderm code in this example requires a Pachyderm cluster version 1.9.8 or later.
36
37 ## Introduction
38
39 Apache® Kafka is a distributed streaming platform
40 that is used in a variety of applications to provide communications between microservices.
41 Many Pachyderm users use Kafka to ingest data from legacy data sources using Pachyderm spouts.
42
43 Pachyderm spouts are a way to ingest data into Pachyderm
44 by having your code get the data from inside a Pachyderm pipeline.
45
46 This is the simplest possible implementation of a Pachyderm spout using Kafka to ingest data.
47 The data ingested is simply the message posted to Kafka.
48 The filename is derived from the Kafka topic name and the message's offset in the topic.
49 You should be able to easily adapt it to your needs.
50
51 ## Setup
52
53 This example includes a pre-configured Kafka cluster that you can deploy on Amazon EKS,
54 adapted from [Craig Johnston's blog post](https://imti.co/kafka-kubernetes/).
55 If you'd like to adapt it for your own cluster on GCP, Azure, or on-premises Kubernetes deployment,
56 the `001-storage-class.yaml` file is probably the only thing you'd need to change.
57 You can replace the parameters and provisioner with the appropriate one for your environment.
58
59 If you already have a Kafka cluster setup, you may skip step 1 of the Kafka setup.
60
61 To correctly build the Docker container from source,
62 it will be necessary to have the Pachyderm source repo structure around this example.
63 It depends on the directory `../../../vendor/github.com/segmentio/kafka-go`,
64 relative to this one,
65 containing the correct code.
66 You can, of course, set up your Go development environment to achieve the same result.
67
68 ### Kafka setup
69
70 1. In the directory `additional_manifests`,
71 you'll find a numbered sequence of Kubernetes manifests for creating a fully-functioning Kafka deployment.
72 You can use the makefile target `make kafka`,
73 which will deploy a kafka cluster a `kafka` namespace, created in the first step.
74 It's a kafka deployment that's intended for testing only.
75 It's not recommended that you use it for production purposes.
76 If you'd like to see the order in which the manifests will be loaded into Kubernetes,
77 run the command
78
79 ```shell
80 make -n kafka
81 ```
82
83 !!! note
84 If you are redeploying a Kafka deployment, run `make clean` before running `make kafka`.
85
86 You can confirm that the Kafka cluster is running properly by verifying that the pods are running.
87
88 !!! note
89 Before deploying Kafka,
90 verify that you are using the correct Kubernetes context by running `kubectl config get-contexts`.
91 For example, when you are deploying on EKS, your active context should end with `eksctl.io`.
92
93
94 ```shell
95 $ kubectl get pods -n kafka
96 NAME READY STATUS RESTARTS AGE
97 kafka-0 1/1 Running 0 3d19h
98 kafka-1 1/1 Running 0 3d19h
99 kafka-2 1/1 Running 0 3d19h
100 kafka-test-client 1/1 Running 0 3d19h
101 kafka-zookeeper-0 1/1 Running 0 3d19h
102 kafka-zookeeper-1 1/1 Running 0 3d19h
103 kafka-zookeeper-2 1/1 Running 0 3d19h
104 ```
105
106 2. Once the Kafka cluster is running, create the topic you'd like to consume messages from.
107 The example is configured to look for a topic called `test_topic`.
108 You may modify the Makefile to use another topic name, of course.
109 To use the example's Kafka environment,
110 you may use the following command to create the topic:
111
112 ```shell
113 $ kubectl -n kafka exec kafka-test-client -- /usr/bin/kafka-topics --zookeeper \
114 kafka-zookeeper.kafka:2181 --topic test_topic --create \
115 --partitions 1 --replication-factor 1
116 Created topic "test".
117 ```
118
119 !!! note
120 The command is using Kubernetes DNS names to specify the Kafka zookeeper service,
121 `kafka-zookeeper.kafka`.
122
123 You can confirm that your topic was created with the following command:
124
125 ```shell
126 $ kubectl -n kafka exec kafka-test-client -- /usr/bin/kafka-topics --zookeeper \
127 kafka-zookeeper.kafka:2181 --list
128 ```
129
130 It should return the topic you created and the topic `__confluent.support.metrics`.
131
132 3. You can start populating the topic with data using the `kafka-console-producer` command.
133 It provides you with a `>` prompt for entering data,
134 delimited by lines for each offset into the topic.
135 In the example below, the messages at offset 0 is `yo`,
136 at offset 1, `man`,
137 and so on.
138 Data entry is completed with an end-of-file character,
139 `Control-d` in most shells.
140
141 ```shell
142 $ kubectl -n kafka exec -ti kafka-test-client -- /usr/bin/kafka-console-producer \
143 --broker-list kafka.kafka:9092 --topic test_topic
144 >yo
145 >man
146 >this
147 >is so
148 >cool!!
149 >
150 ```
151
152 4. You can see if the data has been added to the topic with the `kafka-console-consumer` command.
153
154 ```shell
155 $ kubectl -n kafka exec -ti kafka-test-client -- /usr/bin/kafka-console-consumer \
156 --bootstrap-server kafka:9092 --topic test_topic --from-beginning
157 yo
158 man
159 this
160 is so
161 cool!!
162 ```
163
164 Terminate the command to see the following message:
165
166 ```
167 ^CProcessed a total of 5 messages
168 command terminated with exit code 130
169 ```
170
171 ### Pachyderm setup
172
173 1. If you would simply like to use the prebuilt spout image,
174 you can simply create the spout with the pachctl command
175 using the pipeline definition available in the `pipelines` directory
176
177 ```shell
178 $ pachctl create pipeline -f pipelines/kafka_spout.pipeline
179 ```
180
181 !!! note
182 The Makefile included with this example has a target for customizing that pipeline.
183
184 2. To create your own version of the spout,
185 you may modify the Makefile to use your own Dockerhub account, tag and version
186 by changing these variables accordingly
187
188 ```
189 CONTAINER_VERSION := $(shell pachctl version --client-only)
190 DOCKER_ACCOUNT := pachyderm
191 CONTAINER_NAME := kafka_spout
192 ```
193
194 The Makefile has targets for `create-dag` and `update-dag`,
195 or you may simply make the image with `docker-image`.
196
197 3. Once the spout is running,
198 if the `VERBOSE_LOGGING` variable is set to anything other than `false`,
199 you will see verbose logging in the `kafka_spout` pipeline logs.
200
201 ```shell
202 $ pachctl logs -p kafka_spout -f
203 creating new kafka reader for kafka.kafka:9092 with topic 'test_topic' and group 'test_group'
204 reading kafka queue.
205 opening named pipe /pfs/out.
206 opening tarstream
207 processing header for topic test_topic @ offset 0
208 processing data for topic test_topic @ offset 0
209 closing tarstream.
210 closing named pipe /pfs/out.
211 cleaning up context.
212 reading kafka queue.
213 opening named pipe /pfs/out.
214 opening tarstream
215 processing header for topic test_topic @ offset 1
216 processing data for topic test_topic @ offset 1
217 closing tarstream.
218 closing named pipe /pfs/out.
219 cleaning up context.
220 reading kafka queue.
221 opening named pipe /pfs/out.
222 opening tarstream
223 processing header for topic test_topic @ offset 2
224 processing data for topic test_topic @ offset 2
225 closing tarstream.
226 closing named pipe /pfs/out.
227 cleaning up context.
228 reading kafka queue.
229 opening named pipe /pfs/out.
230 opening tarstream
231 processing header for topic test_topic @ offset 3
232 processing data for topic test_topic @ offset 3
233 closing tarstream.
234 closing named pipe /pfs/out.
235 cleaning up context.
236 reading kafka queue.
237 opening named pipe /pfs/out.
238 opening tarstream
239 processing header for topic test_topic @ offset 4
240 processing data for topic test_topic @ offset 4
241 closing tarstream.
242 closing named pipe /pfs/out.
243 ...
244 ```
245
246 And you will see the message files in the `kafka_spout` repo
247
248 ```shell
249 $ pachctl list file kafka_spout@master
250 NAME TYPE SIZE
251 /test_topic-0 file 2B
252 /test_topic-1 file 3B
253 /test_topic-2 file 4B
254 /test_topic-3 file 5B
255 /test_topic-4 file 6B
256 ```
257
258 ## Pipelines
259
260 This section describes the pipelines that you use in this example.
261
262 ### kafka_spout
263
264 The file `source/main.go` contains a simple Pachyderm spout that processes messages from Kafka,
265 saving them to files in a Pachyderm repo named for the topic and message offset.
266
267 It is configurable via environment variables and command-line flags.
268 Flags override environment variable settings.
269 If your Go development environment is set up correctly,
270 you can see the settings by running the command:
271
272 ```shell
273 $ go run source/main.go --help
274 Usage of /var/folders/xl/xtvj4xtx0tv1llxcnbvlmwc40000gq/T/go-build997659573/b001/exe/main:
275 -kafka_group_id string
276 the Kafka group for maintaining offset state (default "test")
277 -kafka_host string
278 the hostname of the Kafka broker (default "kafka.kafka")
279 -kafka_port string
280 the port of the Kafka broker (default "9092")
281 -kafka_timeout int
282 the timeout in seconds for reading messages from the Kafka queue (default 5)
283 -kafka_topic string
284 the Kafka topic for messages (default "test")
285 -named_pipe string
286 the named pipe for the spout (default "/pfs/out")
287 -v verbose logging
288 exit status 2
289 ```
290
291 The environment variables are as shown
292 in this excerpt from the `pipelines/kafka_spout.pipeline` file:
293
294 ```shell
295 "KAFKA_HOST": "kafka.kafka",
296 "KAFKA_PORT": "9092",
297 "KAFKA_TOPIC": "test_topic",
298 "KAFKA_GROUP_ID": "test_group",
299 "KAFKA_TIMEOUT": "5",
300 "NAMED_PIPE": "/pfs/out",
301 "VERBOSE_LOGGING": "false"
302 ```
303