github.com/apache/beam/sdks/v2@v2.48.2/python/apache_beam/examples/kafkataxi/README.md (about) 1 <!-- 2 Licensed to the Apache Software Foundation (ASF) under one 3 or more contributor license agreements. See the NOTICE file 4 distributed with this work for additional information 5 regarding copyright ownership. The ASF licenses this file 6 to you under the Apache License, Version 2.0 (the 7 "License"); you may not use this file except in compliance 8 with the License. You may obtain a copy of the License at 9 10 http://www.apache.org/licenses/LICENSE-2.0 11 12 Unless required by applicable law or agreed to in writing, 13 software distributed under the License is distributed on an 14 "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY 15 KIND, either express or implied. See the License for the 16 specific language governing permissions and limitations 17 under the License. 18 --> 19 20 # Python KafkaIO Example 21 22 This example reads from the Google Cloud Pub/Sub NYC Taxi stream described 23 [here](https://github.com/googlecodelabs/cloud-dataflow-nyc-taxi-tycoon), writes 24 to a given Kafka topic, and reads back from the same Kafka topic. This example 25 uses cross-language transforms available in 26 [kafka.py](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/kafka.py). 27 Transforms are implemented in Java and are available 28 [here](https://github.com/apache/beam/blob/master/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaIO.java). 29 30 ## Prerequisites 31 32 Install [Java Development kit (JDK) version 8](https://www.oracle.com/java/technologies/javase-downloads.html) 33 in your system and make sure that `JAVA_HOME` environment variable points to 34 your JDK installation. Make sure that `java` command is available in 35 the environment. 36 37 ```sh 38 java -version 39 <Should print information regarding the installed Java version> 40 ``` 41 42 ## Setup the Kafka cluster 43 44 This example requires users to setup a Kafka cluster that the Beam runner 45 executing the pipeline has access to. 46 47 See [here]((https://kafka.apache.org/quickstart)) for general instructions on 48 setting up a Kafka cluster. One option is to setup the Kafka cluster in 49 [GCE](https://cloud.google.com/compute). See 50 [here](https://github.com/GoogleCloudPlatform/java-docs-samples/tree/master/dataflow/flex-templates/kafka_to_bigquery) 51 for step by step instructions on setting up a single node Kafka cluster in GCE. 52 When using Dataflow consider starting the Kafka cluster in the region where 53 Dataflow pipeline will be running. See 54 [here](https://cloud.google.com/dataflow/docs/concepts/regional-endpoints) 55 for more details regarding the selecting a GCP region for Dataflow. 56 57 Let's assume that that IP address of one of the [bootstrap servers](https://kafka.apache.org/quickstart) 58 of the Kafka cluster to be `123.45.67.89:123` and the port to be `9092`. 59 60 ```sh 61 export BOOTSTRAP_SERVER="123.45.67.89:123:9092" 62 ``` 63 64 ## Running the example on latest released Beam version 65 66 Perform Beam runner specific setup. 67 68 ℹ️ Note that cross-language transforms require 69 portable implementations of Spark/Flink/Direct runners. Dataflow requires 70 [runner V2](https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline#dataflow-runner-v2). 71 See [here](https://beam.apache.org/documentation/runners/dataflow/) for 72 instructions for setting up Dataflow. 73 74 Setup a virtual environment for running Beam Python programs. See 75 [here](https://beam.apache.org/get-started/quickstart-py/) for prerequisites. 76 Dataflow requires the `gcp` tag when installing Beam. 77 78 ```sh 79 python -m venv env 80 source env/bin/activate 81 pip install 'apache-beam[gcp]' 82 ``` 83 84 Run the Beam pipeline. You can either use the default Kafka topic name or 85 specify a Kafka topic name. Following command assumes Dataflow. See 86 [here](https://beam.apache.org/get-started/quickstart-py/) for instructions on 87 running Beam Python programs on other runners. 88 89 ℹ️ Note that this exemple is not available in Beam versions before 2.24.0 hence 90 you'll have to either get the example program from Beam or follow steps 91 provided in the section *Running the Example from a Beam Git Clone*. 92 93 ```sh 94 export PROJECT="$(gcloud config get-value project)" 95 export TEMP_LOCATION="gs://MY-BUCKET/temp" 96 export REGION="us-central1" 97 export JOB_NAME="kafka-taxi-`date +%Y%m%d-%H%M%S`" 98 export NUM_WORKERS="5" 99 100 python -m apache_beam.examples.kafkataxi.kafka_taxi \ 101 --runner DataflowRunner \ 102 --temp_location $TEMP_LOCATION \ 103 --project $PROJECT \ 104 --region $REGION \ 105 --num_workers $NUM_WORKERS \ 106 --job_name $JOB_NAME \ 107 --bootstrap_servers $BOOTSTRAP_SERVER 108 ``` 109 110 ## *(Optional)* Running the Example from a Beam Git Clone 111 112 Running this example from a Beam Git clone requires some additional steps. 113 114 Checkout a clone of the Beam Git repo. See 115 [here](https://beam.apache.org/contribute/) for prerequisites. 116 117 Assume your Github username to be `GITHUB_USERNAME`. 118 119 ```sh 120 git clone git@github.com:${GITHUB_USERNAME}/beam 121 cd beam 122 ``` 123 124 Build IO expansion service jar. 125 126 ```sh 127 ./gradlew :sdks:java:io:expansion-service:build 128 ``` 129 130 Push a java SDK Harness container to [Docker](https://www.docker.com/get-started) 131 Hub. See 132 [here](https://beam.apache.org/documentation/runtime/environments/) for 133 prerequisites and additional information. 134 135 ```sh 136 export DOCKER_ROOT="Your Docker Repository Root" 137 ./gradlew :sdks:java:container:java8:docker -Pdocker-repository-root=$DOCKER_ROOT -Pdocker-tag=latest 138 docker push $DOCKER_ROOT/beam_java8_sdk:latest 139 ``` 140 141 For portable Flink/Spark in local mode, instead of above command just build the 142 Java SDK harness container locally using the default values for repository root 143 and the docker tag. 144 145 Activate your Python virtual environment. This example uses `venv`. See 146 [here](https://cwiki.apache.org/confluence/display/BEAM/Python+Tips) for 147 instructions regarding setting up other types of Python virtual environments. 148 149 ```sh 150 cd .. # Creating the virtual environment in the top level work directory. 151 python3 -m venv env 152 source ./env/bin/activate 153 pip install --upgrade pip setuptools wheel 154 ``` 155 156 Install Beam and dependencies and build a Beam distribution. 157 158 ```sh 159 cd beam/sdks/python 160 pip install -r build-requirements.txt 161 pip install -e '.[gcp]' 162 python setup.py sdist 163 ``` 164 165 Run the Beam pipeline. You can either use the default Kafka topic name or specify 166 a Kafka topic name. Following command assumes Dataflow. See 167 [here](https://beam.apache.org/get-started/quickstart-py/) for instructions on 168 running Beam Python programs on other runners. 169 170 ```sh 171 export PROJECT="$(gcloud config get-value project)" 172 export TEMP_LOCATION="gs://MY-BUCKET/temp" 173 export REGION="us-central1" 174 export JOB_NAME="kafka-taxi-`date +%Y%m%d-%H%M%S`" 175 export NUM_WORKERS="5" 176 export PYTHON_DISTRIBUTION="dist/'Name of Python distribution'" 177 178 python -m apache_beam.examples.kafkataxi.kafka_taxi \ 179 --runner DataflowRunner \ 180 --temp_location $TEMP_LOCATION \ 181 --project $PROJECT \ 182 --region $REGION \ 183 --sdk_location $PYTHON_DISTRIBUTION \ 184 --num_workers $NUM_WORKERS \ 185 --job_name $JOB_NAME \ 186 --bootstrap_servers $BOOTSTRAP_SERVER \ 187 --sdk_harness_container_image_overrides ".*java.*,${DOCKER_ROOT}/beam_java8_sdk:latest" 188 ```