github.com/apache/beam/sdks/v2@v2.48.2/python/apache_beam/examples/kafkataxi/README.md (about)

     1  <!--
     2      Licensed to the Apache Software Foundation (ASF) under one
     3      or more contributor license agreements.  See the NOTICE file
     4      distributed with this work for additional information
     5      regarding copyright ownership.  The ASF licenses this file
     6      to you under the Apache License, Version 2.0 (the
     7      "License"); you may not use this file except in compliance
     8      with the License.  You may obtain a copy of the License at
     9  
    10        http://www.apache.org/licenses/LICENSE-2.0
    11  
    12      Unless required by applicable law or agreed to in writing,
    13      software distributed under the License is distributed on an
    14      "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    15      KIND, either express or implied.  See the License for the
    16      specific language governing permissions and limitations
    17      under the License.
    18  -->
    19  
    20  # Python KafkaIO Example
    21  
    22  This example reads from the Google Cloud Pub/Sub NYC Taxi stream described
    23  [here](https://github.com/googlecodelabs/cloud-dataflow-nyc-taxi-tycoon), writes
    24  to a given Kafka topic, and reads back from the same Kafka topic. This example
    25  uses cross-language transforms available in
    26  [kafka.py](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/kafka.py).
    27  Transforms are implemented in Java and are available
    28  [here](https://github.com/apache/beam/blob/master/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaIO.java).
    29  
    30  ## Prerequisites
    31  
    32  Install [Java Development kit (JDK) version 8](https://www.oracle.com/java/technologies/javase-downloads.html)
    33  in your system and make sure that `JAVA_HOME` environment variable points to
    34  your JDK installation. Make sure that `java` command is available in
    35  the environment.
    36  
    37  ```sh
    38  java -version
    39  <Should print information regarding the installed Java version>
    40  ```
    41  
    42  ## Setup the Kafka cluster
    43  
    44  This example requires users to setup a Kafka cluster that the Beam runner
    45  executing the pipeline has access to.
    46  
    47  See [here]((https://kafka.apache.org/quickstart)) for general instructions on
    48  setting up a Kafka cluster. One option is to setup the Kafka cluster in
    49  [GCE](https://cloud.google.com/compute). See
    50  [here](https://github.com/GoogleCloudPlatform/java-docs-samples/tree/master/dataflow/flex-templates/kafka_to_bigquery)
    51  for step by step instructions on  setting up a single node Kafka cluster in GCE.
    52  When using Dataflow consider starting the Kafka cluster in the region where
    53  Dataflow pipeline will be running. See
    54  [here](https://cloud.google.com/dataflow/docs/concepts/regional-endpoints)
    55  for more details regarding the selecting a GCP region for Dataflow.
    56  
    57  Let's assume that that IP address of one of the [bootstrap servers](https://kafka.apache.org/quickstart)
    58  of the Kafka cluster to be  `123.45.67.89:123` and the port to be `9092`.
    59  
    60  ```sh
    61  export BOOTSTRAP_SERVER="123.45.67.89:123:9092"
    62  ```
    63  
    64  ## Running the example on latest released Beam version
    65  
    66  Perform Beam runner specific setup.
    67  
    68  ℹ️ Note that cross-language transforms require
    69  portable implementations of Spark/Flink/Direct runners. Dataflow requires
    70  [runner V2](https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline#dataflow-runner-v2).
    71  See [here](https://beam.apache.org/documentation/runners/dataflow/) for
    72  instructions for setting up Dataflow.
    73  
    74  Setup a virtual environment for running Beam Python programs. See
    75  [here](https://beam.apache.org/get-started/quickstart-py/) for prerequisites.
    76  Dataflow requires the `gcp` tag when installing Beam.
    77  
    78  ```sh
    79  python -m venv env
    80  source env/bin/activate
    81  pip install 'apache-beam[gcp]'
    82  ```
    83  
    84  Run the Beam pipeline. You can either use the default Kafka topic name or
    85  specify a Kafka topic name. Following command assumes Dataflow. See
    86  [here](https://beam.apache.org/get-started/quickstart-py/) for instructions on
    87  running Beam Python programs on other runners.
    88  
    89  ℹ️ Note that this exemple is not available in Beam versions before 2.24.0 hence
    90  you'll have to either get the example program from Beam or follow steps
    91  provided in the section *Running the Example from a Beam Git Clone*.
    92  
    93  ```sh
    94  export PROJECT="$(gcloud config get-value project)"
    95  export TEMP_LOCATION="gs://MY-BUCKET/temp"
    96  export REGION="us-central1"
    97  export JOB_NAME="kafka-taxi-`date +%Y%m%d-%H%M%S`"
    98  export NUM_WORKERS="5"
    99  
   100  python -m apache_beam.examples.kafkataxi.kafka_taxi \
   101    --runner DataflowRunner \
   102    --temp_location $TEMP_LOCATION \
   103    --project $PROJECT \
   104    --region $REGION \
   105    --num_workers $NUM_WORKERS \
   106    --job_name $JOB_NAME \
   107    --bootstrap_servers $BOOTSTRAP_SERVER
   108  ```
   109  
   110  ## *(Optional)*  Running the Example from a Beam Git Clone
   111  
   112  Running this example from a Beam Git clone requires some additional steps.
   113  
   114  Checkout a clone of the Beam Git repo. See
   115  [here](https://beam.apache.org/contribute/) for prerequisites.
   116  
   117  Assume your Github username to be `GITHUB_USERNAME`.
   118  
   119  ```sh
   120  git clone git@github.com:${GITHUB_USERNAME}/beam
   121  cd beam
   122  ```
   123  
   124  Build IO expansion service jar.
   125  
   126  ```sh
   127  ./gradlew :sdks:java:io:expansion-service:build
   128  ```
   129  
   130  Push a java SDK Harness container to [Docker](https://www.docker.com/get-started)
   131  Hub. See
   132  [here](https://beam.apache.org/documentation/runtime/environments/) for
   133  prerequisites and additional information.
   134  
   135  ```sh
   136  export DOCKER_ROOT="Your Docker Repository Root"
   137  ./gradlew :sdks:java:container:java8:docker -Pdocker-repository-root=$DOCKER_ROOT -Pdocker-tag=latest
   138  docker push $DOCKER_ROOT/beam_java8_sdk:latest
   139  ```
   140  
   141  For portable Flink/Spark in local mode, instead of above command just build the
   142  Java SDK harness container locally using the default values for repository root
   143  and the docker tag.
   144  
   145  Activate your Python virtual environment.  This example uses `venv`. See
   146  [here](https://cwiki.apache.org/confluence/display/BEAM/Python+Tips) for
   147  instructions regarding setting up other types of Python virtual environments.
   148  
   149  ```sh
   150  cd ..  # Creating the virtual environment in the top level work directory.
   151  python3 -m venv env
   152  source ./env/bin/activate
   153  pip install --upgrade pip setuptools wheel
   154  ```
   155  
   156  Install Beam and dependencies and build a Beam distribution.
   157  
   158  ```sh
   159  cd beam/sdks/python
   160  pip install -r build-requirements.txt
   161  pip install -e '.[gcp]'
   162  python setup.py sdist
   163  ```
   164  
   165  Run the Beam pipeline. You can either use the default Kafka topic name or specify
   166  a Kafka topic name. Following command assumes Dataflow. See
   167  [here](https://beam.apache.org/get-started/quickstart-py/) for instructions on
   168  running Beam Python programs on other runners.
   169  
   170  ```sh
   171  export PROJECT="$(gcloud config get-value project)"
   172  export TEMP_LOCATION="gs://MY-BUCKET/temp"
   173  export REGION="us-central1"
   174  export JOB_NAME="kafka-taxi-`date +%Y%m%d-%H%M%S`"
   175  export NUM_WORKERS="5"
   176  export PYTHON_DISTRIBUTION="dist/'Name of Python distribution'"
   177  
   178  python -m apache_beam.examples.kafkataxi.kafka_taxi \
   179    --runner DataflowRunner \
   180    --temp_location $TEMP_LOCATION \
   181    --project $PROJECT \
   182    --region $REGION \
   183    --sdk_location $PYTHON_DISTRIBUTION \
   184    --num_workers $NUM_WORKERS \
   185    --job_name $JOB_NAME \
   186    --bootstrap_servers $BOOTSTRAP_SERVER \
   187    --sdk_harness_container_image_overrides ".*java.*,${DOCKER_ROOT}/beam_java8_sdk:latest"
   188  ```