github.com/apache/beam/sdks/v2@v2.48.2/python/apache_beam/examples/inference/README.md

github.com/apache/beam/sdks/v2@v2.48.2/python/apache_beam/examples/inference/README.md (about)

     1  <!--
     2      Licensed to the Apache Software Foundation (ASF) under one
     3      or more contributor license agreements.  See the NOTICE file
     4      distributed with this work for additional information
     5      regarding copyright ownership.  The ASF licenses this file
     6      to you under the Apache License, Version 2.0 (the
     7      "License"); you may not use this file except in compliance
     8      with the License.  You may obtain a copy of the License at
     9  
    10        http://www.apache.org/licenses/LICENSE-2.0
    11  
    12      Unless required by applicable law or agreed to in writing,
    13      software distributed under the License is distributed on an
    14      "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    15      KIND, either express or implied.  See the License for the
    16      specific language governing permissions and limitations
    17      under the License.
    18  -->
    19  
    20  # Example RunInference API pipelines
    21  
    22  This module contains example pipelines that use the Beam RunInference
    23  API. <!---TODO: Add link to full documentation on Beam website when it's published.-->
    24  
    25  Some examples are also used in [our benchmarks](http://s.apache.org/beam-community-metrics/d/ZpS8Uf44z/python-ml-runinference-benchmarks?orgId=1).
    26  
    27  ## Prerequisites
    28  
    29  You must have the latest (possibly unreleased) `apache-beam` or greater installed from the Beam repo in order to run these pipelines,
    30  because some examples rely on the latest features that are actively in development. To install Beam, run the following from the `sdks/python` directory:
    31  ```
    32  pip install -r build-requirements.txt
    33  pip install -e .[gcp]
    34  ```
    35  
    36  ### Tensorflow dependencies
    37  
    38  The following installation requirement is for the Tensorflow model handler examples.
    39  
    40  The RunInference API supports the Tensorflow framework. To use Tensorflow locally, first install `tensorflow`.
    41  ```
    42  pip install tensorflow==2.12.0
    43  ```
    44  
    45  ### PyTorch dependencies
    46  
    47  The following installation requirements are for the files used in these examples.
    48  
    49  The RunInference API supports the PyTorch framework. To use PyTorch locally, first install `torch`.
    50  ```
    51  pip install torch==1.10.0
    52  ```
    53  
    54  If you are using pretrained models from Pytorch's `torchvision.models` [subpackage](https://pytorch.org/vision/0.12/models.html#models-and-pre-trained-weights), you also need to install `torchvision`.
    55  ```
    56  pip install torchvision
    57  ```
    58  
    59  If you are using pretrained models from Hugging Face's `transformers` [package](https://huggingface.co/docs/transformers/index), you also need to install `transformers`.
    60  ```
    61  pip install transformers
    62  ```
    63  
    64  For installation of the `torch` dependency on a distributed runner such as Dataflow, refer to the
    65  [PyPI dependency instructions](https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/#pypi-dependencies).
    66  
    67  
    68  ### TensorRT dependencies
    69  
    70  The RunInference API supports TensorRT SDK for high-performance deep learning inference with NVIDIA GPUs.
    71  To use TensorRT locally, we suggest an environment with TensorRT >= 8.0.1. Install TensorRT as per the
    72  [TensorRT Install Guide](https://docs.nvidia.com/deeplearning/tensorrt/install-guide/index.html). You
    73  will need to make sure the Python bindings for TensorRT are also installed correctly, these are available by installing the python3-libnvinfer and python3-libnvinfer-dev packages on your TensorRT download.
    74  
    75  If you would like to use Docker, you can use an NGC image like:
    76  ```
    77  docker pull nvcr.io/nvidia/tensorrt:22.04-py3
    78  ```
    79  as an existing container base to [build custom Apache Beam container](https://beam.apache.org/documentation/runtime/environments/#modify-existing-base-image).
    80  
    81  
    82  ### ONNX dependencies
    83  
    84  The RunInference API supports ONNX runtime for accelerated inference.
    85  To use ONNX, we suggest installing the following dependencies:
    86  ```
    87  pip install onnxruntime==1.13.1
    88  ```
    89  The onnxruntime dependency is sufficient if you already have a model in onnx format. This library also supports conversion from PyTorch models to ONNX.
    90  If you need to convert TensorFlow models into ONNX, please install:
    91  ```
    92  pip install tf2onnx==1.13.0
    93  ```
    94  If you need to convert sklearn models into ONNX, please install:
    95  ```
    96  pip install skl2onnx
    97  ```
    98  
    99  ### Additional resources
   100  For more information, see the
   101  [Machine Learning](/documentation/sdks/python-machine-learning) and the
   102  [RunInference transform](/documentation/transforms/python/elementwise/runinference) documentation.
   103  
   104  ---
   105  ## Image classification
   106  
   107  [`pytorch_image_classification.py`](./pytorch_image_classification.py) contains an implementation for a RunInference pipeline that performs image classification using the `mobilenet_v2` architecture.
   108  
   109  The pipeline reads the images, performs basic preprocessing, passes the images to the PyTorch implementation of RunInference, and then writes the predictions to a text file.
   110  
   111  ### Dataset and model for image classification
   112  
   113  To use this transform, you need a dataset and model for image classification.
   114  
   115  1. Create a directory named `IMAGES_DIR`. Create or download images and put them in this directory. The directory is not required if image names in the input file `IMAGE_FILE_NAMES.txt` you create in step 2 have absolute paths.
   116  One popular dataset is from [ImageNet](https://www.image-net.org/). Follow their instructions to download the images.
   117  2. Create a file named `IMAGE_FILE_NAMES.txt` that contains the absolute paths of each of the images in `IMAGES_DIR` that you want to use to run image classification. The path to the file can be different types of URIs such as your local file system, an AWS S3 bucket, or a GCP Cloud Storage bucket. For example:
   118  ```
   119  /absolute/path/to/image1.jpg
   120  /absolute/path/to/image2.jpg
   121  ```
   122  3. Download the [mobilenet_v2](https://pytorch.org/vision/stable/_modules/torchvision/models/mobilenetv2.html) model from Pytorch's repository of pretrained models. This model requires the torchvision library. To download this model, run the following commands from a Python shell:
   123  ```
   124  import torch
   125  from torchvision.models import mobilenet_v2
   126  model = mobilenet_v2(pretrained=True)
   127  torch.save(model.state_dict(), 'mobilenet_v2.pth') # You can replace mobilenet_v2.pth with your preferred file name for your model state dictionary.
   128  ```
   129  
   130  ### Running `pytorch_image_classification.py`
   131  
   132  To run the image classification pipeline locally, use the following command:
   133  ```sh
   134  python -m apache_beam.examples.inference.pytorch_image_classification \
   135    --input IMAGE_FILE_NAMES \
   136    --images_dir IMAGES_DIR \
   137    --output OUTPUT \
   138    --model_state_dict_path MODEL_STATE_DICT
   139  ```
   140  `images_dir` is only needed if your `IMAGE_FILE_NAMES.txt` file contains relative paths (they will be relative from `IMAGES_DIR`).
   141  
   142  For example, if you've followed the naming conventions recommended above:
   143  ```sh
   144  python -m apache_beam.examples.inference.pytorch_image_classification \
   145    --input IMAGE_FILE_NAMES.txt \
   146    --output predictions.csv \
   147    --model_state_dict_path mobilenet_v2.pth
   148  ```
   149  This writes the output to the `predictions.csv` with contents like:
   150  ```
   151  /absolute/path/to/image1.jpg;1
   152  /absolute/path/to/image2.jpg;333
   153  ...
   154  ```
   155  
   156  Each image path is paired with a value representing the Imagenet class that returned the highest confidence score out of Imagenet's 1000 classes.
   157  
   158  ---
   159  ## Image segmentation
   160  
   161  [`pytorch_image_segmentation.py`](./pytorch_image_segmentation.py) contains an implementation for a RunInference pipeline that performs image segementation using the `maskrcnn_resnet50_fpn` architecture.
   162  
   163  The pipeline reads images, performs basic preprocessing, passes the images to the PyTorch implementation of RunInference, and then writes predictions to a text file.
   164  
   165  ### Dataset and model for image segmentation
   166  
   167  To use this transform, you need a dataset and model for image segmentation.
   168  
   169  1. Create a directory named `IMAGES_DIR`. Create or download images and put them in this directory. The directory is not required if image names in the input file `IMAGE_FILE_NAMES.txt` you create in step 2 have absolute paths.
   170  A popular dataset is from [Coco](https://cocodataset.org/#home). Follow their instructions to download the images.
   171  2. Create a file named `IMAGE_FILE_NAMES.txt` that contains the absolute paths of each of the images in `IMAGES_DIR` that you want to use to run image segmentation. The path to the file can be different types of URIs such as your local file system, an AWS S3 bucket, or a GCP Cloud Storage bucket. For example:
   172  ```
   173  /absolute/path/to/image1.jpg
   174  /absolute/path/to/image2.jpg
   175  ```
   176  3. Download the [maskrcnn_resnet50_fpn](https://pytorch.org/vision/0.12/models.html#id70) model from Pytorch's repository of pretrained models. This model requires the torchvision library. To download this model, run the following commands from a Python shell:
   177  ```
   178  import torch
   179  from torchvision.models.detection import maskrcnn_resnet50_fpn
   180  model = maskrcnn_resnet50_fpn(pretrained=True)
   181  torch.save(model.state_dict(), 'maskrcnn_resnet50_fpn.pth') # You can replace maskrcnn_resnet50_fpn.pth with your preferred file name for your model state dictionary.
   182  ```
   183  4. Note the path to the `OUTPUT` file. This file is used by the pipeline to write the predictions.
   184  
   185  ### Running `pytorch_image_segmentation.py`
   186  
   187  To run the image segmentation pipeline locally, use the following command:
   188  ```sh
   189  python -m apache_beam.examples.inference.pytorch_image_segmentation \
   190    --input IMAGE_FILE_NAMES \
   191    --images_dir IMAGES_DIR \
   192    --output OUTPUT \
   193    --model_state_dict_path MODEL_STATE_DICT
   194  ```
   195  `images_dir` is only needed if your `IMAGE_FILE_NAMES.txt` file contains relative paths (they will be relative from `IMAGES_DIR`).
   196  
   197  For example, if you've followed the naming conventions recommended above:
   198  ```sh
   199  python -m apache_beam.examples.inference.pytorch_image_segmentation \
   200    --input IMAGE_FILE_NAMES.txt \
   201    --output predictions.csv \
   202    --model_state_dict_path maskrcnn_resnet50_fpn.pth
   203  ```
   204  This writes the output to the `predictions.csv` with contents like:
   205  ```
   206  /absolute/path/to/image1.jpg;['parking meter', 'bottle', 'person', 'traffic light', 'traffic light', 'traffic light']
   207  /absolute/path/to/image2.jpg;['bottle', 'person', 'person']
   208  ...
   209  ```
   210  Each line has data separated by a semicolon ";". The first item is the file name. The second item is a list of predicted instances.
   211  
   212  ---
   213  ## Object Detection
   214  
   215  [`tensorrt_object_detection.py`](./tensorrt_object_detection.py) contains an implementation for a RunInference pipeline that performs object detection using [Tensorflow Object Detection's](https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/tf2_detection_zoo.md) SSD MobileNet v2 320x320 architecture.
   216  
   217  The pipeline reads the images, performs basic preprocessing, passes them to the TensorRT implementation of RunInference, and then writes the predictions to a text file.
   218  
   219  ### Dataset and model for image classification
   220  
   221  You will need to create or download images, and place them into your `IMAGES_DIR` directory. Popular dataset for such task is [COCO dataset](https://cocodataset.org/#home). COCO validation dataset can be obtained [here](http://images.cocodataset.org/zips/val2017.zip).
   222  - **Required**: A path to a file called `IMAGE_FILE_NAMES` that contains the absolute paths of each of the images in `IMAGES_DIR` on which you want to run image segmentation. Paths can be different types of URIs such as your local file system, a AWS S3 bucket or GCP Cloud Storage bucket. For example:
   223  ```
   224  /absolute/path/to/000000000139.jpg
   225  /absolute/path/to/000000289594.jpg
   226  ```
   227  - **Required**: A path to a file called `TRT_ENGINE` that contains the pre-built TensorRT engine from SSD MobileNet v2 320x320 model. You will need to [follow instructions](https://github.com/NVIDIA/TensorRT/tree/main/samples/python/tensorflow_object_detection_api) on how to download and convert this SSD model into TensorRT engine. At [Create ONNX Graph](https://github.com/NVIDIA/TensorRT/tree/main/samples/python/tensorflow_object_detection_api#create-onnx-graph) step, keep batch size at 1. As soon as you are done with [Build TensorRT Engine](https://github.com/NVIDIA/TensorRT/tree/main/samples/python/tensorflow_object_detection_api#build-tensorrt-engine) step. You can use resulted engine as `TRT_ENGINE` input. In addition, make sure that environment you use for TensorRT engine creation is the same environment you use to run TensorRT inference. It is related not only to TensorRT version, but also to a specific GPU used. Read more about it [here](https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#compatibility-serialized-engines).
   228  
   229  - **Required**: A path to a file called `OUTPUT`, to which the pipeline will write the predictions.
   230  - **Optional**: `IMAGES_DIR`, which is the path to the directory where images are stored. Not required if image names in the input file `IMAGE_FILE_NAMES` have absolute paths.
   231  
   232  ### Running `tensorrt_object_detection.py`
   233  
   234  To run the image classification pipeline locally, use the following command:
   235  ```sh
   236  python -m apache_beam.examples.inference.tensorrt_object_detection \
   237    --input IMAGE_FILE_NAMES \
   238    --images_dir IMAGES_DIR \
   239    --output OUTPUT \
   240    --engine_path TRT_ENGINE
   241  ```
   242  For example:
   243  ```sh
   244  python -m apache_beam.examples.inference.tensorrt_object_detection \
   245    --input image_file_names.txt \
   246    --output predictions.csv \
   247    --engine_path ssd_mobilenet_v2_320x320_coco17_tpu-8.trt
   248  ```
   249  This writes the output to the `predictions.csv` with contents like:
   250  ```
   251  /absolute/path/to/000000000139.jpg;[{'ymin': '217.31875205039978' 'xmin': '295.93122482299805' 'ymax': '315.90323209762573' 'xmax': '357.8959655761719' 'score': '0.72342616' 'class': 'chair'}  {'ymin': '166.81788557767868'.....
   252  
   253  /absolute/path/to/000000289594.jpg;[{'ymin': '227.25109100341797' 'xmin': '331.7402381300926'  'ymax': '476.88533782958984' 'xmax': '402.2928895354271' 'score': '0.77217317' 'class': 'person'} {'ymin': '231.8712615966797' 'xmin': '292.8590789437294'.....
   254  ...
   255  ```
   256  Each line has data separated by a semicolon ";". The first item is the file name. The second item is a list of dictionaries, where each dictionary corresponds with a single detection. A detection contains: box coordinates (ymin, xmin, ymax, xmax); score; and class.
   257  
   258  ---
   259  ## Language modeling
   260  
   261  [`pytorch_language_modeling.py`](./pytorch_language_modeling.py) contains an implementation for a RunInference pipeline that performs masked language modeling (that is, decoding a masked token in a sentence) using the `BertForMaskedLM` architecture from Hugging Face.
   262  
   263  The pipeline reads sentences, performs basic preprocessing to convert the last word into a `[MASK]` token, passes the masked sentence to the PyTorch implementation of RunInference, and then writes the predictions to a text file.
   264  
   265  ### Dataset and model for language modeling
   266  
   267  To use this transform, you need a dataset and model for language modeling.
   268  
   269  1. Download the [BertForMaskedLM](https://huggingface.co/docs/transformers/model_doc/bert#transformers.BertForMaskedLM) model from Hugging Face's repository of pretrained models. You must already have `transformers` installed, then from a Python shell run:
   270  ```
   271  import torch
   272  from transformers import BertForMaskedLM
   273  model = BertForMaskedLM.from_pretrained('bert-base-uncased', return_dict=True)
   274  torch.save(model.state_dict(), 'BertForMaskedLM.pth') # You can replace BertForMaskedLM.pth with your preferred file name for your model state dictionary.
   275  ```
   276  2. (Optional) Create a file named `SENTENCES.txt` that contains sentences to feed into the model. The content of the file should be similar to the following example:
   277  ```
   278  The capital of France is Paris .
   279  He looked up and saw the sun and stars .
   280  ...
   281  ```
   282  
   283  ### Running `pytorch_language_modeling.py`
   284  
   285  To run the language modeling pipeline locally, use the following command:
   286  ```sh
   287  python -m apache_beam.examples.inference.pytorch_language_modeling \
   288    --input SENTENCES \
   289    --output OUTPUT \
   290    --model_state_dict_path MODEL_STATE_DICT
   291  ```
   292  The `input` argument is optional. If none is provided, it will run the pipeline with some
   293  example sentences.
   294  
   295  For example, if you've followed the naming conventions recommended above:
   296  ```sh
   297  python -m apache_beam.examples.inference.pytorch_language_modeling \
   298    --input SENTENCES.txt \
   299    --output predictions.csv \
   300    --model_state_dict_path BertForMaskedLM.pth
   301  ```
   302  Or, using the default example sentences:
   303  ```sh
   304  python -m apache_beam.examples.inference.pytorch_language_modeling \
   305    --output predictions.csv \
   306    --model_state_dict_path BertForMaskedLM.pth
   307  ```
   308  
   309  This writes the output to the `predictions.csv` with contents like:
   310  ```
   311  The capital of France is Paris .;paris
   312  He looked up and saw the sun and stars .;moon
   313  ...
   314  ```
   315  Each line has data separated by a semicolon ";".
   316  The first item is the input sentence. The model masks the last word and tries to predict it;
   317  the second item is the word that the model predicts for the mask.
   318  
   319  ---
   320  ## MNIST digit classification
   321  [`sklearn_mnist_classification.py`](./sklearn_mnist_classification.py) contains an implementation for a RunInference pipeline that performs image classification on handwritten digits from the [MNIST](https://en.wikipedia.org/wiki/MNIST_database) database.
   322  
   323  The pipeline reads rows of pixels corresponding to a digit, performs basic preprocessing, passes the pixels to the Scikit-learn implementation of RunInference, and then writes the predictions to a text file.
   324  
   325  ### Dataset and model for MNIST digit classification
   326  
   327  To use this transform, you need a dataset and model for MNIST digit classification.
   328  
   329  1. Create a file named `INPUT.csv` that contains labels and pixels to feed into the model. Each row should have comma-separated elements. The first element is the label. All other elements are pixel values. The csv should not have column headers. The content of the file should be similar to the following example:
   330  ```
   331  1,0,0,0...
   332  0,0,0,0...
   333  1,0,0,0...
   334  4,0,0,0...
   335  ...
   336  ```
   337  2. Create a file named `MODEL_PATH` that contains the pickled file of a scikit-learn model trained on MNIST data. Please refer to this scikit-learn [model persistence documentation](https://scikit-learn.org/stable/model_persistence.html) on how to serialize models.
   338  3. Update sklearn_examples_requirements.txt to match the version of sklearn used to train the model. Sklearn doesn't guarantee model compatability between versions.
   339  
   340  
   341  ### Running `sklearn_mnist_classification.py`
   342  
   343  To run the MNIST classification pipeline locally, use the following command:
   344  ```sh
   345  python -m apache_beam.examples.inference.sklearn_mnist_classification.py \
   346    --input INPUT \
   347    --output OUTPUT \
   348    --model_path MODEL_PATH
   349  ```
   350  For example:
   351  ```sh
   352  python -m apache_beam.examples.inference.sklearn_mnist_classification.py \
   353    --input INPUT.csv \
   354    --output predictions.txt \
   355    --model_path mnist_model_svm.pickle
   356  ```
   357  
   358  This writes the output to the `predictions.txt` with contents like:
   359  ```
   360  1,1
   361  4,9
   362  7,1
   363  0,0
   364  ...
   365  ```
   366  Each line has data separated by a comma ",". The first item is the actual label of the digit. The second item is the predicted label of the digit.
   367  
   368  ### Running `sklearn_japanese_housing_regression.py`
   369  
   370  #### Getting the data:
   371  Data for this example can be found at:
   372  https://www.kaggle.com/datasets/nishiodens/japan-real-estate-transaction-prices
   373  
   374  #### Models:
   375  Prebuilt sklearn pipelines are hosted at:
   376  https://storage.cloud.google.com/apache-beam-ml/models/japanese_housing/
   377  
   378  Note: This example uses more than one model. Since not all features in an example are populated, a different model will be chosen based on available data.
   379  
   380  For example, an example without distance to the nearest station will use a model that doesn't rely on that data.
   381  
   382  #### Running the Pipeline
   383  To run locally, use the following command:
   384  ```sh
   385  python -m apache_beam.examples.inference.sklearn_japanese_housing_regression.py \
   386    --input_file INPUT \
   387    --output OUTPUT \
   388    --model_path MODEL_PATH
   389  ```
   390  For example:
   391  ```sh
   392  python -m apache_beam.examples.inference.sklearn_japanese_housing_regression.py \
   393    --input_file housing_examples.csv \
   394    --output predictions.txt \
   395    --model_path https://storage.cloud.google.com/apache-beam-ml/models/japanese_housing/
   396  ```
   397  
   398  This writes the output to the `predictions.txt` with contents like:
   399  ```
   400  True Price 40000000.0, Predicted Price 34645912.039208
   401  True Price 34000000.0, Predicted Price 28648634.135857
   402  True Price 31000000.0, Predicted Price 25654277.256461
   403  ...
   404  ```
   405  
   406  ---
   407  ## Sentiment classification using ONNX version of RoBERTa
   408  [`onnx_sentiment_classification.py`](./onnx_sentiment_classification.py) contains an implementation for a RunInference pipeline that performs sentiment classification on movie reviews.
   409  
   410  The pipeline reads rows of txt files corresponding to movie reviews, performs basic preprocessing, passes the pixels to the ONNX version of RoBERTa via RunInference, and then writes the predictions (0 for negative, 1 for positive) to a text file.
   411  
   412  ### Dataset and model for sentiment classification
   413  We assume you already have a trained model in onnx format. In our example, we use RoBERTa from https://github.com/SeldonIO/seldon-models/blob/master/pytorch/moviesentiment_roberta/pytorch-roberta-onnx.ipynb.
   414  
   415  For input data, you can generate your own movie reviews (separated by line breaks) or use IMDB reviews online (https://ai.stanford.edu/~amaas/data/sentiment/).
   416  
   417  The output will be a text file, with a binary label (0 for negative, 1 for positive) appended to the review, separated by a semicolon.
   418  
   419  ### Running the pipeline
   420  To run locally, you can use the following command:
   421  ```sh
   422  python -m apache_beam.examples.inference.onnx_sentiment_classification.py \
   423    --input_file [input file path] \
   424    --output [output file path] \
   425    --model_uri [path to onnx model]
   426  ```
   427  
   428  This writes the output to the output file path with contents like:
   429  ```
   430  A comedy-drama of nearly epic proportions rooted in a sincere performance by the title character undergoing midlife crisis .;1
   431  ```
   432  
   433  ---
   434  ## MNIST digit classification with Tensorflow
   435  [`tensorflow_mnist_classification.py`](./tensorflow_mnist_classification.py) contains an implementation for a RunInference pipeline that performs image classification on handwritten digits from the [MNIST](https://en.wikipedia.org/wiki/MNIST_database) database.
   436  
   437  The pipeline reads rows of pixels corresponding to a digit, performs basic preprocessing(converts the input shape to 28x28), passes the pixels to the trained Tensorflow model with RunInference, and then writes the predictions to a text file.
   438  
   439  ### Dataset and model for MNIST digit classification
   440  
   441  To use this transform, you need a dataset and model for MNIST digit classification.
   442  
   443  1. Create a file named [`INPUT.csv`](gs://apache-beam-ml/testing/inputs/it_mnist_data.csv) that contains labels and pixels to feed into the model. Each row should have comma-separated elements. The first element is the label. All other elements are pixel values. The csv should not have column headers. The content of the file should be similar to the following example:
   444  ```
   445  1,0,0,0...
   446  0,0,0,0...
   447  1,0,0,0...
   448  4,0,0,0...
   449  ...
   450  ```
   451  2. Save the trained tensorflow model to a directory `MODEL_DIR` .
   452  
   453  
   454  ### Running `tensorflow_mnist_classification.py`
   455  
   456  To run the MNIST classification pipeline locally, use the following command:
   457  ```sh
   458  python -m apache_beam.examples.inference.tensorflow_mnist_classification.py \
   459    --input INPUT \
   460    --output OUTPUT \
   461    --model_path MODEL_DIR
   462  ```
   463  For example:
   464  ```sh
   465  python -m apache_beam.examples.inference.tensorflow_mnist_classification.py \
   466    --input INPUT.csv \
   467    --output predictions.txt \
   468    --model_path MODEL_DIR
   469  ```
   470  
   471  This writes the output to the `predictions.txt` with contents like:
   472  ```
   473  1,1
   474  4,4
   475  0,0
   476  7,7
   477  3,3
   478  5,5
   479  ...
   480  ```
   481  Each line has data separated by a comma ",". The first item is the actual label of the digit. The second item is the predicted label of the digit.
   482  
   483  ---
   484  ## Image segmentation with Tensorflow and TensorflowHub
   485  
   486  [`tensorflow_imagenet_segmentation.py`](./tensorflow_imagenet_segmentation.py) contains an implementation for a RunInference pipeline that performs image segementation using the [`mobilenet_v2`]("https://tfhub.dev/google/tf2-preview/mobilenet_v2/classification/4") architecture from the tensorflow hub.
   487  
   488  The pipeline reads images, performs basic preprocessing, passes the images to the Tensorflow implementation of RunInference, and then writes predictions to a text file.
   489  
   490  ### Dataset and model for image segmentation
   491  
   492  To use this transform, you need a dataset and model for image segmentation.
   493  
   494  1. Create a directory named `IMAGE_DIR`. Create or download images and put them in this directory. We
   495  will use the [example image]("https://storage.googleapis.com/download.tensorflow.org/example_images/") on tensorflow.
   496  2. Create a file named `IMAGE_FILE_NAMES.txt` that names of each of the images in `IMAGE_DIR` that you want to use to run image segmentation. For example:
   497  ```
   498  grace_hopper.jpg
   499  ```
   500  3. A tensorflow `MODEL_PATH`, we will use the [mobilenet]("https://tfhub.dev/google/tf2-preview/mobilenet_v2/classification/4") model.
   501  4. Note the path to the `OUTPUT` file. This file is used by the pipeline to write the predictions.
   502  5. Install TensorflowHub: `pip install tensorflow_hub`
   503  
   504  ### Running `tensorflow_imagenet_segmentation.py`
   505  
   506  To run the image segmentation pipeline locally, use the following command:
   507  ```sh
   508  python -m apache_beam.examples.inference.tensorflow_imagenet_segmentation \
   509    --input IMAGE_FILE_NAMES \
   510    --image_dir IMAGES_DIR \
   511    --output OUTPUT \
   512    --model_path MODEL_PATH
   513  ```
   514  
   515  For example, if you've followed the naming conventions recommended above:
   516  ```sh
   517  python -m apache_beam.examples.inference.tensorflow_imagenet_segmentation \
   518    --input IMAGE_FILE_NAMES.txt \
   519    --image_dir "https://storage.googleapis.com/download.tensorflow.org/example_images/" \
   520    --output predictions.txt \
   521    --model_path "https://tfhub.dev/google/tf2-preview/mobilenet_v2/classification/4"
   522  ```
   523  This writes the output to the `predictions.txt` with contents like:
   524  ```
   525  background
   526  ...
   527  ```
   528  Each line has a list of predicted label.
   529  
   530  ---
   531  ## MNIST digit classification with Tensorflow using Saved Model Weights
   532  [`tensorflow_mnist_with_weights.py`](./tensorflow_mnist_with_weights.py) contains an implementation for a RunInference pipeline that performs image classification on handwritten digits from the [MNIST](https://en.wikipedia.org/wiki/MNIST_database) database.
   533  
   534  The pipeline reads rows of pixels corresponding to a digit, performs basic preprocessing(converts the input shape to 28x28), passes the pixels to the trained Tensorflow model with RunInference, and then writes the predictions to a text file.
   535  
   536  The model is loaded from the saved model weights. This can be done by passing a function which creates the model and setting the model type as
   537  `ModelType.SAVED_WEIGHTS` to the `TFModelHandler`. The path to saved weights saved using `model.save_weights(path)` should be passed to the `model_path` argument.
   538  
   539  ### Dataset and model for MNIST digit classification
   540  
   541  To use this transform, you need a dataset and model for MNIST digit classification.
   542  
   543  1. Create a file named [`INPUT.csv`](gs://apache-beam-ml/testing/inputs/it_mnist_data.csv) that contains labels and pixels to feed into the model. Each row should have comma-separated elements. The first element is the label. All other elements are pixel values. The csv should not have column headers. The content of the file should be similar to the following example:
   544  ```
   545  1,0,0,0...
   546  0,0,0,0...
   547  1,0,0,0...
   548  4,0,0,0...
   549  ...
   550  ```
   551  2. Save the weights of trained tensorflow model to a directory `SAVED_WEIGHTS_DIR` .
   552  
   553  
   554  ### Running `tensorflow_mnist_with_weights.py`
   555  
   556  To run the MNIST classification pipeline locally, use the following command:
   557  ```sh
   558  python -m apache_beam.examples.inference.tensorflow_mnist_with_weights.py \
   559    --input INPUT \
   560    --output OUTPUT \
   561    --model_path SAVED_WEIGHTS_DIR
   562  ```
   563  For example:
   564  ```sh
   565  python -m apache_beam.examples.inference.tensorflow_mnist_with_weights.py \
   566    --input INPUT.csv \
   567    --output predictions.txt \
   568    --model_path SAVED_WEIGHTS_DIR
   569  ```
   570  
   571  This writes the output to the `predictions.txt` with contents like:
   572  ```
   573  1,1
   574  4,4
   575  0,0
   576  7,7
   577  3,3
   578  5,5
   579  ...
   580  ```
   581  Each line has data separated by a comma ",". The first item is the actual label of the digit. The second item is the predicted label of the digit.
   582  ## Iris Classification
   583  
   584  [`xgboost_iris_classification.py`](./xgboost_iris_classification.py) contains an implementation for a RunInference pipeline that performs classification on tabular data from the [Iris Dataset](https://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html).
   585  
   586  The pipeline reads rows that contain the features of a given iris. The features are Sepal Length, Sepal Width, Petal Length and Petal Width. The pipeline passes those features to the XGBoost implementation of RunInference which writes the iris type predictions to a text file.
   587  
   588  ### Dataset and model for iris classification
   589  
   590  To use this transform, you need to have sklearn installed. The dataset is loaded from using sklearn. The `_train_model` function can be used to train a simple classifier. The function outputs it's configuration in a file that can be loaded by the `XGBoostModelHandler`.
   591  
   592  ### Training a simple classifier
   593  
   594  The following function allows you to train a simple classifier using the sklearn Iris dataset. The trained model will be saved in the location passed as a parameter and can then later be loaded in a pipeline using the `XGBoostModelHandler`.
   595  ```
   596  import xgboost
   597  
   598  from sklearn.datasets import load_iris
   599  from sklearn.model_selection import train_test_split
   600  
   601  
   602  def _train_model(model_state_output_path: str = '/tmp/model.json', seed=999):
   603    """Function to train an XGBoost Classifier using the sklearn Iris dataset"""
   604    dataset = load_iris()
   605    x_train, _, y_train, _ = train_test_split(
   606        dataset['data'], dataset['target'], test_size=.2, random_state=seed)
   607    booster = xgboost.XGBClassifier(
   608        n_estimators=2, max_depth=2, learning_rate=1, objective='binary:logistic')
   609    booster.fit(x_train, y_train)
   610    booster.save_model(model_state_output_path)
   611    return booster
   612  ```
   613  
   614  #### Running the Pipeline
   615  To run locally, use the following command:
   616  
   617  ```
   618  python -m apache_beam.examples.inference.xgboost_iris_classification.py \
   619    --input_type INPUT_TYPE \
   620    --output OUTPUT_FILE \
   621    -- model_state MODEL_STATE_JSON \
   622    [--no_split|--split]
   623  ```
   624  
   625  For example:
   626  
   627  ```
   628  python -m apache_beam.examples.inference.xgboost_iris_classification.py \
   629    --input_type numpy \
   630    --output predictions.txt \
   631    --model_state model_state.json \
   632    --split
   633  ```
   634  
   635  This writes the output to the `predictions.txt`. Each line contains the batch number and a list with all outputted class labels. There are 3 possible values for class labels: `0`, `1`, and `2`. When each batch contains a single elements the output look like this:
   636  ```
   637  0,[1]
   638  1,[2]
   639  2,[1]
   640  3,[0]
   641  ...
   642  ```
   643  
   644  When all elements are in a single batch the output looks like this:
   645  ```
   646  0,[1 1 1 0 0 0 0 1 2 0 0 2 0 2 1 2 2 2 2 0 0 0 0 2 2 0 2 2 2 1]
   647  
   648  ```
   649  
   650  ## Milk Quality Prediction Windowing Example
   651  
   652  `milk_quality_prediction_windowing.py` contains an implementation of a windowing pipeline making use of the RunInference transform. An XGBoost classification the quality of milk based on measurements of pH, temperature, taste, odor, fat, turbidity and color. The model labels a measurement as `bad`, `medium` or `good`. The model is trained on the [Kaggle Milk Quality Prediction dataset](https://www.kaggle.com/datasets/cpluzshrijayan/milkquality).
   653  
   654  #### Loading and preprocessing the dataset
   655  
   656  The `preprocess_data` function loads the Kaggle dataset from a csv file and splits it into a training and accompanying label set as well as a test set. In typical machine learning setting we would use the training set and the labels to train the model and the test set is used to calculate various metrics such as recall and precision.   We will use the test set data in a test streaming pipeline to showcase the windowing capabilities.
   657  
   658  #### Training an XGBoost classifier
   659  
   660  The `train_model` function allows you to train a simple XGBoost classifier using the Kaggle Milk Quality Prediction dataset. The trained model will be saved in JSON format at the location passed as a parameter and can then later be used for inference using by loading it via the XGBoostModelhandler.
   661  
   662  #### Running the pipeline
   663  
   664  ```
   665  python -m apache_beam.examples.inference.milk_quality_prediction_windowing.py \
   666      --dataset \
   667      <DATASET> \
   668      --pipeline_input_data \
   669      <INPUT_DATA> \
   670      --training_set \
   671      <TRAINING_SET> \
   672      --labels \
   673      <LABELS> \
   674      --model_state \
   675      <MODEL_STATE>
   676  ```
   677  
   678  Where `<DATASET>` is the path to a csv file containing the Kaggle Milk Quality prediction dataset, `<INPUT_DATA>` a filepath to save the data that will be used as input for the streaming pipeline (test set), `<TRAINING_SET>` a filepath to store the training set in csv format, `<LABELS>` a filepath to store the csv containing the labels used to train the model and  `<MODEL_STATE>` the path to the JSON file containing the trained model.
   679  `<INPUT_DATA>`, `<TRAINING_SET>`, and `<LABELS>` will all be parsed from `<DATASET>` and saved before pipeline execution.
   680  
   681  Using the test set, we simulate a streaming pipeline that a receives a new measurement of the milk quality parameters every minute. A sliding window keeps track of the measurement of the last 30 minutes and new window starts every 5 minutes. The model predicts the quality of each measurement. After 30 minutes the results are aggregated in a tuple containing the number of measurements that were predicted as bad, medium and high quality samples. The output of each window looks as follows:
   682  ```
   683  MilkQualityAggregation(bad_quality_measurements=10, medium_quality_measurements=13, high_quality_measurements=6)
   684  MilkQualityAggregation(bad_quality_measurements=9, medium_quality_measurements=11, high_quality_measurements=4)
   685  MilkQualityAggregation(bad_quality_measurements=8, medium_quality_measurements=7, high_quality_measurements=4)
   686  MilkQualityAggregation(bad_quality_measurements=6, medium_quality_measurements=4, high_quality_measurements=4)
   687  MilkQualityAggregation(bad_quality_measurements=3, medium_quality_measurements=3, high_quality_measurements=3)
   688  MilkQualityAggregation(bad_quality_measurements=1, medium_quality_measurements=2, high_quality_measurements=1)
   689  ```