github.com/apache/beam/sdks/v2@v2.48.2/python/apache_beam/examples/inference/README.md (about) 1 <!-- 2 Licensed to the Apache Software Foundation (ASF) under one 3 or more contributor license agreements. See the NOTICE file 4 distributed with this work for additional information 5 regarding copyright ownership. The ASF licenses this file 6 to you under the Apache License, Version 2.0 (the 7 "License"); you may not use this file except in compliance 8 with the License. You may obtain a copy of the License at 9 10 http://www.apache.org/licenses/LICENSE-2.0 11 12 Unless required by applicable law or agreed to in writing, 13 software distributed under the License is distributed on an 14 "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY 15 KIND, either express or implied. See the License for the 16 specific language governing permissions and limitations 17 under the License. 18 --> 19 20 # Example RunInference API pipelines 21 22 This module contains example pipelines that use the Beam RunInference 23 API. <!---TODO: Add link to full documentation on Beam website when it's published.--> 24 25 Some examples are also used in [our benchmarks](http://s.apache.org/beam-community-metrics/d/ZpS8Uf44z/python-ml-runinference-benchmarks?orgId=1). 26 27 ## Prerequisites 28 29 You must have the latest (possibly unreleased) `apache-beam` or greater installed from the Beam repo in order to run these pipelines, 30 because some examples rely on the latest features that are actively in development. To install Beam, run the following from the `sdks/python` directory: 31 ``` 32 pip install -r build-requirements.txt 33 pip install -e .[gcp] 34 ``` 35 36 ### Tensorflow dependencies 37 38 The following installation requirement is for the Tensorflow model handler examples. 39 40 The RunInference API supports the Tensorflow framework. To use Tensorflow locally, first install `tensorflow`. 41 ``` 42 pip install tensorflow==2.12.0 43 ``` 44 45 ### PyTorch dependencies 46 47 The following installation requirements are for the files used in these examples. 48 49 The RunInference API supports the PyTorch framework. To use PyTorch locally, first install `torch`. 50 ``` 51 pip install torch==1.10.0 52 ``` 53 54 If you are using pretrained models from Pytorch's `torchvision.models` [subpackage](https://pytorch.org/vision/0.12/models.html#models-and-pre-trained-weights), you also need to install `torchvision`. 55 ``` 56 pip install torchvision 57 ``` 58 59 If you are using pretrained models from Hugging Face's `transformers` [package](https://huggingface.co/docs/transformers/index), you also need to install `transformers`. 60 ``` 61 pip install transformers 62 ``` 63 64 For installation of the `torch` dependency on a distributed runner such as Dataflow, refer to the 65 [PyPI dependency instructions](https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/#pypi-dependencies). 66 67 68 ### TensorRT dependencies 69 70 The RunInference API supports TensorRT SDK for high-performance deep learning inference with NVIDIA GPUs. 71 To use TensorRT locally, we suggest an environment with TensorRT >= 8.0.1. Install TensorRT as per the 72 [TensorRT Install Guide](https://docs.nvidia.com/deeplearning/tensorrt/install-guide/index.html). You 73 will need to make sure the Python bindings for TensorRT are also installed correctly, these are available by installing the python3-libnvinfer and python3-libnvinfer-dev packages on your TensorRT download. 74 75 If you would like to use Docker, you can use an NGC image like: 76 ``` 77 docker pull nvcr.io/nvidia/tensorrt:22.04-py3 78 ``` 79 as an existing container base to [build custom Apache Beam container](https://beam.apache.org/documentation/runtime/environments/#modify-existing-base-image). 80 81 82 ### ONNX dependencies 83 84 The RunInference API supports ONNX runtime for accelerated inference. 85 To use ONNX, we suggest installing the following dependencies: 86 ``` 87 pip install onnxruntime==1.13.1 88 ``` 89 The onnxruntime dependency is sufficient if you already have a model in onnx format. This library also supports conversion from PyTorch models to ONNX. 90 If you need to convert TensorFlow models into ONNX, please install: 91 ``` 92 pip install tf2onnx==1.13.0 93 ``` 94 If you need to convert sklearn models into ONNX, please install: 95 ``` 96 pip install skl2onnx 97 ``` 98 99 ### Additional resources 100 For more information, see the 101 [Machine Learning](/documentation/sdks/python-machine-learning) and the 102 [RunInference transform](/documentation/transforms/python/elementwise/runinference) documentation. 103 104 --- 105 ## Image classification 106 107 [`pytorch_image_classification.py`](./pytorch_image_classification.py) contains an implementation for a RunInference pipeline that performs image classification using the `mobilenet_v2` architecture. 108 109 The pipeline reads the images, performs basic preprocessing, passes the images to the PyTorch implementation of RunInference, and then writes the predictions to a text file. 110 111 ### Dataset and model for image classification 112 113 To use this transform, you need a dataset and model for image classification. 114 115 1. Create a directory named `IMAGES_DIR`. Create or download images and put them in this directory. The directory is not required if image names in the input file `IMAGE_FILE_NAMES.txt` you create in step 2 have absolute paths. 116 One popular dataset is from [ImageNet](https://www.image-net.org/). Follow their instructions to download the images. 117 2. Create a file named `IMAGE_FILE_NAMES.txt` that contains the absolute paths of each of the images in `IMAGES_DIR` that you want to use to run image classification. The path to the file can be different types of URIs such as your local file system, an AWS S3 bucket, or a GCP Cloud Storage bucket. For example: 118 ``` 119 /absolute/path/to/image1.jpg 120 /absolute/path/to/image2.jpg 121 ``` 122 3. Download the [mobilenet_v2](https://pytorch.org/vision/stable/_modules/torchvision/models/mobilenetv2.html) model from Pytorch's repository of pretrained models. This model requires the torchvision library. To download this model, run the following commands from a Python shell: 123 ``` 124 import torch 125 from torchvision.models import mobilenet_v2 126 model = mobilenet_v2(pretrained=True) 127 torch.save(model.state_dict(), 'mobilenet_v2.pth') # You can replace mobilenet_v2.pth with your preferred file name for your model state dictionary. 128 ``` 129 130 ### Running `pytorch_image_classification.py` 131 132 To run the image classification pipeline locally, use the following command: 133 ```sh 134 python -m apache_beam.examples.inference.pytorch_image_classification \ 135 --input IMAGE_FILE_NAMES \ 136 --images_dir IMAGES_DIR \ 137 --output OUTPUT \ 138 --model_state_dict_path MODEL_STATE_DICT 139 ``` 140 `images_dir` is only needed if your `IMAGE_FILE_NAMES.txt` file contains relative paths (they will be relative from `IMAGES_DIR`). 141 142 For example, if you've followed the naming conventions recommended above: 143 ```sh 144 python -m apache_beam.examples.inference.pytorch_image_classification \ 145 --input IMAGE_FILE_NAMES.txt \ 146 --output predictions.csv \ 147 --model_state_dict_path mobilenet_v2.pth 148 ``` 149 This writes the output to the `predictions.csv` with contents like: 150 ``` 151 /absolute/path/to/image1.jpg;1 152 /absolute/path/to/image2.jpg;333 153 ... 154 ``` 155 156 Each image path is paired with a value representing the Imagenet class that returned the highest confidence score out of Imagenet's 1000 classes. 157 158 --- 159 ## Image segmentation 160 161 [`pytorch_image_segmentation.py`](./pytorch_image_segmentation.py) contains an implementation for a RunInference pipeline that performs image segementation using the `maskrcnn_resnet50_fpn` architecture. 162 163 The pipeline reads images, performs basic preprocessing, passes the images to the PyTorch implementation of RunInference, and then writes predictions to a text file. 164 165 ### Dataset and model for image segmentation 166 167 To use this transform, you need a dataset and model for image segmentation. 168 169 1. Create a directory named `IMAGES_DIR`. Create or download images and put them in this directory. The directory is not required if image names in the input file `IMAGE_FILE_NAMES.txt` you create in step 2 have absolute paths. 170 A popular dataset is from [Coco](https://cocodataset.org/#home). Follow their instructions to download the images. 171 2. Create a file named `IMAGE_FILE_NAMES.txt` that contains the absolute paths of each of the images in `IMAGES_DIR` that you want to use to run image segmentation. The path to the file can be different types of URIs such as your local file system, an AWS S3 bucket, or a GCP Cloud Storage bucket. For example: 172 ``` 173 /absolute/path/to/image1.jpg 174 /absolute/path/to/image2.jpg 175 ``` 176 3. Download the [maskrcnn_resnet50_fpn](https://pytorch.org/vision/0.12/models.html#id70) model from Pytorch's repository of pretrained models. This model requires the torchvision library. To download this model, run the following commands from a Python shell: 177 ``` 178 import torch 179 from torchvision.models.detection import maskrcnn_resnet50_fpn 180 model = maskrcnn_resnet50_fpn(pretrained=True) 181 torch.save(model.state_dict(), 'maskrcnn_resnet50_fpn.pth') # You can replace maskrcnn_resnet50_fpn.pth with your preferred file name for your model state dictionary. 182 ``` 183 4. Note the path to the `OUTPUT` file. This file is used by the pipeline to write the predictions. 184 185 ### Running `pytorch_image_segmentation.py` 186 187 To run the image segmentation pipeline locally, use the following command: 188 ```sh 189 python -m apache_beam.examples.inference.pytorch_image_segmentation \ 190 --input IMAGE_FILE_NAMES \ 191 --images_dir IMAGES_DIR \ 192 --output OUTPUT \ 193 --model_state_dict_path MODEL_STATE_DICT 194 ``` 195 `images_dir` is only needed if your `IMAGE_FILE_NAMES.txt` file contains relative paths (they will be relative from `IMAGES_DIR`). 196 197 For example, if you've followed the naming conventions recommended above: 198 ```sh 199 python -m apache_beam.examples.inference.pytorch_image_segmentation \ 200 --input IMAGE_FILE_NAMES.txt \ 201 --output predictions.csv \ 202 --model_state_dict_path maskrcnn_resnet50_fpn.pth 203 ``` 204 This writes the output to the `predictions.csv` with contents like: 205 ``` 206 /absolute/path/to/image1.jpg;['parking meter', 'bottle', 'person', 'traffic light', 'traffic light', 'traffic light'] 207 /absolute/path/to/image2.jpg;['bottle', 'person', 'person'] 208 ... 209 ``` 210 Each line has data separated by a semicolon ";". The first item is the file name. The second item is a list of predicted instances. 211 212 --- 213 ## Object Detection 214 215 [`tensorrt_object_detection.py`](./tensorrt_object_detection.py) contains an implementation for a RunInference pipeline that performs object detection using [Tensorflow Object Detection's](https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/tf2_detection_zoo.md) SSD MobileNet v2 320x320 architecture. 216 217 The pipeline reads the images, performs basic preprocessing, passes them to the TensorRT implementation of RunInference, and then writes the predictions to a text file. 218 219 ### Dataset and model for image classification 220 221 You will need to create or download images, and place them into your `IMAGES_DIR` directory. Popular dataset for such task is [COCO dataset](https://cocodataset.org/#home). COCO validation dataset can be obtained [here](http://images.cocodataset.org/zips/val2017.zip). 222 - **Required**: A path to a file called `IMAGE_FILE_NAMES` that contains the absolute paths of each of the images in `IMAGES_DIR` on which you want to run image segmentation. Paths can be different types of URIs such as your local file system, a AWS S3 bucket or GCP Cloud Storage bucket. For example: 223 ``` 224 /absolute/path/to/000000000139.jpg 225 /absolute/path/to/000000289594.jpg 226 ``` 227 - **Required**: A path to a file called `TRT_ENGINE` that contains the pre-built TensorRT engine from SSD MobileNet v2 320x320 model. You will need to [follow instructions](https://github.com/NVIDIA/TensorRT/tree/main/samples/python/tensorflow_object_detection_api) on how to download and convert this SSD model into TensorRT engine. At [Create ONNX Graph](https://github.com/NVIDIA/TensorRT/tree/main/samples/python/tensorflow_object_detection_api#create-onnx-graph) step, keep batch size at 1. As soon as you are done with [Build TensorRT Engine](https://github.com/NVIDIA/TensorRT/tree/main/samples/python/tensorflow_object_detection_api#build-tensorrt-engine) step. You can use resulted engine as `TRT_ENGINE` input. In addition, make sure that environment you use for TensorRT engine creation is the same environment you use to run TensorRT inference. It is related not only to TensorRT version, but also to a specific GPU used. Read more about it [here](https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#compatibility-serialized-engines). 228 229 - **Required**: A path to a file called `OUTPUT`, to which the pipeline will write the predictions. 230 - **Optional**: `IMAGES_DIR`, which is the path to the directory where images are stored. Not required if image names in the input file `IMAGE_FILE_NAMES` have absolute paths. 231 232 ### Running `tensorrt_object_detection.py` 233 234 To run the image classification pipeline locally, use the following command: 235 ```sh 236 python -m apache_beam.examples.inference.tensorrt_object_detection \ 237 --input IMAGE_FILE_NAMES \ 238 --images_dir IMAGES_DIR \ 239 --output OUTPUT \ 240 --engine_path TRT_ENGINE 241 ``` 242 For example: 243 ```sh 244 python -m apache_beam.examples.inference.tensorrt_object_detection \ 245 --input image_file_names.txt \ 246 --output predictions.csv \ 247 --engine_path ssd_mobilenet_v2_320x320_coco17_tpu-8.trt 248 ``` 249 This writes the output to the `predictions.csv` with contents like: 250 ``` 251 /absolute/path/to/000000000139.jpg;[{'ymin': '217.31875205039978' 'xmin': '295.93122482299805' 'ymax': '315.90323209762573' 'xmax': '357.8959655761719' 'score': '0.72342616' 'class': 'chair'} {'ymin': '166.81788557767868'..... 252 253 /absolute/path/to/000000289594.jpg;[{'ymin': '227.25109100341797' 'xmin': '331.7402381300926' 'ymax': '476.88533782958984' 'xmax': '402.2928895354271' 'score': '0.77217317' 'class': 'person'} {'ymin': '231.8712615966797' 'xmin': '292.8590789437294'..... 254 ... 255 ``` 256 Each line has data separated by a semicolon ";". The first item is the file name. The second item is a list of dictionaries, where each dictionary corresponds with a single detection. A detection contains: box coordinates (ymin, xmin, ymax, xmax); score; and class. 257 258 --- 259 ## Language modeling 260 261 [`pytorch_language_modeling.py`](./pytorch_language_modeling.py) contains an implementation for a RunInference pipeline that performs masked language modeling (that is, decoding a masked token in a sentence) using the `BertForMaskedLM` architecture from Hugging Face. 262 263 The pipeline reads sentences, performs basic preprocessing to convert the last word into a `[MASK]` token, passes the masked sentence to the PyTorch implementation of RunInference, and then writes the predictions to a text file. 264 265 ### Dataset and model for language modeling 266 267 To use this transform, you need a dataset and model for language modeling. 268 269 1. Download the [BertForMaskedLM](https://huggingface.co/docs/transformers/model_doc/bert#transformers.BertForMaskedLM) model from Hugging Face's repository of pretrained models. You must already have `transformers` installed, then from a Python shell run: 270 ``` 271 import torch 272 from transformers import BertForMaskedLM 273 model = BertForMaskedLM.from_pretrained('bert-base-uncased', return_dict=True) 274 torch.save(model.state_dict(), 'BertForMaskedLM.pth') # You can replace BertForMaskedLM.pth with your preferred file name for your model state dictionary. 275 ``` 276 2. (Optional) Create a file named `SENTENCES.txt` that contains sentences to feed into the model. The content of the file should be similar to the following example: 277 ``` 278 The capital of France is Paris . 279 He looked up and saw the sun and stars . 280 ... 281 ``` 282 283 ### Running `pytorch_language_modeling.py` 284 285 To run the language modeling pipeline locally, use the following command: 286 ```sh 287 python -m apache_beam.examples.inference.pytorch_language_modeling \ 288 --input SENTENCES \ 289 --output OUTPUT \ 290 --model_state_dict_path MODEL_STATE_DICT 291 ``` 292 The `input` argument is optional. If none is provided, it will run the pipeline with some 293 example sentences. 294 295 For example, if you've followed the naming conventions recommended above: 296 ```sh 297 python -m apache_beam.examples.inference.pytorch_language_modeling \ 298 --input SENTENCES.txt \ 299 --output predictions.csv \ 300 --model_state_dict_path BertForMaskedLM.pth 301 ``` 302 Or, using the default example sentences: 303 ```sh 304 python -m apache_beam.examples.inference.pytorch_language_modeling \ 305 --output predictions.csv \ 306 --model_state_dict_path BertForMaskedLM.pth 307 ``` 308 309 This writes the output to the `predictions.csv` with contents like: 310 ``` 311 The capital of France is Paris .;paris 312 He looked up and saw the sun and stars .;moon 313 ... 314 ``` 315 Each line has data separated by a semicolon ";". 316 The first item is the input sentence. The model masks the last word and tries to predict it; 317 the second item is the word that the model predicts for the mask. 318 319 --- 320 ## MNIST digit classification 321 [`sklearn_mnist_classification.py`](./sklearn_mnist_classification.py) contains an implementation for a RunInference pipeline that performs image classification on handwritten digits from the [MNIST](https://en.wikipedia.org/wiki/MNIST_database) database. 322 323 The pipeline reads rows of pixels corresponding to a digit, performs basic preprocessing, passes the pixels to the Scikit-learn implementation of RunInference, and then writes the predictions to a text file. 324 325 ### Dataset and model for MNIST digit classification 326 327 To use this transform, you need a dataset and model for MNIST digit classification. 328 329 1. Create a file named `INPUT.csv` that contains labels and pixels to feed into the model. Each row should have comma-separated elements. The first element is the label. All other elements are pixel values. The csv should not have column headers. The content of the file should be similar to the following example: 330 ``` 331 1,0,0,0... 332 0,0,0,0... 333 1,0,0,0... 334 4,0,0,0... 335 ... 336 ``` 337 2. Create a file named `MODEL_PATH` that contains the pickled file of a scikit-learn model trained on MNIST data. Please refer to this scikit-learn [model persistence documentation](https://scikit-learn.org/stable/model_persistence.html) on how to serialize models. 338 3. Update sklearn_examples_requirements.txt to match the version of sklearn used to train the model. Sklearn doesn't guarantee model compatability between versions. 339 340 341 ### Running `sklearn_mnist_classification.py` 342 343 To run the MNIST classification pipeline locally, use the following command: 344 ```sh 345 python -m apache_beam.examples.inference.sklearn_mnist_classification.py \ 346 --input INPUT \ 347 --output OUTPUT \ 348 --model_path MODEL_PATH 349 ``` 350 For example: 351 ```sh 352 python -m apache_beam.examples.inference.sklearn_mnist_classification.py \ 353 --input INPUT.csv \ 354 --output predictions.txt \ 355 --model_path mnist_model_svm.pickle 356 ``` 357 358 This writes the output to the `predictions.txt` with contents like: 359 ``` 360 1,1 361 4,9 362 7,1 363 0,0 364 ... 365 ``` 366 Each line has data separated by a comma ",". The first item is the actual label of the digit. The second item is the predicted label of the digit. 367 368 ### Running `sklearn_japanese_housing_regression.py` 369 370 #### Getting the data: 371 Data for this example can be found at: 372 https://www.kaggle.com/datasets/nishiodens/japan-real-estate-transaction-prices 373 374 #### Models: 375 Prebuilt sklearn pipelines are hosted at: 376 https://storage.cloud.google.com/apache-beam-ml/models/japanese_housing/ 377 378 Note: This example uses more than one model. Since not all features in an example are populated, a different model will be chosen based on available data. 379 380 For example, an example without distance to the nearest station will use a model that doesn't rely on that data. 381 382 #### Running the Pipeline 383 To run locally, use the following command: 384 ```sh 385 python -m apache_beam.examples.inference.sklearn_japanese_housing_regression.py \ 386 --input_file INPUT \ 387 --output OUTPUT \ 388 --model_path MODEL_PATH 389 ``` 390 For example: 391 ```sh 392 python -m apache_beam.examples.inference.sklearn_japanese_housing_regression.py \ 393 --input_file housing_examples.csv \ 394 --output predictions.txt \ 395 --model_path https://storage.cloud.google.com/apache-beam-ml/models/japanese_housing/ 396 ``` 397 398 This writes the output to the `predictions.txt` with contents like: 399 ``` 400 True Price 40000000.0, Predicted Price 34645912.039208 401 True Price 34000000.0, Predicted Price 28648634.135857 402 True Price 31000000.0, Predicted Price 25654277.256461 403 ... 404 ``` 405 406 --- 407 ## Sentiment classification using ONNX version of RoBERTa 408 [`onnx_sentiment_classification.py`](./onnx_sentiment_classification.py) contains an implementation for a RunInference pipeline that performs sentiment classification on movie reviews. 409 410 The pipeline reads rows of txt files corresponding to movie reviews, performs basic preprocessing, passes the pixels to the ONNX version of RoBERTa via RunInference, and then writes the predictions (0 for negative, 1 for positive) to a text file. 411 412 ### Dataset and model for sentiment classification 413 We assume you already have a trained model in onnx format. In our example, we use RoBERTa from https://github.com/SeldonIO/seldon-models/blob/master/pytorch/moviesentiment_roberta/pytorch-roberta-onnx.ipynb. 414 415 For input data, you can generate your own movie reviews (separated by line breaks) or use IMDB reviews online (https://ai.stanford.edu/~amaas/data/sentiment/). 416 417 The output will be a text file, with a binary label (0 for negative, 1 for positive) appended to the review, separated by a semicolon. 418 419 ### Running the pipeline 420 To run locally, you can use the following command: 421 ```sh 422 python -m apache_beam.examples.inference.onnx_sentiment_classification.py \ 423 --input_file [input file path] \ 424 --output [output file path] \ 425 --model_uri [path to onnx model] 426 ``` 427 428 This writes the output to the output file path with contents like: 429 ``` 430 A comedy-drama of nearly epic proportions rooted in a sincere performance by the title character undergoing midlife crisis .;1 431 ``` 432 433 --- 434 ## MNIST digit classification with Tensorflow 435 [`tensorflow_mnist_classification.py`](./tensorflow_mnist_classification.py) contains an implementation for a RunInference pipeline that performs image classification on handwritten digits from the [MNIST](https://en.wikipedia.org/wiki/MNIST_database) database. 436 437 The pipeline reads rows of pixels corresponding to a digit, performs basic preprocessing(converts the input shape to 28x28), passes the pixels to the trained Tensorflow model with RunInference, and then writes the predictions to a text file. 438 439 ### Dataset and model for MNIST digit classification 440 441 To use this transform, you need a dataset and model for MNIST digit classification. 442 443 1. Create a file named [`INPUT.csv`](gs://apache-beam-ml/testing/inputs/it_mnist_data.csv) that contains labels and pixels to feed into the model. Each row should have comma-separated elements. The first element is the label. All other elements are pixel values. The csv should not have column headers. The content of the file should be similar to the following example: 444 ``` 445 1,0,0,0... 446 0,0,0,0... 447 1,0,0,0... 448 4,0,0,0... 449 ... 450 ``` 451 2. Save the trained tensorflow model to a directory `MODEL_DIR` . 452 453 454 ### Running `tensorflow_mnist_classification.py` 455 456 To run the MNIST classification pipeline locally, use the following command: 457 ```sh 458 python -m apache_beam.examples.inference.tensorflow_mnist_classification.py \ 459 --input INPUT \ 460 --output OUTPUT \ 461 --model_path MODEL_DIR 462 ``` 463 For example: 464 ```sh 465 python -m apache_beam.examples.inference.tensorflow_mnist_classification.py \ 466 --input INPUT.csv \ 467 --output predictions.txt \ 468 --model_path MODEL_DIR 469 ``` 470 471 This writes the output to the `predictions.txt` with contents like: 472 ``` 473 1,1 474 4,4 475 0,0 476 7,7 477 3,3 478 5,5 479 ... 480 ``` 481 Each line has data separated by a comma ",". The first item is the actual label of the digit. The second item is the predicted label of the digit. 482 483 --- 484 ## Image segmentation with Tensorflow and TensorflowHub 485 486 [`tensorflow_imagenet_segmentation.py`](./tensorflow_imagenet_segmentation.py) contains an implementation for a RunInference pipeline that performs image segementation using the [`mobilenet_v2`]("https://tfhub.dev/google/tf2-preview/mobilenet_v2/classification/4") architecture from the tensorflow hub. 487 488 The pipeline reads images, performs basic preprocessing, passes the images to the Tensorflow implementation of RunInference, and then writes predictions to a text file. 489 490 ### Dataset and model for image segmentation 491 492 To use this transform, you need a dataset and model for image segmentation. 493 494 1. Create a directory named `IMAGE_DIR`. Create or download images and put them in this directory. We 495 will use the [example image]("https://storage.googleapis.com/download.tensorflow.org/example_images/") on tensorflow. 496 2. Create a file named `IMAGE_FILE_NAMES.txt` that names of each of the images in `IMAGE_DIR` that you want to use to run image segmentation. For example: 497 ``` 498 grace_hopper.jpg 499 ``` 500 3. A tensorflow `MODEL_PATH`, we will use the [mobilenet]("https://tfhub.dev/google/tf2-preview/mobilenet_v2/classification/4") model. 501 4. Note the path to the `OUTPUT` file. This file is used by the pipeline to write the predictions. 502 5. Install TensorflowHub: `pip install tensorflow_hub` 503 504 ### Running `tensorflow_imagenet_segmentation.py` 505 506 To run the image segmentation pipeline locally, use the following command: 507 ```sh 508 python -m apache_beam.examples.inference.tensorflow_imagenet_segmentation \ 509 --input IMAGE_FILE_NAMES \ 510 --image_dir IMAGES_DIR \ 511 --output OUTPUT \ 512 --model_path MODEL_PATH 513 ``` 514 515 For example, if you've followed the naming conventions recommended above: 516 ```sh 517 python -m apache_beam.examples.inference.tensorflow_imagenet_segmentation \ 518 --input IMAGE_FILE_NAMES.txt \ 519 --image_dir "https://storage.googleapis.com/download.tensorflow.org/example_images/" \ 520 --output predictions.txt \ 521 --model_path "https://tfhub.dev/google/tf2-preview/mobilenet_v2/classification/4" 522 ``` 523 This writes the output to the `predictions.txt` with contents like: 524 ``` 525 background 526 ... 527 ``` 528 Each line has a list of predicted label. 529 530 --- 531 ## MNIST digit classification with Tensorflow using Saved Model Weights 532 [`tensorflow_mnist_with_weights.py`](./tensorflow_mnist_with_weights.py) contains an implementation for a RunInference pipeline that performs image classification on handwritten digits from the [MNIST](https://en.wikipedia.org/wiki/MNIST_database) database. 533 534 The pipeline reads rows of pixels corresponding to a digit, performs basic preprocessing(converts the input shape to 28x28), passes the pixels to the trained Tensorflow model with RunInference, and then writes the predictions to a text file. 535 536 The model is loaded from the saved model weights. This can be done by passing a function which creates the model and setting the model type as 537 `ModelType.SAVED_WEIGHTS` to the `TFModelHandler`. The path to saved weights saved using `model.save_weights(path)` should be passed to the `model_path` argument. 538 539 ### Dataset and model for MNIST digit classification 540 541 To use this transform, you need a dataset and model for MNIST digit classification. 542 543 1. Create a file named [`INPUT.csv`](gs://apache-beam-ml/testing/inputs/it_mnist_data.csv) that contains labels and pixels to feed into the model. Each row should have comma-separated elements. The first element is the label. All other elements are pixel values. The csv should not have column headers. The content of the file should be similar to the following example: 544 ``` 545 1,0,0,0... 546 0,0,0,0... 547 1,0,0,0... 548 4,0,0,0... 549 ... 550 ``` 551 2. Save the weights of trained tensorflow model to a directory `SAVED_WEIGHTS_DIR` . 552 553 554 ### Running `tensorflow_mnist_with_weights.py` 555 556 To run the MNIST classification pipeline locally, use the following command: 557 ```sh 558 python -m apache_beam.examples.inference.tensorflow_mnist_with_weights.py \ 559 --input INPUT \ 560 --output OUTPUT \ 561 --model_path SAVED_WEIGHTS_DIR 562 ``` 563 For example: 564 ```sh 565 python -m apache_beam.examples.inference.tensorflow_mnist_with_weights.py \ 566 --input INPUT.csv \ 567 --output predictions.txt \ 568 --model_path SAVED_WEIGHTS_DIR 569 ``` 570 571 This writes the output to the `predictions.txt` with contents like: 572 ``` 573 1,1 574 4,4 575 0,0 576 7,7 577 3,3 578 5,5 579 ... 580 ``` 581 Each line has data separated by a comma ",". The first item is the actual label of the digit. The second item is the predicted label of the digit. 582 ## Iris Classification 583 584 [`xgboost_iris_classification.py`](./xgboost_iris_classification.py) contains an implementation for a RunInference pipeline that performs classification on tabular data from the [Iris Dataset](https://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html). 585 586 The pipeline reads rows that contain the features of a given iris. The features are Sepal Length, Sepal Width, Petal Length and Petal Width. The pipeline passes those features to the XGBoost implementation of RunInference which writes the iris type predictions to a text file. 587 588 ### Dataset and model for iris classification 589 590 To use this transform, you need to have sklearn installed. The dataset is loaded from using sklearn. The `_train_model` function can be used to train a simple classifier. The function outputs it's configuration in a file that can be loaded by the `XGBoostModelHandler`. 591 592 ### Training a simple classifier 593 594 The following function allows you to train a simple classifier using the sklearn Iris dataset. The trained model will be saved in the location passed as a parameter and can then later be loaded in a pipeline using the `XGBoostModelHandler`. 595 ``` 596 import xgboost 597 598 from sklearn.datasets import load_iris 599 from sklearn.model_selection import train_test_split 600 601 602 def _train_model(model_state_output_path: str = '/tmp/model.json', seed=999): 603 """Function to train an XGBoost Classifier using the sklearn Iris dataset""" 604 dataset = load_iris() 605 x_train, _, y_train, _ = train_test_split( 606 dataset['data'], dataset['target'], test_size=.2, random_state=seed) 607 booster = xgboost.XGBClassifier( 608 n_estimators=2, max_depth=2, learning_rate=1, objective='binary:logistic') 609 booster.fit(x_train, y_train) 610 booster.save_model(model_state_output_path) 611 return booster 612 ``` 613 614 #### Running the Pipeline 615 To run locally, use the following command: 616 617 ``` 618 python -m apache_beam.examples.inference.xgboost_iris_classification.py \ 619 --input_type INPUT_TYPE \ 620 --output OUTPUT_FILE \ 621 -- model_state MODEL_STATE_JSON \ 622 [--no_split|--split] 623 ``` 624 625 For example: 626 627 ``` 628 python -m apache_beam.examples.inference.xgboost_iris_classification.py \ 629 --input_type numpy \ 630 --output predictions.txt \ 631 --model_state model_state.json \ 632 --split 633 ``` 634 635 This writes the output to the `predictions.txt`. Each line contains the batch number and a list with all outputted class labels. There are 3 possible values for class labels: `0`, `1`, and `2`. When each batch contains a single elements the output look like this: 636 ``` 637 0,[1] 638 1,[2] 639 2,[1] 640 3,[0] 641 ... 642 ``` 643 644 When all elements are in a single batch the output looks like this: 645 ``` 646 0,[1 1 1 0 0 0 0 1 2 0 0 2 0 2 1 2 2 2 2 0 0 0 0 2 2 0 2 2 2 1] 647 648 ``` 649 650 ## Milk Quality Prediction Windowing Example 651 652 `milk_quality_prediction_windowing.py` contains an implementation of a windowing pipeline making use of the RunInference transform. An XGBoost classification the quality of milk based on measurements of pH, temperature, taste, odor, fat, turbidity and color. The model labels a measurement as `bad`, `medium` or `good`. The model is trained on the [Kaggle Milk Quality Prediction dataset](https://www.kaggle.com/datasets/cpluzshrijayan/milkquality). 653 654 #### Loading and preprocessing the dataset 655 656 The `preprocess_data` function loads the Kaggle dataset from a csv file and splits it into a training and accompanying label set as well as a test set. In typical machine learning setting we would use the training set and the labels to train the model and the test set is used to calculate various metrics such as recall and precision. We will use the test set data in a test streaming pipeline to showcase the windowing capabilities. 657 658 #### Training an XGBoost classifier 659 660 The `train_model` function allows you to train a simple XGBoost classifier using the Kaggle Milk Quality Prediction dataset. The trained model will be saved in JSON format at the location passed as a parameter and can then later be used for inference using by loading it via the XGBoostModelhandler. 661 662 #### Running the pipeline 663 664 ``` 665 python -m apache_beam.examples.inference.milk_quality_prediction_windowing.py \ 666 --dataset \ 667 <DATASET> \ 668 --pipeline_input_data \ 669 <INPUT_DATA> \ 670 --training_set \ 671 <TRAINING_SET> \ 672 --labels \ 673 <LABELS> \ 674 --model_state \ 675 <MODEL_STATE> 676 ``` 677 678 Where `<DATASET>` is the path to a csv file containing the Kaggle Milk Quality prediction dataset, `<INPUT_DATA>` a filepath to save the data that will be used as input for the streaming pipeline (test set), `<TRAINING_SET>` a filepath to store the training set in csv format, `<LABELS>` a filepath to store the csv containing the labels used to train the model and `<MODEL_STATE>` the path to the JSON file containing the trained model. 679 `<INPUT_DATA>`, `<TRAINING_SET>`, and `<LABELS>` will all be parsed from `<DATASET>` and saved before pipeline execution. 680 681 Using the test set, we simulate a streaming pipeline that a receives a new measurement of the milk quality parameters every minute. A sliding window keeps track of the measurement of the last 30 minutes and new window starts every 5 minutes. The model predicts the quality of each measurement. After 30 minutes the results are aggregated in a tuple containing the number of measurements that were predicted as bad, medium and high quality samples. The output of each window looks as follows: 682 ``` 683 MilkQualityAggregation(bad_quality_measurements=10, medium_quality_measurements=13, high_quality_measurements=6) 684 MilkQualityAggregation(bad_quality_measurements=9, medium_quality_measurements=11, high_quality_measurements=4) 685 MilkQualityAggregation(bad_quality_measurements=8, medium_quality_measurements=7, high_quality_measurements=4) 686 MilkQualityAggregation(bad_quality_measurements=6, medium_quality_measurements=4, high_quality_measurements=4) 687 MilkQualityAggregation(bad_quality_measurements=3, medium_quality_measurements=3, high_quality_measurements=3) 688 MilkQualityAggregation(bad_quality_measurements=1, medium_quality_measurements=2, high_quality_measurements=1) 689 ```