github.com/NVIDIA/aistore@v1.3.23-0.20240517131212-7df6609be51d/docs/tutorials/etl/compute_md5.md

github.com/NVIDIA/aistore@v1.3.23-0.20240517131212-7df6609be51d/docs/tutorials/etl/compute_md5.md (about)

     1  ---
     2  layout: post
     3  title: COMPUTE MD5
     4  permalink: /tutorials/etl/compute-md5
     5  redirect_from:
     6   - /tutorials/etl/compute_md5.md/
     7   - /docs/tutorials/etl/compute_md5.md/
     8  ---
     9  
    10  # Compute MD5
    11  
    12  In this example, we will see how ETL can be used to do something as simple as computing MD5 of the object.
    13  We will go over two ways of starting ETL to achieve our goal.
    14  Get ready!
    15  
    16  
    17  `Note: ETL is still in development so some steps may not work exactly as written below.`
    18  
    19  ## Prerequisites
    20  
    21  * AIStore cluster deployed on Kubernetes. We recommend following guide below.
    22    * [Deploy AIStore on local Kuberenetes cluster](https://github.com/NVIDIA/ais-k8s/blob/master/operator/README.md)
    23    * [Deploy AIStore on the cloud](https://github.com/NVIDIA/ais-k8s/blob/master/terraform/README.md)
    24  
    25  ## Prepare ETL
    26  
    27  To showcase ETL's capabilities, we will go over a simple ETL container that computes the MD5 checksum of the object.
    28  There are three ways of approaching this problem:
    29  
    30  1. **Simplified flow**
    31  
    32      In this example, we will be using `python3.11v2` runtime.
    33      In simplified flow, we are only expected to write a simple `transform` function, which can look like this (`code.py`):
    34  
    35      ```python
    36      import hashlib
    37  
    38      def transform(input_bytes):
    39          md5 = hashlib.md5()
    40          md5.update(input_bytes)
    41          return md5.hexdigest().encode()
    42      ```
    43  
    44      `transform` function must take bytes as an argument (the object's content) and return output bytes that will be saved in the transformed object.
    45  
    46      Once we have the `transform` function defined, we can use CLI to build and initialize ETL:
    47      ```console
    48      $ ais etl init code --from-file=code.py --runtime=python3.11v2 --name=transformer-md5 --comm-type hpull
    49      transformer-md5
    50      ```
    51  
    52  2. **Simplified flow with input/output**
    53     Similar to the above example, we will be using the `python3.11v2` runtime.
    54     However, the python code in this case expects data as standard input and writes the output bytes to standard output, as shown in the following `code.py`:
    55  
    56     ```python
    57      import hashlib
    58      import sys
    59  
    60      md5 = hashlib.md5()
    61      for chunk in sys.stdin.buffer.read():
    62          md5.update(chunk)
    63      sys.stdout.buffer.write(md5.hexdigest().encode())
    64     ```
    65  
    66     We can now use the CLI to build and initialize ETL with `io://` communicator type:
    67     ```console
    68     $ ais etl init code --from-file=code.py --runtime=python3.11v2 --comm-type="io://" --name="compute-md5"
    69     compute-md5
    70     ```
    71  
    72  3. **Regular flow**
    73  
    74      First, we need to write a server.
    75      In this case, we will write a Python 3 HTTP server.
    76      The code for it can look like this (`server.py`):
    77  
    78      ```python
    79      #!/usr/bin/env python
    80  
    81      import argparse
    82      import hashlib
    83      from http.server import HTTPServer, BaseHTTPRequestHandler
    84  
    85  
    86      class S(BaseHTTPRequestHandler):
    87          def _set_headers(self):
    88              self.send_response(200)
    89              self.send_header("Content-type", "text/plain")
    90              self.end_headers()
    91  
    92          def do_POST(self):
    93              content_length = int(self.headers["Content-Length"])
    94              post_data = self.rfile.read(content_length)
    95              md5 = hashlib.md5()
    96              md5.update(post_data)
    97  
    98              self._set_headers()
    99              self.wfile.write(md5.hexdigest().encode())
   100  
   101  
   102      def run(server_class=HTTPServer, handler_class=S, addr="localhost", port=8000):
   103          server_address = (addr, port)
   104          httpd = server_class(server_address, handler_class)
   105  
   106          print(f"Starting httpd server on {addr}:{port}")
   107          httpd.serve_forever()
   108  
   109  
   110      if __name__ == "__main__":
   111          parser = argparse.ArgumentParser(description="Run a simple HTTP server")
   112          parser.add_argument(
   113              "-l",
   114              "--listen",
   115              default="localhost",
   116              help="Specify the IP address on which the server listens",
   117          )
   118          parser.add_argument(
   119              "-p",
   120              "--port",
   121              type=int,
   122              default=8000,
   123              help="Specify the port on which the server listens",
   124          )
   125          args = parser.parse_args()
   126          run(addr=args.listen, port=args.port)
   127      ```
   128  
   129      Once we have a server that computes the MD5, we need to create an image out of it.
   130      For that, we need to write `Dockerfile`, which can look like this:
   131  
   132      ```dockerfile
   133      FROM python:3.8.5-alpine3.11
   134  
   135      RUN mkdir /code
   136      WORKDIR /code
   137      COPY server.py server.py
   138  
   139      EXPOSE 80
   140  
   141      ENTRYPOINT [ "/code/server.py", "--listen", "0.0.0.0", "--port", "80" ]
   142      ```
   143  
   144      Once we have the docker file, we must build it and publish it to some [Docker Registry](https://docs.docker.com/registry/) so that our Kubernetes cluster can pull this image later.
   145      In this example, we will use [docker.io](https://hub.docker.com/) Docker Registry.
   146  
   147      ```console
   148      $ docker build -t docker.io/aistore/md5_server:v1 .
   149      $ docker push docker.io/aistore/md5_server:v1
   150      ```
   151  
   152      The next step is to create spec of a Pod, that will be run on Kubernetes (`spec.yaml`):
   153  
   154      ```yaml
   155      apiVersion: v1
   156      kind: Pod
   157      metadata:
   158        name: transformer-md5
   159      spec:
   160        containers:
   161          - name: server
   162            image: docker.io/aistore/md5_server:v1
   163            ports:
   164              - name: default
   165                containerPort: 80
   166            command: ['/code/server.py', '--listen', '0.0.0.0', '--port', '80']
   167      ```
   168  
   169      **Important**: the server listens on the same port as specified in `ports.containerPort`.
   170      It is required, as a target needs to know the precise socket address of the ETL container.
   171  
   172      Once we have our `spec.yaml`, we can initialize ETL with CLI:
   173      ```console
   174      $ ais etl init spec --from-file=spec.yaml --name=transformer-md5 --comm-type="hpush://"
   175      transformer-md5
   176      ```
   177  
   178  Just before we started ETL containers, our Pods looked like this:
   179  
   180  ```console
   181  $ kubectl get pods
   182  NAME                   READY   STATUS    RESTARTS   AGE
   183  demo-ais-admin-99p8r   1/1     Running   0          31m
   184  demo-ais-proxy-5vqb8   1/1     Running   0          31m
   185  demo-ais-proxy-g7jf7   1/1     Running   0          31m
   186  demo-ais-target-0      1/1     Running   0          31m
   187  demo-ais-target-1      1/1     Running   0          29m
   188  ```
   189  
   190  We can see that the cluster is running with one proxy and two targets.
   191  After we initialized the ETL, we expect two more Pods to be started (`#targets == #etl_containers`).
   192  
   193  ```console
   194  $ kubectl get pods
   195  NAME                      READY   STATUS    RESTARTS   AGE
   196  demo-ais-admin-99p8r      1/1     Running   0          41m
   197  demo-ais-proxy-5vqb8      1/1     Running   0          41m
   198  demo-ais-proxy-g7jf7      1/1     Running   0          41m
   199  demo-ais-target-0         1/1     Running   0          41m
   200  demo-ais-target-1         1/1     Running   0          39m
   201  transformer-md5-fgjk3     1/1     Running   0          1m
   202  transformer-md5-vspra     1/1     Running   0          1m
   203  ```
   204  
   205  As expected, two more Pods are up and running - one for each target.
   206  
   207  > ETL containers will be run on the same node as the targets that started them.
   208  > In other words, each ETL container runs close to data and does not generate any extract-transform-load related network traffic.
   209  > Given that there are as many ETL containers as storage nodes (one container per target) and that all ETL containers run in parallel, the cumulative "transformation" bandwidth scales proportionally to the number of storage nodes and disks.
   210  
   211  Finally, we can use newly created Pods to transform the objects on the fly for us:
   212  
   213  ```console
   214  $ ais create transform
   215  $ echo "some text :)" | ais put - transform/shard.in
   216  $ ais etl object transformer-md5 transform/shard.in -
   217  393c6706efb128fbc442d3f7d084a426
   218  ```
   219  
   220  Voilà! The ETL container successfully computed the `md5` on the `transform/shard.in` object.
   221  
   222  Alternatively, one can use the offline ETL feature to transform the whole bucket.
   223  
   224  ```console
   225  $ ais create transform
   226  $ echo "some text :)" | ais put - transform/shard.in
   227  $ ais etl bucket transformer-md5 ais://transform ais://transform-md5 --wait
   228  ```
   229  
   230  Once ETL isn't needed anymore, the Pods can be stopped with:
   231  
   232  ```console
   233  $ ais etl stop transformer-md5
   234  ETL containers stopped successfully.
   235  $ kubectl get pods
   236  NAME                      READY   STATUS    RESTARTS   AGE
   237  demo-ais-admin-99p8r      1/1     Running   0          50m
   238  demo-ais-proxy-5vqb8      1/1     Running   0          50m
   239  demo-ais-proxy-g7jf7      1/1     Running   0          49m
   240  demo-ais-target-0         1/1     Running   0          50m
   241  demo-ais-target-1         1/1     Running   0          49m
   242  ```