github.com/NVIDIA/aistore@v1.3.23-0.20240517131212-7df6609be51d/docs/tutorials/etl/compute_md5.md (about) 1 --- 2 layout: post 3 title: COMPUTE MD5 4 permalink: /tutorials/etl/compute-md5 5 redirect_from: 6 - /tutorials/etl/compute_md5.md/ 7 - /docs/tutorials/etl/compute_md5.md/ 8 --- 9 10 # Compute MD5 11 12 In this example, we will see how ETL can be used to do something as simple as computing MD5 of the object. 13 We will go over two ways of starting ETL to achieve our goal. 14 Get ready! 15 16 17 `Note: ETL is still in development so some steps may not work exactly as written below.` 18 19 ## Prerequisites 20 21 * AIStore cluster deployed on Kubernetes. We recommend following guide below. 22 * [Deploy AIStore on local Kuberenetes cluster](https://github.com/NVIDIA/ais-k8s/blob/master/operator/README.md) 23 * [Deploy AIStore on the cloud](https://github.com/NVIDIA/ais-k8s/blob/master/terraform/README.md) 24 25 ## Prepare ETL 26 27 To showcase ETL's capabilities, we will go over a simple ETL container that computes the MD5 checksum of the object. 28 There are three ways of approaching this problem: 29 30 1. **Simplified flow** 31 32 In this example, we will be using `python3.11v2` runtime. 33 In simplified flow, we are only expected to write a simple `transform` function, which can look like this (`code.py`): 34 35 ```python 36 import hashlib 37 38 def transform(input_bytes): 39 md5 = hashlib.md5() 40 md5.update(input_bytes) 41 return md5.hexdigest().encode() 42 ``` 43 44 `transform` function must take bytes as an argument (the object's content) and return output bytes that will be saved in the transformed object. 45 46 Once we have the `transform` function defined, we can use CLI to build and initialize ETL: 47 ```console 48 $ ais etl init code --from-file=code.py --runtime=python3.11v2 --name=transformer-md5 --comm-type hpull 49 transformer-md5 50 ``` 51 52 2. **Simplified flow with input/output** 53 Similar to the above example, we will be using the `python3.11v2` runtime. 54 However, the python code in this case expects data as standard input and writes the output bytes to standard output, as shown in the following `code.py`: 55 56 ```python 57 import hashlib 58 import sys 59 60 md5 = hashlib.md5() 61 for chunk in sys.stdin.buffer.read(): 62 md5.update(chunk) 63 sys.stdout.buffer.write(md5.hexdigest().encode()) 64 ``` 65 66 We can now use the CLI to build and initialize ETL with `io://` communicator type: 67 ```console 68 $ ais etl init code --from-file=code.py --runtime=python3.11v2 --comm-type="io://" --name="compute-md5" 69 compute-md5 70 ``` 71 72 3. **Regular flow** 73 74 First, we need to write a server. 75 In this case, we will write a Python 3 HTTP server. 76 The code for it can look like this (`server.py`): 77 78 ```python 79 #!/usr/bin/env python 80 81 import argparse 82 import hashlib 83 from http.server import HTTPServer, BaseHTTPRequestHandler 84 85 86 class S(BaseHTTPRequestHandler): 87 def _set_headers(self): 88 self.send_response(200) 89 self.send_header("Content-type", "text/plain") 90 self.end_headers() 91 92 def do_POST(self): 93 content_length = int(self.headers["Content-Length"]) 94 post_data = self.rfile.read(content_length) 95 md5 = hashlib.md5() 96 md5.update(post_data) 97 98 self._set_headers() 99 self.wfile.write(md5.hexdigest().encode()) 100 101 102 def run(server_class=HTTPServer, handler_class=S, addr="localhost", port=8000): 103 server_address = (addr, port) 104 httpd = server_class(server_address, handler_class) 105 106 print(f"Starting httpd server on {addr}:{port}") 107 httpd.serve_forever() 108 109 110 if __name__ == "__main__": 111 parser = argparse.ArgumentParser(description="Run a simple HTTP server") 112 parser.add_argument( 113 "-l", 114 "--listen", 115 default="localhost", 116 help="Specify the IP address on which the server listens", 117 ) 118 parser.add_argument( 119 "-p", 120 "--port", 121 type=int, 122 default=8000, 123 help="Specify the port on which the server listens", 124 ) 125 args = parser.parse_args() 126 run(addr=args.listen, port=args.port) 127 ``` 128 129 Once we have a server that computes the MD5, we need to create an image out of it. 130 For that, we need to write `Dockerfile`, which can look like this: 131 132 ```dockerfile 133 FROM python:3.8.5-alpine3.11 134 135 RUN mkdir /code 136 WORKDIR /code 137 COPY server.py server.py 138 139 EXPOSE 80 140 141 ENTRYPOINT [ "/code/server.py", "--listen", "0.0.0.0", "--port", "80" ] 142 ``` 143 144 Once we have the docker file, we must build it and publish it to some [Docker Registry](https://docs.docker.com/registry/) so that our Kubernetes cluster can pull this image later. 145 In this example, we will use [docker.io](https://hub.docker.com/) Docker Registry. 146 147 ```console 148 $ docker build -t docker.io/aistore/md5_server:v1 . 149 $ docker push docker.io/aistore/md5_server:v1 150 ``` 151 152 The next step is to create spec of a Pod, that will be run on Kubernetes (`spec.yaml`): 153 154 ```yaml 155 apiVersion: v1 156 kind: Pod 157 metadata: 158 name: transformer-md5 159 spec: 160 containers: 161 - name: server 162 image: docker.io/aistore/md5_server:v1 163 ports: 164 - name: default 165 containerPort: 80 166 command: ['/code/server.py', '--listen', '0.0.0.0', '--port', '80'] 167 ``` 168 169 **Important**: the server listens on the same port as specified in `ports.containerPort`. 170 It is required, as a target needs to know the precise socket address of the ETL container. 171 172 Once we have our `spec.yaml`, we can initialize ETL with CLI: 173 ```console 174 $ ais etl init spec --from-file=spec.yaml --name=transformer-md5 --comm-type="hpush://" 175 transformer-md5 176 ``` 177 178 Just before we started ETL containers, our Pods looked like this: 179 180 ```console 181 $ kubectl get pods 182 NAME READY STATUS RESTARTS AGE 183 demo-ais-admin-99p8r 1/1 Running 0 31m 184 demo-ais-proxy-5vqb8 1/1 Running 0 31m 185 demo-ais-proxy-g7jf7 1/1 Running 0 31m 186 demo-ais-target-0 1/1 Running 0 31m 187 demo-ais-target-1 1/1 Running 0 29m 188 ``` 189 190 We can see that the cluster is running with one proxy and two targets. 191 After we initialized the ETL, we expect two more Pods to be started (`#targets == #etl_containers`). 192 193 ```console 194 $ kubectl get pods 195 NAME READY STATUS RESTARTS AGE 196 demo-ais-admin-99p8r 1/1 Running 0 41m 197 demo-ais-proxy-5vqb8 1/1 Running 0 41m 198 demo-ais-proxy-g7jf7 1/1 Running 0 41m 199 demo-ais-target-0 1/1 Running 0 41m 200 demo-ais-target-1 1/1 Running 0 39m 201 transformer-md5-fgjk3 1/1 Running 0 1m 202 transformer-md5-vspra 1/1 Running 0 1m 203 ``` 204 205 As expected, two more Pods are up and running - one for each target. 206 207 > ETL containers will be run on the same node as the targets that started them. 208 > In other words, each ETL container runs close to data and does not generate any extract-transform-load related network traffic. 209 > Given that there are as many ETL containers as storage nodes (one container per target) and that all ETL containers run in parallel, the cumulative "transformation" bandwidth scales proportionally to the number of storage nodes and disks. 210 211 Finally, we can use newly created Pods to transform the objects on the fly for us: 212 213 ```console 214 $ ais create transform 215 $ echo "some text :)" | ais put - transform/shard.in 216 $ ais etl object transformer-md5 transform/shard.in - 217 393c6706efb128fbc442d3f7d084a426 218 ``` 219 220 VoilĂ ! The ETL container successfully computed the `md5` on the `transform/shard.in` object. 221 222 Alternatively, one can use the offline ETL feature to transform the whole bucket. 223 224 ```console 225 $ ais create transform 226 $ echo "some text :)" | ais put - transform/shard.in 227 $ ais etl bucket transformer-md5 ais://transform ais://transform-md5 --wait 228 ``` 229 230 Once ETL isn't needed anymore, the Pods can be stopped with: 231 232 ```console 233 $ ais etl stop transformer-md5 234 ETL containers stopped successfully. 235 $ kubectl get pods 236 NAME READY STATUS RESTARTS AGE 237 demo-ais-admin-99p8r 1/1 Running 0 50m 238 demo-ais-proxy-5vqb8 1/1 Running 0 50m 239 demo-ais-proxy-g7jf7 1/1 Running 0 49m 240 demo-ais-target-0 1/1 Running 0 50m 241 demo-ais-target-1 1/1 Running 0 49m 242 ```