github.com/matrixorigin/matrixone@v1.2.0/pkg/udf/pythonservice/demo/README.md (about) 1 # MO Python UDF Demo 2 3 This document briefly describes the use of python UDFs in MO, showing the flexibility and convenience of python UDFs. 4 5 ## Standalone Deployment 6 7 1. Before the deployment, we need to make sure we have [python3](https://www.python.org/downloads/) locally and install the grpc package in python: 8 9 ``` 10 pip install protobuf 11 pip install grpcio 12 ``` 13 14 2. Follow the steps in [Getting Started](https://docs.matrixorigin.cn/en/1.0.0-rc1/MatrixOne/Get-Started/install-standalone-matrixone/) to complete the installation, and modify the configuration of mo_ctl by running the following command before starting MO: 15 16 ```shell 17 mo_ctl set_conf MO_CONF_FILE="${MO_PATH}/matrixone/etc/launch-with-python-udf-server/launch.toml" 18 ``` 19 20 3. Launch and connect to MO. 21 22 ## "Hello world" 23 24 Let's start with a simple function: a python UDF that adds two numbers together. 25 26 Input the following SQL in the client to create the function: 27 28 ```sql 29 create or replace function py_add(a int, b int) returns int language python as 30 $$ 31 def add(a, b): 32 return a + b 33 $$ 34 handler 'add'; 35 ``` 36 37 > Note: currently, matrixone does not perform syntax checking on the python script in UDF body. 38 39 This SQL defines a function called py_add that takes two parameters of type `int` and returns the sum of the two parameters. Python code comes after `as`. `handler` represents the name of the python function that will be called. 40 41 Now we can use the function: 42 43 ```sql 44 select py_add(1,2); 45 ``` 46 47 Output: 48 49 ``` 50 +--------------+ 51 | py_add(1, 2) | 52 +--------------+ 53 | 3 | 54 +--------------+ 55 ``` 56 57 We can remove the function when we no longer need it: 58 59 ```sql 60 drop function py_add(int, int); 61 ``` 62 63 We now have a basic understanding of how to use python UDFs. 64 65 ## Advancement 66 67 In this section, we'll look at some more advanced uses of python UDFs. 68 69 ### Imported Python UDF 70 71 In the previous section, the py_add function is written directly in SQL, and the python code is written after the keyword `as`. We call this form **embedded UDF**. However, if the function logic is very complex, writing it directly in SQL is not a good choice. It bloats the SQL statements and makes the code less maintainable. In this case, we can choose another way to create python UDFs: importing python code from an external file, known as **imported UDF**. There are two types of files that are supported: (1) a single python file, and (2) the wheel package. Let's continue with the example of the py_add function from the previous section, and write the python code to a file. 72 73 The code for ./add func.py is as follows: 74 75 ```python 76 def add(a, b): 77 return a + b 78 ``` 79 80 Input the following SQL in the client to create the function(if you connect to matrixone via `mysql` cli, `--local-infile` flag is required to allow the cli reading local python files): 81 82 ```sql 83 create or replace function py_add(a int, b int) returns int language python import 84 './add_func.py' 85 handler 'add'; 86 ``` 87 88 We use the keyword `import` to import the flie add_func.py from the current working directory. 89 90 Call the function: 91 92 ```sql 93 select py_add(1,2); 94 ``` 95 96 We still get the same output as in the previous section: 97 98 ``` 99 +--------------+ 100 | py_add(1, 2) | 101 +--------------+ 102 | 3 | 103 +--------------+ 104 ``` 105 106 Importing a wheel package and a single python file share the same syntax, so we won't go into detail here. In the final example, we will create a python UDF by importing a wheel package. 107 108 ### Vector Option 109 110 In some scenarios, we may want a python function to receive multiple tuples at once to improve the efficiency. For example, in model inference, we typically process data in batches, where each batch is a vector of tuples. The vector option is designed to handle this situation. We'll continue to use the py_add function as an example to demonstrate the usage of the vector option. 111 112 Input the following SQL in the client to create the function: 113 114 ```sql 115 create or replace function py_add(a int, b int) returns int language python as 116 $$ 117 def add(a, b): # a, b are list 118 return [a[i] + b[i] for i in range(len(a))] 119 add.vector = True 120 $$ 121 handler 'add'; 122 ``` 123 124 We use `add.vector = True` to indicate that the python function add should receive two integer lists (vectors) as input, rather than individual integer values. 125 126 Call the function: 127 128 ```sql 129 select py_add(1,2); 130 ``` 131 132 We still get the same output as in the previous section: 133 134 ``` 135 +--------------+ 136 | py_add(1, 2) | 137 +--------------+ 138 | 3 | 139 +--------------+ 140 ``` 141 142 With the vector option, we have the freedom to choose how the function processes its input, whether it's one tuple at a time or multiple tuples at once. 143 144 ### Type Mapping 145 146 The correspondence between MO types and python types is as follows: 147 148 | MO Types | Python Types | 149 | -------------------------------------------------------- | --------------------------- | 150 | bool | bool | 151 | int8, int16, int32, int64, uint8, uint16, uint32, uint64 | int | 152 | float32, float64 | float | 153 | char, varchar, text, uuid | str | 154 | json | str, int, float, list, dict | 155 | time | datetime.timedelta | 156 | date | datetime.date | 157 | datetime, timestamp | datetime.datetime | 158 | decimal64, decimal128 | decimal.Decimal | 159 | binary, varbinary, blob | bytes | 160 161 ### Function Overloading and Matching 162 163 Overloading is not supported in MO UDF yet, all function names must be unique across an MO cluster. 164 165 ## Example: Credit Card Fraud Detection 166 167 In this section, we use the example "Credit card fraud detection" to illustrate how python UDFs can be used in a machine learning inference pipeline. (Detailed code in [github](https://github.com/matrixorigin/matrixone/tree/main/pkg/udf/pythonservice/demo)) 168 169 We need to ensure that the local python environment has numpy and scikit-learn installed. 170 171 ### Background 172 173 It is important that credit card companies are able to recognize fraudulent credit card transactions so that customers are not charged for items that they did not purchase. (Details in kaggle [Credit Card Fraud Detection](https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud)) 174 175 ### Data 176 177 The dataset contains transactions made by credit cards in September 2013 by European cardholders. The data format is as follows: 178 179 | Column Name | Type | Explanation | 180 | ----------- | ------ | ------------------------------------------------------------ | 181 | Time | int | Number of seconds elapsed between this transaction and the first transaction in the dataset | 182 | V1 ~ V28 | double | May be result of a PCA Dimensionality reduction to protect user identities and sensitive features | 183 | Amount | double | Transaction amount | 184 | Class | int | 1 for fraudulent transactions, 0 otherwise | 185 186 We split the data into training, validation, and test sets in an 8:1:1 ratio. Since the training process is not the focus. we won't go into much detail here. We treat the test set as new data appearing in the production process, and store it in MO. The DDL is as follows: 187 188 ```sql 189 create table credit_card_transaction( 190 id int auto_increment primary key, 191 time int, 192 features json, -- store V1 ~ V28, using json list 193 amount decimal(20,2) 194 ); 195 ``` 196 197 ### Inference 198 199 After the model is trained, we can write the inference function. The core code is as follows: 200 201 ```python 202 import decimal 203 import os 204 from typing import List 205 206 import joblib 207 import numpy as np 208 209 # model path 210 model_path = os.path.join(os.path.dirname(os.path.abspath(__file__)), 'model_with_scaler') 211 212 213 def detect(featuresList: List[List[int]], amountList: List[decimal.Decimal]) -> List[bool]: 214 # load model 215 model_with_scaler = joblib.load(model_path) 216 217 columns_features = np.array(featuresList) 218 column_amount = np.array(amountList, dtype='float').reshape(-1, 1) 219 # normalize the amount 220 column_amount = model_with_scaler['amount_scaler'].transform(column_amount) 221 data = np.concatenate((columns_features, column_amount), axis=1) 222 # model inference 223 predictions = model_with_scaler['model'].predict(data) 224 return [pred == 1 for pred in predictions.tolist()] 225 226 # enable vector option 227 detect.vector = True 228 ``` 229 230 Then write setup.py to build the wheel package: 231 232 ```python 233 from setuptools import setup, find_packages 234 235 setup( 236 name="detect", 237 version="1.0.0", 238 packages=find_packages(), 239 package_data={ 240 'credit': ['model_with_scaler'] # 模型 241 }, 242 ) 243 ``` 244 245 The project directory structure is as follows: 246 247 ``` 248 |-- demo/ 249 |-- credit/ 250 |-- __init__.py 251 |-- detection.py # inference function 252 |-- model_with_scaler # model 253 |-- setup.py 254 ``` 255 256 Build the wheel package detect-1.0.0-py3-none-any.whl with the command `python setup.py bdist_wheel`. 257 258 Create the function: 259 260 ```sql 261 create or replace function py_detect(features json, amount decimal) returns bool language python import 262 -- replace with the directory where the wheel package is located 263 'your_code_path/detect-1.0.0-py3-none-any.whl' 264 -- the function detect in the module credit 265 handler 'credit.detect'; 266 ``` 267 268 Call the function: 269 270 ```sql 271 select id, py_detect(features, amount) as is_fraud from credit_card_transaction limit 10; 272 ``` 273 274 Output: 275 276 ``` 277 +---------+----------+ 278 | id | is_fraud | 279 +---------+----------+ 280 | 1 | false | 281 | 2 | false | 282 | 3 | true | 283 | 4 | false | 284 | 5 | false | 285 | 6 | false | 286 | 7 | false | 287 | 8 | true | 288 | 9 | false | 289 | 10 | false | 290 +---------+----------+ 291 ``` 292 293 Now, we have completed the inference of task credit card fraud detection in MO. 294 295 From this example, it's clear that we can easily use Python UDFs to tackle tasks that SQL alone cannot handle. Python UDFs not only extend the semantics of SQL but also spare us from writing data movement and transformation programs manually, greatly enhancing the development efficiency.