github.com/matrixorigin/matrixone@v1.2.0/pkg/udf/pythonservice/demo/README.md

github.com/matrixorigin/matrixone@v1.2.0/pkg/udf/pythonservice/demo/README.md (about)

     1  # MO Python UDF Demo
     2  
     3  This document briefly describes the use of python UDFs in MO, showing the flexibility and convenience of python UDFs.
     4  
     5  ## Standalone Deployment
     6  
     7  1. Before the deployment, we need to make sure we have [python3](https://www.python.org/downloads/) locally and install the grpc package in python:
     8  
     9  ```
    10  pip install protobuf
    11  pip install grpcio
    12  ```
    13  
    14  2. Follow the steps in [Getting Started](https://docs.matrixorigin.cn/en/1.0.0-rc1/MatrixOne/Get-Started/install-standalone-matrixone/) to complete the installation, and modify the configuration of mo_ctl by running the following command before starting MO:
    15  
    16  ```shell
    17  mo_ctl set_conf MO_CONF_FILE="${MO_PATH}/matrixone/etc/launch-with-python-udf-server/launch.toml"
    18  ```
    19  
    20  3. Launch and connect to MO.
    21  
    22  ## "Hello world"
    23  
    24  Let's start with a simple function: a python UDF that adds two numbers together.
    25  
    26  Input the following SQL in the client to create the function:
    27  
    28  ```sql
    29  create or replace function py_add(a int, b int) returns int language python as 
    30  $$
    31  def add(a, b):
    32    return a + b
    33  $$
    34  handler 'add';
    35  ```
    36  
    37  > Note: currently, matrixone does not perform syntax checking on the python script in UDF body.
    38  
    39  This SQL defines a function called py_add that takes two parameters of type `int` and returns the sum of the two parameters. Python code comes after `as`. `handler` represents the name of the python function that will be called.
    40  
    41  Now we can use the function:
    42  
    43  ```sql
    44  select py_add(1,2);
    45  ```
    46  
    47  Output:
    48  
    49  ```
    50  +--------------+
    51  | py_add(1, 2) |
    52  +--------------+
    53  |            3 |
    54  +--------------+
    55  ```
    56  
    57  We can remove the function when we no longer need it:
    58  
    59  ```sql
    60  drop function py_add(int, int);
    61  ```
    62  
    63  We now have a basic understanding of how to use python UDFs.
    64  
    65  ## Advancement
    66  
    67  In this section, we'll look at some more advanced uses of python UDFs.
    68  
    69  ### Imported Python UDF
    70  
    71  In the previous section, the py_add function is written directly in SQL, and the python code is written after the keyword `as`. We call this form **embedded UDF**. However, if the function logic is very complex, writing it directly in SQL is not a good choice. It bloats the SQL statements and makes the code less maintainable. In this case, we can choose another way to create python UDFs: importing python code from an external file, known as **imported UDF**. There are two types of files that are supported: (1) a single python file, and (2) the wheel package. Let's continue with the example of the py_add function from the previous section, and write the python code to a file.
    72  
    73  The code for ./add func.py is as follows:
    74  
    75  ```python
    76  def add(a, b):
    77    return a + b
    78  ```
    79  
    80  Input the following SQL in the client to create the function(if you connect to matrixone via `mysql` cli, `--local-infile` flag is required to allow the cli reading local python files）:
    81  
    82  ```sql
    83  create or replace function py_add(a int, b int) returns int language python import 
    84  './add_func.py' 
    85  handler 'add';
    86  ```
    87  
    88  We use the keyword `import` to import the flie add_func.py from the current working directory.
    89  
    90  Call the function:
    91  
    92  ```sql
    93  select py_add(1,2);
    94  ```
    95  
    96  We still get the same output as in the previous section:
    97  
    98  ```
    99  +--------------+
   100  | py_add(1, 2) |
   101  +--------------+
   102  |            3 |
   103  +--------------+ 
   104  ```
   105  
   106  Importing a wheel package and a single python file share the same syntax, so we won't go into detail here. In the final example, we will create a python UDF by importing a wheel package.
   107  
   108  ### Vector Option
   109  
   110  In some scenarios, we may want a python function to receive multiple tuples at once to improve the efficiency. For example, in model inference, we typically process data in batches, where each batch is a vector of tuples. The vector option is designed to handle this situation. We'll continue to use the py_add function as an example to demonstrate the usage of the vector option.
   111  
   112  Input the following SQL in the client to create the function:
   113  
   114  ```sql
   115  create or replace function py_add(a int, b int) returns int language python as 
   116  $$
   117  def add(a, b):  # a, b are list
   118    return [a[i] + b[i] for i in range(len(a))]
   119  add.vector = True
   120  $$
   121  handler 'add';
   122  ```
   123  
   124  We use `add.vector = True` to indicate that the python function add should receive two integer lists (vectors) as input, rather than individual integer values.
   125  
   126  Call the function:
   127  
   128  ```sql
   129  select py_add(1,2);
   130  ```
   131  
   132  We still get the same output as in the previous section:
   133  
   134  ```
   135  +--------------+
   136  | py_add(1, 2) |
   137  +--------------+
   138  |            3 |
   139  +--------------+ 
   140  ```
   141  
   142  With the vector option, we have the freedom to choose how the function processes its input, whether it's one tuple at a time or multiple tuples at once.
   143  
   144  ### Type Mapping
   145  
   146  The correspondence between MO types and python types is as follows:
   147  
   148  | MO Types                                                 | Python Types                |
   149  | -------------------------------------------------------- | --------------------------- |
   150  | bool                                                     | bool                        |
   151  | int8, int16, int32, int64, uint8, uint16, uint32, uint64 | int                         |
   152  | float32, float64                                         | float                       |
   153  | char, varchar, text, uuid                                | str                         |
   154  | json                                                     | str, int, float, list, dict |
   155  | time                                                     | datetime.timedelta          |
   156  | date                                                     | datetime.date               |
   157  | datetime, timestamp                                      | datetime.datetime           |
   158  | decimal64, decimal128                                    | decimal.Decimal             |
   159  | binary, varbinary, blob                                  | bytes                       |
   160  
   161  ### Function Overloading and Matching
   162  
   163  Overloading is not supported in MO UDF yet, all function names must be unique across an MO cluster.
   164  
   165  ## Example: Credit Card Fraud Detection
   166  
   167  In this section, we use the example "Credit card fraud detection" to illustrate how python UDFs can be used in a machine learning inference pipeline. (Detailed code in [github](https://github.com/matrixorigin/matrixone/tree/main/pkg/udf/pythonservice/demo))
   168  
   169  We need to ensure that the local python environment has numpy and scikit-learn installed.
   170  
   171  ### Background
   172  
   173  It is important that credit card companies are able to recognize fraudulent credit card transactions so that customers are not charged for items that they did not purchase. (Details in kaggle [Credit Card Fraud Detection](https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud))
   174  
   175  ### Data
   176  
   177  The dataset contains transactions made by credit cards in September 2013 by European cardholders. The data format is as follows:
   178  
   179  | Column Name | Type   | Explanation                                                  |
   180  | ----------- | ------ | ------------------------------------------------------------ |
   181  | Time        | int    | Number of seconds elapsed between this transaction and the first transaction in the dataset |
   182  | V1 ~ V28    | double | May be result of a PCA Dimensionality reduction to protect user identities and sensitive features |
   183  | Amount      | double | Transaction amount                                           |
   184  | Class       | int    | 1 for fraudulent transactions, 0 otherwise                   |
   185  
   186  We split the data into training, validation, and test sets in an 8:1:1 ratio. Since the training process is not the focus. we won't go into much detail here. We treat the test set as new data appearing in the production process, and store it in MO. The DDL is as follows:
   187  
   188  ```sql
   189  create table credit_card_transaction(
   190      id int auto_increment primary key,
   191      time int,
   192      features json, -- store V1 ~ V28, using json list
   193      amount decimal(20,2)
   194  );
   195  ```
   196  
   197  ### Inference
   198  
   199  After the model is trained, we can write the inference function. The core code is as follows:
   200  
   201  ```python
   202  import decimal
   203  import os
   204  from typing import List
   205  
   206  import joblib
   207  import numpy as np
   208  
   209  # model path
   210  model_path = os.path.join(os.path.dirname(os.path.abspath(__file__)), 'model_with_scaler')
   211  
   212  
   213  def detect(featuresList: List[List[int]], amountList: List[decimal.Decimal]) -> List[bool]:
   214      # load model
   215      model_with_scaler = joblib.load(model_path)
   216  
   217      columns_features = np.array(featuresList)
   218      column_amount = np.array(amountList, dtype='float').reshape(-1, 1)
   219      # normalize the amount
   220      column_amount = model_with_scaler['amount_scaler'].transform(column_amount)
   221      data = np.concatenate((columns_features, column_amount), axis=1)
   222      # model inference
   223      predictions = model_with_scaler['model'].predict(data)
   224      return [pred == 1 for pred in predictions.tolist()]
   225  
   226  # enable vector option
   227  detect.vector = True
   228  ```
   229  
   230  Then write setup.py to build the wheel package:
   231  
   232  ```python
   233  from setuptools import setup, find_packages
   234  
   235  setup(
   236      name="detect",
   237      version="1.0.0",
   238      packages=find_packages(),
   239      package_data={
   240          'credit': ['model_with_scaler']  # 模型
   241      },
   242  )
   243  ```
   244  
   245  The project directory structure is as follows:
   246  
   247  ```
   248  |-- demo/
   249  	|-- credit/
   250  		|-- __init__.py
   251  		|-- detection.py		# inference function
   252  		|-- model_with_scaler	# model
   253  	|-- setup.py
   254  ```
   255  
   256  Build the wheel package detect-1.0.0-py3-none-any.whl with the command `python setup.py bdist_wheel`.
   257  
   258  Create the function:
   259  
   260  ```sql
   261  create or replace function py_detect(features json, amount decimal) returns bool language python import 
   262  -- replace with the directory where the wheel package is located
   263  'your_code_path/detect-1.0.0-py3-none-any.whl' 
   264  -- the function detect in the module credit
   265  handler 'credit.detect'; 
   266  ```
   267  
   268  Call the function:
   269  
   270  ```sql
   271  select id, py_detect(features, amount) as is_fraud from credit_card_transaction limit 10;
   272  ```
   273  
   274  Output：
   275  
   276  ```
   277  +---------+----------+
   278  | id      | is_fraud |
   279  +---------+----------+
   280  |       1 | false    |
   281  |       2 | false    |
   282  |       3 | true     |
   283  |       4 | false    |
   284  |       5 | false    |
   285  |       6 | false    |
   286  |       7 | false    |
   287  |       8 | true     |
   288  |       9 | false    |
   289  |      10 | false    |
   290  +---------+----------+
   291  ```
   292  
   293  Now, we have completed the inference of task credit card fraud detection in MO.
   294  
   295  From this example, it's clear that we can easily use Python UDFs to tackle tasks that SQL alone cannot handle. Python UDFs not only extend the semantics of SQL but also spare us from writing data movement and transformation programs manually, greatly enhancing the development efficiency.