github.com/minio/minio@v0.0.0-20240328213742-3f72439b8a27/docs/select/README.md (about) 1 # Select API Quickstart Guide [![Slack](https://slack.min.io/slack?type=svg)](https://slack.min.io) 2 3 Traditional retrieval of objects is always as whole entities, i.e GetObject for a 5 GiB object, will always return 5 GiB of data. S3 Select API allows us to retrieve a subset of data by using simple SQL expressions. By using Select API to retrieve only the data needed by the application, drastic performance improvements can be achieved. 4 5 You can use the Select API to query objects with following features: 6 7 - Objects must be in CSV, JSON, or Parquet(*) format. 8 - UTF-8 is the only encoding type the Select API supports. 9 - GZIP or BZIP2 - CSV and JSON files can be compressed using GZIP, BZIP2, [ZSTD](https://facebook.github.io/zstd/), and streaming formats of [LZ4](https://lz4.github.io/lz4/), [S2](https://github.com/klauspost/compress/tree/master/s2#s2-compression) and [SNAPPY](http://google.github.io/snappy/). 10 - Parquet API supports columnar compression for using GZIP, Snappy, LZ4. Whole object compression is not supported for Parquet objects. 11 - Server-side encryption - The Select API supports querying objects that are protected with server-side encryption. 12 13 Type inference and automatic conversion of values is performed based on the context when the value is un-typed (such as when reading CSV data). If present, the CAST function overrides automatic conversion. 14 15 The [mc sql](https://min.io/docs/minio/linux/reference/minio-mc/mc-sql.html) command can be used for executing queries using the command line. 16 17 (*) Parquet is disabled on the MinIO server by default. See below how to enable it. 18 19 ## Enabling Parquet Format 20 21 Parquet is DISABLED by default since hostile crafted input can easily crash the server. 22 23 If you are in a controlled environment where it is safe to assume no hostile content can be uploaded to your cluster you can safely enable Parquet. 24 To enable Parquet set the environment variable `MINIO_API_SELECT_PARQUET=on`. 25 26 ## Example using Python API 27 28 ### 1. Prerequisites 29 30 - Install MinIO Server from [here](https://min.io/docs/minio/linux/index.html#procedure). 31 - Familiarity with AWS S3 API. 32 - Familiarity with Python and installing dependencies. 33 34 ### 2. Install boto3 35 36 Install `aws-sdk-python` from AWS SDK for Python official docs [here](https://aws.amazon.com/sdk-for-python/) 37 38 ### 3. Example 39 40 As an example, let us take a gzip compressed CSV file. Without S3 Select, we would need to download, decompress and process the entire CSV to get the data you needed. With Select API, can use a simple SQL expression to return only the data from the CSV you’re interested in, instead of retrieving the entire object. Following Python example shows how to retrieve the first column `Location` from an object containing data in CSV format. 41 42 Please replace ``endpoint_url``,``aws_access_key_id``, ``aws_secret_access_key``, ``Bucket`` and ``Key`` with your local setup in this ``select.py`` file. 43 44 ```py 45 #!/usr/bin/env/env python3 46 import boto3 47 48 s3 = boto3.client('s3', 49 endpoint_url='http://localhost:9000', 50 aws_access_key_id='minio', 51 aws_secret_access_key='minio123', 52 region_name='us-east-1') 53 54 r = s3.select_object_content( 55 Bucket='mycsvbucket', 56 Key='sampledata/TotalPopulation.csv.gz', 57 ExpressionType='SQL', 58 Expression="select * from s3object s where s.Location like '%United States%'", 59 InputSerialization={ 60 'CSV': { 61 "FileHeaderInfo": "USE", 62 }, 63 'CompressionType': 'GZIP', 64 }, 65 OutputSerialization={'CSV': {}}, 66 ) 67 68 for event in r['Payload']: 69 if 'Records' in event: 70 records = event['Records']['Payload'].decode('utf-8') 71 print(records) 72 elif 'Stats' in event: 73 statsDetails = event['Stats']['Details'] 74 print("Stats details bytesScanned: ") 75 print(statsDetails['BytesScanned']) 76 print("Stats details bytesProcessed: ") 77 print(statsDetails['BytesProcessed']) 78 ``` 79 80 ## 4. Run the Program 81 82 Upload a sample dataset to MinIO using the following commands. 83 84 ```sh 85 curl "https://population.un.org/wpp/Download/Files/1_Indicators%20(Standard)/CSV_FILES/WPP2019_TotalPopulationBySex.csv" > TotalPopulation.csv 86 mc mb myminio/mycsvbucket 87 gzip TotalPopulation.csv 88 mc cp TotalPopulation.csv.gz myminio/mycsvbucket/sampledata/ 89 ``` 90 91 Now let us proceed to run our select example to query for `Location` which matches `United States`. 92 93 ```sh 94 $ python3 select.py 95 840,United States of America,2,Medium,1950,1950.5,79233.218,79571.179,158804.395 96 97 840,United States of America,2,Medium,1951,1951.5,80178.933,80726.116,160905.035 98 99 840,United States of America,2,Medium,1952,1952.5,81305.206,82019.632,163324.851 100 101 840,United States of America,2,Medium,1953,1953.5,82565.875,83422.307,165988.190 102 .... 103 .... 104 .... 105 106 Stats details bytesScanned: 107 6758866 108 Stats details bytesProcessed: 109 25786743 110 ``` 111 112 For a more detailed SELECT SQL reference, please see [here](https://docs.aws.amazon.com/AmazonS3/latest/dev/s3-glacier-select-sql-reference-select.html) 113 114 ## 5. Explore Further 115 116 - [Use `mc` with MinIO Server](https://min.io/docs/minio/linux/reference/minio-mc.html) 117 - [Use `mc sql` with MinIO Server](https://min.io/docs/minio/linux/reference/minio-mc/mc-sql.html#command-mc.sql) 118 - [Use `minio-go` SDK with MinIO Server](https://min.io/docs/minio/linux/developers/go/minio-go.html) 119 - [Use `aws-cli` with MinIO Server](https://min.io/docs/minio/linux/integrations/aws-cli-with-minio.html) 120 - [The MinIO documentation website](https://min.io/docs/minio/linux/index.html) 121 122 ## 6. Implementation Status 123 124 - Full AWS S3 [SELECT SQL](https://docs.aws.amazon.com/AmazonS3/latest/dev/s3-glacier-select-sql-reference-select.html) syntax is supported. 125 - All [operators](https://docs.aws.amazon.com/AmazonS3/latest/dev/s3-glacier-select-sql-reference-operators.html) are supported. 126 - All aggregation, conditional, type-conversion and string functions are supported. 127 - JSON path expressions such as `FROM S3Object[*].path` are not yet evaluated. 128 - Large numbers (outside of the signed 64-bit range) are not yet supported. 129 - The Date [functions](https://docs.aws.amazon.com/AmazonS3/latest/dev/s3-glacier-select-sql-reference-date.html) `DATE_ADD`, `DATE_DIFF`, `EXTRACT` and `UTCNOW` along with type conversion using `CAST` to the `TIMESTAMP` data type are currently supported. 130 - AWS S3's [reserved keywords](https://docs.aws.amazon.com/AmazonS3/latest/dev/s3-glacier-select-sql-reference-keyword-list.html) list is not yet respected. 131 - CSV input fields (even quoted) cannot contain newlines even if `RecordDelimiter` is something else.