github.com/minio/minio@v0.0.0-20240328213742-3f72439b8a27/docs/select/README.md (about)

     1  # Select API Quickstart Guide [![Slack](https://slack.min.io/slack?type=svg)](https://slack.min.io)
     2  
     3  Traditional retrieval of objects is always as whole entities, i.e GetObject for a 5 GiB object, will always return 5 GiB of data. S3 Select API allows us to retrieve a subset of data by using simple SQL expressions. By using Select API to retrieve only the data needed by the application, drastic performance improvements can be achieved.
     4  
     5  You can use the Select API to query objects with following features:
     6  
     7  - Objects must be in CSV, JSON, or Parquet(*) format.
     8  - UTF-8 is the only encoding type the Select API supports.
     9  - GZIP or BZIP2 - CSV and JSON files can be compressed using GZIP, BZIP2, [ZSTD](https://facebook.github.io/zstd/), and streaming formats of [LZ4](https://lz4.github.io/lz4/), [S2](https://github.com/klauspost/compress/tree/master/s2#s2-compression) and [SNAPPY](http://google.github.io/snappy/).
    10  - Parquet API supports columnar compression for  using GZIP, Snappy, LZ4. Whole object compression is not supported for Parquet objects.
    11  - Server-side encryption - The Select API supports querying objects that are protected with server-side encryption.
    12  
    13  Type inference and automatic conversion of values is performed based on the context when the value is un-typed (such as when reading CSV data). If present, the CAST function overrides automatic conversion.
    14  
    15  The [mc sql](https://min.io/docs/minio/linux/reference/minio-mc/mc-sql.html) command can be used for executing queries using the command line.
    16  
    17  (*) Parquet is disabled on the MinIO server by default. See below how to enable it.
    18  
    19  ## Enabling Parquet Format
    20  
    21  Parquet is DISABLED by default since hostile crafted input can easily crash the server.
    22  
    23  If you are in a controlled environment where it is safe to assume no hostile content can be uploaded to your cluster you can safely enable Parquet.
    24  To enable Parquet set the environment variable `MINIO_API_SELECT_PARQUET=on`.
    25  
    26  ## Example using Python API
    27  
    28  ### 1. Prerequisites
    29  
    30  - Install MinIO Server from [here](https://min.io/docs/minio/linux/index.html#procedure).
    31  - Familiarity with AWS S3 API.
    32  - Familiarity with Python and installing dependencies.
    33  
    34  ### 2. Install boto3
    35  
    36  Install `aws-sdk-python` from AWS SDK for Python official docs [here](https://aws.amazon.com/sdk-for-python/)
    37  
    38  ### 3. Example
    39  
    40  As an example, let us take a gzip compressed CSV file. Without S3 Select, we would need to download, decompress and process the entire CSV to get the data you needed. With Select API, can use a simple SQL expression to return only the data from the CSV you’re interested in, instead of retrieving the entire object. Following Python example shows how to retrieve the first column `Location` from an object containing data in CSV format.
    41  
    42  Please replace ``endpoint_url``,``aws_access_key_id``, ``aws_secret_access_key``, ``Bucket`` and ``Key`` with your local setup in this ``select.py`` file.
    43  
    44  ```py
    45  #!/usr/bin/env/env python3
    46  import boto3
    47  
    48  s3 = boto3.client('s3',
    49                    endpoint_url='http://localhost:9000',
    50                    aws_access_key_id='minio',
    51                    aws_secret_access_key='minio123',
    52                    region_name='us-east-1')
    53  
    54  r = s3.select_object_content(
    55      Bucket='mycsvbucket',
    56      Key='sampledata/TotalPopulation.csv.gz',
    57      ExpressionType='SQL',
    58      Expression="select * from s3object s where s.Location like '%United States%'",
    59      InputSerialization={
    60          'CSV': {
    61              "FileHeaderInfo": "USE",
    62          },
    63          'CompressionType': 'GZIP',
    64      },
    65      OutputSerialization={'CSV': {}},
    66  )
    67  
    68  for event in r['Payload']:
    69      if 'Records' in event:
    70          records = event['Records']['Payload'].decode('utf-8')
    71          print(records)
    72      elif 'Stats' in event:
    73          statsDetails = event['Stats']['Details']
    74          print("Stats details bytesScanned: ")
    75          print(statsDetails['BytesScanned'])
    76          print("Stats details bytesProcessed: ")
    77          print(statsDetails['BytesProcessed'])
    78  ```
    79  
    80  ## 4. Run the Program
    81  
    82  Upload a sample dataset to MinIO using the following commands.
    83  
    84  ```sh
    85  curl "https://population.un.org/wpp/Download/Files/1_Indicators%20(Standard)/CSV_FILES/WPP2019_TotalPopulationBySex.csv" > TotalPopulation.csv
    86  mc mb myminio/mycsvbucket
    87  gzip TotalPopulation.csv
    88  mc cp TotalPopulation.csv.gz myminio/mycsvbucket/sampledata/
    89  ```
    90  
    91  Now let us proceed to run our select example to query for `Location` which matches `United States`.
    92  
    93  ```sh
    94  $ python3 select.py
    95  840,United States of America,2,Medium,1950,1950.5,79233.218,79571.179,158804.395
    96  
    97  840,United States of America,2,Medium,1951,1951.5,80178.933,80726.116,160905.035
    98  
    99  840,United States of America,2,Medium,1952,1952.5,81305.206,82019.632,163324.851
   100  
   101  840,United States of America,2,Medium,1953,1953.5,82565.875,83422.307,165988.190
   102  ....
   103  ....
   104  ....
   105  
   106  Stats details bytesScanned:
   107  6758866
   108  Stats details bytesProcessed:
   109  25786743
   110  ```
   111  
   112  For a more detailed SELECT SQL reference, please see [here](https://docs.aws.amazon.com/AmazonS3/latest/dev/s3-glacier-select-sql-reference-select.html)
   113  
   114  ## 5. Explore Further
   115  
   116  - [Use `mc` with MinIO Server](https://min.io/docs/minio/linux/reference/minio-mc.html)
   117  - [Use `mc sql` with MinIO Server](https://min.io/docs/minio/linux/reference/minio-mc/mc-sql.html#command-mc.sql)
   118  - [Use `minio-go` SDK with MinIO Server](https://min.io/docs/minio/linux/developers/go/minio-go.html)
   119  - [Use `aws-cli` with MinIO Server](https://min.io/docs/minio/linux/integrations/aws-cli-with-minio.html)
   120  - [The MinIO documentation website](https://min.io/docs/minio/linux/index.html)
   121  
   122  ## 6. Implementation Status
   123  
   124  - Full AWS S3 [SELECT SQL](https://docs.aws.amazon.com/AmazonS3/latest/dev/s3-glacier-select-sql-reference-select.html) syntax is supported.
   125  - All [operators](https://docs.aws.amazon.com/AmazonS3/latest/dev/s3-glacier-select-sql-reference-operators.html) are supported.
   126  - All aggregation, conditional, type-conversion and string functions are supported.
   127  - JSON path expressions such as `FROM S3Object[*].path` are not yet evaluated.
   128  - Large numbers (outside of the signed 64-bit range) are not yet supported.
   129  - The Date [functions](https://docs.aws.amazon.com/AmazonS3/latest/dev/s3-glacier-select-sql-reference-date.html) `DATE_ADD`, `DATE_DIFF`, `EXTRACT` and `UTCNOW` along with type conversion using `CAST` to the `TIMESTAMP` data type are currently supported.
   130  - AWS S3's [reserved keywords](https://docs.aws.amazon.com/AmazonS3/latest/dev/s3-glacier-select-sql-reference-keyword-list.html) list is not yet respected.
   131  - CSV input fields (even quoted) cannot contain newlines even if `RecordDelimiter` is something else.