storj.io/minio@v0.0.0-20230509071714-0cbc90f649b1/docs/select/README.md (about)

     1  # Select API Quickstart Guide [![Slack](https://slack.min.io/slack?type=svg)](https://slack.min.io)
     2  Traditional retrieval of objects is always as whole entities, i.e GetObject for a 5 GiB object, will always return 5 GiB of data. S3 Select API allows us to retrieve a subset of data by using simple SQL expressions. By using Select API to retrieve only the data needed by the application, drastic performance improvements can be achieved.
     3  
     4  You can use the Select API to query objects with following features:
     5  
     6  - Objects must be in CSV, JSON, or Parquet(*) format. 
     7  - UTF-8 is the only encoding type the Select API supports.
     8  - GZIP or BZIP2 - CSV and JSON files can be compressed using GZIP or BZIP2. The Select API supports columnar compression for Parquet using GZIP, Snappy, LZ4. Whole object compression is not supported for Parquet objects.
     9  - Server-side encryption - The Select API supports querying objects that are protected with server-side encryption.
    10  
    11  Type inference and automatic conversion of values is performed based on the context when the value is un-typed (such as when reading CSV data). If present, the CAST function overrides automatic conversion.
    12  
    13  The [mc sql](https://docs.min.io/docs/minio-client-complete-guide.html#sql) command can be used for executing queries using the command line. 
    14  
    15  (*) Parquet is disabled on the MinIO server by default. See below how to enable it.
    16  
    17  ## Enabling Parquet Format
    18  
    19  Parquet is DISABLED by default since hostile crafted input can easily crash the server.
    20  
    21  If you are in a controlled environment where it is safe to assume no hostile content can be uploaded to your cluster you can safely enable Parquet.
    22  To enable Parquet set the environment variable `MINIO_API_SELECT_PARQUET=on`.
    23  
    24  # Example using Python API 
    25  
    26  ## 1. Prerequisites
    27  - Install MinIO Server from [here](http://docs.min.io/docs/minio-quickstart-guide).
    28  - Familiarity with AWS S3 API.
    29  - Familiarity with Python and installing dependencies.
    30  
    31  ## 2. Install boto3
    32  Install `aws-sdk-python` from AWS SDK for Python official docs [here](https://aws.amazon.com/sdk-for-python/)
    33  
    34  ## 3. Example
    35  As an example, let us take a gzip compressed CSV file. Without S3 Select, we would need to download, decompress and process the entire CSV to get the data you needed. With Select API, can use a simple SQL expression to return only the data from the CSV you’re interested in, instead of retrieving the entire object. Following Python example shows how to retrieve the first column `Location` from an object containing data in CSV format.
    36  
    37  Please replace ``endpoint_url``,``aws_access_key_id``, ``aws_secret_access_key``, ``Bucket`` and ``Key`` with your local setup in this ``select.py`` file.
    38  
    39  ```py
    40  #!/usr/bin/env/env python3
    41  import boto3
    42  
    43  s3 = boto3.client('s3',
    44                    endpoint_url='http://localhost:9000',
    45                    aws_access_key_id='minio',
    46                    aws_secret_access_key='minio123',
    47                    region_name='us-east-1')
    48  
    49  r = s3.select_object_content(
    50      Bucket='mycsvbucket',
    51      Key='sampledata/TotalPopulation.csv.gz',
    52      ExpressionType='SQL',
    53      Expression="select * from s3object s where s.Location like '%United States%'",
    54      InputSerialization={
    55          'CSV': {
    56              "FileHeaderInfo": "USE",
    57          },
    58          'CompressionType': 'GZIP',
    59      },
    60      OutputSerialization={'CSV': {}},
    61  )
    62  
    63  for event in r['Payload']:
    64      if 'Records' in event:
    65          records = event['Records']['Payload'].decode('utf-8')
    66          print(records)
    67      elif 'Stats' in event:
    68          statsDetails = event['Stats']['Details']
    69          print("Stats details bytesScanned: ")
    70          print(statsDetails['BytesScanned'])
    71          print("Stats details bytesProcessed: ")
    72          print(statsDetails['BytesProcessed'])
    73  ```
    74  
    75  ## 4. Run the Program
    76  Upload a sample dataset to MinIO using the following commands.
    77  ```sh
    78  $ curl "https://population.un.org/wpp/Download/Files/1_Indicators%20(Standard)/CSV_FILES/WPP2019_TotalPopulationBySex.csv" > TotalPopulation.csv
    79  $ mc mb myminio/mycsvbucket
    80  $ gzip TotalPopulation.csv
    81  $ mc cp TotalPopulation.csv.gz myminio/mycsvbucket/sampledata/
    82  ```
    83  
    84  Now let us proceed to run our select example to query for `Location` which matches `United States`.
    85  ```sh
    86  $ python3 select.py
    87  840,United States of America,2,Medium,1950,1950.5,79233.218,79571.179,158804.395
    88  
    89  840,United States of America,2,Medium,1951,1951.5,80178.933,80726.116,160905.035
    90  
    91  840,United States of America,2,Medium,1952,1952.5,81305.206,82019.632,163324.851
    92  
    93  840,United States of America,2,Medium,1953,1953.5,82565.875,83422.307,165988.190
    94  ....
    95  ....
    96  ....
    97  
    98  Stats details bytesScanned:
    99  6758866
   100  Stats details bytesProcessed:
   101  25786743
   102  ```
   103  
   104  For a more detailed SELECT SQL reference, please see [here](https://docs.aws.amazon.com/AmazonS3/latest/dev/s3-glacier-select-sql-reference-select.html)
   105  
   106  ## 5. Explore Further
   107  - [Use `mc` with MinIO Server](https://docs.min.io/docs/minio-client-quickstart-guide)
   108  - [Use `mc sql` with MinIO Server](https://docs.min.io/docs/minio-client-complete-guide.html#sql)
   109  - [Use `minio-go` SDK with MinIO Server](https://docs.min.io/docs/golang-client-quickstart-guide)
   110  - [Use `aws-cli` with MinIO Server](https://docs.min.io/docs/aws-cli-with-minio)
   111  - [Use `s3cmd` with MinIO Server](https://docs.min.io/docs/s3cmd-with-minio)
   112  - [The MinIO documentation website](https://docs.min.io)
   113  
   114  ## 6. Implementation Status
   115  - Full AWS S3 [SELECT SQL](https://docs.aws.amazon.com/AmazonS3/latest/dev/s3-glacier-select-sql-reference-select.html) syntax is supported.
   116  - All [operators](https://docs.aws.amazon.com/AmazonS3/latest/dev/s3-glacier-select-sql-reference-operators.html) are supported.
   117  - All aggregation, conditional, type-conversion and string functions are supported.
   118  - JSON path expressions such as `FROM S3Object[*].path` are not yet evaluated.
   119  - Large numbers (outside of the signed 64-bit range) are not yet supported.
   120  - The Date [functions](https://docs.aws.amazon.com/AmazonS3/latest/dev/s3-glacier-select-sql-reference-date.html) `DATE_ADD`, `DATE_DIFF`, `EXTRACT` and `UTCNOW` along with type conversion using `CAST` to the `TIMESTAMP` data type are currently supported.
   121  - AWS S3's [reserved keywords](https://docs.aws.amazon.com/AmazonS3/latest/dev/s3-glacier-select-sql-reference-keyword-list.html) list is not yet respected.
   122  - CSV input fields (even quoted) cannot contain newlines even if `RecordDelimiter` is something else.