github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/docs/reference/auditing.md

github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/docs/reference/auditing.md (about)

     1  ---
     2  title: Auditing
     3  parent: Reference
     4  description: Auditing is a solution for lakeFS Cloud which enables tracking of events and activities performed within the solution. These logs capture information such as who accessed the solution, what actions were taken, and when they occurred.
     5  redirect_from: 
     6    - /audit.html
     7    - /reference/audit.html
     8    - /cloud/auditing.html
     9  ---
    10  
    11  # Auditing
    12  {: .d-inline-block }
    13  lakeFS Cloud
    14  {: .label .label-green }
    15  
    16  {: .note}
    17  > Auditing is only available for [lakeFS Cloud]({% link understand/lakefs-cloud.md %}).
    18  
    19  {: .warning }
    20  > Please note, as of Jan 2024, the queryable interface within the lakeFS Cloud UI has been removed in favor of direct access to lakeFS audit logs. This document now describes how to set up and query this information using [AWS Glue](https://aws.amazon.com/glue/) as a reference.
    21  
    22  The lakeFS audit log allows you to view all relevant user action information in a clear and organized table, including when the action was performed, by whom, and what it was they did. 
    23  
    24  This can be useful for several purposes, including: 
    25  
    26  1. Compliance - Audit logs can be used to show what data users accessed, as well as any changes they made to user management.
    27  
    28  2. Troubleshooting - If something changes on your underlying object store that you weren't expecting, such as a big file suddenly breaking into thousands of smaller files, you can use the audit log to find out what action led to this change. 
    29  
    30  ## Setting up access to Audit Logs on AWS S3
    31  
    32  The access to the Audit Logs is done via [AWS S3 Access Point](https://aws.amazon.com/s3/features/access-points/).
    33  
    34  There are different ways to interact with an access point (see [Using access points in AWS](https://docs.aws.amazon.com/AmazonS3/latest/userguide/using-access-points.html)).
    35  
    36  The initial setup:
    37  
    38  1. Take note of the IAM Role ARN that will be used to access the data. This should be the user or role used by e.g. Athena.
    39  1. [Reach out to customer success](mailto:support@treeverse.io?subject=ARN to use for audit logs) and provide this ARN. Once receiving the ARN role, an access point will be created and you should get in response the following details:
    40     1. S3 Bucket (e.g. `arn:aws:s3:::lakefs-audit-logs-us-east-1-production`)
    41     2. S3 URI to an access point (e.g. `s3://arn:aws:s3:us-east-1:<treeverse-id>:accesspoint/lakefs-logs-<organization>`)
    42     3. Access Point alias. You can use this alias instead of the bucket name or Access Point ARN to access data through the Access Point. (e.g. `lakefs-logs-<generated>-s3alias`)
    43  3. Update your IAM Role policy and trust policy if required
    44  
    45  A minimal example for IAM policy with 2 lakeFS installations in 2 regions (`us-east-1`, `us-west-2`):
    46  
    47  ```json
    48  {
    49      "Version": "2012-10-17",
    50      "Statement": [
    51          {
    52              "Effect": "Allow",
    53              "Action": [
    54                  "s3:ListBucket"
    55              ],
    56              "Resource": [
    57                  "arn:aws:s3:::lakefs-audit-logs-us-east-1-production",
    58                  "arn:aws:s3:::lakefs-audit-logs-us-east-1-production/*",
    59                  "arn:aws:s3:::lakefs-logs-<generated>-s3alias/*",
    60                  "arn:aws:s3:us-east-1:<treeverse-id>:accesspoint/lakefs-logs-<organization>",
    61                  "arn:aws:s3:us-east-1:<treeverse-id>:accesspoint/lakefs-logs-<organization>/*"
    62              ],
    63              "Condition": {
    64                  "StringLike": {
    65                      "s3:prefix": [
    66                          "etl/v1/data/region=<region_a>/organization=org-<organization>/*",
    67                          "etl/v1/data/region=<region_b>/organization=org-<organization>/*"
    68                      ]
    69                  }
    70              }
    71          },
    72          {
    73              "Effect": "Allow",
    74              "Action": [
    75                  "s3:GetObject",
    76                  "s3:GetObjectVersion"
    77              ],
    78              "Resource": [
    79                  "arn:aws:s3:::lakefs-audit-logs-us-east-1-production",
    80                  "arn:aws:s3:::lakefs-audit-logs-us-east-1-production/etl/v1/data/region=<region_a>/organization=org-<organization>/*",
    81                  "arn:aws:s3:::lakefs-audit-logs-us-east-1-production/etl/v1/data/region=<region_b>/organization=org-<organization>/*",
    82                  "arn:aws:s3:::lakefs-logs-<generated>-s3alias/*",
    83                  "arn:aws:s3:us-east-1:<treeverse-id>:accesspoint/lakefs-logs-<organization>/object/etl/v1/data/region=<region_a>/organization=org-<organization>/*",
    84                  "arn:aws:s3:us-east-1:<treeverse-id>:accesspoint/lakefs-logs-<organization>/object/etl/v1/data/region=<region_b>/organization=org-<organization>/*"
    85              ]
    86          },
    87          {
    88              "Action": [
    89                  "kms:Decrypt"
    90              ],
    91              "Resource": [
    92                  "arn:aws:kms:us-east-1:<treeverse-id>:key/<encryption-key-id>"
    93              ],
    94              "Effect": "Allow"
    95          }
    96      ]
    97  }
    98  ```
    99  
   100  Trust Policy example that allows anyone in your account to assume the role above:
   101  
   102  ```json
   103  {
   104      "Version": "2012-10-17",
   105      "Statement": [
   106          {
   107              "Effect": "Allow",
   108              "Principal": {
   109                  "AWS": "arn:aws:iam::<YOUR_ACCOUNT_ID>:root"
   110              },
   111              "Action": "sts:AssumeRole",
   112              "Condition": {}
   113          }
   114      ]
   115  }
   116  ```
   117  
   118  
   119  Authentication is done by assuming an IAM Role:
   120  
   121  ```bash
   122  # Assume role use AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_SESSION_TOKEN:
   123  aws sts assume-role --role-arn arn:aws:iam::<your-aws-account>:role/<reader-role> --role-session-name <name> 
   124  
   125  # verify role assumed
   126  aws sts get-caller-identity 
   127  
   128  # list objects (can be used with --recursive) with access point ARN
   129  aws s3 ls arn:aws:s3:us-east-1:<treeverse-id>:accesspoint/lakefs-logs-<organization>/etl/v1/data/region=<region>/organization=org-<organization>/
   130  
   131  # get object locally via s3 access point alias 
   132  aws s3api get-object --bucket lakefs-logs-<generated>-s3alias --key etl/v1/data/region=<region>/organization=org-<organization>/year=<YY>/month=<MM>/day=<DD>/hour=<HH>/<file>-snappy.parquet sample.parquet 
   133  ```
   134  
   135  ## Data layout
   136  
   137  {: .note }
   138  > The bucket name is important when creating the IAM policy but, the Access Point ARN and Alias will be the ones that are used to access the data (i.e AWS CLI, Spark etc).
   139  
   140  **Bucket Name:** `lakefs-audit-logs-us-east-1-production`
   141  
   142  **Root prefix:** `etl/v1/data/region=<region>/organization=org-<organization-name>/`
   143  
   144  **Files Path pattern:** All the audit logs files are in parquet format and their pattern is: `etl/v1/data/region=<region>/organization=org-<organization-name>/year=<YY>/month=<MM>/day=<DD>/hour=<HH>/*-snappy.parquet`
   145  
   146  ### Path Values
   147  
   148  **region:** lakeFS installation region (e.g the region in lakeFS URL: https://<organization-name>.<region>.lakefscloud.io/)
   149  
   150  **organization:** Found in the lakeFS URL `https://<organization-name>.<region>.lakefscloud.io/`. The value in the S3 path must be prefixed with `org-<organization-name>`
   151  
   152  ### Partitions
   153  
   154  - year
   155  - month
   156  - day
   157  - hour
   158  
   159  ### Example
   160  
   161  As an example paths for "Acme" organization with 2 lakeFS installations:
   162  
   163  ```text
   164  # ACME in us-east-1 
   165  etl/v1/data/region=us-east-1/organization=org-acme/year=2024/month=02/day=12/hour=13/log_abc-snappy.parquet
   166  
   167  # ACME in us-west-2 
   168  etl/v1/data/region=us-west-2/organization=org-acme/year=2024/month=02/day=12/hour=13/log_xyz-snappy.parquet
   169  ```
   170  
   171  ## Schema
   172  
   173  The files are in parquet format and can be accessed directly from Spark or any client that can read parquet files.
   174  Using Spark's [`printSchema()`](https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.DataFrame.printSchema.html) we can inspect the values, that’s the latest schema with comments on important columns:
   175  
   176  | column              | type   | description                                                                                                                                                                                                                         |
   177  |---------------------|--------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
   178  | `data_user`         | string | the internal user ID for the user making the request. if using an external IdP (i.e SSO, Microsoft Entra, etc) it will be the UID represented by the IdP. (see below an  example how to extract the info of external IDs in python) |
   179  | `data_repository`   | string | the repository ID relevant for this request. Currently only returned for s3_gateway requests                                                                                                                                        |
   180  | `data_ref`          | string | the reference ID (tag, branch, ...) relevant for this request. Currently only returned for s3_gateway requests                                                                                                                      |
   181  | `data_status_code`  | int    | HTTP status code returned for this request                                                                                                                                                                                          |
   182  | `data_service_name` | string | Service name for the request. Could be either "rest_api" or "s3_gateway"                                                                                                                                                            |
   183  | `data_request_id`   | string | Unique ID representing this request                                                                                                                                                                                                 |
   184  | `data_path`         | string | HTTP path used for this request                                                                                                                                                                                                     |
   185  | `data_operation_id` | string | Logical operation ID for this request. E.g. `list_objects`, `delete_repository`, ...                                                                                                                                                |
   186  | `data_method`       | string | HTTP method for the request                                                                                                                                                                                                         |
   187  | `data_time`         | string | datetime representing the start time of this request, in [ISO 8601](https://en.wikipedia.org/wiki/ISO_8601) format                                                                                                                                                        |
   188  
   189  
   190  ## IdP users: map user IDs from audit logs to an email in lakeFS
   191  
   192  The `data_user` column in each log represents the user id that performed it.
   193  
   194  * It might be empty in cases where authentication is not required (e.g login attempt).
   195  * If the user is an API user created internally in lakeFS that id is also the name it was given.
   196  * `data_user` might contain an ID to an external IdP (i.e. SSO system), usually it is not human friendly, we can correlate the ID to a lakeFS email used, see an example using the [Python lakefs-sdk](../integrations/python.md#using-the-lakefs-sdk).
   197  
   198  
   199  ```python
   200  import lakefs_sdk
   201  
   202  # Configure HTTP basic authorization: basic_auth
   203  configuration = lakefs_sdk.Configuration(
   204      host = "https://<org>.<region>.lakefscloud.io/api/v1",
   205      username = 'AKIA...',
   206      password = '...'
   207  )
   208  
   209  # Print all user email and uid in lakeFS 
   210  # the uid is equal to the user id in the audit logs.
   211  with lakefs_sdk.ApiClient(configuration) as api_client:
   212      auth_api = lakefs_sdk.AuthApi(api_client)
   213      has_more = True
   214      next_offset = ''
   215      page_size = 100 
   216      while has_more: 
   217          resp = auth_api.list_users(prefix='', after=next_offset, amount=page_size)
   218          for u in resp.results:
   219              email = u.email
   220              uid = u.id
   221              print(f'Email: {email}, UID: {uid}')
   222  
   223          has_more = resp.pagination.has_more 
   224          next_offset = resp.pagination.next_offset
   225  ```
   226  
   227  ## Example: Glue Notebook with Spark
   228  
   229  ```python
   230  from awsglue.transforms import *
   231  from pyspark.context import SparkContext
   232  from awsglue.context import GlueContext
   233  from awsglue.job import Job
   234  
   235  sc = SparkContext.getOrCreate()
   236  glueContext = GlueContext(sc)
   237  spark = glueContext.spark_session
   238  job = Job(glueContext)
   239  
   240  # connect to s3 access point 
   241  alias = 's3://<bucket-alias-name>'
   242  s3_dyf = glueContext.create_dynamic_frame.from_options(
   243      format_options={},
   244      connection_type="s3",
   245      format="parquet",
   246      connection_options={
   247          "paths": [alias + "/etl/v1/data/region=<region>/organization=org-<org>/year=<YY>/month=<MM>/day=<DD>/hour=<HH>/"],
   248          "recurse": True,
   249      },
   250      transformation_ctx="sample-ctx",
   251  )
   252  
   253  s3_dyf.show()
   254  s3_dyf.printSchema()
   255  ```