github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/docs/reference/auditing.md (about) 1 --- 2 title: Auditing 3 parent: Reference 4 description: Auditing is a solution for lakeFS Cloud which enables tracking of events and activities performed within the solution. These logs capture information such as who accessed the solution, what actions were taken, and when they occurred. 5 redirect_from: 6 - /audit.html 7 - /reference/audit.html 8 - /cloud/auditing.html 9 --- 10 11 # Auditing 12 {: .d-inline-block } 13 lakeFS Cloud 14 {: .label .label-green } 15 16 {: .note} 17 > Auditing is only available for [lakeFS Cloud]({% link understand/lakefs-cloud.md %}). 18 19 {: .warning } 20 > Please note, as of Jan 2024, the queryable interface within the lakeFS Cloud UI has been removed in favor of direct access to lakeFS audit logs. This document now describes how to set up and query this information using [AWS Glue](https://aws.amazon.com/glue/) as a reference. 21 22 The lakeFS audit log allows you to view all relevant user action information in a clear and organized table, including when the action was performed, by whom, and what it was they did. 23 24 This can be useful for several purposes, including: 25 26 1. Compliance - Audit logs can be used to show what data users accessed, as well as any changes they made to user management. 27 28 2. Troubleshooting - If something changes on your underlying object store that you weren't expecting, such as a big file suddenly breaking into thousands of smaller files, you can use the audit log to find out what action led to this change. 29 30 ## Setting up access to Audit Logs on AWS S3 31 32 The access to the Audit Logs is done via [AWS S3 Access Point](https://aws.amazon.com/s3/features/access-points/). 33 34 There are different ways to interact with an access point (see [Using access points in AWS](https://docs.aws.amazon.com/AmazonS3/latest/userguide/using-access-points.html)). 35 36 The initial setup: 37 38 1. Take note of the IAM Role ARN that will be used to access the data. This should be the user or role used by e.g. Athena. 39 1. [Reach out to customer success](mailto:support@treeverse.io?subject=ARN to use for audit logs) and provide this ARN. Once receiving the ARN role, an access point will be created and you should get in response the following details: 40 1. S3 Bucket (e.g. `arn:aws:s3:::lakefs-audit-logs-us-east-1-production`) 41 2. S3 URI to an access point (e.g. `s3://arn:aws:s3:us-east-1:<treeverse-id>:accesspoint/lakefs-logs-<organization>`) 42 3. Access Point alias. You can use this alias instead of the bucket name or Access Point ARN to access data through the Access Point. (e.g. `lakefs-logs-<generated>-s3alias`) 43 3. Update your IAM Role policy and trust policy if required 44 45 A minimal example for IAM policy with 2 lakeFS installations in 2 regions (`us-east-1`, `us-west-2`): 46 47 ```json 48 { 49 "Version": "2012-10-17", 50 "Statement": [ 51 { 52 "Effect": "Allow", 53 "Action": [ 54 "s3:ListBucket" 55 ], 56 "Resource": [ 57 "arn:aws:s3:::lakefs-audit-logs-us-east-1-production", 58 "arn:aws:s3:::lakefs-audit-logs-us-east-1-production/*", 59 "arn:aws:s3:::lakefs-logs-<generated>-s3alias/*", 60 "arn:aws:s3:us-east-1:<treeverse-id>:accesspoint/lakefs-logs-<organization>", 61 "arn:aws:s3:us-east-1:<treeverse-id>:accesspoint/lakefs-logs-<organization>/*" 62 ], 63 "Condition": { 64 "StringLike": { 65 "s3:prefix": [ 66 "etl/v1/data/region=<region_a>/organization=org-<organization>/*", 67 "etl/v1/data/region=<region_b>/organization=org-<organization>/*" 68 ] 69 } 70 } 71 }, 72 { 73 "Effect": "Allow", 74 "Action": [ 75 "s3:GetObject", 76 "s3:GetObjectVersion" 77 ], 78 "Resource": [ 79 "arn:aws:s3:::lakefs-audit-logs-us-east-1-production", 80 "arn:aws:s3:::lakefs-audit-logs-us-east-1-production/etl/v1/data/region=<region_a>/organization=org-<organization>/*", 81 "arn:aws:s3:::lakefs-audit-logs-us-east-1-production/etl/v1/data/region=<region_b>/organization=org-<organization>/*", 82 "arn:aws:s3:::lakefs-logs-<generated>-s3alias/*", 83 "arn:aws:s3:us-east-1:<treeverse-id>:accesspoint/lakefs-logs-<organization>/object/etl/v1/data/region=<region_a>/organization=org-<organization>/*", 84 "arn:aws:s3:us-east-1:<treeverse-id>:accesspoint/lakefs-logs-<organization>/object/etl/v1/data/region=<region_b>/organization=org-<organization>/*" 85 ] 86 }, 87 { 88 "Action": [ 89 "kms:Decrypt" 90 ], 91 "Resource": [ 92 "arn:aws:kms:us-east-1:<treeverse-id>:key/<encryption-key-id>" 93 ], 94 "Effect": "Allow" 95 } 96 ] 97 } 98 ``` 99 100 Trust Policy example that allows anyone in your account to assume the role above: 101 102 ```json 103 { 104 "Version": "2012-10-17", 105 "Statement": [ 106 { 107 "Effect": "Allow", 108 "Principal": { 109 "AWS": "arn:aws:iam::<YOUR_ACCOUNT_ID>:root" 110 }, 111 "Action": "sts:AssumeRole", 112 "Condition": {} 113 } 114 ] 115 } 116 ``` 117 118 119 Authentication is done by assuming an IAM Role: 120 121 ```bash 122 # Assume role use AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_SESSION_TOKEN: 123 aws sts assume-role --role-arn arn:aws:iam::<your-aws-account>:role/<reader-role> --role-session-name <name> 124 125 # verify role assumed 126 aws sts get-caller-identity 127 128 # list objects (can be used with --recursive) with access point ARN 129 aws s3 ls arn:aws:s3:us-east-1:<treeverse-id>:accesspoint/lakefs-logs-<organization>/etl/v1/data/region=<region>/organization=org-<organization>/ 130 131 # get object locally via s3 access point alias 132 aws s3api get-object --bucket lakefs-logs-<generated>-s3alias --key etl/v1/data/region=<region>/organization=org-<organization>/year=<YY>/month=<MM>/day=<DD>/hour=<HH>/<file>-snappy.parquet sample.parquet 133 ``` 134 135 ## Data layout 136 137 {: .note } 138 > The bucket name is important when creating the IAM policy but, the Access Point ARN and Alias will be the ones that are used to access the data (i.e AWS CLI, Spark etc). 139 140 **Bucket Name:** `lakefs-audit-logs-us-east-1-production` 141 142 **Root prefix:** `etl/v1/data/region=<region>/organization=org-<organization-name>/` 143 144 **Files Path pattern:** All the audit logs files are in parquet format and their pattern is: `etl/v1/data/region=<region>/organization=org-<organization-name>/year=<YY>/month=<MM>/day=<DD>/hour=<HH>/*-snappy.parquet` 145 146 ### Path Values 147 148 **region:** lakeFS installation region (e.g the region in lakeFS URL: https://<organization-name>.<region>.lakefscloud.io/) 149 150 **organization:** Found in the lakeFS URL `https://<organization-name>.<region>.lakefscloud.io/`. The value in the S3 path must be prefixed with `org-<organization-name>` 151 152 ### Partitions 153 154 - year 155 - month 156 - day 157 - hour 158 159 ### Example 160 161 As an example paths for "Acme" organization with 2 lakeFS installations: 162 163 ```text 164 # ACME in us-east-1 165 etl/v1/data/region=us-east-1/organization=org-acme/year=2024/month=02/day=12/hour=13/log_abc-snappy.parquet 166 167 # ACME in us-west-2 168 etl/v1/data/region=us-west-2/organization=org-acme/year=2024/month=02/day=12/hour=13/log_xyz-snappy.parquet 169 ``` 170 171 ## Schema 172 173 The files are in parquet format and can be accessed directly from Spark or any client that can read parquet files. 174 Using Spark's [`printSchema()`](https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.DataFrame.printSchema.html) we can inspect the values, that’s the latest schema with comments on important columns: 175 176 | column | type | description | 177 |---------------------|--------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| 178 | `data_user` | string | the internal user ID for the user making the request. if using an external IdP (i.e SSO, Microsoft Entra, etc) it will be the UID represented by the IdP. (see below an example how to extract the info of external IDs in python) | 179 | `data_repository` | string | the repository ID relevant for this request. Currently only returned for s3_gateway requests | 180 | `data_ref` | string | the reference ID (tag, branch, ...) relevant for this request. Currently only returned for s3_gateway requests | 181 | `data_status_code` | int | HTTP status code returned for this request | 182 | `data_service_name` | string | Service name for the request. Could be either "rest_api" or "s3_gateway" | 183 | `data_request_id` | string | Unique ID representing this request | 184 | `data_path` | string | HTTP path used for this request | 185 | `data_operation_id` | string | Logical operation ID for this request. E.g. `list_objects`, `delete_repository`, ... | 186 | `data_method` | string | HTTP method for the request | 187 | `data_time` | string | datetime representing the start time of this request, in [ISO 8601](https://en.wikipedia.org/wiki/ISO_8601) format | 188 189 190 ## IdP users: map user IDs from audit logs to an email in lakeFS 191 192 The `data_user` column in each log represents the user id that performed it. 193 194 * It might be empty in cases where authentication is not required (e.g login attempt). 195 * If the user is an API user created internally in lakeFS that id is also the name it was given. 196 * `data_user` might contain an ID to an external IdP (i.e. SSO system), usually it is not human friendly, we can correlate the ID to a lakeFS email used, see an example using the [Python lakefs-sdk](../integrations/python.md#using-the-lakefs-sdk). 197 198 199 ```python 200 import lakefs_sdk 201 202 # Configure HTTP basic authorization: basic_auth 203 configuration = lakefs_sdk.Configuration( 204 host = "https://<org>.<region>.lakefscloud.io/api/v1", 205 username = 'AKIA...', 206 password = '...' 207 ) 208 209 # Print all user email and uid in lakeFS 210 # the uid is equal to the user id in the audit logs. 211 with lakefs_sdk.ApiClient(configuration) as api_client: 212 auth_api = lakefs_sdk.AuthApi(api_client) 213 has_more = True 214 next_offset = '' 215 page_size = 100 216 while has_more: 217 resp = auth_api.list_users(prefix='', after=next_offset, amount=page_size) 218 for u in resp.results: 219 email = u.email 220 uid = u.id 221 print(f'Email: {email}, UID: {uid}') 222 223 has_more = resp.pagination.has_more 224 next_offset = resp.pagination.next_offset 225 ``` 226 227 ## Example: Glue Notebook with Spark 228 229 ```python 230 from awsglue.transforms import * 231 from pyspark.context import SparkContext 232 from awsglue.context import GlueContext 233 from awsglue.job import Job 234 235 sc = SparkContext.getOrCreate() 236 glueContext = GlueContext(sc) 237 spark = glueContext.spark_session 238 job = Job(glueContext) 239 240 # connect to s3 access point 241 alias = 's3://<bucket-alias-name>' 242 s3_dyf = glueContext.create_dynamic_frame.from_options( 243 format_options={}, 244 connection_type="s3", 245 format="parquet", 246 connection_options={ 247 "paths": [alias + "/etl/v1/data/region=<region>/organization=org-<org>/year=<YY>/month=<MM>/day=<DD>/hour=<HH>/"], 248 "recurse": True, 249 }, 250 transformation_ctx="sample-ctx", 251 ) 252 253 s3_dyf.show() 254 s3_dyf.printSchema() 255 ```