github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/docs/integrations/r.md (about) 1 --- 2 title: R 3 description: How to use lakeFS from R including creating branches, committing changes, and merging. 4 parent: Integrations 5 --- 6 7 # Using R with lakeFS 8 9 R is a powerful language used widely in data science. lakeFS interfaces with R in two ways: 10 11 * To **read and write data in lakeFS** use standard S3 tools such as the `aws.s3` library. lakeFS has a [S3 gateway](https://docs.lakefs.io/understand/architecture.html#s3-gateway) which presents a lakeFS repository as an S3 bucket. 12 * For working with **lakeFS operations such as branches and commits** use the [API](https://docs.lakefs.io/reference/api.html) for which can be accessed from R using the `httr` library. 13 14 _To see examples of R in action with lakeFS please visit the [lakeFS-samples](https://github.com/treeverse/lakeFS-samples/) repository and the [sample](https://github.com/treeverse/lakeFS-samples/blob/main/00_notebooks/R.ipynb) [notebooks](https://github.com/treeverse/lakeFS-samples/blob/main/00_notebooks/R-weather.ipynb)_. 15 16 {% include toc.html %} 17 18 ## Reading and Writing from lakeFS with R 19 20 Working with data stored in lakeFS from R is the same as you would with an S3 bucket, via the [S3 Gateway that lakeFS provides](https://docs.lakefs.io/understand/architecture.html#s3-gateway). 21 22 You can use any library that interfaces with S3. In this example we'll use the [aws.s3](https://github.com/cloudyr/aws.s3) library. 23 24 ```r 25 install.packages(c("aws.s3")) 26 library(aws.s3) 27 ``` 28 29 ### Configuration 30 31 The [R S3 client documentation](https://cloud.r-project.org/web/packages/aws.s3/aws.s3.pdf) includes full details of the configuration options available. A good approach for using it with lakeFS set the endpoint and authentication details as environment variables: 32 33 ```r 34 Sys.setenv("AWS_ACCESS_KEY_ID" = "AKIAIOSFODNN7EXAMPLE", 35 "AWS_SECRET_ACCESS_KEY" = "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY", 36 "AWS_S3_ENDPOINT" = "lakefs.mycorp.com:8000") 37 ``` 38 39 _Note: it is generally best practice to set these environment variables outside of the R script; it is done so here for convenience of the example._ 40 41 In conjunction with this you must also specify `region` and `use_https` _in each call of an `aws.s3` function_ as these cannot be set globally. For example: 42 43 ```r 44 bucketlist( 45 region = "", 46 use_https = FALSE 47 ) 48 ``` 49 50 * `region` should always be empty 51 * `use_https` should be set to `TRUE` or `FALSE` depending on whether your lakeFS endpoint uses HTTPS. 52 53 ### Listing repositories 54 55 The S3 gateway exposes a repository as a bucket, and so using the `aws.s3` function `bucketlist` will return a list of available repositories on lakeFS: 56 57 ```r 58 bucketlist( 59 region = "", 60 use_https = FALSE 61 ) 62 ``` 63 64 ### Writing to lakeFS from R 65 66 Assuming you're using the `aws.s3` library there various functions available including `s3save`, `s3saveRDS`, and `put_object`. Here's an example of writing an R object to lakeFS: 67 68 ```r 69 repo_name <- "example" 70 branch <- "development" 71 72 s3saveRDS(x=my_df, 73 bucket = repo_name, 74 object = paste0(branch,"/my_df.R"), 75 region = "", 76 use_https = FALSE) 77 ``` 78 79 You can also upload local files to lakeFS using R and the `put_object` function: 80 81 ```r 82 repo_name <- "example" 83 branch <- "development" 84 local_file <- "/tmp/never.gonna" 85 86 put_object(file = local_file, 87 bucket = repo_name, 88 object = paste0(branch,"/give/you/up"), 89 region = "", 90 use_https = FALSE) 91 ``` 92 93 ### Reading from lakeFS with R 94 95 As with writing data from R to lakeFS, there is a similar set of functions for reading data. These include `s3load`, `s3readRDS`, and `get_object`. Here's an example of reading an R object from lakeFS: 96 97 ```r 98 repo_name <- "example" 99 branch <- "development" 100 101 my_df <- s3readRDS(bucket = repo_name, 102 object = paste0(branch,"/my_data.R"), 103 region = "", 104 use_https = FALSE) 105 ``` 106 107 ### Listing Objects 108 109 In general you should always specify a branch prefix when listing objects. Here's an example to list the `main` branch in the `quickstart` repository: 110 111 ```R 112 get_bucket_df(bucket = "quickstart", 113 prefix = "main/", 114 region = "", 115 use_https = FALSE) 116 ``` 117 118 When listing objects in lakeFS there is a special case which is the repository/bucket level. When you list at this level you will get the branches returned as folders. These are not listed recursively, unless you list something under the branch. To understand more about this please refer to [#5441](https://github.com/treeverse/lakeFS/issues/5441) 119 120 ### Working with Arrow 121 122 Arrow's [R library](https://arrow.apache.org/docs/r/index.html) includes [powerful support](https://arrow.apache.org/docs/r/index.html#what-can-the-arrow-package-do) for data analysis, including reading and writing multiple file formats including Parquet, Arrow, CSV, and JSON. It has functionality for [connecting to S3](https://arrow.apache.org/docs/r/articles/fs.html), and thus integrates perfectly with lakeFS. 123 124 To start with install and load the library 125 126 ```r 127 install.packages("arrow") 128 library(arrow) 129 ``` 130 131 Then create an S3FileSystem object to connect to your lakeFS instance 132 133 ```r 134 lakefs <- S3FileSystem$create( 135 endpoint_override = "lakefs.mycorp.com:8000", 136 scheme = "http" 137 access_key = "AKIAIOSFODNN7EXAMPLE", 138 secret_key = "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY", 139 region = "", 140 ) 141 ``` 142 143 From here you can list the contents of a particular lakeFS repository and branch 144 145 ```r 146 lakefs$ls(path = "quickstart/main") 147 ``` 148 149 To read a Parquet from lakeFS with R use the `read_parquet` function 150 151 ```r 152 lakes <- read_parquet(lakefs$path("quickstart/main/lakes.parquet")) 153 ``` 154 155 Writing a file follows a similar pattern. Here is rewriting the same file as above but in Arrow format 156 157 ```r 158 write_feather(x = lakes, 159 sink = lakefs$path("quickstart/main/lakes.arrow")) 160 ``` 161 162 ## Performing lakeFS Operations using the lakeFS API from R 163 164 As well as reading and writing data, you will also want to carry out lakeFS operations from R including creating branches, committing data, and more. 165 166 To do this call the lakeFS [API](https://docs.lakefs.io/reference/api.html) from the `httr` library. You should refer to the API documentation for full details of the endpoints and their behaviour. Below are a few examples to illustrate the usage. 167 168 ### Check the lakeFS Server Version 169 170 This is a useful API call to establish connectivity and test authentication. 171 172 ```r 173 library(httr) 174 lakefs_api_url <- "lakefs.mycorp.com:8000/api/v1" 175 lakefsAccessKey <- "AKIAIOSFODNN7EXAMPLE" 176 lakefsSecretKey <- "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY" 177 178 r=GET(url=paste0(lakefs_api_url, "/config/version"), 179 authenticate(lakefsAccessKey, lakefsSecretKey)) 180 ``` 181 182 The returned object `r` can be inspected to determine the outcome of the operation by comparing it to the status codes specified in the API. Here is some example R code to demonstrate the idea: 183 184 ```r 185 if (r$status_code == 200) { 186 print(paste0("✅lakeFS credentials and connectivity verified. ℹ️lakeFS version ",content(r)$version)) 187 } else { 188 print("🛑 failed to get lakeFS version") 189 print(content(r)$message) 190 } 191 ``` 192 193 ### Create a Repository 194 195 ```r 196 library(httr) 197 lakefs_api_url <- "lakefs.mycorp.com:8000/api/v1" 198 lakefsAccessKey <- "AKIAIOSFODNN7EXAMPLE" 199 lakefsSecretKey <- "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY" 200 repo_name <- "my_new_repo" 201 202 # Define the payload 203 body=list(name=repo_name, 204 storage_namespace="s3://example-bucket/foo") 205 206 # Call the API 207 r=POST(url=paste0(lakefs_api_url, "/repositories"), 208 authenticate(lakefsAccessKey, lakefsSecretKey), 209 body=body, encode="json") 210 ``` 211 212 ### Commit Data 213 214 ```r 215 library(httr) 216 lakefs_api_url <- "lakefs.mycorp.com:8000/api/v1" 217 lakefsAccessKey <- "AKIAIOSFODNN7EXAMPLE" 218 lakefsSecretKey <- "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY" 219 repo_name <- "my_new_repo" 220 branch <- "example" 221 222 # Define the payload 223 body=list(message="add some data and charts", 224 metadata=list( 225 client="httr", 226 author="rmoff")) 227 228 # Call the API 229 r=POST(url=paste0(lakefs_api_url, "/repositories/", repo_name, "/branches/", branch, "/commits"), 230 authenticate(lakefsAccessKey, lakefsSecretKey), 231 body=body, encode="json") 232 ```