github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/docs/integrations/r.md (about)

     1  ---
     2  title: R
     3  description: How to use lakeFS from R including creating branches, committing changes, and merging.
     4  parent: Integrations
     5  ---
     6  
     7  # Using R with lakeFS
     8  
     9  R is a powerful language used widely in data science. lakeFS interfaces with R in two ways: 
    10  
    11  * To **read and write data in lakeFS** use standard S3 tools such as the `aws.s3` library. lakeFS has a [S3 gateway](https://docs.lakefs.io/understand/architecture.html#s3-gateway) which presents a lakeFS repository as an S3 bucket. 
    12  * For working with **lakeFS operations such as branches and commits** use the [API](https://docs.lakefs.io/reference/api.html) for which can be accessed from R using the `httr` library. 
    13  
    14  _To see examples of R in action with lakeFS please visit the [lakeFS-samples](https://github.com/treeverse/lakeFS-samples/) repository and the [sample](https://github.com/treeverse/lakeFS-samples/blob/main/00_notebooks/R.ipynb) [notebooks](https://github.com/treeverse/lakeFS-samples/blob/main/00_notebooks/R-weather.ipynb)_.
    15  
    16  {% include toc.html %}
    17  
    18  ## Reading and Writing from lakeFS with R
    19  
    20  Working with data stored in lakeFS from R is the same as you would with an S3 bucket, via the [S3 Gateway that lakeFS provides](https://docs.lakefs.io/understand/architecture.html#s3-gateway).
    21  
    22  You can use any library that interfaces with S3. In this example we'll use the [aws.s3](https://github.com/cloudyr/aws.s3) library.
    23  
    24  ```r
    25  install.packages(c("aws.s3"))
    26  library(aws.s3)
    27  ```
    28  
    29  ### Configuration 
    30  
    31  The [R S3 client documentation](https://cloud.r-project.org/web/packages/aws.s3/aws.s3.pdf) includes full details of the configuration options available. A good approach for using it with lakeFS set the endpoint and authentication details as environment variables: 
    32  
    33  ```r
    34  Sys.setenv("AWS_ACCESS_KEY_ID" = "AKIAIOSFODNN7EXAMPLE",
    35             "AWS_SECRET_ACCESS_KEY" = "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
    36             "AWS_S3_ENDPOINT" = "lakefs.mycorp.com:8000")
    37  ```
    38  
    39  _Note: it is generally best practice to set these environment variables outside of the R script; it is done so here for convenience of the example._
    40  
    41  In conjunction with this you must also specify `region` and `use_https` _in each call of an `aws.s3` function_ as these cannot be set globally. For example: 
    42  
    43  ```r
    44  bucketlist(
    45      region = "",
    46      use_https = FALSE
    47      )
    48  ```
    49  
    50  * `region` should always be empty
    51  * `use_https` should be set to `TRUE` or `FALSE` depending on whether your lakeFS endpoint uses HTTPS.
    52  
    53  ### Listing repositories
    54  
    55  The S3 gateway exposes a repository as a bucket, and so using the `aws.s3` function `bucketlist` will return a list of available repositories on lakeFS: 
    56  
    57  ```r
    58  bucketlist(
    59      region = "",
    60      use_https = FALSE
    61      )
    62  ```
    63  
    64  ### Writing to lakeFS from R
    65  
    66  Assuming you're using the `aws.s3` library there various functions available including `s3save`, `s3saveRDS`, and `put_object`. Here's an example of writing an R object to lakeFS: 
    67  
    68  ```r
    69  repo_name <- "example"
    70  branch <- "development"
    71  
    72  s3saveRDS(x=my_df, 
    73            bucket = repo_name, 
    74            object = paste0(branch,"/my_df.R"), 
    75            region = "",
    76            use_https = FALSE)
    77  ```
    78  
    79  You can also upload local files to lakeFS using R and the `put_object` function: 
    80  
    81  ```r
    82  repo_name <- "example"
    83  branch <- "development"
    84  local_file <- "/tmp/never.gonna"
    85  
    86  put_object(file = local_file, 
    87             bucket = repo_name, 
    88             object = paste0(branch,"/give/you/up"),
    89             region = "",
    90             use_https = FALSE)
    91  ```
    92  
    93  ### Reading from lakeFS with R
    94  
    95  As with writing data from R to lakeFS, there is a similar set of functions for reading data. These include `s3load`, `s3readRDS`, and `get_object`. Here's an example of reading an R object from lakeFS: 
    96  
    97  ```r
    98  repo_name <- "example"
    99  branch <- "development"
   100  
   101  my_df <- s3readRDS(bucket = repo_name, 
   102                     object = paste0(branch,"/my_data.R"),
   103                     region = "",
   104                     use_https = FALSE)
   105  ```
   106  
   107  ### Listing Objects
   108  
   109  In general you should always specify a branch prefix when listing objects. Here's an example to list the `main` branch in the `quickstart` repository: 
   110  
   111  ```R
   112  get_bucket_df(bucket = "quickstart",
   113                prefix = "main/",
   114                region = "",
   115                use_https = FALSE)
   116  ```
   117  
   118  When listing objects in lakeFS there is a special case which is the repository/bucket level. When you list at this level you will get the branches returned as folders. These are not listed recursively, unless you list something under the branch. To understand more about this please refer to [#5441](https://github.com/treeverse/lakeFS/issues/5441)
   119  
   120  ### Working with Arrow
   121  
   122  Arrow's [R library](https://arrow.apache.org/docs/r/index.html) includes [powerful support](https://arrow.apache.org/docs/r/index.html#what-can-the-arrow-package-do) for data analysis, including reading and writing multiple file formats including Parquet, Arrow, CSV, and JSON. It has functionality for [connecting to S3](https://arrow.apache.org/docs/r/articles/fs.html), and thus integrates perfectly with lakeFS. 
   123  
   124  To start with install and load the library
   125  
   126  ```r
   127  install.packages("arrow")
   128  library(arrow)
   129  ```
   130  
   131  Then create an S3FileSystem object to connect to your lakeFS instance
   132  
   133  ```r
   134  lakefs <- S3FileSystem$create(
   135      endpoint_override = "lakefs.mycorp.com:8000",
   136      scheme = "http"
   137      access_key = "AKIAIOSFODNN7EXAMPLE", 
   138      secret_key = "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY", 
   139      region = "",
   140  )
   141  ```
   142  
   143  From here you can list the contents of a particular lakeFS repository and branch
   144  
   145  ```r
   146  lakefs$ls(path = "quickstart/main")
   147  ```
   148  
   149  To read a Parquet from lakeFS with R use the `read_parquet` function
   150  
   151  ```r
   152  lakes <- read_parquet(lakefs$path("quickstart/main/lakes.parquet"))
   153  ```
   154  
   155  Writing a file follows a similar pattern. Here is rewriting the same file as above but in Arrow format
   156  
   157  ```r
   158  write_feather(x = lakes,
   159                sink = lakefs$path("quickstart/main/lakes.arrow"))
   160  ```
   161  
   162  ## Performing lakeFS Operations using the lakeFS API from R
   163  
   164  As well as reading and writing data, you will also want to carry out lakeFS operations from R including creating branches, committing data, and more. 
   165  
   166  To do this call the lakeFS [API](https://docs.lakefs.io/reference/api.html) from the `httr` library. You should refer to the API documentation for full details of the endpoints and their behaviour. Below are a few examples to illustrate the usage. 
   167  
   168  ### Check the lakeFS Server Version
   169  
   170  This is a useful API call to establish connectivity and test authentication. 
   171  
   172  ```r
   173  library(httr)
   174  lakefs_api_url <- "lakefs.mycorp.com:8000/api/v1"
   175  lakefsAccessKey <- "AKIAIOSFODNN7EXAMPLE"
   176  lakefsSecretKey <- "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
   177  
   178  r=GET(url=paste0(lakefs_api_url, "/config/version"), 
   179        authenticate(lakefsAccessKey, lakefsSecretKey))
   180  ```
   181  
   182  The returned object `r` can be inspected to determine the outcome of the operation by comparing it to the status codes specified in the API. Here is some example R code to demonstrate the idea: 
   183  
   184  ```r
   185  if (r$status_code == 200) {
   186      print(paste0("✅lakeFS credentials and connectivity verified. ℹ️lakeFS version ",content(r)$version))   
   187  } else {
   188      print("🛑 failed to get lakeFS version")
   189      print(content(r)$message)
   190  }
   191  ```
   192  
   193  ### Create a Repository
   194  
   195  ```r
   196  library(httr)
   197  lakefs_api_url <- "lakefs.mycorp.com:8000/api/v1"
   198  lakefsAccessKey <- "AKIAIOSFODNN7EXAMPLE"
   199  lakefsSecretKey <- "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
   200  repo_name <- "my_new_repo"
   201  
   202  # Define the payload
   203  body=list(name=repo_name, 
   204            storage_namespace="s3://example-bucket/foo")
   205  
   206  # Call the API
   207  r=POST(url=paste0(lakefs_api_url, "/repositories"), 
   208          authenticate(lakefsAccessKey, lakefsSecretKey),
   209          body=body, encode="json")
   210  ```
   211  
   212  ### Commit Data
   213  
   214  ```r
   215  library(httr)
   216  lakefs_api_url <- "lakefs.mycorp.com:8000/api/v1"
   217  lakefsAccessKey <- "AKIAIOSFODNN7EXAMPLE"
   218  lakefsSecretKey <- "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
   219  repo_name <- "my_new_repo"
   220  branch <- "example"
   221  
   222  # Define the payload
   223  body=list(message="add some data and charts", 
   224            metadata=list(
   225                client="httr", 
   226                author="rmoff"))
   227  
   228  # Call the API
   229  r=POST(url=paste0(lakefs_api_url, "/repositories/", repo_name, "/branches/", branch, "/commits"), 
   230         authenticate(lakefsAccessKey, lakefsSecretKey),
   231         body=body, encode="json")
   232  ```