---
title: "Reading Zarr arrays with Rarr" 
date: "`r BiocStyle::doc_date()`" 
author:
  - name: Mike L. Smith
affiliation:
  - EMBL Heidelberg
email: mike.smith@embl.de 
package: "`r BiocStyle::pkg_ver('Rarr')`" 
vignette: >
  %\VignetteIndexEntry{"Working with Zarr arrays in R"}
  %\VignetteEngine{knitr::rmarkdown} 
  %\VignetteEncoding{UTF-8} 
output:
  BiocStyle::html_document
---

```{css, echo=FALSE}
.main-container {
  max-width: 1600px;
  margin-left: auto;
  margin-right: auto;
}
```

# Introduction

The Zarr specification defines a format for chunked, compressed, N-dimensional
arrays.  It's design allows efficient access to subsets of the stored array, and
supports both local and cloud storage systems. Zarr is experiencing increasing
adoption in a number of scientific fields, where multi-dimensional data are
prevalent. In particular as a back-end to the The Open Microscopy Environment's
[OME-NGFF format](https://ngff.openmicroscopy.org/latest/) for storing
bioimaging data in the cloud.

**Rarr** is intended to be a simple interface to reading and writing individual 
Zarr arrays. It is developed
in R and C with no reliance on external libraries or APIs for interfacing with
the Zarr arrays. Additional compression libraries (e.g. blosc) are bundled with
**Rarr** to provide support for datasets compressed using these tools.

## Limitations with **Rarr**

If you know about Zarr arrays already, you'll probably be aware they can be
stored in hierarchical groups, where additional meta data can explain the
relationship between the arrays.  Currently, **Rarr** is not designed to be
aware of these hierarchical Zarr array collections. However, the component
arrays can be read individually by providing the path to them directly.

Currently, there are also limitations on the Zarr datatypes that can be accessed
using **Rarr**.  For now most numeric types can be read into R, although in some
instances e.g. 64-bit integers there is potential for loss of information.
Writing is more limited with support only for datatypes that are supported
natively in R and only using the column-first representation.

## Example data

The are some example Zarr arrays included with the package.  These were created
using the Zarr Python implementation and are primarily intended for testing the
functionality of **Rarr**.  You can use the code below to list the complete set
on your system, however it's a long list so we don't show the output here.

```{r, local-examples, echo = TRUE, results = "hide"}
list.dirs(
  system.file("extdata", "zarr_examples", package = "Rarr"),
  recursive = TRUE
) |>
  grep(pattern = "zarr$", value = TRUE)
```

# Quick start guide

```{r, child=normalizePath(file.path("..", "inst", "rmd", "quick_start.Rmd"))}
```

# Additional details

## Working with Zarr metadata
By default the `zarr_overview()` function prints a summary of the array to
screen, so you can get a quick idea of the array properties.  However, there are
times when it might be useful to compute on that metadata, in which case
printing to screen isn't very helpful.  If his is the case the function also has
the argument `as_data_frame` which toggles whether to print the output to
screen, as seen before above, or to return a `data.frame` containing the array
details.

```{r, overview-2}
zarr_overview(zarr_example, as_data_frame = TRUE)
```

## Using credentials to access S3 buckets {#s3-credentials}

If you're accessing data in a private S3 bucket, you can set the environment
variables `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY` to store your
credentials.  For example, lets try reading a file in a private S3 bucket:

```{r, s3-forbidden,  error=TRUE}
zarr_overview("https://s3.embl.de/rarr-testing/bzip2.zarr")
```

We can see the "Access Denied" message in our output, indicating that we don't
have permission to access this resource as an anonymous user.  However, if we use the key pair
below, which gives read-only access to the objects in the `rarr-testing` bucket,
we're now able to interrogate the files with functions in *Rarr*.

```{r}
Sys.setenv(
  "AWS_ACCESS_KEY_ID" = "V352gmotks4ZyhfU",
  "AWS_SECRET_ACCESS_KEY" = "2jveausI91c8P3c7OCRIOrxdLbp3LNW8"
)
zarr_overview("https://s3.embl.de/rarr-testing/bzip2.zarr")
```

Behind the scenes **Rarr** makes use of the **paws** suite of packages
(https://paws-r.github.io/) to interact with S3 storage.  A comprehensive
overview of the multiple ways credentials can be set and used by **paws** can be
found at https://github.com/paws-r/paws/blob/main/docs/credentials.md.  If 
setting environment variables as above doesn't work or is inappropriate for 
your use case please refer to that document for other options.

## Creating an S3 client {#s3-client}

Although **Rarr** will try its best to find appropriate credentials and settings
to access a bucket, it is not always successful. Once such example is when you
have AWS credentials set somewhere and you try to access a public bucket.  We
can see an example of this below, where we access the same public bucket used in
\@ref(read-s3), but it now fails because we have set the `AWS_ACCESS_KEY_ID` 
environment variable in the previous section.


```{r, bad-s3, error = TRUE}
s3_address <- "https://uk1s3.embassy.ebi.ac.uk/idr/zarr/v0.4/idr0076A/10501752.zarr/0"
zarr_overview(s3_address)
```

You might encounter similar problems if you're trying to access multiple buckets
each of which require different credentials.  The solution here is to create an
"s3_client" using `paws.storage::s3()`, which contains all the required details
for accessing a particular bucket.  Doing so will prevent **Rarr** from trying
to determine things on its own, and gives you complete control over the settings
used to communicate with the S3 bucket. Here's an example that will let us
access the failing bucket by creating a client with anonymous credentials.

```{r, paws-s3-client, eval = TRUE, echo = TRUE}
s3_client <- paws.storage::s3(
  config = list(
    credentials = list(anonymous = TRUE), 
    region = "auto",
    endpoint = "https://uk1s3.embassy.ebi.ac.uk")
  )
```

If you're accessing a public bucket, the most important step is to provide a
`credentials` list with `anonymous = TRUE`.  Doing so ensures that no attempts
to find other credentials are made, and prevents the problems seen above.  If
you're using files on Amazon AWS storage you'll need to set the `region` to
whatever is appropriate for your data e.g. `"us-east-2"`, `"eu-west-3"`, etc.
For other S3 providers that don't have regions use the value `"auto"` as in the
example below.  Finally the `endpoint` argument is the full hostname of the
server where your files can be found.  For more information on creating an S3
client see the [**paws.storage**
documentation](https://paws-r.github.io/docs/s3/).

We can then pass our s3_client to `zarr_overview()` and it now works
successfully.  

```{r, works-again, error = TRUE}
zarr_overview(s3_address, s3_client = s3_client)
```

Most functions in **Rarr** have the `s3_client` argument and it
can be applied in the same way.

## Writing subsets of data

One of the key features of the Zarr specification is that the arrays are
chunked, allowing rapid access to the required data without needed to read or
write everything else.  If you want to modify a subset of a Zarr array, it is
very inefficient to write all chunks to disk, which is what `write_zarr_array()`
does.  Instead, **Rarr** provides two functions for reducing the amount of
writing required if the circumstances allow: `create_empty_zarr_array()` and
`update_zarr_array()`.

### Creating an "empty" array

Despite the name, you can actually think of `create_empty_zarr_array()` as
creating an array where all the values are the same.  The Zarr specification
allows for "uninitialized" chunks, which are not actually present on disk.  In
this case, any reading application assumes the entirety of the chunk is filled
with a single value, which is found in the array metadata.  This allows for very
efficient creation of the new array, since only a small metadata file is
actually written.  However it is necessary to provide some additional details,
such as the shape of the array, since there's no R array to infer these from.
Let's look at an example:

```{r, create-empty-array}
path <- tempfile()
create_empty_zarr_array(
  zarr_array_path = path,
  dim = c(50, 20),
  chunk_dim = c(5, 10),
  data_type = "integer",
  fill_value = 7L
)
```

First we have to provide a location for the array to be created using the
`zarr_array_path` argument.  Then we provide the dimensions of the new array,
and the shape of the chunks it should be split into.  These two arguments must
be compatible with one another i.e. have the same number of dimensions and no
value in `chunk_dim` should exceed the corresponding value in `dim`.  The
`data_type` argument defines what type of values will be stored in the array.
This is currently limited to: `"integer"`, `"double"`, and `"string"`^[**Rarr**
is currently limited to writing Zarr arrays using data types native to R, rather
than the full range provided by other implementations.].  Finally we use the
`fill_value` argument to provide our default value for the uninitialized chunks.
The next few lines check what's actually been created on our file system.
First, we use `list.files()` to confirm that that only file that's been created
is the `.zarray` metadata; there are no chunk files.  Then we use `table()` to
check the contents of the array, and confirm that when it's read the resulting
array in R is full of 7s, our fill value.

```{r, check-empty-array}
list.files(path, all.files = TRUE, no.. = TRUE)
table(read_zarr_array(path))
```

### Updating a subset of an existing array

Lets assume we want to update the first row of our array to contain the sequence
of integers from 1 to 20.  In the code below we create an example vector
containing the new data.  We then use `update_zarr_array()`, passing the
location of the Zarr and the new values to be inserted.  Finally, we provide the
`index` argument which defines which elements in the Zarr array should be
updated.  It's important that the shape and number of values in `x` corresponds
to the total count of points in the Zarr array we want to update e.g. in this
case we're updating a single row of 20 values.

```{r, update-array}
x <- 1:20
update_zarr_array(
  zarr_array_path = path,
  x = x,
  index = list(1, 1:20)
)
```

As before, we can take a look at what's happened on disk and confirm the values
are present in the array if we read it into R.

```{r, check-update}
list.files(path, all.files = TRUE, no.. = TRUE)
read_zarr_array(path, index = list(1:2, 1:5))
table(read_zarr_array(path))
```

Here `list.files()` confirms that there's now two chunk files that have been
created.  When we first created the Zarr we specified that the chunks should be
10 columns wide, so by modifying 20 columns we'd expect at least two chunks to
be realized on disk.  We use `read_zarr_array()` to confirm visually that the
first row contains our sequence of values, whilst the second row is still all 7.
We use `table()` to confirm that the total contents is as expected.

# Appendix

## Session info {.unnumbered}
```{r, session-info, echo = FALSE}
sessionInfo()
```