Contents

library(scATAC.Explorer)
#> Loading required package: SingleCellExperiment
#> Loading required package: SummarizedExperiment
#> Loading required package: MatrixGenerics
#> Loading required package: matrixStats
#> 
#> Attaching package: 'MatrixGenerics'
#> The following objects are masked from 'package:matrixStats':
#> 
#>     colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
#>     colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
#>     colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
#>     colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
#>     colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
#>     colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
#>     colWeightedMeans, colWeightedMedians, colWeightedSds,
#>     colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
#>     rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
#>     rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
#>     rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
#>     rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
#>     rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
#>     rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
#>     rowWeightedSds, rowWeightedVars
#> Loading required package: GenomicRanges
#> Loading required package: stats4
#> Loading required package: BiocGenerics
#> Loading required package: generics
#> 
#> Attaching package: 'generics'
#> The following objects are masked from 'package:base':
#> 
#>     as.difftime, as.factor, as.ordered, intersect, is.element, setdiff,
#>     setequal, union
#> 
#> Attaching package: 'BiocGenerics'
#> The following objects are masked from 'package:generics':
#> 
#>     intersect, setdiff, setequal, union
#> The following objects are masked from 'package:stats':
#> 
#>     IQR, mad, sd, var, xtabs
#> The following objects are masked from 'package:base':
#> 
#>     Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
#>     as.data.frame, basename, cbind, colnames, dirname, do.call,
#>     duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
#>     lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
#>     pmin.int, rank, rbind, rownames, sapply, saveRDS, setdiff,
#>     setequal, table, tapply, union, unique, unsplit, which.max,
#>     which.min
#> Loading required package: S4Vectors
#> 
#> Attaching package: 'S4Vectors'
#> The following object is masked from 'package:utils':
#> 
#>     findMatches
#> The following objects are masked from 'package:base':
#> 
#>     I, expand.grid, unname
#> Loading required package: IRanges
#> Loading required package: GenomeInfoDb
#> Loading required package: Biobase
#> Welcome to Bioconductor
#> 
#>     Vignettes contain introductory material; view with
#>     'browseVignettes()'. To cite Bioconductor, see
#>     'citation("Biobase")', and for packages 'citation("pkgname")'.
#> 
#> Attaching package: 'Biobase'
#> The following object is masked from 'package:MatrixGenerics':
#> 
#>     rowMedians
#> The following objects are masked from 'package:matrixStats':
#> 
#>     anyMissing, rowMedians
#> Warning: replacing previous import 'S4Arrays::read_block' by
#> 'DelayedArray::read_block' when loading 'SummarizedExperiment'
#> Loading required package: BiocFileCache
#> Loading required package: dbplyr
#> Loading required package: data.table
#> 
#> Attaching package: 'data.table'
#> The following object is masked from 'package:SummarizedExperiment':
#> 
#>     shift
#> The following object is masked from 'package:GenomicRanges':
#> 
#>     shift
#> The following object is masked from 'package:IRanges':
#> 
#>     shift
#> The following objects are masked from 'package:S4Vectors':
#> 
#>     first, second
#> Loading required package: zellkonverter
#> Registered S3 method overwritten by 'zellkonverter':
#>   method                                             from      
#>   py_to_r.pandas.core.arrays.categorical.Categorical reticulate

1 Introduction

scATAC.Explorer (Single Cell ATAC-seq Explorer) is a curated collection of publicly available scATAC-seq datasets. It aims to provide a single point of entry for users looking to investigate epigenetics and chromatin accessibilty at a single cell resolution across many available datasets.

Users can quickly search available datasets using the metadata table, and then download any datasets they have discovered relevant to their research in a standard and easily accessible format. Optionally, users can save the datasets for use in applications other than R.

This package will improve the ease of studying the epigenome across a variety of organisims, cell types, and diseases. Developers may use this package to obtain data for validation of new algorithms, or to study differences between scATAC-seq datasets.

2 Exploring available datasets

Start by exploring the available datasets through metadata.

res = queryATAC(metadata_only = TRUE)
X Reference Accession Author Journal
1 Satpathy_NatureBiotech_2019 GSE129785 Satpathy Nature Biotech
2 Satpathy_NatureBiotech_2019 GSE129785 Satpathy Nature Biotech
3 Satpathy_NatureBiotech_2019 GSE129785 Satpathy Nature Biotech
4 Satpathy_NatureBiotech_2019 GSE129785 Satpathy Nature Biotech
5 Buenrostro_Cell_2018 GSE96769 Buenrostro Cell
6 Corces_NatureGenetics_2016 GSE74310 Corces Nature Genetics

This will return a list containing a single dataframe of metadata for all available datasets. View the metadata with View(res[[1]]) and then check ?queryATAC for a description of searchable fields.

Note: in order to keep the function’s interface consistent, queryATAC always returns a list of objects, even if there is only one object. You may prefer running res = queryATAC(metadata_only = TRUE)[[1]] in order to save the dataframe directly.

The metatadata_only argument can be applied alongside any other argument in order to examine only datasets that have certain qualities. You can, for instance, view only breast cancer datasets by using

res = queryATAC(disease = 'leukemia', metadata_only = TRUE)[[1]]
X Reference Accession Author Journal
6 6 Corces_NatureGenetics_2016 GSE74310 Corces Nature Genetics
45 45 Satpathy_NatureMedicine_2018 GSE107817 Satpathy Nature Medicine

Table 1: Search parameters for queryATAC alongside example values.
Search Parameter Description Examples
accession Search by unique accession number or ID GSE129785, GSE89362
has_cell_types Filter by presence of cell-type annotations TRUE, FALSE
has_clusters Filter by presence of cluster results TRUE, FALSE
disease Search by disease Carcinoma, Leukemia
broad_cell_category Search by broad cell cateogries present in datasets Neuronal, Immune
tissue_cell_type Search by tissue or cell type when available PBMC, glia, cerebral cortex
author Search by first author Satpathy, Cusanovich
journal Search by publication journal Science, Nature, Cell
year Search by year of publication <2015, >2015, 2013-2015
pmid Search by PubMed ID 27526324, 32494068
sequence_tech Search by sequencing technology 10x Genomics Chromium
organism Search by source organism Mus musculus
genome_build Search by genome build hg19, hg38, mm10
sparse Return expression in sparse matrices TRUE, FALSE

2.1 Searching by year

In order to search by single years and a range of years, the package looks for specific patterns. ‘2013-2015’ will search for datasets published between 2013 and 2015, inclusive. ‘<2015’ or ‘2015>’ will search for datasets published before or in 2015. ‘>2015’ or ‘2015<’ will search for datasets published in or after 2015.

3 Getting datasets

Once you’ve found a field to search on, you can get your data. For this example, we’re pulling a specific dataset by its GEO accession ID.

res = queryATAC(accession = "GSE89362")

This will return a list containing dataset GSE89362. The dataset is stored as a SingleCellExperiment object, which has the following metadata attached to the object:


Table 2: Metadata attributes in the SingleCellExperiment object.
Attribute Description
cells A list of cells included in the study
regions A list of genomic regions (peaks) included in the study
pmid The PubMed ID of the study
technology The sequencing technology used
genome_build The genome build used for data generation
score_type The type of scoring or normalization used on the counts data
organism The type of organism from which cells were sequenced
author The first author of the paper presenting the data
disease The diseases sampled cells were sampled from
summary A broad summary of the study conditions the sample was assayed from
accession The GEO accession ID for the dataset

To access the chromatin accessibility counts data for a result, use

View(counts(res[[1]]))
Ik0h.r1.A1 Ik0h.r1.A2 Ik0h.r1.A3 Ik0h.r1.A4 Ik0h.r1.A5 Ik0h.r1.A6 Ik0h.r1.B10 Ik0h.r1.B12
chr1-63176687-63176922 0 0 0 0 0 0 0 0
chr1-125435762-125435907 1 1 1 0 0 0 1 0
chr1-139067353-139067473 0 2 0 0 0 0 0 0
chr1-152305577-152305725 0 0 0 0 0 0 0 0
chr1-9748401-9748588 1 2 0 0 0 0 1 2
chr1-51478424-51478572 0 1 0 0 1 0 1 0
chr1-53296958-53297153 0 0 0 0 0 0 0 0
chr1-53313523-53313648 0 0 0 0 0 0 2 0
chr1-75251656-75251790 0 0 0 0 0 0 0 0
chr1-105971701-105971845 1 0 0 0 0 0 2 1

Cell type labels and/or cluster assignments are stored under colData(res[[1]]) for datasets for which cell type labels and cluster assignments are available.

Metadata is stored in a named list accessible by metadata(res[[1]]). Specific entries can be accessed by attribute name.

metadata(res[[1]])$pmid
#> [1] "31682608"

3.1 Example: Returning all datasets with cell-type labels

Say you want to compare chromatin accessibility between known cell types. To do this, you need datasets that have cell-type annotations available. Be aware that returning a large amount of datasets like this will require a large amount of memory (greater than 16GB, if not more).

res = queryATAC(has_cell_type = TRUE)

This will return a list of all datasets that have cell-types annotations available. You can see the cell types for the first dataset using the following command:

View(colData(res[[1]]))
Ik0h.r1.A1
Ik0h.r1.A2
Ik0h.r1.A3
Ik0h.r1.A4
Ik0h.r1.A5
Ik0h.r1.A6

The first column of this dataframe contains the cell cluster assignment (if available), and the second contains the cell type assignment (if available). The row names of the dataframe specify the cell ID/barcode the annotation belongs to.

4 Saving Data

To facilitate the use of any or all datasets outside of R, you can use saveATAC(). saveATAC takes two parameters. The first parameter is the data object to be saved (ie. a SingleCellExperiment object from queryATAC()). The second paramter is a string specifying the directory you would like data to be saved in. Note that the output directory should not already exist.

To save the data from the earlier example to disk, use the following commands.

res = queryATAC(accession = "GSE89362")[[1]]
saveATAC(res, '~/Downloads/GSE89362')

The result is three files saving the scATAC-seq dataset in the Matrix Market format that can be used in other programs. A fourth csv file will be saved if cell type annotations or cluster assignments are available in the dataset.

5 Session Information

sessionInfo()
#> R Under development (unstable) (2024-10-21 r87258)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: America/New_York
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] scATAC.Explorer_1.13.0      zellkonverter_1.17.0       
#>  [3] data.table_1.16.2           BiocFileCache_2.15.0       
#>  [5] dbplyr_2.5.0                SingleCellExperiment_1.29.0
#>  [7] SummarizedExperiment_1.37.0 Biobase_2.67.0             
#>  [9] GenomicRanges_1.59.0        GenomeInfoDb_1.43.0        
#> [11] IRanges_2.41.0              S4Vectors_0.45.0           
#> [13] BiocGenerics_0.53.1         generics_0.1.3             
#> [15] MatrixGenerics_1.19.0       matrixStats_1.4.1          
#> [17] BiocStyle_2.35.0           
#> 
#> loaded via a namespace (and not attached):
#>  [1] dir.expiry_1.15.0       xfun_0.48               bslib_0.8.0            
#>  [4] lattice_0.22-6          vctrs_0.6.5             tools_4.5.0            
#>  [7] parallel_4.5.0          curl_5.2.3              tibble_3.2.1           
#> [10] fansi_1.0.6             RSQLite_2.3.7           blob_1.2.4             
#> [13] pkgconfig_2.0.3         Matrix_1.7-1            lifecycle_1.0.4        
#> [16] GenomeInfoDbData_1.2.13 compiler_4.5.0          htmltools_0.5.8.1      
#> [19] sass_0.4.9              yaml_2.3.10             pillar_1.9.0           
#> [22] crayon_1.5.3            jquerylib_0.1.4         DelayedArray_0.33.1    
#> [25] cachem_1.1.0            abind_1.4-8             basilisk_1.19.0        
#> [28] tidyselect_1.2.1        digest_0.6.37           purrr_1.0.2            
#> [31] dplyr_1.1.4             bookdown_0.41           fastmap_1.2.0          
#> [34] grid_4.5.0              cli_3.6.3               SparseArray_1.7.0      
#> [37] magrittr_2.0.3          S4Arrays_1.7.0          utf8_1.2.4             
#> [40] filelock_1.0.3          UCSC.utils_1.3.0        bit64_4.5.2            
#> [43] rmarkdown_2.28          XVector_0.47.0          httr_1.4.7             
#> [46] bit_4.5.0               reticulate_1.39.0       png_0.1-8              
#> [49] memoise_2.0.1           evaluate_1.0.1          knitr_1.48             
#> [52] basilisk.utils_1.19.0   rlang_1.1.4             Rcpp_1.0.13            
#> [55] glue_1.8.0              DBI_1.2.3               BiocManager_1.30.25    
#> [58] jsonlite_1.8.9          R6_2.5.1                zlibbioc_1.53.0