scATAC.Explorer 1.13.0
library(scATAC.Explorer)
#> Loading required package: SingleCellExperiment
#> Loading required package: SummarizedExperiment
#> Loading required package: MatrixGenerics
#> Loading required package: matrixStats
#>
#> Attaching package: 'MatrixGenerics'
#> The following objects are masked from 'package:matrixStats':
#>
#> colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
#> colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
#> colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
#> colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
#> colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
#> colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
#> colWeightedMeans, colWeightedMedians, colWeightedSds,
#> colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
#> rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
#> rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
#> rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
#> rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
#> rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
#> rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
#> rowWeightedSds, rowWeightedVars
#> Loading required package: GenomicRanges
#> Loading required package: stats4
#> Loading required package: BiocGenerics
#> Loading required package: generics
#>
#> Attaching package: 'generics'
#> The following objects are masked from 'package:base':
#>
#> as.difftime, as.factor, as.ordered, intersect, is.element, setdiff,
#> setequal, union
#>
#> Attaching package: 'BiocGenerics'
#> The following objects are masked from 'package:generics':
#>
#> intersect, setdiff, setequal, union
#> The following objects are masked from 'package:stats':
#>
#> IQR, mad, sd, var, xtabs
#> The following objects are masked from 'package:base':
#>
#> Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
#> as.data.frame, basename, cbind, colnames, dirname, do.call,
#> duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
#> lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
#> pmin.int, rank, rbind, rownames, sapply, saveRDS, setdiff,
#> setequal, table, tapply, union, unique, unsplit, which.max,
#> which.min
#> Loading required package: S4Vectors
#>
#> Attaching package: 'S4Vectors'
#> The following object is masked from 'package:utils':
#>
#> findMatches
#> The following objects are masked from 'package:base':
#>
#> I, expand.grid, unname
#> Loading required package: IRanges
#> Loading required package: GenomeInfoDb
#> Loading required package: Biobase
#> Welcome to Bioconductor
#>
#> Vignettes contain introductory material; view with
#> 'browseVignettes()'. To cite Bioconductor, see
#> 'citation("Biobase")', and for packages 'citation("pkgname")'.
#>
#> Attaching package: 'Biobase'
#> The following object is masked from 'package:MatrixGenerics':
#>
#> rowMedians
#> The following objects are masked from 'package:matrixStats':
#>
#> anyMissing, rowMedians
#> Warning: replacing previous import 'S4Arrays::read_block' by
#> 'DelayedArray::read_block' when loading 'SummarizedExperiment'
#> Loading required package: BiocFileCache
#> Loading required package: dbplyr
#> Loading required package: data.table
#>
#> Attaching package: 'data.table'
#> The following object is masked from 'package:SummarizedExperiment':
#>
#> shift
#> The following object is masked from 'package:GenomicRanges':
#>
#> shift
#> The following object is masked from 'package:IRanges':
#>
#> shift
#> The following objects are masked from 'package:S4Vectors':
#>
#> first, second
#> Loading required package: zellkonverter
#> Registered S3 method overwritten by 'zellkonverter':
#> method from
#> py_to_r.pandas.core.arrays.categorical.Categorical reticulate
scATAC.Explorer (Single Cell ATAC-seq Explorer) is a curated collection of publicly available scATAC-seq datasets. It aims to provide a single point of entry for users looking to investigate epigenetics and chromatin accessibilty at a single cell resolution across many available datasets.
Users can quickly search available datasets using the metadata table, and then download any datasets they have discovered relevant to their research in a standard and easily accessible format. Optionally, users can save the datasets for use in applications other than R.
This package will improve the ease of studying the epigenome across a variety of organisims, cell types, and diseases. Developers may use this package to obtain data for validation of new algorithms, or to study differences between scATAC-seq datasets.
Start by exploring the available datasets through metadata.
res = queryATAC(metadata_only = TRUE)
X | Reference | Accession | Author | Journal |
---|---|---|---|---|
1 | Satpathy_NatureBiotech_2019 | GSE129785 | Satpathy | Nature Biotech |
2 | Satpathy_NatureBiotech_2019 | GSE129785 | Satpathy | Nature Biotech |
3 | Satpathy_NatureBiotech_2019 | GSE129785 | Satpathy | Nature Biotech |
4 | Satpathy_NatureBiotech_2019 | GSE129785 | Satpathy | Nature Biotech |
5 | Buenrostro_Cell_2018 | GSE96769 | Buenrostro | Cell |
6 | Corces_NatureGenetics_2016 | GSE74310 | Corces | Nature Genetics |
This will return a list containing a single dataframe of metadata for all available datasets.
View the metadata with View(res[[1]])
and then check ?queryATAC
for a description of searchable fields.
Note: in order to keep the function’s interface consistent, queryATAC
always returns a list of objects, even if there is only one object.
You may prefer running res = queryATAC(metadata_only = TRUE)[[1]]
in order to save the dataframe directly.
The metatadata_only
argument can be applied alongside any other argument in order to examine only datasets that have certain qualities.
You can, for instance, view only breast cancer datasets by using
res = queryATAC(disease = 'leukemia', metadata_only = TRUE)[[1]]
X | Reference | Accession | Author | Journal | |
---|---|---|---|---|---|
6 | 6 | Corces_NatureGenetics_2016 | GSE74310 | Corces | Nature Genetics |
45 | 45 | Satpathy_NatureMedicine_2018 | GSE107817 | Satpathy | Nature Medicine |
Search Parameter | Description | Examples |
---|---|---|
accession | Search by unique accession number or ID | GSE129785, GSE89362 |
has_cell_types | Filter by presence of cell-type annotations | TRUE, FALSE |
has_clusters | Filter by presence of cluster results | TRUE, FALSE |
disease | Search by disease | Carcinoma, Leukemia |
broad_cell_category | Search by broad cell cateogries present in datasets | Neuronal, Immune |
tissue_cell_type | Search by tissue or cell type when available | PBMC, glia, cerebral cortex |
author | Search by first author | Satpathy, Cusanovich |
journal | Search by publication journal | Science, Nature, Cell |
year | Search by year of publication | <2015, >2015, 2013-2015 |
pmid | Search by PubMed ID | 27526324, 32494068 |
sequence_tech | Search by sequencing technology | 10x Genomics Chromium |
organism | Search by source organism | Mus musculus |
genome_build | Search by genome build | hg19, hg38, mm10 |
sparse | Return expression in sparse matrices | TRUE, FALSE |
In order to search by single years and a range of years, the package looks for specific patterns. ‘2013-2015’ will search for datasets published between 2013 and 2015, inclusive. ‘<2015’ or ‘2015>’ will search for datasets published before or in 2015. ‘>2015’ or ‘2015<’ will search for datasets published in or after 2015.
Once you’ve found a field to search on, you can get your data. For this example, we’re pulling a specific dataset by its GEO accession ID.
res = queryATAC(accession = "GSE89362")
This will return a list containing dataset GSE89362.
The dataset is stored as a SingleCellExperiment
object,
which has the following metadata attached to the object:
Attribute | Description |
---|---|
cells | A list of cells included in the study |
regions | A list of genomic regions (peaks) included in the study |
pmid | The PubMed ID of the study |
technology | The sequencing technology used |
genome_build | The genome build used for data generation |
score_type | The type of scoring or normalization used on the counts data |
organism | The type of organism from which cells were sequenced |
author | The first author of the paper presenting the data |
disease | The diseases sampled cells were sampled from |
summary | A broad summary of the study conditions the sample was assayed from |
accession | The GEO accession ID for the dataset |
To access the chromatin accessibility counts data for a result, use
View(counts(res[[1]]))
Ik0h.r1.A1 | Ik0h.r1.A2 | Ik0h.r1.A3 | Ik0h.r1.A4 | Ik0h.r1.A5 | Ik0h.r1.A6 | Ik0h.r1.B10 | Ik0h.r1.B12 | |
---|---|---|---|---|---|---|---|---|
chr1-63176687-63176922 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
chr1-125435762-125435907 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 0 |
chr1-139067353-139067473 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 |
chr1-152305577-152305725 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
chr1-9748401-9748588 | 1 | 2 | 0 | 0 | 0 | 0 | 1 | 2 |
chr1-51478424-51478572 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 |
chr1-53296958-53297153 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
chr1-53313523-53313648 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 |
chr1-75251656-75251790 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
chr1-105971701-105971845 | 1 | 0 | 0 | 0 | 0 | 0 | 2 | 1 |
Cell type labels and/or cluster assignments are stored under colData(res[[1]])
for datasets
for which cell type labels and cluster assignments are available.
Metadata is stored in a named list accessible by metadata(res[[1]])
.
Specific entries can be accessed by attribute name.
metadata(res[[1]])$pmid
#> [1] "31682608"
Say you want to compare chromatin accessibility between known cell types. To do this, you need datasets that have cell-type annotations available. Be aware that returning a large amount of datasets like this will require a large amount of memory (greater than 16GB, if not more).
res = queryATAC(has_cell_type = TRUE)
This will return a list of all datasets that have cell-types annotations available. You can see the cell types for the first dataset using the following command:
View(colData(res[[1]]))
Ik0h.r1.A1 |
Ik0h.r1.A2 |
Ik0h.r1.A3 |
Ik0h.r1.A4 |
Ik0h.r1.A5 |
Ik0h.r1.A6 |
The first column of this dataframe contains the cell cluster assignment (if available), and the second contains the cell type assignment (if available). The row names of the dataframe specify the cell ID/barcode the annotation belongs to.
To facilitate the use of any or all datasets outside of R, you can use saveATAC()
.
saveATAC
takes two parameters. The first parameter is the data
object to be saved (ie. a SingleCellExperiment object
from queryATAC()
). The second paramter is a string specifying the directory you would like data to be saved in.
Note that the output directory should not already exist.
To save the data from the earlier example to disk, use the following commands.
res = queryATAC(accession = "GSE89362")[[1]]
saveATAC(res, '~/Downloads/GSE89362')
The result is three files saving the scATAC-seq dataset in the Matrix Market format that can be used in other programs. A fourth csv file will be saved if cell type annotations or cluster assignments are available in the dataset.
sessionInfo()
#> R Under development (unstable) (2024-10-21 r87258)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: America/New_York
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] scATAC.Explorer_1.13.0 zellkonverter_1.17.0
#> [3] data.table_1.16.2 BiocFileCache_2.15.0
#> [5] dbplyr_2.5.0 SingleCellExperiment_1.29.0
#> [7] SummarizedExperiment_1.37.0 Biobase_2.67.0
#> [9] GenomicRanges_1.59.0 GenomeInfoDb_1.43.0
#> [11] IRanges_2.41.0 S4Vectors_0.45.0
#> [13] BiocGenerics_0.53.1 generics_0.1.3
#> [15] MatrixGenerics_1.19.0 matrixStats_1.4.1
#> [17] BiocStyle_2.35.0
#>
#> loaded via a namespace (and not attached):
#> [1] dir.expiry_1.15.0 xfun_0.48 bslib_0.8.0
#> [4] lattice_0.22-6 vctrs_0.6.5 tools_4.5.0
#> [7] parallel_4.5.0 curl_5.2.3 tibble_3.2.1
#> [10] fansi_1.0.6 RSQLite_2.3.7 blob_1.2.4
#> [13] pkgconfig_2.0.3 Matrix_1.7-1 lifecycle_1.0.4
#> [16] GenomeInfoDbData_1.2.13 compiler_4.5.0 htmltools_0.5.8.1
#> [19] sass_0.4.9 yaml_2.3.10 pillar_1.9.0
#> [22] crayon_1.5.3 jquerylib_0.1.4 DelayedArray_0.33.1
#> [25] cachem_1.1.0 abind_1.4-8 basilisk_1.19.0
#> [28] tidyselect_1.2.1 digest_0.6.37 purrr_1.0.2
#> [31] dplyr_1.1.4 bookdown_0.41 fastmap_1.2.0
#> [34] grid_4.5.0 cli_3.6.3 SparseArray_1.7.0
#> [37] magrittr_2.0.3 S4Arrays_1.7.0 utf8_1.2.4
#> [40] filelock_1.0.3 UCSC.utils_1.3.0 bit64_4.5.2
#> [43] rmarkdown_2.28 XVector_0.47.0 httr_1.4.7
#> [46] bit_4.5.0 reticulate_1.39.0 png_0.1-8
#> [49] memoise_2.0.1 evaluate_1.0.1 knitr_1.48
#> [52] basilisk.utils_1.19.0 rlang_1.1.4 Rcpp_1.0.13
#> [55] glue_1.8.0 DBI_1.2.3 BiocManager_1.30.25
#> [58] jsonlite_1.8.9 R6_2.5.1 zlibbioc_1.53.0