R is an open-source statistical environment which can be easily modified to enhance its functionality via packages. recount is a R package available via the Bioconductor repository for packages. R can be installed on any operating system from CRAN after which you can install recount by using the following commands in your R session:
## try http:// if https:// URLs are not supported
source("https://bioconductor.org/biocLite.R")
biocLite("recount")
## Check that you have a valid Bioconductor installation
biocValid()recount is based on many other packages and in particular in those that have implemented the infrastructure needed for dealing with RNA-seq data. That is, packages like GenomicFeatures and rtracklayer that allow you to import the data. A recount user is not expected to deal with those packages directly but will need to be familiar with SummarizedExperiment to understand the results recount generates. It might also prove to be highly beneficial to check the
If you are asking yourself the question “Where do I start using Bioconductor?” you might be interested in this blog post.
As package developers, we try to explain clearly how to use our packages and in which order to use the functions. But R and Bioconductor have a steep learning curve so it is critical to learn where to ask for help. The blog post quoted above mentions some but we would like to highlight the Bioconductor support site as the main resource for getting help: remember to use the recount tag and check the older posts. Other alternatives are available such as creating GitHub issues and tweeting. However, please note that if you want to receive help you should adhere to the posting guidelines. It is particularly critical that you provide a small reproducible example and your session information so package developers can track down the source of the error.
We hope that recount will be useful for your research. Please use the following information to cite the package and the overall approach. Thank you!
## Citation info
citation('recount')## 
## Collado-Torres L, Nellore A, Kammers K, Ellis SE, Taub MA, Hansen
## KD, Jaffe AE, Langmead B and Leek JT (2017). "Reproducible RNA-seq
## analysis using recount2." _Nature Biotechnology_. doi:
## 10.1038/nbt.3838 (URL: http://doi.org/10.1038/nbt.3838), <URL:
## http://www.nature.com/nbt/journal/v35/n4/full/nbt.3838.html>.
## 
## A BibTeX entry for LaTeX users is
## 
##   @Article{,
##     title = {Reproducible RNA-seq analysis using recount2},
##     author = {Leonardo Collado-Torres and Abhinav Nellore and Kai Kammers and Shannon E. Ellis and Margaret A. Taub and Kasper D. Hansen and Andrew E. Jaffe and Ben Langmead and Jeffrey T. Leek},
##     year = {2017},
##     journal = {Nature Biotechnology},
##     doi = {10.1038/nbt.3838},
##     url = {http://www.nature.com/nbt/journal/v35/n4/full/nbt.3838.html},
##   }As of January 30, 2017 the annotation used for the exon and gene counts is Gencode v25.
Here is a very quick example of how to download a RangedSummarizedExperiment object with the gene counts for a 2 groups project (12 samples) with SRA study id SRP009615 using the recount package (Morgan, Obenchain, Lang, and Thompson, 2017). The RangedSummarizedExperiment object is defined in the SummarizedExperiment (Winter, 2017) package and can be used for differential expression analysis with different packages. Here we show how to use DESeq2 (Carlson, 2017) to perform the differential expresion analysis.
This quick analysis is explained in more detail later on in this document. Further information about the recount project can be found in the publication pre-print.
## Load library
library('recount')
## Find a project of interest
project_info <- abstract_search('GSE32465')
## Download the gene-level RangedSummarizedExperiment data
download_study(project_info$project)
## Load the data
load(file.path(project_info$project, 'rse_gene.Rdata'))
## Browse the project at SRA
browse_study(project_info$project)
## View GEO ids
colData(rse_gene)$geo_accession
## Extract the sample characteristics
geochar <- lapply(split(colData(rse_gene), seq_len(nrow(colData(rse_gene)))), geo_characteristics)
## Note that the information for this study is a little inconsistent, so we
## have to fix it.
geochar <- do.call(rbind, lapply(geochar, function(x) {
    if('cells' %in% colnames(x)) {
        colnames(x)[colnames(x) == 'cells'] <- 'cell.line'
        return(x)
    } else {
        return(x)
    }
}))
## We can now define some sample information to use
sample_info <- data.frame(
    run = colData(rse_gene)$run,
    group = ifelse(grepl('uninduced', colData(rse_gene)$title), 'uninduced', 'induced'),
    gene_target = sapply(colData(rse_gene)$title, function(x) { strsplit(strsplit(x,
        'targeting ')[[1]][2], ' gene')[[1]][1] }),
    cell.line = geochar$cell.line
)
## Scale counts by taking into account the total coverage per sample
rse <- scale_counts(rse_gene)
## Add sample information for DE analysis
colData(rse)$group <- sample_info$group
colData(rse)$gene_target <- sample_info$gene_target
## Perform differential expression analysis with DESeq2
library('DESeq2')
## Specify design and switch to DESeq2 format
dds <- DESeqDataSet(rse, ~ gene_target + group)
## Perform DE analysis
dds <- DESeq(dds, test = 'LRT', reduced = ~ gene_target, fitType = 'local')
res <- results(dds)
## Explore results
plotMA(res, main="DESeq2 results for SRP009615")
## Make a report with the results
library('regionReport')
DESeq2Report(dds, res = res, project = 'SRP009615',
    intgroup = c('group', 'gene_target'), outdir = '.',
    output = 'SRP009615-results')The recount project also hosts the necessary data to perform annotation-agnostic differential expression analyses with derfinder (Boettiger, 2017). An example analysis would like this:
## Define expressed regions for study SRP009615, only for chromosome Y
regions <- expressed_regions('SRP009615', 'chrY', cutoff = 5L, 
    maxClusterGap = 3000L)
## Compute coverage matrix for study SRP009615, only for chromosome Y
system.time( rse_ER <- coverage_matrix('SRP009615', 'chrY', regions) )
## Round the coverage matrix to integers
covMat <- round(assays(rse_ER)$counts, 0)
## Get phenotype data for study SRP009615
pheno <- colData(rse_ER)
## Complete the phenotype table with the data we got from GEO
m <- match(pheno$run, sample_info$run)
pheno <- cbind(pheno, sample_info[m, 2:3])
## Build a DESeqDataSet
dds_ers <- DESeqDataSetFromMatrix(countData = covMat, colData = pheno,
    design =  ~ gene_target + group)
## Perform differential expression analysis with DESeq2 at the ER-level
dds_ers <- DESeq(dds_ers, test = 'LRT', reduced = ~ gene_target,
    fitType = 'local')
res_ers <- results(dds_ers)
## Explore results
plotMA(res_ers, main="DESeq2 results for SRP009615 (ER-level, chrY)")
## Create a more extensive exploratory report
DESeq2Report(dds_ers, res = res_ers,
    project = 'SRP009615 (ER-level, chrY)',
    intgroup = c('group', 'gene_target'), outdir = '.',
    output = 'SRP009615-results-ER-level-chrY')recount is an R package that provides an interface to the recount project website. This package allows you to download the files from the recount project and has helper functions for getting you started with differential expression analyses. This vignette will walk you through an example.
This is a brief overview of what you can do with recount. In this particular example we will download data from the SRP009615 study which sequenced 12 samples as described in the previous link.
If you don’t have recount installed, please do so with:
## try http:// if https:// URLs are not supported
source("https://bioconductor.org/biocLite.R")
biocLite("recount")Next we load the required packages. Loading recount will load the required dependencies.
## Load recount R package
library('recount')Lets say that we don’t know the actual SRA accession number for this study but we do know a particular term which will help us identify it. If that’s the case, we can use the abstract_search() function to identify the study of interest as shown below.
## Find a project of interest
project_info <- abstract_search('GSE32465')
## Explore info
project_info##     number_samples species
## 340             12   human
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               abstract
## 340 Summary: K562-shX cells are made in an effort to validate TFBS data and ChIP-seq antibodies in Myers lab (GSE32465). K562 cells are transduced with lentiviral vector having Tet-inducible shRNA targeting a transcription factor gene. Cells with stable integration of shRNA constructs are selected using puromycin in growth media. Doxycycline is added to the growth media to induce the expression of shRNA and a red fluorescent protein marker.  A successful shRNA cell line shows at least a 70% reduction in expression of the target transcription factor as measured by qPCR. For identification, we designated these cell lines as K562-shX, where X is the transcription factor targeted by shRNA and K562 denotes the parent cell line.  For example, K562-shATF3 cells are K562 derived cells selected for stable integration of shRNA targeting the transcription factor ATF3 gene and showed at least a 70% reduction in the expression of ATF3 gene when measured by qPCR. Cells growing without doxycycline (uninduced) are used as a control to measure the change in expression of target transcription factor gene after induction of shRNA using doxycycline. For detailed growth and culturing protocols for these cells please refer to http://hudsonalpha.org/myers-lab/protocols . To identify the potential downstream targets of the candidate transcription factor, analyze the mRNA expression profile of the uninduced and induced K562-shX using RNA-seq.    For data usage terms and conditions, please refer to http://www.genome.gov/27528022 and http://www.genome.gov/Pages/Research/ENCODE/ENCODEDataReleasePolicyFinal2008.pdf    Overall Design: Make K562-shX cells as described in the http://hudsonalpha.org/myers-lab/protocols . Measure the mRNA expression levels in uninduced K562-shX and induced K562-shX cells in two biological replicates using RNA-seq. Identify the potential downstream targets of the candidate transcription factor.
##       project
## 340 SRP009615Now that we have a study that we are interested in, we can download the RangedSummarizedExperiment object (see SummarizedExperiment) with the data summarized at the gene level. The function download_study() helps us do this. If you are interested on how the annotation was defined, check reproduce_ranges() described in the Annotation section further down.
## Download the gene-level RangedSummarizedExperiment data
download_study(project_info$project)## 2017-08-11 20:34:28 downloading file rse_gene.Rdata to SRP009615## Load the data
load(file.path(project_info$project, 'rse_gene.Rdata'))We can explore a bit this RangedSummarizedExperiment as shown below.
rse_gene## class: RangedSummarizedExperiment 
## dim: 58037 12 
## metadata(0):
## assays(1): counts
## rownames(58037): ENSG00000000003.14 ENSG00000000005.5 ...
##   ENSG00000283698.1 ENSG00000283699.1
## rowData names(3): gene_id bp_length symbol
## colnames(12): SRR387777 SRR387778 ... SRR389083 SRR389084
## colData names(21): project sample ... title characteristics## This is the sample phenotype data provided by the recount project
colData(rse_gene)## DataFrame with 12 rows and 21 columns
##               project      sample  experiment         run
##           <character> <character> <character> <character>
## SRR387777   SRP009615   SRS281685   SRX110461   SRR387777
## SRR387778   SRP009615   SRS281686   SRX110462   SRR387778
## SRR387779   SRP009615   SRS281687   SRX110463   SRR387779
## SRR387780   SRP009615   SRS281688   SRX110464   SRR387780
## SRR389077   SRP009615   SRS282369   SRX111299   SRR389077
## ...               ...         ...         ...         ...
## SRR389080   SRP009615   SRS282372   SRX111302   SRR389080
## SRR389081   SRP009615   SRS282373   SRX111303   SRR389081
## SRR389082   SRP009615   SRS282374   SRX111304   SRR389082
## SRR389083   SRP009615   SRS282375   SRX111305   SRR389083
## SRR389084   SRP009615   SRS282376   SRX111306   SRR389084
##           read_count_as_reported_by_sra reads_downloaded
##                               <integer>        <integer>
## SRR387777                      30631853         30631853
## SRR387778                      37001306         37001306
## SRR387779                      40552001         40552001
## SRR387780                      32466352         32466352
## SRR389077                      27819603         27819603
## ...                                 ...              ...
## SRR389080                      34856203         34856203
## SRR389081                      23351679         23351679
## SRR389082                      18144828         18144828
## SRR389083                      24417368         24417368
## SRR389084                      23060084         23060084
##           proportion_of_reads_reported_by_sra_downloaded paired_end
##                                                <numeric>  <logical>
## SRR387777                                              1      FALSE
## SRR387778                                              1      FALSE
## SRR387779                                              1      FALSE
## SRR387780                                              1      FALSE
## SRR389077                                              1      FALSE
## ...                                                  ...        ...
## SRR389080                                              1      FALSE
## SRR389081                                              1      FALSE
## SRR389082                                              1      FALSE
## SRR389083                                              1      FALSE
## SRR389084                                              1      FALSE
##           sra_misreported_paired_end mapped_read_count        auc
##                            <logical>         <integer>  <numeric>
## SRR387777                      FALSE          28798572 1029494445
## SRR387778                      FALSE          33170281 1184877985
## SRR387779                      FALSE          37322762 1336528969
## SRR387780                      FALSE          29970735 1073178116
## SRR389077                      FALSE          24966859  893978355
## ...                              ...               ...        ...
## SRR389080                      FALSE          32469994 1163527939
## SRR389081                      FALSE          21904197  781685955
## SRR389082                      FALSE          17199795  616048853
## SRR389083                      FALSE          22499386  806323346
## SRR389084                      FALSE          21957003  787795710
##           sharq_beta_tissue sharq_beta_cell_type biosample_submission_date
##                 <character>          <character>               <character>
## SRR387777             blood                 k562   2011-12-05T15:40:03.870
## SRR387778             blood                 k562   2011-12-05T15:40:03.897
## SRR387779             blood                 k562   2011-12-05T15:40:03.910
## SRR387780             blood                 k562   2011-12-05T15:40:03.923
## SRR389077             blood                 k562   2011-12-13T11:26:05.720
## ...                     ...                  ...                       ...
## SRR389080             blood                 k562   2011-12-13T11:26:05.787
## SRR389081             blood                 k562   2011-12-13T11:26:05.800
## SRR389082             blood                 k562   2011-12-13T11:26:05.817
## SRR389083             blood                 k562   2011-12-13T11:26:05.830
## SRR389084             blood                 k562   2011-12-13T11:26:05.847
##           biosample_publication_date   biosample_update_date
##                          <character>             <character>
## SRR387777    2011-12-07T09:29:59.890 2014-08-27T04:18:20.530
## SRR387778    2011-12-07T09:29:59.890 2014-08-27T04:18:21.053
## SRR387779    2011-12-07T09:29:59.890 2014-08-27T04:18:21.800
## SRR387780    2011-12-07T09:29:59.890 2014-08-27T04:18:22.320
## SRR389077    2011-12-13T11:26:06.663 2014-08-27T04:22:14.857
## ...                              ...                     ...
## SRR389080    2011-12-13T11:26:06.663 2014-08-27T04:22:15.933
## SRR389081    2011-12-13T11:26:06.663 2014-08-27T04:22:16.290
## SRR389082    2011-12-13T11:26:06.663 2014-08-27T04:22:16.647
## SRR389083    2011-12-13T11:26:06.663 2014-08-27T04:22:17.037
## SRR389084    2011-12-13T11:26:06.663 2014-08-27T04:22:17.473
##           avg_read_length geo_accession  bigwig_file
##                 <integer>   <character>  <character>
## SRR387777              36     GSM836270 SRR387777.bw
## SRR387778              36     GSM836271 SRR387778.bw
## SRR387779              36     GSM836272 SRR387779.bw
## SRR387780              36     GSM836273 SRR387780.bw
## SRR389077              36     GSM847561 SRR389077.bw
## ...                   ...           ...          ...
## SRR389080              36     GSM847564 SRR389080.bw
## SRR389081              36     GSM847565 SRR389081.bw
## SRR389082              36     GSM847566 SRR389082.bw
## SRR389083              36     GSM847567 SRR389083.bw
## SRR389084              36     GSM847568 SRR389084.bw
##                                                                                                     title
##                                                                                               <character>
## SRR387777   K562 cells with shRNA targeting SRF gene cultured with no doxycycline (uninduced - UI), rep1.
## SRR387778  K562 cells with shRNA targeting SRF gene cultured with doxycycline for 48 hours (48 hr), rep1.
## SRR387779   K562 cells with shRNA targeting SRF gene cultured with no doxycycline (uninduced - UI), rep2.
## SRR387780  K562 cells with shRNA targeting SRF gene cultured with doxycycline for 48 hours (48 hr), rep2.
## SRR389077  K562 cells with shRNA targeting EGR1 gene cultured with no doxycycline (uninduced - UI), rep1.
## ...                                                                                                   ...
## SRR389080 K562 cells with shRNA targeting EGR1 gene cultured with doxycycline for 72 hours (72 hr), rep2.
## SRR389081  K562 cells with shRNA targeting ATF3 gene cultured with no doxycycline (uninduced - UI), rep1.
## SRR389082 K562 cells with shRNA targeting ATF3 gene cultured with doxycycline for 24 hours (24 hr), rep1.
## SRR389083  K562 cells with shRNA targeting ATF3 gene cultured with no doxycycline (uninduced - UI), rep2.
## SRR389084 K562 cells with shRNA targeting ATF3 gene cultured with doxycycline for 24 hours (24 hr), rep2.
##                                                                                               characteristics
##                                                                                               <CharacterList>
## SRR387777                                               cells: K562,shRNA expression: no,treatment: Puromycin
## SRR387778                  cells: K562,shRNA expression: yes, targeting SRF,treatment: Puromycin, doxycycline
## SRR387779                                               cells: K562,shRNA expression: no,treatment: Puromycin
## SRR387780                   cells: K562,shRNA expression: yes targeting SRF,treatment: Puromycin, doxycycline
## SRR389077                          cell line: K562,shRNA expression: no shRNA expression,treatment: Puromycin
## ...                                                                                                       ...
## SRR389080 cell line: K562,shRNA expression: expressing shRNA targeting EGR1,treatment: Puromycin, doxycycline
## SRR389081                          cell line: K562,shRNA expression: no shRNA expression,treatment: Puromycin
## SRR389082 cell line: K562,shRNA expression: expressing shRNA targeting ATF3,treatment: Puromycin, doxycycline
## SRR389083                          cell line: K562,shRNA expression: no shRNA expression,treatment: Puromycin
## SRR389084 cell line: K562,shRNA expression: expressing shRNA targeting ATF3,treatment: Puromycin, doxycycline## At the gene level, the row data includes the gene Gencode ids, the gene
## symbols and the sum of the reduced exons widths, which can be used for 
## taking into account the gene length.
rowData(rse_gene)## DataFrame with 58037 rows and 3 columns
##                  gene_id bp_length          symbol
##              <character> <integer> <CharacterList>
## 1     ENSG00000000003.14      4535          TSPAN6
## 2      ENSG00000000005.5      1610            TNMD
## 3     ENSG00000000419.12      1207            DPM1
## 4     ENSG00000000457.13      6883           SCYL3
## 5     ENSG00000000460.16      5967        C1orf112
## ...                  ...       ...             ...
## 58033  ENSG00000283695.1        61              NA
## 58034  ENSG00000283696.1       997              NA
## 58035  ENSG00000283697.1      1184    LOC101928917
## 58036  ENSG00000283698.1       940              NA
## 58037  ENSG00000283699.1        60         MIR4481## At the exon level, you can get the gene Gencode ids from the names of:
# rowRanges(rse_exon)Once we have identified the study of interest, we can use the browse_study() function to browse the study at the SRA website.
## Browse the project at SRA
browse_study(project_info$project)The SRA website includes an Experiments link which further describes each of the samples. From the information available for SRP009615 at NCBI we have some further sample information that we can save for use in our differential expression analysis. We can get some of this information from GEO. The function find_geo() finds the GEO accession id for a given SRA run accession id, which we can then use with geo_info() and geo_characteristics() to parse this information. The rse_gene object already has some of this information.
## View GEO ids
colData(rse_gene)$geo_accession##  [1] "GSM836270" "GSM836271" "GSM836272" "GSM836273" "GSM847561"
##  [6] "GSM847562" "GSM847563" "GSM847564" "GSM847565" "GSM847566"
## [11] "GSM847567" "GSM847568"## Extract the sample characteristics
geochar <- lapply(split(colData(rse_gene), seq_len(nrow(colData(rse_gene)))), geo_characteristics)
## Note that the information for this study is a little inconsistent, so we
## have to fix it.
geochar <- do.call(rbind, lapply(geochar, function(x) {
    if('cells' %in% colnames(x)) {
        colnames(x)[colnames(x) == 'cells'] <- 'cell.line'
        return(x)
    } else {
        return(x)
    }
}))
## We can now define some sample information to use
sample_info <- data.frame(
    run = colData(rse_gene)$run,
    group = ifelse(grepl('uninduced', colData(rse_gene)$title), 'uninduced', 'induced'),
    gene_target = sapply(colData(rse_gene)$title, function(x) { strsplit(strsplit(x,
        'targeting ')[[1]][2], ' gene')[[1]][1] }),
    cell.line = geochar$cell.line
)The recount project records the sum of the base level coverage for each gene (or exon). These raw counts have to be scaled and there are several ways in which you can choose to do so. The function scale_counts() helps you scale them in a way that is tailored to Rail-RNA output.
## Scale counts by taking into account the total coverage per sample
rse <- scale_counts(rse_gene)We are almost ready to perform our differential expression analysis. Lets just add the information we recovered GEO about these samples.
## Add sample information for DE analysis
colData(rse)$group <- sample_info$group
colData(rse)$gene_target <- sample_info$gene_targetNow that the RangedSummarizedExperiment is complete, we can use DESeq2 or another package to perform the differential expression test. Note that you can use DEFormats for switching between formats if you want to use another package, like edgeR.
In this particular analysis, we’ll test whether there is a group difference adjusting for the gene target.
## Perform differential expression analysis with DESeq2
library('DESeq2')
## Specify design and switch to DESeq2 format
dds <- DESeqDataSet(rse, ~ gene_target + group)## converting counts to integer mode## Perform DE analysis
dds <- DESeq(dds, test = 'LRT', reduced = ~ gene_target, fitType = 'local')## estimating size factors## estimating dispersions## gene-wise dispersion estimates## mean-dispersion relationship## final dispersion estimates## fitting model and testingres <- results(dds)We can now use functions from DESeq2 to explore the results. For more details check the DESeq2 vignette. For example, we can make a MA plot as shown below.
## Explore results
plotMA(res, main="DESeq2 results for SRP009615")We can also use the regionReport package to generate interactive HTML reports exploring the DESeq2 results (or edgeR results if you used that package).
## Make a report with the results
library('regionReport')
report <- DESeq2Report(dds, res = res, project = 'SRP009615',
    intgroup = c('group', 'gene_target'), outdir = '.',
    output = 'SRP009615-results', nBest = 10, nBestFeatures = 2)## Warning in c.bibentry(knitcitations = citation("knitcitations"),
## regionReport = citation("regionReport")[1], : method is only applicable to
## 'bibentry' objects## Writing 9 Bibtex entries ... OK
## Results written to file './SRP009615-results.bib'
## processing file: SRP009615-results.Rmd
## output file: SRP009615-results.knit.md
## 
## Output created: SRP009615-results.htmlYou can view the final report here.
The gene Gencode ids are included in several objects in recount. With these ids you can find the gene symbols, gene names and other information. The org.Hs.eg.db is useful for finding this information and the following code shows how to get the gene names and symbols.
## Load required library
library('org.Hs.eg.db')## Loading required package: AnnotationDbi## ## Extract Gencode gene ids
gencode <- gsub('\\..*', '', names(recount_genes))
## Find the gene information we are interested in
gene_info <- select(org.Hs.eg.db, gencode, c('ENTREZID', 'GENENAME', 'SYMBOL',
    'ENSEMBL'), 'ENSEMBL')## 'select()' returned many:many mapping between keys and columns## Explore part of the results
dim(gene_info)## [1] 58732     4head(gene_info)##           ENSEMBL ENTREZID
## 1 ENSG00000000003     7105
## 2 ENSG00000000005    64102
## 3 ENSG00000000419     8813
## 4 ENSG00000000457    57147
## 5 ENSG00000000460    55732
## 6 ENSG00000000938     2268
##                                                      GENENAME   SYMBOL
## 1                                               tetraspanin 6   TSPAN6
## 2                                                 tenomodulin     TNMD
## 3 dolichyl-phosphate mannosyltransferase subunit 1, catalytic     DPM1
## 4                                    SCY1 like pseudokinase 3    SCYL3
## 5                         chromosome 1 open reading frame 112 C1orf112
## 6              FGR proto-oncogene, Src family tyrosine kinase      FGRThe recount project also hosts for each project sample coverage bigWig files created by Rail-RNA and a mean coverage bigWig file. For the mean coverage bigWig file, all samples were normalized to libraries of 40 million reads, each a 100 base-pairs long. recount can be used along with derfinder (Boettiger, 2017) to identify expressed regions from the data. This type of analysis is annotation-agnostic which can be advantageous. The following subsections illustrate this type of analysis.
For an annotation-agnostic differential expression analysis, we first need to define the regions of interest. With recount we can do so using the expressed_regions() function as shown below for the same study we studied earlier.
## Define expressed regions for study SRP009615, only for chromosome Y
regions <- expressed_regions('SRP009615', 'chrY', cutoff = 5L, 
    maxClusterGap = 3000L)## 2017-08-11 20:36:18 loadCoverage: loading BigWig file http://duffel.rail.bio/recount/SRP009615/bw/mean_SRP009615.bw## 2017-08-11 20:36:21 loadCoverage: applying the cutoff to the merged data## 2017-08-11 20:36:21 filterData: originally there were 57227415 rows, now there are 57227415 rows. Meaning that 0 percent was filtered.## 2017-08-11 20:36:21 findRegions: identifying potential segments## 2017-08-11 20:36:21 findRegions: segmenting information## 2017-08-11 20:36:21 .getSegmentsRle: segmenting with cutoff(s) 5## 2017-08-11 20:36:21 findRegions: identifying candidate regions## 2017-08-11 20:36:22 findRegions: identifying region clusters## Briefly explore the resulting regions
regions## GRanges object with 808 ranges and 6 metadata columns:
##       seqnames               ranges strand |            value
##          <Rle>            <IRanges>  <Rle> |        <numeric>
##     1     chrY   [2929794, 2929829]      * | 14.7265009482702
##     2     chrY   [2956678, 2956701]      * | 12.8106340567271
##     3     chrY   [2977203, 2977227]      * | 5.34908433914185
##     4     chrY   [2977957, 2977994]      * | 6.46976616508082
##     5     chrY   [2978850, 2978871]      * |  5.7976552139629
##   ...      ...                  ...    ... .              ...
##   804     chrY [26614511, 26614546]      * | 7.28189378314548
##   805     chrY [26614548, 26614553]      * | 5.48768202463786
##   806     chrY [26614779, 26614808]      * | 6.64339276949565
##   807     chrY [26626808, 26626848]      * | 12.6038152648181
##   808     chrY [26626971, 26627028]      * | 14.1673366941255
##                   area indexStart  indexEnd cluster clusterL
##              <numeric>  <integer> <integer>   <Rle>    <Rle>
##     1 530.154034137726    2929794   2929829       1       36
##     2  307.45521736145    2956678   2956701       2       24
##     3 133.727108478546    2977203   2977227       3     2750
##     4 245.851114273071    2977957   2977994       3     2750
##     5 127.548414707184    2978850   2978871       3     2750
##   ...              ...        ...       ...     ...      ...
##   804 262.148176193237   26614511  26614546     224      298
##   805 32.9260921478271   26614548  26614553     224      298
##   806 199.301783084869   26614779  26614808     224      298
##   807 516.756425857544   26626808  26626848     225      221
##   808 821.705528259277   26626971  26627028     225      221
##   -------
##   seqinfo: 1 sequence from an unspecified genomeOnce the regions have been defined, you can export them into a BED file using rtracklayer or other file formats.
Having defined the expressed regions, we can now compute a coverage matrix for these regions. We can do so using the function coverage_matrix() from recount as shown below. It returns a RangedSummarizedExperiment object.
## Compute coverage matrix for study SRP009615, only for chromosome Y
system.time( rse_ER <- coverage_matrix('SRP009615', 'chrY', regions) )## 2017-08-11 20:36:25 railMatrix: processing regions 1 to 808## 2017-08-11 20:36:25 railMatrix: processing file http://duffel.rail.bio/recount/SRP009615/bw/SRR387777.bw## 2017-08-11 20:36:28 railMatrix: processing file http://duffel.rail.bio/recount/SRP009615/bw/SRR387778.bw## 2017-08-11 20:36:32 railMatrix: processing file http://duffel.rail.bio/recount/SRP009615/bw/SRR387779.bw## 2017-08-11 20:36:35 railMatrix: processing file http://duffel.rail.bio/recount/SRP009615/bw/SRR387780.bw## 2017-08-11 20:36:38 railMatrix: processing file http://duffel.rail.bio/recount/SRP009615/bw/SRR389077.bw## 2017-08-11 20:36:41 railMatrix: processing file http://duffel.rail.bio/recount/SRP009615/bw/SRR389078.bw## 2017-08-11 20:36:44 railMatrix: processing file http://duffel.rail.bio/recount/SRP009615/bw/SRR389079.bw## 2017-08-11 20:36:47 railMatrix: processing file http://duffel.rail.bio/recount/SRP009615/bw/SRR389080.bw## 2017-08-11 20:36:50 railMatrix: processing file http://duffel.rail.bio/recount/SRP009615/bw/SRR389081.bw## 2017-08-11 20:36:53 railMatrix: processing file http://duffel.rail.bio/recount/SRP009615/bw/SRR389082.bw## 2017-08-11 20:36:56 railMatrix: processing file http://duffel.rail.bio/recount/SRP009615/bw/SRR389083.bw## 2017-08-11 20:36:59 railMatrix: processing file http://duffel.rail.bio/recount/SRP009615/bw/SRR389084.bw##    user  system elapsed 
##   5.864   0.240  38.785## Explore the RSE a bit
dim(rse_ER)## [1] 808  12rse_ER## class: RangedSummarizedExperiment 
## dim: 808 12 
## metadata(0):
## assays(1): counts
## rownames(808): 1 2 ... 807 808
## rowData names(6): value area ... cluster clusterL
## colnames(12): SRR387777 SRR387778 ... SRR389083 SRR389084
## colData names(21): project sample ... title characteristicsThe resulting count matrix has one row per region and one column per sample. The counts correspond to the number (or fraction) of reads overlapping the regions and are scaled by default to a library size of 40 million reads. For some differential expression methods, you might have to round this matrix into integers. We’ll use DESeq2 to identify which expressed regions are differentially expressed.
## Round the coverage matrix to integers
covMat <- round(assays(rse_ER)$counts, 0)You can use scale = FALSE if you want the raw base-pair coverage counts and then scale them with scale_counts(). If you want integer counts, use round = TRUE.
DESeqDataSet objectWe first need to get some phenotype information for these samples similar to the first analysis we did. We can get this data using download_study().
## Get phenotype data for study SRP009615
pheno <- colData(rse_ER)
## Complete the phenotype table with the data we got from GEO
m <- match(pheno$run, sample_info$run)
pheno <- cbind(pheno, sample_info[m, 2:3])
## Explore the phenotype data a little bit
head(pheno)## DataFrame with 6 rows and 23 columns
##               project      sample  experiment         run
##           <character> <character> <character> <character>
## SRR387777   SRP009615   SRS281685   SRX110461   SRR387777
## SRR387778   SRP009615   SRS281686   SRX110462   SRR387778
## SRR387779   SRP009615   SRS281687   SRX110463   SRR387779
## SRR387780   SRP009615   SRS281688   SRX110464   SRR387780
## SRR389077   SRP009615   SRS282369   SRX111299   SRR389077
## SRR389078   SRP009615   SRS282370   SRX111300   SRR389078
##           read_count_as_reported_by_sra reads_downloaded
##                               <integer>        <integer>
## SRR387777                      30631853         30631853
## SRR387778                      37001306         37001306
## SRR387779                      40552001         40552001
## SRR387780                      32466352         32466352
## SRR389077                      27819603         27819603
## SRR389078                      31758658         31758658
##           proportion_of_reads_reported_by_sra_downloaded paired_end
##                                                <integer>  <logical>
## SRR387777                                              1      FALSE
## SRR387778                                              1      FALSE
## SRR387779                                              1      FALSE
## SRR387780                                              1      FALSE
## SRR389077                                              1      FALSE
## SRR389078                                              1      FALSE
##           sra_misreported_paired_end mapped_read_count        auc
##                            <logical>         <integer>  <integer>
## SRR387777                      FALSE          28798572 1029494445
## SRR387778                      FALSE          33170281 1184877985
## SRR387779                      FALSE          37322762 1336528969
## SRR387780                      FALSE          29970735 1073178116
## SRR389077                      FALSE          24966859  893978355
## SRR389078                      FALSE          28190059 1009216922
##           sharq_beta_tissue sharq_beta_cell_type biosample_submission_date
##                 <character>          <character>               <character>
## SRR387777             blood                 k562   2011-12-05T15:40:03.870
## SRR387778             blood                 k562   2011-12-05T15:40:03.897
## SRR387779             blood                 k562   2011-12-05T15:40:03.910
## SRR387780             blood                 k562   2011-12-05T15:40:03.923
## SRR389077             blood                 k562   2011-12-13T11:26:05.720
## SRR389078             blood                 k562   2011-12-13T11:26:05.760
##           biosample_publication_date   biosample_update_date
##                          <character>             <character>
## SRR387777    2011-12-07T09:29:59.890 2014-08-27T04:18:20.530
## SRR387778    2011-12-07T09:29:59.890 2014-08-27T04:18:21.053
## SRR387779    2011-12-07T09:29:59.890 2014-08-27T04:18:21.800
## SRR387780    2011-12-07T09:29:59.890 2014-08-27T04:18:22.320
## SRR389077    2011-12-13T11:26:06.663 2014-08-27T04:22:14.857
## SRR389078    2011-12-13T11:26:06.663 2014-08-27T04:22:15.227
##           avg_read_length geo_accession  bigwig_file
##                 <integer>   <character>  <character>
## SRR387777              36     GSM836270 SRR387777.bw
## SRR387778              36     GSM836271 SRR387778.bw
## SRR387779              36     GSM836272 SRR387779.bw
## SRR387780              36     GSM836273 SRR387780.bw
## SRR389077              36     GSM847561 SRR389077.bw
## SRR389078              36     GSM847562 SRR389078.bw
##                                                                                                     title
##                                                                                               <character>
## SRR387777   K562 cells with shRNA targeting SRF gene cultured with no doxycycline (uninduced - UI), rep1.
## SRR387778  K562 cells with shRNA targeting SRF gene cultured with doxycycline for 48 hours (48 hr), rep1.
## SRR387779   K562 cells with shRNA targeting SRF gene cultured with no doxycycline (uninduced - UI), rep2.
## SRR387780  K562 cells with shRNA targeting SRF gene cultured with doxycycline for 48 hours (48 hr), rep2.
## SRR389077  K562 cells with shRNA targeting EGR1 gene cultured with no doxycycline (uninduced - UI), rep1.
## SRR389078 K562 cells with shRNA targeting EGR1 gene cultured with doxycycline for 72 hours (72 hr), rep1.
##                                                                                                          characteristics
##                                                                                                              <character>
## SRR387777                                               c("cells: K562", "shRNA expression: no", "treatment: Puromycin")
## SRR387778                  c("cells: K562", "shRNA expression: yes, targeting SRF", "treatment: Puromycin, doxycycline")
## SRR387779                                               c("cells: K562", "shRNA expression: no", "treatment: Puromycin")
## SRR387780                   c("cells: K562", "shRNA expression: yes targeting SRF", "treatment: Puromycin, doxycycline")
## SRR389077                          c("cell line: K562", "shRNA expression: no shRNA expression", "treatment: Puromycin")
## SRR389078 c("cell line: K562", "shRNA expression: expressing shRNA targeting EGR1", "treatment: Puromycin, doxycycline")
##               group gene_target
##            <factor>    <factor>
## SRR387777 uninduced         SRF
## SRR387778   induced         SRF
## SRR387779 uninduced         SRF
## SRR387780   induced         SRF
## SRR389077 uninduced        EGR1
## SRR389078   induced        EGR1Now that we have the necessary data, we can construct a DESeqDataSet object using the function DESeqDataSetFromMatrix() from DESeq2.
## Build a DESeqDataSet
dds_ers <- DESeqDataSetFromMatrix(countData = covMat, colData = pheno,
    design =  ~ gene_target + group)## converting counts to integer modeWith the DESeqDataSet object in place, we can then use the function DESeq() from DESeq2 to perform the differential expression analysis (between groups) as shown below.
## Perform differential expression analysis with DESeq2 at the ER-level
dds_ers <- DESeq(dds_ers, test = 'LRT', reduced = ~ gene_target,
    fitType = 'local')## estimating size factors## estimating dispersions## gene-wise dispersion estimates## mean-dispersion relationship## final dispersion estimates## fitting model and testingres_ers <- results(dds_ers)We can then visually explore the results, like we did before.
## Explore results
plotMA(res_ers, main="DESeq2 results for SRP009615 (ER-level, chrY)")We can also use regionReport to create a more extensive exploratory report.
## Create the report
report2 <- DESeq2Report(dds_ers, res = res_ers,
    project = 'SRP009615 (ER-level, chrY)',
    intgroup = c('group', 'gene_target'), outdir = '.',
    output = 'SRP009615-results-ER-level-chrY')This section describes the annotation used in recount.
We used the comprehensive gene annotation for (CHR regions) from Gencode v25 (GRCh38.p7) to get the list of genes. Specifically this GFF3 file. We then used the org.Hs.eg.db package to get the gene symbols. For each gene, we also counted the total length of exonic base pairs for that given gene and stored this information in the bp_length column. You can see below this information for the rse_gene object included in recount. The rownames() are the Gencode gene_id.
## Gene annotation information
rowRanges(rse_gene_SRP009615)## GRanges object with 58037 ranges and 3 metadata columns:
##                      seqnames                 ranges strand |
##                         <Rle>              <IRanges>  <Rle> |
##   ENSG00000000003.14     chrX [100627109, 100639991]      - |
##    ENSG00000000005.5     chrX [100584802, 100599885]      + |
##   ENSG00000000419.12    chr20 [ 50934867,  50958555]      - |
##   ENSG00000000457.13     chr1 [169849631, 169894267]      - |
##   ENSG00000000460.16     chr1 [169662007, 169854080]      + |
##                  ...      ...                    ...    ... .
##    ENSG00000283695.1    chr19 [ 52865369,  52865429]      - |
##    ENSG00000283696.1     chr1 [161399409, 161422424]      + |
##    ENSG00000283697.1     chrX [149548210, 149549852]      - |
##    ENSG00000283698.1     chr2 [112439312, 112469687]      - |
##    ENSG00000283699.1    chr10 [ 12653138,  12653197]      - |
##                                 gene_id bp_length          symbol
##                             <character> <integer> <CharacterList>
##   ENSG00000000003.14 ENSG00000000003.14      4535          TSPAN6
##    ENSG00000000005.5  ENSG00000000005.5      1610            TNMD
##   ENSG00000000419.12 ENSG00000000419.12      1207            DPM1
##   ENSG00000000457.13 ENSG00000000457.13      6883           SCYL3
##   ENSG00000000460.16 ENSG00000000460.16      5967        C1orf112
##                  ...                ...       ...             ...
##    ENSG00000283695.1  ENSG00000283695.1        61              NA
##    ENSG00000283696.1  ENSG00000283696.1       997              NA
##    ENSG00000283697.1  ENSG00000283697.1      1184    LOC101928917
##    ENSG00000283698.1  ENSG00000283698.1       940              NA
##    ENSG00000283699.1  ENSG00000283699.1        60         MIR4481
##   -------
##   seqinfo: 25 sequences (1 circular) from an unspecified genome; no seqlengths## Also accessible via
rowData(rse_gene_SRP009615)## DataFrame with 58037 rows and 3 columns
##                  gene_id bp_length          symbol
##              <character> <integer> <CharacterList>
## 1     ENSG00000000003.14      4535          TSPAN6
## 2      ENSG00000000005.5      1610            TNMD
## 3     ENSG00000000419.12      1207            DPM1
## 4     ENSG00000000457.13      6883           SCYL3
## 5     ENSG00000000460.16      5967        C1orf112
## ...                  ...       ...             ...
## 58033  ENSG00000283695.1        61              NA
## 58034  ENSG00000283696.1       997              NA
## 58035  ENSG00000283697.1      1184    LOC101928917
## 58036  ENSG00000283698.1       940              NA
## 58037  ENSG00000283699.1        60         MIR4481For an rse_exon object, the rownames() correspond to the Gencode gene_id. You can add a gene_id column if you are interested in subsetting the whole rse_exon object.
## Get the rse_exon object for study SRP009615
download_study('SRP009615', type = 'rse-exon')## 2017-08-11 20:37:05 downloading file rse_exon.Rdata to SRP009615load(file.path('SRP009615', 'rse_exon.Rdata'))
## Annotation information
rowRanges(rse_exon)## GRanges object with 329092 ranges and 0 metadata columns:
##                      seqnames                 ranges strand
##                         <Rle>              <IRanges>  <Rle>
##   ENSG00000000003.14     chrX [100627109, 100629986]      -
##   ENSG00000000003.14     chrX [100630759, 100630866]      -
##   ENSG00000000003.14     chrX [100632063, 100632068]      -
##   ENSG00000000003.14     chrX [100632485, 100632568]      -
##   ENSG00000000003.14     chrX [100633405, 100633539]      -
##                  ...      ...                    ...    ...
##    ENSG00000283698.1     chr2 [112439312, 112439849]      -
##    ENSG00000283698.1     chr2 [112440396, 112440611]      -
##    ENSG00000283698.1     chr2 [112463991, 112464036]      -
##    ENSG00000283698.1     chr2 [112469548, 112469687]      -
##    ENSG00000283699.1    chr10 [ 12653138,  12653197]      -
##   -------
##   seqinfo: 25 sequences (1 circular) from an unspecified genome; no seqlengths## Add a gene_id column
rowRanges(rse_exon)$gene_id <- rownames(rse_exon)
## Example subset
rse_exon_subset <- subset(rse_exon, subset = gene_id == 'ENSG00000000003.14')
rowRanges(rse_exon_subset)## GRanges object with 10 ranges and 1 metadata column:
##                      seqnames                 ranges strand |
##                         <Rle>              <IRanges>  <Rle> |
##   ENSG00000000003.14     chrX [100627109, 100629986]      - |
##   ENSG00000000003.14     chrX [100630759, 100630866]      - |
##   ENSG00000000003.14     chrX [100632063, 100632068]      - |
##   ENSG00000000003.14     chrX [100632485, 100632568]      - |
##   ENSG00000000003.14     chrX [100633405, 100633539]      - |
##   ENSG00000000003.14     chrX [100633931, 100634029]      - |
##   ENSG00000000003.14     chrX [100635178, 100635252]      - |
##   ENSG00000000003.14     chrX [100635558, 100635746]      - |
##   ENSG00000000003.14     chrX [100636191, 100637104]      - |
##   ENSG00000000003.14     chrX [100639945, 100639991]      - |
##                                 gene_id
##                             <character>
##   ENSG00000000003.14 ENSG00000000003.14
##   ENSG00000000003.14 ENSG00000000003.14
##   ENSG00000000003.14 ENSG00000000003.14
##   ENSG00000000003.14 ENSG00000000003.14
##   ENSG00000000003.14 ENSG00000000003.14
##   ENSG00000000003.14 ENSG00000000003.14
##   ENSG00000000003.14 ENSG00000000003.14
##   ENSG00000000003.14 ENSG00000000003.14
##   ENSG00000000003.14 ENSG00000000003.14
##   ENSG00000000003.14 ENSG00000000003.14
##   -------
##   seqinfo: 25 sequences (1 circular) from an unspecified genome; no seqlengthsThe exons in a rse_exon object are those that have a Gencode gene_id and that were reduced as described in the GenomicRanges package:
?`inter-range-methods`You can reproduce the gene and exon information using the function reproduce_ranges().
If you are interested in using another annotation based on hg38 coordinates you can do so using the function coverage_matrix() or by running bwtool thanks to the BigWig files in recount, in which case recount.bwtool will be helpful.
If you are re-processing a small set of samples, it simply might be easier to use coverage_matrix() as shown below using the annotation information provided by the EnsDb.Hsapiens.v79 package.
## Get the reduced exons based on EnsDb.Hsapiens.v79 which matches hg38
exons <- reproduce_ranges('exon', db = 'EnsDb.Hsapiens.v79')
## Change the chromosome names to match those used in the BigWig files
library('GenomeInfoDb')
seqlevelsStyle(exons) <- 'UCSC'
## Get the count matrix at the exon level for chrY
exons_chrY <- keepSeqlevels(exons, 'chrY', pruning.mode = 'coarse')
exonRSE <- coverage_matrix('SRP009615', 'chrY', unlist(exons_chrY),
    chunksize = 3000)## 2017-08-11 20:37:28 railMatrix: processing regions 1 to 2157## 2017-08-11 20:37:28 railMatrix: processing file http://duffel.rail.bio/recount/SRP009615/bw/SRR387777.bw## 2017-08-11 20:37:31 railMatrix: processing file http://duffel.rail.bio/recount/SRP009615/bw/SRR387778.bw## 2017-08-11 20:37:35 railMatrix: processing file http://duffel.rail.bio/recount/SRP009615/bw/SRR387779.bw## 2017-08-11 20:37:38 railMatrix: processing file http://duffel.rail.bio/recount/SRP009615/bw/SRR387780.bw## 2017-08-11 20:37:42 railMatrix: processing file http://duffel.rail.bio/recount/SRP009615/bw/SRR389077.bw## 2017-08-11 20:37:45 railMatrix: processing file http://duffel.rail.bio/recount/SRP009615/bw/SRR389078.bw## 2017-08-11 20:37:48 railMatrix: processing file http://duffel.rail.bio/recount/SRP009615/bw/SRR389079.bw## 2017-08-11 20:37:51 railMatrix: processing file http://duffel.rail.bio/recount/SRP009615/bw/SRR389080.bw## 2017-08-11 20:37:55 railMatrix: processing file http://duffel.rail.bio/recount/SRP009615/bw/SRR389081.bw## 2017-08-11 20:37:59 railMatrix: processing file http://duffel.rail.bio/recount/SRP009615/bw/SRR389082.bw## 2017-08-11 20:38:02 railMatrix: processing file http://duffel.rail.bio/recount/SRP009615/bw/SRR389083.bw## 2017-08-11 20:38:05 railMatrix: processing file http://duffel.rail.bio/recount/SRP009615/bw/SRR389084.bwexonMatrix <- assays(exonRSE)$counts
dim(exonMatrix)## [1] 2157   12head(exonMatrix)##   SRR387777 SRR387778 SRR387779 SRR387780 SRR389077 SRR389078  SRR389079
## 1  0.000000  0.000000   0.00000  0.000000  0.000000  0.000000  0.3256991
## 2  0.000000  0.000000   0.00000  0.000000  0.000000  0.000000  0.8375120
## 3  9.791214 16.980651  17.23868  5.367236  4.832332  2.417716 10.7713344
## 4  0.000000  0.675175   0.00000  0.000000  0.000000  0.000000  0.0000000
## 5  0.000000  0.000000   0.00000  0.000000  0.000000  0.000000  0.0000000
## 6  0.000000  0.000000   0.00000  0.000000  0.000000  0.000000  0.0000000
##   SRR389080 SRR389081 SRR389082 SRR389083 SRR389084
## 1 0.6875641  0.000000   0.00000  0.000000  0.000000
## 2 0.0000000  1.023429   0.00000  0.000000  0.000000
## 3 7.4256919 15.965491  28.04972  7.143536  9.139425
## 4 0.0000000  0.000000   0.00000  0.000000  0.000000
## 5 0.0000000  0.000000   0.00000  0.000000  0.000000
## 6 0.0000000  0.000000   0.00000  0.000000  0.000000## Summary the information at the gene level for chrY
exon_gene <- rep(names(exons_chrY), elementNROWS(exons_chrY))
geneMatrix <- do.call(rbind, lapply(split(as.data.frame(exonMatrix),
    exon_gene), colSums))
dim(geneMatrix)## [1] 544  12head(geneMatrix)##                 SRR387777 SRR387778  SRR387779  SRR387780 SRR389077
## ENSG00000012817  9.791214 17.655826 17.8671773  6.7090447  4.832332
## ENSG00000067048 77.902314 69.981889 41.6900803 83.0430650 52.797699
## ENSG00000067646 44.371293 52.123510 22.6257722 57.2505152 44.251631
## ENSG00000092377 10.568294  9.722520 13.8867173 16.0271624 11.364928
## ENSG00000099715  3.613424  1.789214  0.6284937  2.6836179  2.997835
## ENSG00000099721  0.000000  0.000000  0.0000000  0.7827219  1.342314
##                 SRR389078  SRR389079  SRR389080 SRR389081  SRR389082
## ENSG00000012817  2.893332 12.7255290  8.1132560 16.988920  28.049724
## ENSG00000067048 52.000713 52.3677621 53.9737791 80.441512 106.485062
## ENSG00000067646 44.232314 31.5928125 56.0708495 38.480927  45.905450
## ENSG00000092377  1.426849 17.0061457 24.2022551 33.773154   9.349908
## ENSG00000099715  0.000000  0.8375120  0.0000000  1.023429   2.337477
## ENSG00000099721  0.000000  0.4652844  0.7563205  0.000000   0.000000
##                 SRR389083 SRR389084
## ENSG00000012817  8.830204  9.139425
## ENSG00000067048 65.978494 34.679041
## ENSG00000067646 35.469641 28.992288
## ENSG00000092377  3.571768 16.450965
## ENSG00000099715  0.000000  0.000000
## ENSG00000099721  0.000000  0.000000The above solution works well for a small set of samples. However, bwtool is faster for processing large sets. You can see how we used bwtool at leekgroup/recount-website. It’s much faster than R and the small package recount.bwtool will be helpful for such scenarios. Note that you will need to run scale_counts() after using coverage_matrix_bwtool() and that will have to download the BigWig files first unless you are working at JHPCE or SciServer. coverage_matrix_bwtool() will download the files for you, but that till take time.
Finally, the BigWig files hosted by recount have the following chromosome names and sizes as specified in hg38.sizes. You’ll have to make sure that you match these names when using alternative annotations.
If you are interested in finding possible gene fusion events in a particular study, you can download the RangedSummarizedExperiment object at the exon-exon junctions level for that study. The objects we provide classify exon-exon junction events as shown below.
library('recount')
## Download and load RSE object for SRP009615 study
download_study('SRP009615', type = 'rse-jx')## 2017-08-11 20:38:09 downloading file rse_jx.Rdata to SRP009615load(file.path('SRP009615', 'rse_jx.Rdata'))
## Exon-exon junctions by class
table(rowRanges(rse_jx)$class)## 
## alternative_end       annotated       exon_skip          fusion 
##           35601          140251            9513             489 
##           novel 
##           26475## Potential gene fusions by chromosome
fusions <- rowRanges(rse_jx)[rowRanges(rse_jx)$class == 'fusion']
fusions_by_chr <- sort(table(seqnames(fusions)), decreasing = TRUE)
fusions_by_chr[fusions_by_chr > 0]## 
##  chr1 chr19  chr6 chr12 chr22 chr17  chr5 chr16 chr11  chr3 chr10  chr2 
##    61    54    37    31    27    26    26    24    22    22    21    20 
##  chr7  chrX chr14  chr4  chr9  chr8 chr15 chr20 chr13 chr21 chr18 
##    20    16    13    13    12    10     9     9     6     6     4## Genes with the most fusions
head(sort(table(unlist(fusions$symbol_proposed)), decreasing = TRUE))## 
##      TMEM14B       SYCP2L LOC100129924        MATR3        ACSM3 
##            6            5            4            4            3 
##         BMI1 
##            3If you are interested in checking the frequency of a particular exon-exon junction then check the snaptron_query() function described in the following section.
Our collaborators built SnaptronUI to query the Intropolis database of the exon-exon junctions found in SRA. SnaptronUI allows you to do several types of queries manually and perform analyses. recount has the function snaptron_query() which uses Snaptron to check if particular exon-exon junctions are present in Intropolis. For example, here we use snaptron_query() to get in R-friendly format the information for 3 exon-exon junctions using Intropolis version 1 which uses hg19 coordinates. Version 2 uses hg38 coordinates and the exon-exon junctions were derived from a larger data set: about 50,000 samples vs 21,000 in version 1. snaptron_query() can also be used to access the 30 million and 36 million exon-exon junctions derived from the GTEx and TCGA consortiums respectively (about 10 and 11 thousand samples respectively).
library('GenomicRanges')
junctions <- GRanges(seqnames = 'chr2', IRanges(
    start = c(28971711, 29555082, 29754983),
    end = c(29462418, 29923339, 29917715)))
snaptron_query(junctions)## 2017-08-11 20:38:11 querying Snaptron## 2017-08-11 20:38:12 processing results## 2017-08-11 20:38:12 extracting information## GRanges object with 3 ranges and 14 metadata columns:
##       seqnames               ranges strand |     type snaptron_id
##          <Rle>            <IRanges>  <Rle> | <factor>   <integer>
##   [1]     chr2 [28971711, 29462418]      - |  SRAv1:I    22314441
##   [2]     chr2 [29555082, 29923339]      + |  SRAv1:I    22328842
##   [3]     chr2 [29754983, 29917715]      - |  SRAv1:I    22329293
##             annotated  left_motif right_motif     left_annotated
##       <CharacterList> <character> <character>    <CharacterList>
##   [1]              NA          AT          AC                 NA
##   [2]              NA          GC          AG                 NA
##   [3]               1          GT          AG aC19,cG19,cG38,...
##          right_annotated            samples read_coverage_by_sample
##          <CharacterList>      <IntegerList>           <IntegerList>
##   [1]                 NA               3481                       1
##   [2]                 NA 2643,4867,4945,...               1,1,1,...
##   [3] aC19,cG19,cG38,...         3,5,13,...               1,1,1,...
##       samples_count coverage_sum coverage_avg coverage_median
##           <integer>    <integer>    <numeric>       <numeric>
##   [1]             1            1        1.000               1
##   [2]            13           13        1.000               1
##   [3]          1625         5380        3.311               1
##       source_dataset_id
##               <integer>
##   [1]                 0
##   [2]                 0
##   [3]                 0
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengthsIf you use snaptron_query() please cite snaptron.cs.jhu.edu/snaptron/docs/. Snaptron is separate from recount. Thank you!
If you are interested in downloading all the data you can do so using the recount_url data.frame included in the package. That is:
## Get data.frame with all the URLs
library('recount')
dim(recount_url)## [1] 93043     4## Explore URLs
head(recount_url$url)## [1] "http://duffel.rail.bio/recount/DRP000366/counts_exon.tsv.gz"
## [2] "http://duffel.rail.bio/recount/DRP000366/counts_gene.tsv.gz"
## [3] "http://duffel.rail.bio/recount/DRP000366/rse_exon.Rdata"    
## [4] "http://duffel.rail.bio/recount/DRP000366/rse_gene.Rdata"    
## [5] "http://duffel.rail.bio/recount/DRP000366/counts_jx.tsv.gz"  
## [6] "http://duffel.rail.bio/recount/DRP000366/rse_jx.Rdata"You can then download all the data using the download_study() function or however you prefer to do so with recount_url.
## Download all files (will take a while! It's over 8 TB of data)
sapply(unique(recount_url$project), download_study, type = 'all')You can find the code for the website at leekgroup/recount-website if you want to deploy your own local mirror. Please let us know if you choose to do deploy a mirror.
Not everyone has over 8 terabytes of disk space available to download all the data from the recount project. However, thanks to SciServer you can access it locally via a Jupyter Notebook. If you do so and want to share your work, please let the SciServer maintainers know via Twitter at IDIESJHU.
To use SciServer, you first have to go to http://www.sciserver.org/ then click on “Login to SciServer”. If you are a first time user, click on “Register New Account” and enter the information they request as shown on the next image.
Once you’ve registered, open the SciServer compute tool.
Then click on “Create container”, choose a name of your preference and make sure that you choose to load R 3.3.x and the recount public volume. This will enable you to access all of the recount data as if it was on your computer. Click “Create” once you are ready.
Once you have a container, create a new R Notebook as shown in the image below.
In this R Notebook you can insert new code cells where you can type R code. For example, you can install recount and DESeq2 to run the code example from this vignette. You first have to install the packages as shown below.
Then, the main part when using SciServer is to use to remember that all the recount data is available locally. So instead of using download_study(), you can simply use load() with the correct path as shown below.
Several functions in the recount package have the outdir argument, which you can specify to use the data locally. Try using expressed_regions() in SciServer to get started with the following code:
## Expressed-regions SciServer example
regions <- expressed_regions('SRP009615', 'chrY', cutoff = 5L,
    maxClusterGap = 3000L, outdir = file.path(scipath, 'SRP009615'))
regionsYou can find the R code for a full SciServer demo here. You can copy and paste it into a cell and run it all to see how SciServer compute works.
If you encounter problems when using SciServer please check their support page and the SciServer compute help section.
If you use SciServer for your published work, please cite it accordingly. SciServer is administered by the Institute for Data Intensive Engineering and Science at Johns Hopkins University and is funded by National Science Foundation award ACI-1261715.
The recount package (Morgan, Obenchain, Lang, and Thompson, 2017) was made possible thanks to:
Code for creating the vignette
## Create the vignette
library('rmarkdown')
system.time(render('recount-quickstart.Rmd', 'BiocStyle::html_document'))
## Extract the R code
library('knitr')
knit('recount-quickstart.Rmd', tangle = TRUE)## Clean up
file.remove('quickstartRef.bib')## [1] TRUEDate the vignette was generated.
## [1] "2017-08-11 20:38:14 EDT"Wallclock time spent generating the vignette.
## Time difference of 4.007 minsR session information.
## Session info ----------------------------------------------------------------------------------------------------------##  setting  value                       
##  version  R version 3.4.1 (2017-06-30)
##  system   x86_64, linux-gnu           
##  ui       X11                         
##  language (EN)                        
##  collate  C                           
##  tz       posixrules                  
##  date     2017-08-11## Packages --------------------------------------------------------------------------------------------------------------##  package                * version  date       source        
##  acepack                  1.4.1    2016-10-29 CRAN (R 3.4.1)
##  annotate                 1.54.0   2017-08-11 Bioconductor  
##  AnnotationDbi          * 1.38.2   2017-08-11 Bioconductor  
##  AnnotationFilter       * 1.0.0    2017-08-11 Bioconductor  
##  AnnotationHub            2.8.2    2017-08-11 Bioconductor  
##  backports                1.1.0    2017-05-22 CRAN (R 3.4.1)
##  base                   * 3.4.1    2017-07-06 local         
##  base64enc                0.1-3    2015-07-28 CRAN (R 3.4.1)
##  bibtex                   0.4.2    2017-06-30 CRAN (R 3.4.1)
##  Biobase                * 2.36.2   2017-08-11 Bioconductor  
##  BiocGenerics           * 0.22.0   2017-08-11 Bioconductor  
##  BiocInstaller            1.26.0   2017-08-11 Bioconductor  
##  BiocParallel             1.10.1   2017-08-11 Bioconductor  
##  BiocStyle              * 2.4.1    2017-08-11 Bioconductor  
##  biomaRt                  2.32.1   2017-08-11 Bioconductor  
##  Biostrings               2.44.2   2017-08-11 Bioconductor  
##  bit                      1.1-12   2014-04-09 CRAN (R 3.4.1)
##  bit64                    0.9-7    2017-05-08 CRAN (R 3.4.1)
##  bitops                   1.0-6    2013-08-17 CRAN (R 3.4.1)
##  blob                     1.1.0    2017-06-17 CRAN (R 3.4.1)
##  bookdown                 0.4      2017-05-20 CRAN (R 3.4.1)
##  BSgenome                 1.44.0   2017-08-11 Bioconductor  
##  bumphunter               1.16.0   2017-08-11 Bioconductor  
##  checkmate                1.8.3    2017-07-03 CRAN (R 3.4.1)
##  cluster                  2.0.6    2017-03-10 CRAN (R 3.4.1)
##  codetools                0.2-15   2016-10-05 CRAN (R 3.4.1)
##  colorspace               1.3-2    2016-12-14 CRAN (R 3.4.1)
##  compiler                 3.4.1    2017-07-06 local         
##  curl                     2.8.1    2017-07-21 CRAN (R 3.4.1)
##  data.table               1.10.4   2017-02-01 CRAN (R 3.4.1)
##  datasets               * 3.4.1    2017-07-06 local         
##  DBI                      0.7      2017-06-18 CRAN (R 3.4.1)
##  DEFormats                1.4.0    2017-08-11 Bioconductor  
##  DelayedArray           * 0.2.7    2017-08-11 Bioconductor  
##  derfinder                1.10.5   2017-08-11 Bioconductor  
##  derfinderHelper          1.10.0   2017-08-11 Bioconductor  
##  DESeq2                 * 1.16.1   2017-08-11 Bioconductor  
##  devtools               * 1.13.3   2017-08-02 CRAN (R 3.4.1)
##  digest                   0.6.12   2017-01-27 CRAN (R 3.4.1)
##  doRNG                    1.6.6    2017-04-10 CRAN (R 3.4.1)
##  downloader               0.4      2015-07-09 CRAN (R 3.4.1)
##  DT                     * 0.2      2016-08-09 CRAN (R 3.4.1)
##  edgeR                    3.18.1   2017-08-11 Bioconductor  
##  EnsDb.Hsapiens.v79     * 2.1.0    2017-07-06 Bioconductor  
##  ensembldb              * 2.0.4    2017-08-11 Bioconductor  
##  evaluate                 0.10.1   2017-06-24 CRAN (R 3.4.1)
##  foreach                  1.4.3    2015-10-13 CRAN (R 3.4.1)
##  foreign                  0.8-69   2017-06-22 CRAN (R 3.4.1)
##  Formula                  1.2-2    2017-07-10 CRAN (R 3.4.1)
##  genefilter               1.58.1   2017-08-11 Bioconductor  
##  geneplotter              1.54.0   2017-08-11 Bioconductor  
##  GenomeInfoDb           * 1.12.2   2017-08-11 Bioconductor  
##  GenomeInfoDbData         0.99.0   2017-07-06 Bioconductor  
##  GenomicAlignments        1.12.1   2017-08-11 Bioconductor  
##  GenomicFeatures        * 1.28.4   2017-08-11 Bioconductor  
##  GenomicFiles             1.12.0   2017-08-11 Bioconductor  
##  GenomicRanges          * 1.28.4   2017-08-11 Bioconductor  
##  GEOquery                 2.42.0   2017-08-11 Bioconductor  
##  ggplot2                * 2.2.1    2016-12-30 CRAN (R 3.4.1)
##  graphics               * 3.4.1    2017-07-06 local         
##  grDevices              * 3.4.1    2017-07-06 local         
##  grid                     3.4.1    2017-07-06 local         
##  gridExtra                2.2.1    2016-02-29 CRAN (R 3.4.1)
##  gtable                   0.2.0    2016-02-26 CRAN (R 3.4.1)
##  highr                    0.6      2016-05-09 CRAN (R 3.4.1)
##  Hmisc                    4.0-3    2017-05-02 CRAN (R 3.4.1)
##  htmlTable                1.9      2017-01-26 CRAN (R 3.4.1)
##  htmltools                0.3.6    2017-04-28 CRAN (R 3.4.1)
##  htmlwidgets              0.9      2017-07-10 CRAN (R 3.4.1)
##  httpuv                   1.3.5    2017-07-04 CRAN (R 3.4.1)
##  httr                     1.2.1    2016-07-03 CRAN (R 3.4.1)
##  interactiveDisplayBase   1.14.0   2017-08-11 Bioconductor  
##  IRanges                * 2.10.2   2017-08-11 Bioconductor  
##  iterators                1.0.8    2015-10-13 CRAN (R 3.4.1)
##  jsonlite                 1.5      2017-06-01 CRAN (R 3.4.1)
##  knitcitations          * 1.0.8    2017-07-04 CRAN (R 3.4.1)
##  knitr                  * 1.17     2017-08-10 CRAN (R 3.4.1)
##  knitrBootstrap           1.0.1    2017-07-19 CRAN (R 3.4.1)
##  labeling                 0.3      2014-08-23 CRAN (R 3.4.1)
##  lattice                  0.20-35  2017-03-25 CRAN (R 3.4.1)
##  latticeExtra             0.6-28   2016-02-09 CRAN (R 3.4.1)
##  lazyeval                 0.2.0    2016-06-12 CRAN (R 3.4.1)
##  limma                    3.32.5   2017-08-11 Bioconductor  
##  locfit                   1.5-9.1  2013-04-20 CRAN (R 3.4.1)
##  lubridate                1.6.0    2016-09-13 CRAN (R 3.4.1)
##  magrittr                 1.5      2014-11-22 CRAN (R 3.4.1)
##  markdown                 0.8      2017-04-20 CRAN (R 3.4.1)
##  Matrix                   1.2-10   2017-05-03 CRAN (R 3.4.1)
##  matrixStats            * 0.52.2   2017-04-14 CRAN (R 3.4.1)
##  memoise                  1.1.0    2017-04-21 CRAN (R 3.4.1)
##  methods                * 3.4.1    2017-07-06 local         
##  mime                     0.5      2016-07-07 CRAN (R 3.4.1)
##  munsell                  0.4.3    2016-02-13 CRAN (R 3.4.1)
##  nnet                     7.3-12   2016-02-02 CRAN (R 3.4.1)
##  org.Hs.eg.db           * 3.4.1    2017-07-06 Bioconductor  
##  parallel               * 3.4.1    2017-07-06 local         
##  pheatmap               * 1.0.8    2015-12-11 CRAN (R 3.4.1)
##  pkgconfig                2.0.1    2017-03-21 CRAN (R 3.4.1)
##  pkgmaker                 0.22     2014-05-14 CRAN (R 3.4.1)
##  plyr                     1.8.4    2016-06-08 CRAN (R 3.4.1)
##  ProtGenerics             1.8.0    2017-08-11 Bioconductor  
##  qvalue                   2.8.0    2017-08-11 Bioconductor  
##  R6                       2.2.2    2017-06-17 CRAN (R 3.4.1)
##  RColorBrewer           * 1.1-2    2014-12-07 CRAN (R 3.4.1)
##  Rcpp                     0.12.12  2017-07-15 CRAN (R 3.4.1)
##  RCurl                    1.95-4.8 2016-03-01 CRAN (R 3.4.1)
##  recount                * 1.2.3    2017-08-12 Bioconductor  
##  RefManageR               0.14.12  2017-07-04 CRAN (R 3.4.1)
##  regionReport           * 1.10.2   2017-08-11 Bioconductor  
##  registry                 0.3      2015-07-08 CRAN (R 3.4.1)
##  rentrez                  1.1.0    2017-06-01 CRAN (R 3.4.1)
##  reshape2                 1.4.2    2016-10-22 CRAN (R 3.4.1)
##  rlang                    0.1.2    2017-08-09 CRAN (R 3.4.1)
##  rmarkdown                1.6      2017-06-15 CRAN (R 3.4.1)
##  rngtools                 1.2.4    2014-03-06 CRAN (R 3.4.1)
##  rpart                    4.1-11   2017-03-13 CRAN (R 3.4.1)
##  rprojroot                1.2      2017-01-16 CRAN (R 3.4.1)
##  Rsamtools                1.28.0   2017-08-11 Bioconductor  
##  RSQLite                  2.0      2017-06-19 CRAN (R 3.4.1)
##  rstudioapi               0.6      2016-06-27 CRAN (R 3.4.1)
##  rtracklayer              1.36.4   2017-08-11 Bioconductor  
##  S4Vectors              * 0.14.3   2017-08-11 Bioconductor  
##  scales                   0.4.1    2016-11-09 CRAN (R 3.4.1)
##  shiny                    1.0.3    2017-04-26 CRAN (R 3.4.1)
##  splines                  3.4.1    2017-07-06 local         
##  stats                  * 3.4.1    2017-07-06 local         
##  stats4                 * 3.4.1    2017-07-06 local         
##  stringi                  1.1.5    2017-04-07 CRAN (R 3.4.1)
##  stringr                  1.2.0    2017-02-18 CRAN (R 3.4.1)
##  SummarizedExperiment   * 1.6.3    2017-08-11 Bioconductor  
##  survival                 2.41-3   2017-04-04 CRAN (R 3.4.1)
##  tibble                   1.3.3    2017-05-28 CRAN (R 3.4.1)
##  tools                    3.4.1    2017-07-06 local         
##  utils                  * 3.4.1    2017-07-06 local         
##  VariantAnnotation        1.22.3   2017-08-11 Bioconductor  
##  withr                    2.0.0    2017-07-28 CRAN (R 3.4.1)
##  XML                      3.98-1.9 2017-06-19 CRAN (R 3.4.1)
##  xml2                     1.1.1    2017-01-24 CRAN (R 3.4.1)
##  xtable                   1.8-2    2016-02-05 CRAN (R 3.4.1)
##  XVector                  0.16.0   2017-08-11 Bioconductor  
##  yaml                     2.1.14   2016-11-12 CRAN (R 3.4.1)
##  zlibbioc                 1.22.0   2017-08-11 BioconductorThis vignette was generated using BiocStyle (Arora, Morgan, Carlson, and Pagès, 2017) with knitr (Lawrence, Huber, Pagès, Aboyoun, et al., 2013) and rmarkdown (Rainer, 2016) running behind the scenes.
Citations made with knitcitations (Lawrence, Gentleman, and Carey, 2009).
[1] J. Allaire, J. Cheng, Y. Xie, J. McPherson, et al. rmarkdown: Dynamic Documents for R. R package version 1.6. 2017. URL: https://CRAN.R-project.org/package=rmarkdown.
[1] J. Allaire, J. Cheng, Y. Xie, J. McPherson, et al. rmarkdown: Dynamic Documents for R. R package version 1.6. 2017. URL: https://CRAN.R-project.org/package=rmarkdown.
[1] S. Arora, M. Morgan, M. Carlson and H. Pagès. GenomeInfoDb: Utilities for manipulating chromosome and other ‘seqname’ identifiers. R package version 1.12.2. 2017.
[1] C. Boettiger. knitcitations: Citations for ‘Knitr’ Markdown Files. R package version 1.0.8. 2017. URL: https://CRAN.R-project.org/package=knitcitations.
[1] C. Boettiger. knitcitations: Citations for ‘Knitr’ Markdown Files. R package version 1.0.8. 2017. URL: https://CRAN.R-project.org/package=knitcitations.
[1] W. Chang. downloader: Download Files over HTTP and HTTPS. R package version 0.4. 2015. URL: https://CRAN.R-project.org/package=downloader.
[1] L. Collado-Torres, A. E. Jaffe and J. T. Leek. regionReport: Generate HTML or PDF reports for a set of genomic regions or DESeq2/edgeR results. https://github.com/leekgroup/regionReport - R package version 1.10.2. 2016. URL: http://www.bioconductor.org/packages/regionReport.
## No encoding supplied: defaulting to UTF-8.
## No encoding supplied: defaulting to UTF-8.
## No encoding supplied: defaulting to UTF-8.
## No encoding supplied: defaulting to UTF-8.[1] S. Davis and P. Meltzer. “GEOquery: a bridge between the Gene Expression Omnibus (GEO) and BioConductor”. In: Bioinformatics 14 (2007), pp. 1846–1847.
[1] R. Kolde. pheatmap: Pretty Heatmaps. R package version 1.0.8. 2015. URL: https://CRAN.R-project.org/package=pheatmap.
## No encoding supplied: defaulting to UTF-8.
## No encoding supplied: defaulting to UTF-8.
## No encoding supplied: defaulting to UTF-8.
## No encoding supplied: defaulting to UTF-8.[1] M. Morgan, V. Obenchain, M. Lang and R. Thompson. BiocParallel: Bioconductor facilities for parallel evaluation. R package version 1.10.1. 2017. URL: https://github.com/Bioconductor/BiocParallel.
[1] H. Pagès, M. Lawrence and P. Aboyoun. S4Vectors: S4 implementation of vectors and lists. R package version 0.14.3. 2017.
[1] R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. Vienna, Austria, 2017. URL: https://www.R-project.org/.
[1] J. Rainer. EnsDb.Hsapiens.v79: Ensembl based annotation package. R package version 2.1.0. 2016.
[1] H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2009. ISBN: 978-0-387-98140-6. URL: http://ggplot2.org.
[1] H. Wickham. “testthat: Get Started with Testing”. In: The R Journal 3 (2011), pp. 5–10. URL: http://journal.r-project.org/archive/2011-1/RJournal_2011-1_Wickham.pdf.
[1] H. Wickham and W. Chang. devtools: Tools to Make Developing R Packages Easier. R package version 1.13.3. 2017. URL: https://CRAN.R-project.org/package=devtools.
[1] D. Winter. rentrez: Entrez in R. R package version 1.1.0. 2017. URL: https://CRAN.R-project.org/package=rentrez.
[1] Y. Xie. DT: A Wrapper of the JavaScript Library ‘DataTables’. R package version 0.2. 2016. URL: https://CRAN.R-project.org/package=DT.
[1] Y. Xie. “knitr: A Comprehensive Tool for Reproducible Research in R”. In: Implementing Reproducible Computational Research. Ed. by V. Stodden, F. Leisch and R. D. Peng. ISBN 978-1466561595. Chapman and Hall/CRC, 2014. URL: http://www.crcpress.com/product/isbn/9781466561595.
[1] Y. Xie. “knitr: A Comprehensive Tool for Reproducible Research in R”. In: Implementing Reproducible Computational Research. Ed. by V. Stodden, F. Leisch and R. D. Peng. ISBN 978-1466561595. Chapman and Hall/CRC, 2014. URL: http://www.crcpress.com/product/isbn/9781466561595.