Common Sequence Analysis Work Flows

Martin Morgan, Sonali Arora
October 28, 2014

RNA-seq

RNA-seq differential expression of known genes

Simplest scenario
Experimental design: simple, replicated; track covariates and be aware of batch effects
Sequencing: moderate length and number of reads; single or paired-end (though probalbly paired-end).
Alignment: basic splice-aware aligner, e.g., Bowtie2, STAR. Viable Bioconductor approaches: Rsubread, Rbowtie (especially via the QuasR package).
Reduction: GenomicRanges::summarizeOverlaps() or external tools, using gene model from TxDb.* package or GFF / GTF files. End result: matrix of counts.
Analysis: DESeq2, edgeR, and additional software.

RNA-seq differential expression of known transcripts

Popular non-R work flow: Rbowtie2, tophat, cufflinks, cuffdiff.
Biocondutor options
- DEXSeq: differential exon use.
- Rsubread::subjunc() for aligning without requiring known gene models.
- cummeRbund: working with cufflinks output.

Single-cell expression

See my recent slides outlining ChIP-seq and relevant Bioconductor software.

Experimental design / wet lab: important to effectively enrich genomic DNA via ChIP, otherwise hard to distinguish signal peaks from background
Sequencing: moderate length and number of single-end reads very adequate.
Alignment: Basic aligners sufficient
Reduction
- External software; many tools depending on application, e.g., MACS.
- Product: BED and / or WIG files of called peaks
Analysis & Comprehension
- ChIPQC for quality control.
- rtracklayer to input BED and WIG files to standard Bioconductor data structures.
- ChIPpeakAnno, ChIPXpres for annotating peaks in relation to genes.
- DiffBind to assess differential representation of peaks in a designed experiment.
- AnnotationHub for accessing (some) consortium-level summary data.

See Michael Lawrence's variant calling with VariantTools. and Val Obenchain's manipulation and annotation of called variants with VariantAnnotation.

Sequencing: requires high-quality reads with high per-nucleotide depth of coverage – longer, paired-end sequencing.
Alignment: requires effective aligners; BWA, GMAP, …
- gmapR wraps the GMAP aligner in R.
Reduction: typically to VCF files summarizing variants and / or population-level variation. GATK and other non-R tools commonly used.
- VariantTools includes facilities for calling variants.
- h5vc targets a different intermediate step: summarize base counts at each position in the genome; use this as a starting point for calling variants, and to evaluate false positives, etc.
Analysis & comprehension
- VariantAnnotation, ensemblVEP for querying / inputing VCF files, and for annotation of variants (“is this a coding variant?”, etc.).
- SomaticSignatures for working with somatic signatures of single-nucleotide variatns.

See the short introduction and lab centered around Illumina 450k methylation arrays and the minfi package.

Analysis & comprehension: bsseq, BiSeq for processing and analysis; bumphunter as basic tool for identifying CpG features.

Experimental design: typically population-level surveys with moderate (10's-100's) of samples.
Wet lab & sequencing: often target phylogenetically-informative genes, requiring longer (overlapping) paired-end reads. Many existing studies used 454 technology, which has a different sequencing error model than Illumina (e.g., homopolymers are a common error, instead of trailing nucleotide quality deterioration).
Reduction: Pre-processing (e.g., knitting together overlapping paired-end reads) and taxonomic classification / placement in third-party software, e.g., QIIME, pplacer. End result: count table summarizing represenation of distinct taxa in each sample.
- rRDP provides an R / Bioconductor interface to the RDP classifiere.
Analysis: R / Bioconductor and many insights from microarray / RNA-seq analysis well suited to count table, but common pipelines have re- or dis-invented the wheel.
- phyloseq provides very nice tools for general analysis.