ORFik 1.20.2
Welcome to the introduction of data management with ORFik experiment. This vignette will walk you through how to work with large amounts of sequencing data effectively in ORFik.
ORFik
is an R package containing various functions for analysis of RiboSeq, RNASeq, RCP-seq, TCP-seq, Chip-seq and Cage data, we advice you to read ORFikOverview vignette, before starting this one.
NGS libraries are becoming more and more numerous. As a bioinformatician / biologist you often work on multi-library experiments, like 6 libraries of RNA-seq and 6 Ribo-seq libraries, split on 3 conditions with 2 replicates each. Then make some plots or statistics. A lot of things can go wrong when you scale up from just 1 library to many, or even to multiple experiments.
Another problem is also that annotations like gff and fasta files combined with the NGS data, must be separately loaded. Making it possible to use wrong annotation for the NGS data.
So to summarize, the ORFik experiment API abstracts what could be done with 1 NGS library and a corresponding organism annotation to the level of multiple libraries and the comparison between them, standardizing ploting, comparisons, loading libraries and much more.
It is an object that simplify and error correct your NGS workflow, creating a single R object that stores and controls all results relevant to a specific experiment. It contains following important parts:
Let’s say we have a human experiment, containing annotation files (.gtf and .fasta genome) + Next generation sequencing libraries (NGS-data); RNA-seq, ribo-seq and CAGE.
An example of how to make the experiment will now be shown:
First load ORFik
library(ORFik)
In a normal experiment, you would usually have only bam files from alignment of your experiment to start with (and split this into 3 experiments, 1 for RNA-seq, 1 for Ribo-seq and 1 for CAGE), but to simplify this for you to replicate we use the ORFik example data.
The minimal amount of information you need to make an ORFik experiment is:
# 1. Pick directory (normally a folder with your aligned bam files)
NGS.dir <- system.file("extdata/Homo_sapiens_sample", "", package = "ORFik")
# 2. .gff/.gtf location
txdb <- system.file("extdata/Homo_sapiens_sample", "Homo_sapiens_dummy.gtf.db", package = "ORFik")
# 3. fasta genome location
fasta <- system.file("extdata/Homo_sapiens_sample", "Homo_sapiens_dummy.fasta", package = "ORFik")
# 4. Pick an experiment name
exper.name <- "ORFik_example_human"
list.files(NGS.dir)
## [1] "CAGE_Mutant_rep1.ofst" "CAGE_Mutant_rep2.ofst"
## [3] "CAGE_WT_rep1.ofst" "CAGE_WT_rep2.ofst"
## [5] "Homo_sapiens_dummy.fasta" "Homo_sapiens_dummy.fasta.fai"
## [7] "Homo_sapiens_dummy.gtf" "Homo_sapiens_dummy.gtf.db"
## [9] "PAS_Mutant_rep1.ofst" "PAS_Mutant_rep2.ofst"
## [11] "PAS_WT_rep1.ofst" "PAS_WT_rep2.ofst"
## [13] "QC_STATS" "RFP_Mutant_rep1.ofst"
## [15] "RFP_Mutant_rep2.ofst" "RFP_WT_rep1.ofst"
## [17] "RFP_WT_rep2.ofst" "RNA_Mutant_rep1.ofst"
## [19] "RNA_Mutant_rep2.ofst" "RNA_WT_rep1.ofst"
## [21] "RNA_WT_rep2.ofst" "ofst"
Experiments are created by all accepted files from a folder (file extension given by type argument, default: bam, bed, wig, ofst), so remember to keep your experiment folder clean of other NGS libraries of these types not related to the experiment.
# This experiment is intentionally malformed, so we first make only a template:
template <- create.experiment(dir = NGS.dir, # directory of the NGS files for the experiment
exper.name, # Experiment name
txdb = txdb, # gtf / gff / gff.db annotation
fa = fasta, # Fasta genome
organism = "Homo sapiens", # Scientific naming
saveDir = NULL, # Create template instead of ready experiment
)
# The experiment contains 3 main parts:
# 1. Annotation, organism, general info:
data.frame(template)[1:3, ]
## X1
## 1 name
## 2 gff
## 3 fasta
## X2
## 1 ORFik_example_human
## 2 /tmp/Rtmpf8NXAt/Rinstb73935747b7e3/ORFik/extdata/Homo_sapiens_sample/Homo_sapiens_dummy.gtf.db
## 3 /tmp/Rtmpf8NXAt/Rinstb73935747b7e3/ORFik/extdata/Homo_sapiens_sample/Homo_sapiens_dummy.fasta
## X3 X4 X5 X6
## 1
## 2 organism Homo sapiens
## 3
# 2. NGS data set-up info:
data.frame(template)[4:8, 1:5]
## X1 X2 X3 X4 X5
## 4 libtype stage rep condition fraction
## 5 CAGE 1 Mutant
## 6 CAGE 2 Mutant
## 7 CAGE 1 WT
## 8 CAGE 2 WT
# 3. NGS File paths:
data.frame(template)[4:8, 6]
## [1] "filepath"
## [2] "/tmp/Rtmpf8NXAt/Rinstb73935747b7e3/ORFik/extdata/Homo_sapiens_sample/CAGE_Mutant_rep1.ofst"
## [3] "/tmp/Rtmpf8NXAt/Rinstb73935747b7e3/ORFik/extdata/Homo_sapiens_sample/CAGE_Mutant_rep2.ofst"
## [4] "/tmp/Rtmpf8NXAt/Rinstb73935747b7e3/ORFik/extdata/Homo_sapiens_sample/CAGE_WT_rep1.ofst"
## [5] "/tmp/Rtmpf8NXAt/Rinstb73935747b7e3/ORFik/extdata/Homo_sapiens_sample/CAGE_WT_rep2.ofst"
You see from the template, it excludes files with .bai or .fai, .rdata etc, and only using data of NGS libraries, defined by argument (type).
You can also see it tries to guess library types, stages, replicates, condition etc. It will also try to auto-detect paired end bam files. ## Fixing or updating an experiment Since every NGS file in a experiment must be a unique set of information columns ( there can not be 2 RNA-seq libraries from wildtype that are replicate1 etc), the create.experiment function will intentionally abort if it can not distinguish all the libraries in some way. (Example: It might find 2 files that are categorized as RNA-seq replicate 1, but the condtion: Wild type vs crispr mutant was not auto-detected), so the files would be non-unique.
To fix the things it did not find (a condition not specified, etc), there are 3 ways:
Let’s update the template to have correct tissue-fraction in one of the samples.
template$X5[5:6] <- "heart_valve" # <- fix non unique row (tissue fraction is heart valve)
df <- read.experiment(template)# read experiment from template
Normally you read experiments saved to disc, if you made only a template, save it by doing:
save.experiment(df, file = "path/to/save/experiment.csv")
You can then load the experiment whenever you need it.
ORFik comes with a example experiment, you can load with:
ORFik.template.experiment()
To see the object, just show it like this:
df
## experiment: ORFik_example_human with 4 library types and 16 runs
## libtype rep condition fraction
## 1: CAGE 1 Mutant heart_valve
## 2: CAGE 2 Mutant heart_valve
## 3: CAGE 1 WT
## 4: CAGE 2 WT
## 5: PAS 1 Mutant
## 6: PAS 2 Mutant
## 7: PAS 1 WT
## 8: PAS 2 WT
## 9: RFP 1 Mutant
## 10: RFP 2 Mutant
## 11: RFP 1 WT
## 12: RFP 2 WT
## 13: RNA 1 Mutant
## 14: RNA 2 Mutant
## 15: RNA 1 WT
## 16: RNA 2 WT
When you print the experiment object, you see here that file paths are hidden, you can access them like this:
filepath(df, type = "default")
## [1] "/tmp/Rtmpf8NXAt/Rinstb73935747b7e3/ORFik/extdata/Homo_sapiens_sample/CAGE_Mutant_rep1.ofst"
## [2] "/tmp/Rtmpf8NXAt/Rinstb73935747b7e3/ORFik/extdata/Homo_sapiens_sample/CAGE_Mutant_rep2.ofst"
## [3] "/tmp/Rtmpf8NXAt/Rinstb73935747b7e3/ORFik/extdata/Homo_sapiens_sample/CAGE_WT_rep1.ofst"
## [4] "/tmp/Rtmpf8NXAt/Rinstb73935747b7e3/ORFik/extdata/Homo_sapiens_sample/CAGE_WT_rep2.ofst"
## [5] "/tmp/Rtmpf8NXAt/Rinstb73935747b7e3/ORFik/extdata/Homo_sapiens_sample/PAS_Mutant_rep1.ofst"
## [6] "/tmp/Rtmpf8NXAt/Rinstb73935747b7e3/ORFik/extdata/Homo_sapiens_sample/PAS_Mutant_rep2.ofst"
## [7] "/tmp/Rtmpf8NXAt/Rinstb73935747b7e3/ORFik/extdata/Homo_sapiens_sample/PAS_WT_rep1.ofst"
## [8] "/tmp/Rtmpf8NXAt/Rinstb73935747b7e3/ORFik/extdata/Homo_sapiens_sample/PAS_WT_rep2.ofst"
## [9] "/tmp/Rtmpf8NXAt/Rinstb73935747b7e3/ORFik/extdata/Homo_sapiens_sample/RFP_Mutant_rep1.ofst"
## [10] "/tmp/Rtmpf8NXAt/Rinstb73935747b7e3/ORFik/extdata/Homo_sapiens_sample/RFP_Mutant_rep2.ofst"
## [11] "/tmp/Rtmpf8NXAt/Rinstb73935747b7e3/ORFik/extdata/Homo_sapiens_sample/RFP_WT_rep1.ofst"
## [12] "/tmp/Rtmpf8NXAt/Rinstb73935747b7e3/ORFik/extdata/Homo_sapiens_sample/RFP_WT_rep2.ofst"
## [13] "/tmp/Rtmpf8NXAt/Rinstb73935747b7e3/ORFik/extdata/Homo_sapiens_sample/RNA_Mutant_rep1.ofst"
## [14] "/tmp/Rtmpf8NXAt/Rinstb73935747b7e3/ORFik/extdata/Homo_sapiens_sample/RNA_Mutant_rep2.ofst"
## [15] "/tmp/Rtmpf8NXAt/Rinstb73935747b7e3/ORFik/extdata/Homo_sapiens_sample/RNA_WT_rep1.ofst"
## [16] "/tmp/Rtmpf8NXAt/Rinstb73935747b7e3/ORFik/extdata/Homo_sapiens_sample/RNA_WT_rep2.ofst"
ORFik has an extensive syntax for file type variants for your libraries: example is you have both bam, bigwig and ofst files of the same library, used for different purposes.
If you have varying version of libraries, like p-shifted, bam, simplified wig files, you can get file paths to different version with this function, like this:
filepath(df[df$libtype == "RFP", ], type = "pshifted")[2] # RFP = Ribo-seq, Default location for pshifted reads
## [1] "/tmp/Rtmpf8NXAt/Rinstb73935747b7e3/ORFik/extdata/Homo_sapiens_sample/RFP_Mutant_rep2.ofst"
The filepath function uses a reductive search, so that if you specify type = “bigwig”, and you do not have those files, it will point you to the lower level file “ofst”. If you don’t have those either, it goes to the default file (usually bam format). This ensure you will at least load something, it just depends how fast those files are. It also makes it easy to scale up and generalize you scripts to new experiments.
There are 3 ways to load NGS data, the first one is to load data into an environment. By default all libraries are loaded into .GlobalEnv (global environment), you can check what environment it is output to, by running:
envExp(df) #This will be the environment
## <environment: R_GlobalEnv>
The library names are decided by the columns in experiment, to see what the names will be, do:
bamVarName(df) #This will be the names:
## [1] "CAGE_Mutant_fheart_valve_r1" "CAGE_Mutant_fheart_valve_r2"
## [3] "CAGE_WT_r1" "CAGE_WT_r2"
## [5] "PAS_Mutant_r1" "PAS_Mutant_r2"
## [7] "PAS_WT_r1" "PAS_WT_r2"
## [9] "RFP_Mutant_r1" "RFP_Mutant_r2"
## [11] "RFP_WT_r1" "RFP_WT_r2"
## [13] "RNA_Mutant_r1" "RNA_Mutant_r2"
## [15] "RNA_WT_r1" "RNA_WT_r2"
Now let’s auto-load the libraries to the global environment
outputLibs(df) # With default output.mode = "envir".
## Outputting libraries from: ORFik_example_human
To remove the outputted libraries:
# remove.experiments(df)
The second way gives you a list, where the elements are the NGS libraries. There are also two ways of loading the list:
outputLibs(df, output.mode = "envirlist")[1:2] # Save NGS to envir, then return as list
## Outputting libraries from: ORFik_example_human
## $CAGE_Mutant_fheart_valve_r1
## GRanges object with 102 ranges and 2 metadata columns:
## seqnames ranges strand | score size
## <Rle> <IRanges> <Rle> | <numeric> <numeric>
## [1] chr1 306 + | 25 1
## [2] chr1 308 + | 3 1
## [3] chr1 309 + | 8 1
## [4] chr1 310 + | 2 1
## [5] chr1 313 + | 1 1
## ... ... ... ... . ... ...
## [98] chr6 333 + | 1 1
## [99] chr6 339 + | 1 1
## [100] chr6 340 + | 1 1
## [101] chr6 355 + | 1 1
## [102] chr6 363 + | 1 1
## -------
## seqinfo: 6 sequences from an unspecified genome; no seqlengths
##
## $CAGE_Mutant_fheart_valve_r2
## GRanges object with 106 ranges and 2 metadata columns:
## seqnames ranges strand | score size
## <Rle> <IRanges> <Rle> | <numeric> <numeric>
## [1] chr1 306 + | 44 1
## [2] chr1 307 + | 11 1
## [3] chr1 309 + | 1 1
## [4] chr1 310 + | 2 1
## [5] chr1 312 + | 3 1
## ... ... ... ... . ... ...
## [102] chr6 326 + | 1 1
## [103] chr6 327 + | 1 1
## [104] chr6 329 + | 2 1
## [105] chr6 334 + | 2 1
## [106] chr6 336 + | 1 1
## -------
## seqinfo: 6 sequences from an unspecified genome; no seqlengths
# Check envir, if it exist, list them and return, if not, only return list
outputLibs(df, output.mode = "list")[1:2]
## Outputting libraries from: ORFik_example_human
## $CAGE_Mutant_fheart_valve_r1
## GRanges object with 102 ranges and 2 metadata columns:
## seqnames ranges strand | score size
## <Rle> <IRanges> <Rle> | <numeric> <numeric>
## [1] chr1 306 + | 25 1
## [2] chr1 308 + | 3 1
## [3] chr1 309 + | 8 1
## [4] chr1 310 + | 2 1
## [5] chr1 313 + | 1 1
## ... ... ... ... . ... ...
## [98] chr6 333 + | 1 1
## [99] chr6 339 + | 1 1
## [100] chr6 340 + | 1 1
## [101] chr6 355 + | 1 1
## [102] chr6 363 + | 1 1
## -------
## seqinfo: 6 sequences from an unspecified genome; no seqlengths
##
## $CAGE_Mutant_fheart_valve_r2
## GRanges object with 106 ranges and 2 metadata columns:
## seqnames ranges strand | score size
## <Rle> <IRanges> <Rle> | <numeric> <numeric>
## [1] chr1 306 + | 44 1
## [2] chr1 307 + | 11 1
## [3] chr1 309 + | 1 1
## [4] chr1 310 + | 2 1
## [5] chr1 312 + | 3 1
## ... ... ... ... . ... ...
## [102] chr6 326 + | 1 1
## [103] chr6 327 + | 1 1
## [104] chr6 329 + | 2 1
## [105] chr6 334 + | 2 1
## [106] chr6 336 + | 1 1
## -------
## seqinfo: 6 sequences from an unspecified genome; no seqlengths
The third way is to load manually, more secure, but also more cumbersome.
files <- filepath(df, type = "default")
CAGE_loaded_manually <- fimport(files[1])
If you use the auto-loading to environment and you have multiple experiments, it might be a chance of non-unique naming, 2 experiments might have a library called cage. To be sure names are unique, we add the experiment name in the variable name:
df@expInVarName <- TRUE
bamVarName(df) #This will be the names:
## [1] "ORFik_example_human_CAGE_Mutant_fheart_valve_r1"
## [2] "ORFik_example_human_CAGE_Mutant_fheart_valve_r2"
## [3] "ORFik_example_human_CAGE_WT_r1"
## [4] "ORFik_example_human_CAGE_WT_r2"
## [5] "ORFik_example_human_PAS_Mutant_r1"
## [6] "ORFik_example_human_PAS_Mutant_r2"
## [7] "ORFik_example_human_PAS_WT_r1"
## [8] "ORFik_example_human_PAS_WT_r2"
## [9] "ORFik_example_human_RFP_Mutant_r1"
## [10] "ORFik_example_human_RFP_Mutant_r2"
## [11] "ORFik_example_human_RFP_WT_r1"
## [12] "ORFik_example_human_RFP_WT_r2"
## [13] "ORFik_example_human_RNA_Mutant_r1"
## [14] "ORFik_example_human_RNA_Mutant_r2"
## [15] "ORFik_example_human_RNA_WT_r1"
## [16] "ORFik_example_human_RNA_WT_r2"
You see here that the experiment name, “ORFik” is in the variable name If you are only working on one experiment, you do not need to include the name, since there is no possibility of duplicate naming (the experiment class validates all names are unique).
Since we want NGS data names without “ORFik”, let’s remove the loaded libraries and load them again.
df@expInVarName <- FALSE
remove.experiments(df)
## Removed loaded libraries from experiment:ORFik_example_human
outputLibs(df)
## Outputting libraries from: ORFik_example_human
There is also many function to load specific parts of the annotation:
txdb <- loadTxdb(df) # transcript annotation
Let’s say we want to load all leaders, cds and 3’ UTRs that are longer than 30. With ORFik experiment this is easy:
txNames <- filterTranscripts(txdb, minFiveUTR = 30, minCDS = 30, minThreeUTR = 30)
loadRegions(txdb, parts = c("leaders", "cds", "trailers"), names.keep = txNames)
The regions are now loaded into .GlobalEnv, only keeping transcripts from txNames.
ORFik supports a myriad of plots for experiments. Lets make a plot with coverage over mrna, seperated by 5’ UTR, CDS and 3’ UTR in one of the ribo-seq libraries from the experiment
transcriptWindow(leaders, cds, trailers, df[9,], BPPARAM = BiocParallel::SerialParam())
## Outputting libraries from: ORFik_example_human
## RFP
## sum
## transcriptNormalized
## [[1]]
##
## [[2]]
If your experiment consists of Ribo-seq, you want to do p-site shifting.
shiftFootprintsByExperiment(df[df$libtype == "RFP",])
P-shifted ribo-seq will automaticly be stored as .ofst (ORFik serialized for R) and .wig (track files for IGV/UCSC) files in a ./pshifted folder, relative to original libraries.
To validate p-shifting, use shiftPlots. Here is an example from Bazzini et al. 2014 I made.
df.baz <- read.experiment("zf_bazzini14_RFP") # <- this exp. does not exist for you
shiftPlots(df.baz, title = "Ribo-seq, zebrafish, Bazzini et al. 2014", type = "heatmap")