--- title: "The Expression Hunter Suite" author: "James R. Perkins" date: "`r format(Sys.Date(), '%m/%d/%Y')`" output: rmarkdown::html_document: highlight: pygments toc: true fig_width: 5 vignette: > %\VignetteIndexEntry{The Expression Hunter Suite} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ## Version Info **R version**: `r R.version.string` **Bioconductor version**: `r BiocManager::version()` **Package version**: `r packageVersion("ExpHunterSuite")` # Introduction ExpHunterSuite implements a comprehensive protocol for the analysis of transcriptional data using established *R* packages and combining their results. It covers all key steps in DEG detection, CEG detection and functional analysis for RNA-seq data. It has been implemented as an R package containing functions that can be run interactively. In addition, it also contains scripts that wrap the functions and can be run directly from the command line. # Standard Package Usage In this section we will describe how the functions in ExpHunterSuite can be used interactively or joined together in user-written scripts. We will also describe how the output reports can be generated from this data. ## Differential Expression Analysis The most basic use of the package is to perform differential expression (DE) gene analysis. ExpHunterSuite will, following some initial preprocessing, run the different methods, combine the results, and produce an output report, as well as a single output table containing the results of all of the methods used, and their combined scores. The combined scores consist of the mean logFC value and the combined adjusted p-value (FDR) values, calculated by Fishers method. To use ExpHunterSuite with only a single DE package, one can run the following command: ```{r DEA_one_pack, echo=TRUE, results="hide", message=FALSE, warning=FALSE} library(ExpHunterSuite) data(toc) data(target) degh_out_one_pack <- main_degenes_Hunter(raw=toc, target=target, modules="D") # D for DESeq2 ``` Where toc is a data.frame of aligned reads per samples and target is a data.frame relating each sample to its sample meta data. Here we include a minimal example of the target file whcih includes the samples (CTL/epm2a), the samples condition (Ctrl or Treat): ```{r input_data, echo=TRUE, results="as.is", message=FALSE, warning=FALSE} head(toc) head(target) ``` The files containing this data are contained within the extData package directory and can be accessed in the following manner. We will come back to them in the section on command line usage. ```{r input_files, echo=TRUE, results="as.is"} system.file("extData", "table_of_counts.txt", package = "ExpHunterSuite") system.file("extData", "target.txt", package = "ExpHunterSuite") ``` To use it with multiple packages, one can run the following command: ```{r DEA_multi_pack, echo=TRUE, results="hide", message=FALSE, warning=FALSE} degh_out_multi_pack <- main_degenes_Hunter(raw=toc, target=target, modules="DEL") # D:DESeq2 E:EdgeR, L:limma ``` The output is a list, which includes, in slot *DE_all_genes* a data.frame containing, for each gene, logFC/p-values/adjusted p-values for the different DE methods implemented: ```{r standard_DEA_results, echo=TRUE} head(degh_out_multi_pack$DE_all_genes) ``` It also contains information on whether the genes are considered to be DE, in the column *genes_tag* The tag *PREVALENT_DEGS* refers to those genes that are considered significant in at least n of the DE methods used *POSSIBLE_DEGS* are those considered significant by at least one method. As such, *PREVALENT_DEGS* and *POSSIBLE_DEGS* will be the same when n = 1. N is controlled by the argument *minlibraries*. To be considered significant for a given method, a gene must have an adjusted-pvalue < 0.05 and |logFC| > 1; these values are adjustable using the arguments *p_val_cutoff* and *lfc*. The *genes_tag* columns includes the labels *NOT_DEGS* and *FILTERED_OUT* to refer to those genes not detected as DE by at least one DE method and those that do not pass the initial low-count filtering step, controlled by parameters *reads* and *minlibraries*. There is another column, *combined_FDR* – which is POS/NEG depending on whether the combined adjusted p-value as described above is less than or equal to 0.05 (or whatever the value of the argument *p_val_cutoff* is). ### More complex model designs. In order to control for specific variables (such as individuals in paired designs, potential confounding factors such as age, etc.), For example, if we consider our previous experiment, but add an extra column to the target, indicating different age groupings for the samples we obtain the following: ```{r standard_DEA_model_target, echo=TRUE, results="as.is"} target_multi <- data.frame(target, age_group = c("adult", "child", "adult", "child", "adult", "adult", "child")) target_multi ``` We may wish to control for the effects of age_group on the experiment. This can be achieved using the argument *model_variables*. The variables given to this argument will be used in the model when calculating differential expression between the Treat and Ctrl samples: ```{r standard_DEA_model_execute, results="hide", message=FALSE, warning=FALSE} degh_out_model <- main_degenes_Hunter(raw=toc, target=target_multi, modules="D", model_variables="age_group") ``` This works by using the variable age_group to create a linear model formula to be passed to the different DE methods (with the exception of NOISeq). The output has the same structure as the original analysis. Custom model designs can also be specified in the *model_variables* argument, based on the R model syntax, see *help("formula")* for more details. If a custom formula is used, the *custom_model* argument must be set to true. ## Co-expression Analysis Co-expression analysis is included via the R package Weighted correlation network analysis (WGCNA). The idea is to look for groups (modules) of genes showing correlated expression. The groups can then be correlated with experimental factors, such as treatment vs. non treatment, as well as other groupings such as the age grouping mentioned earlier, or numeric factors such as known values of metabolites related to the experiment. WGCNA is activated using by adding "W" to the *modules* argument. The traits to be correlated with the modules are specified using the *string_factors* and *numeric_factors* options: ```{r standard_CEA, results="hide", message=FALSE, warning=FALSE} degh_out_coexp <- main_degenes_Hunter(raw=toc, target=target_multi, modules="DW", string_factors="age_group") ``` Please note that WGCNA requires a normalized expression matrix as input, as such it cannot be run alone, it must be run alongside at least one DE method, which is specified with the argument *WGCNA_norm_method*. ## Functional Analysis - Expression Data Functional analysis can be performed on the results of the DEG/WGCNA analysis to look for over representation (enrichement) for groups of genes amongst the DEGs and/or modules of genes obtained. For differentially expressed genes, the following code, which takes as input the output object of running degenes hunter for DE analysis, can be used. Currently only overrepresentation analysis using the clusterProfiler package is implemented. ```{r standard_FA, results="as.is", eval=FALSE} fh_out_one_pack <- main_functional_hunter( #Perform enrichment analysis degh_out_one_pack, model_organism = 'Mouse', # Use specified organism database enrich_dbs = c("MF", "BP", "CC", "Kegg", "Reactome"), # Enrichment analysis for GO, KEGG and Reactome enrich_methods = "o" # Use overepresentation analysis only ) ``` This will produce a list object as output. The members of this list are named: final_main_params - this contains a list with full details of the parameters used ORA - this list contains the results of the analysis. This is a named list of enrichResult objects, one for each gene annotation database used (e.g. GO BP, Reactome, KEGG, or in the case of a custom gmt annotation file, the file name will be used). DEGH_results_annot - this is a dataframe containing the results of the DEG/coexpression analysis. When coexpression analysis has also been run using the main_degenes_Hunter function the following code will ```{r coexp_FA, results="as.is", eval=FALSE} fh_out_coexp <- main_functional_hunter( # Perform enrichment analisys degh_out_coexp, model_organism = 'Mouse', # Use specified organism database enrich_dbs = c("MF", "BP", "CC", "Kegg", "Reactome"), # Enrichment analysis for GO, KEGG and Reactome enrich_methods = "o" # Use overepresentation analysi only ) ``` ## Functional Analysis - general There is also a function to implement functional enrichment more generally, using as input a lists of genes instead of the output of the main_degenes_Hunter function. Each line in the file should contain a list of genes, with the format: identifiergene1,gene2,... ```{r c2e, echo=TRUE, eval=FALSE, results="hide"} input_file <- system.file("extData", "cluster_genes.txt", package = "ExpHunterSuite") print(readLines(input_file,n=2)) organisms_table <- get_organism_table() current_organism_info <- organisms_table[rownames(organisms_table) %in% "Mouse",] org_db <- get_org_db(current_organism_info) enr_lists <- main_clusters_to_enrichment(input_file, org_db=org_db, current_organism_info=current_organism_info, gene_keytype="ENSEMBL") ``` ## Obtaining Reports To obtain highly detailed html reports including multiple plots to visualize the data and the results of the different analysis methods, the following commands can be used for main_degenes_Hunter and main_functional_hun: ```{r write_reports, echo=TRUE, eval=FALSE, results="hide"} write_expression_report(exp_results=degh_out_coexp) write_enrich_files(func_results=fh_out_one_pack) write_functional_report(hunter_results=degh_out_coexp, func_results=fh_out_coexp) ``` In all cases, the output folder for each report can be specified with the *output_files* option. ## Functional Analysis - special options for reports There are several options when using GO with functional analysis whereby you can exploit the hierarchical nature of this ontology to alter the results. simplify - simplify the merged enrichments by removing redundant terms, based on semantic similarity, using the simplify() function from clusterProfiler. Uses the Wang method for semantic similarity, with a cutoff of 0.7. clean_parentals - remove parental GO terms from merged enrichments for each cluster/module. I.e. when the terms in a cluster or module have a parent-child relationship, the child term is kept. Note that different clusters in the merged enrichment might have GOs terms that have a parent-child relationship. There is also a summary method, which is accessed by adding "S" to the write_clusters_to_enrichment function or by adding a non-null "sim_thr" value to the main_functional_hunter funtion. This function is used to reduce the number of enriched GO terms by clustering similar GO terms and then, for each cluster, choosing either the most signficant term or the most ancestral term, specified using the summary_common_name option. group_results - The option can be used to group categories with similar names in the emap plots produced in the reports (the graphs that show the connections between terms) and in the plots produced by write_functional_report with mode "P". max_genes - the maximum number of genes to plot in the cnet plots - the plots in the reports that shows categories connected to each other via shared genes. # Command-line Package Usage The package also includes a number of scripts, in the folder *inst/scripts*, which can be used to run the above functions from the command line. We recommend the user first creates a folder in which to install the ExpHunterSuite command line scripts, then copies the scripts there and make them command line accesible using these commands: ```bash mkdir install_folder Rscript -e "ExpHunterSuite::install_DEgenes_hunter('install_folder')" export PATH=path_to_install_folder:$PATH ``` This export PATH can also be added to the .bashrc or .bash_profile files. The user can then run the protocol from the command line with scripts such as the following, which will implement the functions and create the output reports, all from a single script. ```bash degenes_Hunter.R -t $TARGET_FILE -i $TOC -o $EXP_RESULTS functional_Hunter.R -i $EXP_RESULTS -m Organism -o FUNC_RESULTS ``` Full details of the arguments to give the the script can be found by running *degenes_Hunter.R -h* or *functional_Hunter.R -h*. More examples are given in the README file for this packet