--- title: "lncRna: A Comprehensive Pipeline for lncRNA Identification and Functional Analysis" author: - Jan Pawel Jastrzebski - Damian Czopek - Mariusz Jankowski - Stefano Pascarella - Wiktor Babis - Monika Gawronska date: "`r Sys.Date()`" output: BiocStyle::html_document vignette: > %\VignetteIndexEntry{lncRna: Identification and Functional Analysis Pipeline} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=TRUE} # Global knitr configuration knitr::opts_chunk$set(echo = TRUE, collapse = TRUE, message = FALSE, warning = FALSE) # Load libraries required for the pipeline workflow library(lncRna) library(rtracklayer) library(GenomicRanges) library(IRanges) library(gprofiler2) library(plotly) library(fmsb) library(patchwork) ``` # Introduction The `lncRna` package provides a comprehensive toolkit for the identification, analysis, and functional annotation of long non-coding RNAs (lncRNAs) from RNA-Seq data. The workflow covers several key stages: * **Initial Processing & Filtering:** Feature extraction from GTF files and filtering of lncRNA candidates based on structural features. * **Coding Potential Analysis:** Integration of multiple tools (CPC2, PLEK, CPAT, etc.) and consensus building. * **Performance Evaluation:** Systematic assessment of tools and their combinations using the `calculateCM` framework. * **Functional Analysis:** Identification of potential *cis* and *trans* interactions with protein-coding genes. * **Visualization:** Generation of high-quality plots (Radar, Clock, Sankey, Venn) for publication-ready results. # Data Preparation In a typical analysis, you would load your own data. For this demonstration, we create mock data objects that mimic real-world transcriptomic outputs. ## Create mock GTF and Test Sets We start by creating a representative genomic structure and partitioning our sequences into independent training and testing subsets using the built-in split utility. ```{r create-mock-data} # 1. Sample GRanges object (mimicking an imported GTF file) mockGtf <- GRanges( seqnames = "chr1", ranges = IRanges(start = c(100, 200, 400, 500), width = c(50, 60, 100, 20)), strand = "+", type = c("exon", "exon", "transcript", "exon"), gene_id = c("G1", "G1", "G2", "G2"), gene_biotype = c("protein_coding", "protein_coding", "lncRNA", "lncRNA"), transcript_id = c("tx1", "tx1", "tx2", "tx2"), transcript_biotype = c("protein_coding", "protein_coding", "lncRNA", "lncRNA"), exon_number = c("1", "2", NA, "1") ) # 2. Partitioning sequences using createTrainTestSets # This ensures we have ground truth sets for performance evaluation all_ids <- c("tx1", "tx2", "tx3", "tx4", "tx5", "tx6") set.seed(123) sets <- createTrainTestSets(sequences = all_ids, percentTrain = 0.5, prefix = "demo") # Define independent test sets (nc = non-coding, cds = coding) ncTestSet <- sets$demo.test cdsTestSet <- sets$demo.train ``` # Initial Filtering and Feature Extraction We extract key structural features to distinguish lncRNAs from other biotypes based on the input GTF. ```{r extract-stats} # Calculate transcript length and exon count transcriptStats <- getGtfStats(gtfObject = mockGtf) # Extract biotype information at the gene level geneBiotypes <- getBiotypes(refGtf = mockGtf, level = "gene") print(head(transcriptStats)) ``` # Coding Potential Analysis This stage integrates results from various coding potential predictors and evaluates their reliability. ## Aggregation and Venn Visualization We aggregate results from external tools and visualize the agreement (consensus) between them. ```{r aggregate-and-venn} # Mocking aggregated results from CPC2, PLEK, and CPAT mockCodPotList <- list( seqIDs = c("tx1", "tx2", "tx3", "tx4", "tx5", "tx6"), tools = list( CPC2 = c(0, 1, 1, 0, 1, 0), PLEK = c(0, 1, 0, 0, 1, 1), CPAT = c(0, 1, 1, 1, 0, 0) ) ) # Visualize tool agreement (Venn) if (requireNamespace("venn", quietly = TRUE)) { plotVennCodPot(codPot = mockCodPotList) } ``` ## Performance Evaluation We evaluate all possible tool combinations to identify the most accurate identification strategy for the current dataset. ```{r evaluation-pipeline} # 1. Prepare evaluation sets by annotating predictions with ground truth evaluationSummary <- prepareEvaluationSets( codPotList = mockCodPotList, ncTest = ncTestSet, cdsTest = cdsTestSet ) # 2. Evaluate performance for each tool combination (e.g., "CPC2+PLEK") combSummary <- evaluateToolCombinations(summaryList = evaluationSummary) # 3. Calculate summary metrics combStats <- bestToolCombination(combinationSummaryList = combSummary) print(combStats) ``` # Performance Visualization with calculateCM The `calculateCM` function serves as the central hub for the visualization pipeline. It converts raw predictions into Confusion Matrix objects and allows for threshold-based filtering. ```{r calculate-cm} # Filter for methods with Accuracy >= 0.5 (low threshold for demo purposes) # Use metricsData to provide values for filtering allCms <- calculateCM( combinationSummaryList = combSummary, metricsData = combStats, threshold = 0.5, returnOnlyHigh = TRUE, metricToExtract = "Accuracy", printMetricThresholds = TRUE ) ``` ## Radar and Clock Plots Radar and Clock plots provide a holistic view of tool performance across multiple metrics. ```{r viz-performance, fig.width=10, fig.height=6} # Comparison of top combinations (Radar Plot) methodsToPlot <- names(allCms)[1:min(3, length(allCms))] plotRadarMetrics(cmList = allCms, methods = methodsToPlot, plotTitle = "Top Methods Comparison") # Detailed metrics view (Clock Plot) in 'multiple' layout plotClockMetrics(cmList = allCms, plotTitle = "Performance Metrics Breakdown", layout = "multiple") ``` # Functional Analysis and Interactions The final stage of the pipeline identifies where and how the identified lncRNAs act in the cell. ## Cis and Trans Interactions We identify *cis*-interactions based on genomic proximity and *trans*-interactions based on expression correlation. ```{r interactions} # Identify Cis-interactions (genomic proximity) feelncFile <- tempfile() write.table(data.frame(isBest=1, lncRNA_gene="LNC1", partnerRNA_gene="GENE_A", distance=500), feelncFile, sep="\t", row.names=FALSE) cisInteractions <- findCisInteractions(FEELncClassesFile = feelncFile, maxDist = 1000) # Identify Trans-interactions via expression correlation mockExpr <- matrix(rnorm(50), nrow = 5, dimnames = list(c("LNC1", "LNC2", "GENE_A", "GENE_B", "GENE_C"), paste0("S", 1:10))) mockExpr["LNC1",] <- mockExpr["GENE_A",] + rnorm(10, 0, 0.1) transInteractions <- findTransInteractions(exprMatrix = mockExpr, rval = 0.8, pval = 0.05) ``` ## Annotation and Sankey Diagram We link interactions to functional enrichment terms and visualize the flow using an interactive Sankey diagram. ```{r sankey-viz} # Sample enrichment result (mimicking gprofiler2::gost) mockGost <- data.frame(term_id="GO:01", term_name="stress response", intersection="GENE_A") # Annotate interactions with functional terms interactionsProcessed <- annotateInteractions(gostResult = mockGost, interactionTable = transInteractions, type = "trans") # Generate interactive Sankey diagram plotSankeyInteractions(interactionData = interactionsProcessed, groupBy = "term", selectIds = "stress response", title = "Functional Interaction Flow") ``` # Session Information ```{r session-info} sessionInfo()