---
title: "lncRna: A Comprehensive Pipeline for lncRNA Identification and Functional Analysis"
author:
  - Jan Pawel Jastrzebski
  - Damian Czopek
  - Mariusz Jankowski
  - Stefano Pascarella
  - Wiktor Babis
  - Monika Gawronska
date: "`r Sys.Date()`"
output: BiocStyle::html_document
vignette: >
  %\VignetteIndexEntry{lncRna: Identification and Functional Analysis Pipeline}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=TRUE}
# Global knitr configuration
knitr::opts_chunk$set(echo = TRUE, collapse = TRUE, message = FALSE, warning = FALSE)

# Load libraries required for the pipeline workflow
library(lncRna)
library(rtracklayer)
library(GenomicRanges)
library(IRanges)
library(gprofiler2)
library(plotly)
library(fmsb)
library(patchwork)
```

#  Introduction

The `lncRna` package provides a comprehensive toolkit for the identification, analysis, and functional annotation of long non-coding RNAs (lncRNAs) from RNA-Seq data. The workflow covers several key stages:

* **Initial Processing & Filtering:** Feature extraction from GTF files and filtering of lncRNA candidates based on structural features.
* **Coding Potential Analysis:** Integration of multiple tools (CPC2, PLEK, CPAT, etc.) and consensus building.
* **Performance Evaluation:** Systematic assessment of tools and their combinations using the `calculateCM` framework.
* **Functional Analysis:** Identification of potential *cis* and *trans* interactions with protein-coding genes.
* **Visualization:** Generation of high-quality plots (Radar, Clock, Sankey, Venn) for publication-ready results.

#  Data Preparation

In a typical analysis, you would load your own data. For this demonstration, we create mock data objects that mimic real-world transcriptomic outputs.

##  Create mock GTF and Test Sets

We start by creating a representative genomic structure and partitioning our sequences into independent training and testing subsets using the built-in split utility.

```{r create-mock-data}
# 1. Sample GRanges object (mimicking an imported GTF file)
mockGtf <- GRanges(
  seqnames = "chr1",
  ranges = IRanges(start = c(100, 200, 400, 500), width = c(50, 60, 100, 20)),
  strand = "+",
  type = c("exon", "exon", "transcript", "exon"),
  gene_id = c("G1", "G1", "G2", "G2"),
  gene_biotype = c("protein_coding", "protein_coding", "lncRNA", "lncRNA"),
  transcript_id = c("tx1", "tx1", "tx2", "tx2"),
  transcript_biotype = c("protein_coding", "protein_coding", "lncRNA", "lncRNA"),
  exon_number = c("1", "2", NA, "1")
)

# 2. Partitioning sequences using createTrainTestSets
# This ensures we have ground truth sets for performance evaluation
all_ids <- c("tx1", "tx2", "tx3", "tx4", "tx5", "tx6")
set.seed(123)
sets <- createTrainTestSets(sequences = all_ids, percentTrain = 0.5, prefix = "demo")

# Define independent test sets (nc = non-coding, cds = coding)
ncTestSet <- sets$demo.test
cdsTestSet <- sets$demo.train
```

#  Initial Filtering and Feature Extraction

We extract key structural features to distinguish lncRNAs from other biotypes based on the input GTF.

```{r extract-stats}
# Calculate transcript length and exon count
transcriptStats <- getGtfStats(gtfObject = mockGtf)

# Extract biotype information at the gene level
geneBiotypes <- getBiotypes(refGtf = mockGtf, level = "gene")

print(head(transcriptStats))
```

#  Coding Potential Analysis

This stage integrates results from various coding potential predictors and evaluates their reliability.

##  Aggregation and Venn Visualization

We aggregate results from external tools and visualize the agreement (consensus) between them.

```{r aggregate-and-venn}
# Mocking aggregated results from CPC2, PLEK, and CPAT
mockCodPotList <- list(
  seqIDs = c("tx1", "tx2", "tx3", "tx4", "tx5", "tx6"),
  tools = list(
    CPC2 = c(0, 1, 1, 0, 1, 0),
    PLEK = c(0, 1, 0, 0, 1, 1),
    CPAT = c(0, 1, 1, 1, 0, 0)
  )
)

# Visualize tool agreement (Venn)
if (requireNamespace("venn", quietly = TRUE)) {
    plotVennCodPot(codPot = mockCodPotList)
}
```

##  Performance Evaluation

We evaluate all possible tool combinations to identify the most accurate identification strategy for the current dataset.

```{r evaluation-pipeline}
# 1. Prepare evaluation sets by annotating predictions with ground truth
evaluationSummary <- prepareEvaluationSets(
  codPotList = mockCodPotList,
  ncTest = ncTestSet,
  cdsTest = cdsTestSet
)

# 2. Evaluate performance for each tool combination (e.g., "CPC2+PLEK")
combSummary <- evaluateToolCombinations(summaryList = evaluationSummary)

# 3. Calculate summary metrics
combStats <- bestToolCombination(combinationSummaryList = combSummary)
print(combStats)
```

#  Performance Visualization with calculateCM

The `calculateCM` function serves as the central hub for the visualization pipeline. It converts raw predictions into Confusion Matrix objects and allows for threshold-based filtering.

```{r calculate-cm}
# Filter for methods with Accuracy >= 0.5 (low threshold for demo purposes)
# Use metricsData to provide values for filtering
allCms <- calculateCM(
  combinationSummaryList = combSummary,
  metricsData = combStats, 
  threshold = 0.5,
  returnOnlyHigh = TRUE,
  metricToExtract = "Accuracy",
  printMetricThresholds = TRUE
)
```

##  Radar and Clock Plots

Radar and Clock plots provide a holistic view of tool performance across multiple metrics.

```{r viz-performance, fig.width=10, fig.height=6}
# Comparison of top combinations (Radar Plot)
methodsToPlot <- names(allCms)[1:min(3, length(allCms))]
plotRadarMetrics(cmList = allCms, methods = methodsToPlot, plotTitle = "Top Methods Comparison")

# Detailed metrics view (Clock Plot) in 'multiple' layout
plotClockMetrics(cmList = allCms, plotTitle = "Performance Metrics Breakdown", layout = "multiple")
```

#  Functional Analysis and Interactions

The final stage of the pipeline identifies where and how the identified lncRNAs act in the cell.

##  Cis and Trans Interactions

We identify *cis*-interactions based on genomic proximity and *trans*-interactions based on expression correlation.

```{r interactions}
# Identify Cis-interactions (genomic proximity)
feelncFile <- tempfile()
write.table(data.frame(isBest=1, lncRNA_gene="LNC1", partnerRNA_gene="GENE_A", distance=500), 
            feelncFile, sep="\t", row.names=FALSE)
cisInteractions <- findCisInteractions(FEELncClassesFile = feelncFile, maxDist = 1000)

# Identify Trans-interactions via expression correlation
mockExpr <- matrix(rnorm(50), nrow = 5, 
                   dimnames = list(c("LNC1", "LNC2", "GENE_A", "GENE_B", "GENE_C"), paste0("S", 1:10)))
mockExpr["LNC1",] <- mockExpr["GENE_A",] + rnorm(10, 0, 0.1)

transInteractions <- findTransInteractions(exprMatrix = mockExpr, rval = 0.8, pval = 0.05)
```

##  Annotation and Sankey Diagram

We link interactions to functional enrichment terms and visualize the flow using an interactive Sankey diagram.

```{r sankey-viz}
# Sample enrichment result (mimicking gprofiler2::gost)
mockGost <- data.frame(term_id="GO:01", term_name="stress response", intersection="GENE_A")

# Annotate interactions with functional terms
interactionsProcessed <- annotateInteractions(gostResult = mockGost, 
                                               interactionTable = transInteractions, type = "trans")

# Generate interactive Sankey diagram
plotSankeyInteractions(interactionData = interactionsProcessed, groupBy = "term", 
                       selectIds = "stress response", title = "Functional Interaction Flow")
```

#  Session Information

```{r session-info}
sessionInfo()