---
title: "RNAshapeQC: Quick Start Tutorial"
author: "Miyeon Yeon, Won-Young Choi, and Hyo Young Choi"
date: "`r Sys.Date()`"
output: BiocStyle::html_document
vignette: >
  %\VignetteIndexEntry{RNAshapeQC: Quick Start Tutorial}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(collapse=TRUE, comment="#>", fig.align="center")

suppressPackageStartupMessages({
    library(BiocStyle)
    library(knitr)
    library(RNAshapeQC)
    library(SummarizedExperiment)
    library(ConsensusClusterPlus)
})

## Test R functions
# devtools::document()
# devtools::load_all() # load development version
```


# Introduction

_RNAshapeQC_ provides protocol-specific, **RNA coverage-shape-based quality control (QC) metrics**. This package operates at the gene-level coverage shape rather than isoform-level quantification. Effective transcript length may vary across samples due to alternative splicing. However, we define gene coordinates using the union of exon regions across annotated transcripts to ensure consistent comparability of coverage profiles across samples.

The package is designed to:

- Construct per-gene base-level coverage pileups from BAM files;
- Internally derive exon-only coverage from pileups for downstream QC metrics;
- Compute shape-based QC metrics using:
  - decay rate and gene weight for mRNA-seq,
  - mean coverage depth and window coefficient of variation for total RNA-seq;
- Derive sample-level quality indices:
  - degraded/intact index,
  - suboptimal/optimal index.


## Overview of the workflow

_RNAshapeQC_ provides two complementary workflows: one for mRNA-seq and one for total RNA-seq.

### mRNA-seq

1. **Decay rate (DR)**[^1]: extent of degradation, measured by the mean-corrected slope of log-transformed RNA-seq coverage (gene-by-sample level)
2. **Degradation score (DS)**: weighted average of detectable degradation signals (sample-level)
3. **Projection depth (PD)**: robust distance of each sample from the center of the DS distribution (sample-level)
4. **Degraded/Intact index (DII)**: binary classification of samples as _Degraded_ or _Intact_ (sample-level)

### Total RNA-seq

1. **Mean coverage depth (MCD)**: measure of coverage magnitude (gene-by-sample level)
2. **Window coefficient of variation (wCV)**: measure of coverage nonuniformity (gene-by-sample level)
3. **Area under the curve (AUC)**: smoothed relationship between MCD and wCV (sample-level)
4. **Projection depth (PD)**: robust distance computed using the AUC distribution (sample-level)
5. **Suboptimal/Optimal index (SOI)**: binary classification of samples as _Suboptimal_ or _Optimal_ (sample-level)

### Recommended analysis pipeline

- mRNA-seq:  
  BAM → (BAM slicing) → pileup → DR → DS → PD → DII → downstream applications

- Total RNA-seq:  
  BAM → (BAM slicing) → pileup → MCD & wCV → AUC → PD → SOI → downstream applications

In this vignette, we illustrate a **minimal, fully runnable workflow** using small synthetic toy datasets included in the package. These toy objects mimic typical patterns of RNA degradation and coverage nonuniformity but do not contain any real cohort data.

For detailed method descriptions and extended examples, please refer to the **external user manual** ([_RNAshapeQCdocs_](https://miyeonyeon.github.io/bioc-vignettes/RNAshapeQCdocs_book/)).


# Installation

```{r LoadPkg, eval=FALSE}
## Installation and loading
if (!requireNamespace("BiocManager", quietly=TRUE)) {
    install.packages("BiocManager")
}
BiocManager::install("RNAshapeQC")

library(RNAshapeQC)
```


# Data Processing {#DataPrep}

## Base-level coverage

_RNAshapeQC_ provides the function `construct_pileup()`, to generate per-gene pileups as a single object from BAMs, optionally across multiple studies. For a single study, `construct_pileup()` saves `pileup`, `regions`, and `geneRanges`. For multiple studies, it saves `pileupList`, `regions`, and `geneRanges`.

For example, one might run:

```{r construct_pileup, eval=FALSE}
construct_pileup(
    Gene          = "Gene001",
    studylist     = "TOY",   
    regionsFile   = "/path/to/hg38.gene.regions.txt",
    regionsFormat = "gencode.regions",
    regionsCol    = 2,
    bamFilesList  = list(TOY=toy_bam_paths),
    caseIDList    = list(TOY=toy_case_ids),
    outFile       = "pergene/TOY_Gene001_pileup.RData"  
)
```


## Processed matrices

Once pileups are available, `gen_DR()`, `gen_MCD()`, and `gen_wCV()` functions help to generate processed matrices (e.g., DR for mRNA-seq or MCD & wCV for total RNA-seq). In HPC environment, each row vector of the processed matrix is saved by submitting jobs:

```{bash, eval=FALSE}
sbatch /inst/bash/sbatch_pergene.sh \  # script to submit a higher-level job
  genelist.txt \                       # list of gene symbol (rownames of a processed matrix)
  TOY \                                # study name
  Calculation.R \                      # R code including `gen_*` function
  /inst/bash/submit_pergene.sh         # script to submit per-gene jobs to Slurm
```

For large cohorts or high-throughput analyses on HPC systems, additional execution considerations are discussed in the following note.


<details>

<summary>**HPC and Slurm execution strategies**</summary>

Full HPC/Slurm workflows (e.g., per-gene batch submission) are highly dataset- and cluster-specific. Therefore, _RNAshapeQC_ only provides simple example bash scripts under `inst/bash/`, and users can write their own R wrappers following the same pattern as shown in this vignette.

_RNAshapeQC_ distinguishes between two types of functions for computing QC metrics:

- `get_*` functions (e.g., `get_DR()`, `get_MCD()`, `get_wCV()`)  
  Designed to compute a **gene × sample matrix** for a given gene list in a single study. They use multicore parallelization within a single R session (via parallel / doParallel). In practice, they are typically run on a single HPC node for an entire gene list. For very large lists, users may split the gene list into chunks, run multiple `get_*` calls in separate jobs, and combine the results.

- `gen_*` functions (e.g., `gen_DR()`, `gen_MCD()`, `gen_wCV()`)  
  Designed for **per-gene jobs**, suitable for HPC/Slurm submission. 
  They operate on a single gene at a time and can be applied to: 
  - single-study pileup files 
  - multi-study `pileupList` objects (e.g., TCGA-LUAD + TCGA-HNSC)
  
  Typical usage is one gene per job in a per-gene Slurm array, followed by combining outputs using helper functions (e.g., `combine_vecObj()`).
  
</details>


# Data Analysis

Users who do not yet have processed matrices can generate them by following the workflow described in [Section 3](#DataPrep). This section demonstrates the QC analysis using the toy datasets included in the package.

## Toy datasets

Two primary toy objects are shipped with the package:

- `TOY_mrna_se`: synthetic mRNA-seq-like data (`DR` and `TPM` are stored as assays, and `gene_length` is stored in `rowData()`.)
- `TOY_total_se`: synthetic total RNA-seq-like data (`MCD` and `wCV` are stored as assays.)

_RNAshapeQC_ functions support direct input of `SummarizedExperiment` objects, which is the recommended Bioconductor container format. Optionally, examples in [Section 4.2](#mRNAex) and [Section 4.3](#totalRNAex) also demonstrate usage with matrix inputs (`TOY_mrna_mat` and `TOY_total_mat`). The resulting `SummarizedExperiment` objects from _RNAshapeQC_ outputs can be directly integrated into standard Bioconductor workflows as examples in [Section 5](#interoper).

```{r LoadData}
data("TOY_mrna_se")
data("TOY_total_se")
```

We can quickly inspect the structure as follows:

```{r Str}
TOY_mrna_se
TOY_total_se
```


## mRNA-seq example {#mRNAex}

For mRNA-seq data, _RNAshapeQC_ uses the PD of DS as a sample-level degradation metric.

We can derive the DII using gene weights via `get_DIIwt()` with the toy objects:

```{r get_DIIwt}
## Using SE input directly
res_dii_se <- get_DIIwt(TOY_mrna_se)

names(res_dii_se)
head(res_dii_se$ds.vec)

## (Optional) Using matrix/vector input
data(TOY_mrna_mat)

res_dii_mat <- get_DIIwt(
    DR         = TOY_mrna_mat$DR, 
    TPM        = TOY_mrna_mat$TPM, 
    genelength = TOY_mrna_mat$gene_length
)

## Compare SE and matrix inputs
all.equal(res_dii_se$DR2, res_dii_mat$DR2)
all.equal(res_dii_se$ds.vec, res_dii_mat$ds.vec)
all.equal(res_dii_se$gene.df, res_dii_mat$gene.df)
all.equal(res_dii_se$s, res_dii_mat$s)
```

When a `SummarizedExperiment` object is supplied, `get_DIIwt()` extracts the required assays and row-level metadata internally. As shown above, results obtained from `SummarizedExperiment` input and matrix/vector input are identical.

The returned list typically includes:

- `DR2`: matrix of decay rates with filtered genes
- `ds.vec`: data frame containing
  - `Sample`: sample ID
  - `DS`: degradation score
  - `PD`: projection depth of DS
  - `DII`: _Degraded_ or _Intact_ label based on a cutoff
- `gene.df`: data frame with gene information used to compute gene weights
- `s`: scale factor used to adjust gene-length and TPM medians

We can plot the distributions of DS and PD, as well as the heatmap of DR with gene weights:

```{r plot_DIIwt, message=FALSE, echo=TRUE}
## Path to a temporary file
outfile <- file.path(tempdir(), "Fig_DIIwt.png")

plot_DIIwt(
    DR        = res_dii_se$DR2,
    DIIresult = res_dii_se,
    outFile   = outfile
)

## Saving DIIwt plot to: /.../Fig_DIIwt.png
```

The function prints the file path of the saved figure in the console, and we can open the file directly from that location.

```{r Fig_DIIwt, out.width="100%",out.height="100%", fig.align="center", echo=FALSE}
knitr::include_graphics(outfile)
```


## Total RNA-seq example {#totalRNAex}

For total RNA-seq data, _RNAshapeQC_ focuses on the relationship between MCD and wCV. 

We can combine these metrics to measure coverage-shape optimality using `get_SOI()`:

```{r get_SOI}
## Using SE input directly
## The cutoff value can be chosen by inspecting the PD distribution.
res_soi_se <- get_SOI(TOY_total_se, cutoff=1)

names(res_soi_se)
head(res_soi_se$auc.vec)

## (Optional) Using matrix input
data(TOY_total_mat)

res_soi_mat <- get_SOI(
    MCD    = TOY_total_mat$MCD,
    wCV    = TOY_total_mat$wCV,
    cutoff = 1
)

## Compare SE and matrix inputs
all.equal(res_soi_se$auc.vec, res_soi_mat$auc.vec)
all.equal(res_soi_se$auc.coord, res_soi_mat$auc.coord)
all.equal(res_soi_se$rangeMCD, res_soi_mat$rangeMCD)
```

The results obtained from `SummarizedExperiment` input and matrix input are identical, and it confirms consistent behavior of `get_SOI()` across supported input types.

Here, `auc.vec` contains:

- `Sample`: sample ID
- `AUC`: smoothed area under the curve in the MCD-wCV space
- `PD`: projection depth of the AUC
- `SOI`: _Suboptimal_ or _Optimal_ label based on a cutoff

This provides a simple way to flag suboptimal total RNA-seq samples based on their coverage shape.

We can visualize the smoothed relationship using `plot_SOI()`:

```{r plot_SOI, message=FALSE, echo=TRUE}
outfile <- file.path(tempdir(), "Fig_SOI.png")

## The cutoff value can be chosen by inspecting the PD distribution.
plot_SOI(
    SOIresult = res_soi_se,
    cutoff    = 1,
    outFile   = outfile 
)

## Saving SOI plot to: /.../Fig_SOI.png
```

This figure illustrates how wCV behaves as a function of MCD, and which samples are classified as suboptimal versus optimal.

```{r Fig_SOI, out.width="100%",out.height="100%", fig.align="center", echo=FALSE}
knitr::include_graphics(outfile)
```


# Interoperability with Bioconductor {#interoper}

_RNAshapeQC_ outputs can be organized using the [_SummarizedExperiment_](https://bioconductor.org/packages/devel/bioc/vignettes/SummarizedExperiment/inst/doc/SummarizedExperiment.html) class, which ensures consistent handling of assay and associated meta-data during subsetting. In the container, matrices of DR, TPM, MCD, and wCV can be stored as `assays()`, where rows denote genes and columns represent samples. The gene or sample information tables (annotations) are stored using the `rowData()` or `colData()` functions, respectively. `metadata()` can contain additional QC-related information such as scale factors, MCD ranges, and cutoff values. 

This is an example of constructing `SummarizedExperiment` objects and applying sample-level QCs to filter out low-quality samples before downstream analysis. This coordinated subsetting operation is simultaneously applied to both assays and column data, allowing proper filtering and preventing mismatches between QC metrics and downstream analysis inputs, as shown in `se_mrna_hq` and `se_total_hq`.

```{r se_mrna, message=FALSE, echo=TRUE}
library(SummarizedExperiment)

## Store QC results in mRNA-seq
se_mrna <- SummarizedExperiment(
    assays   = list(DR=assay(TOY_mrna_se, "DR"), TPM=assay(TOY_mrna_se, "TPM")),
    colData  = DataFrame(res_dii_se$ds.vec),
    metadata = list(s=res_dii_se$s)
)
se_mrna

## Subsetting to keep high-quality samples in mRNA-seq
se_mrna_hq <- se_mrna[, se_mrna$DII == "Intact"]
head(assay(se_mrna_hq, "TPM"))
```

This QC filtering can be applied prior to class discovery to reduce clustering instability caused by degraded samples.

```{r CC, message=FALSE, echo=TRUE}
library(ConsensusClusterPlus)

## TPM log transform
tpm_hq  <- log2(assay(se_mrna_hq, "TPM") + 1)

## Set a random seed for reproducibility
set.seed(1)

cc <- ConsensusClusterPlus(
    tpm_hq,
    maxK       = 6,
    reps       = 100,
    pItem      = 0.8,
    pFeature   = 0.8,
    clusterAlg = "hc",
    distance   = "pearson",
    plot       = "png",
    title      = tempdir()
)
cc[[3]]$consensusClass
```

The protocol-specific QC metrics stored in `colData()` can be included as covariates in a differential expression analysis ([_DESeq2_](https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html)) (e.g., `design = ~ batch + condition + SOI`).

```{r se_total, message=FALSE, echo=TRUE}
## Store QC results in total RNA-seq
se_total <- SummarizedExperiment(
    assays   = list(MCD=assay(TOY_total_se, "MCD"), wCV=assay(TOY_total_se, "wCV")),
    rowData  = DataFrame(Gene=rownames(assay(TOY_total_se, "MCD"))),
    colData  = DataFrame(res_soi_se$auc.vec),
    metadata = list(auc.coord=res_soi_se$auc.coord, rangeMCD=res_soi_se$rangeMCD)
)
se_total

## Subsetting to keep high-quality samples in total RNA-seq
se_total_hq <- se_total[, se_total$SOI == "Optimal"]
head(assay(se_total_hq, "wCV"))
```


# Summary

This vignette demonstrated a minimal, runnable example of how to:

- understand how `construct_pileup()` fits into a real workflow,
- load toy `SummarizedExperiment` datasets (`TOY_mrna_se` and `TOY_total_se`),
- use `get_DIIwt()` to derive a degraded/intact index from DR,
- use `get_SOI()` to derive a suboptimal/optimal index from MCD and wCV.

The examples here are intentionally small and synthetic so that the vignette builds quickly and reproducibly. More detailed workflows, methodological explanations, and real-data case studies are documented separately in the full-length user manual.


# Session Information

```{r session}
sessionInfo()
```


[^1]: Choi, H.Y., Jo, H., Zhao, X. et al. SCISSOR: a framework for identifying structural changes in RNA transcripts. _Nat Commun_ 12, 286 (2021). https://doi.org/10.1038/s41467-020-20593-3