--- title: "fRagmentomics: A Per-Fragment Analysis Workflow" author: - name: "Killian Maudet" affiliation: "Université Paris-Saclay, Gustave Roussy, Inserm UMR 1361 Cancer Data Science, IHU PRISM National PRecISion Medicine Center in Oncology, Villejuif, F-94805, France" email: "killian.maudet@gustaveroussy.fr" - name: "Yoann Pradat" affiliation: "Université Paris-Saclay, Gustave Roussy, Inserm UMR 1361 Cancer Data Science, IHU PRISM National PRecISion Medicine Center in Oncology, Villejuif, F-94805, France" - name: "Juliette Samaniego" affiliation: "Université Paris-Saclay, Gustave Roussy, Inserm UMR 1361 Cancer Data Science, IHU PRISM National PRecISion Medicine Center in Oncology, Villejuif, F-94805, France" - name: "Elsa Bernard" affiliation: "Université Paris-Saclay, Gustave Roussy, Inserm UMR 1361 Cancer Data Science, IHU PRISM National PRecISion Medicine Center in Oncology, Villejuif, F-94805, France" date: "`r Sys.Date()`" package: fRagmentomics output: BiocStyle::html_document: toc: true toc_float: true vignette: > %\VignetteIndexEntry{A Per-Fragment Analysis Workflow} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} %\VignettePackage{fRagmentomics} --- ```{r setup, include=FALSE} use_ragg <- requireNamespace("ragg", quietly = TRUE) knitr::opts_chunk$set( fig.width = 10, fig.height = 6, out.width = "100%", fig.align = "center", fig.retina = 2, dev = if (use_ragg) "ragg_png" else "png", dpi = 120 ) ``` # Introduction

Plasma circulating cell-free DNA (cfDNA) analysis has transformed cancer care. While recent work has shown that ctDNA fragments have distinct features (size, end motifs) compared to healthy cfDNA, few tools offer a standardized framework for an in-depth analysis that links these features to the specific mutational status of each DNA fragment. **fRagmentomics** addresses this need by providing a robust and flexible R package to integrate cfDNA fragment features with their mutational status. By standardizing the analysis of both SNVs and indels at the single-fragment level, fRagmentomics supports the interpretation of liquid biopsies to distinguish tumoral origin. This vignette provides a step-by-step walkthrough of the main analysis workflow using the `run_fRagmentomics()` function and demonstrates the package's core visualization capabilities thank to the visualisation functions: `plot_size_distribution()`, `plot_ggseqlogo_meme()`, `plot_freq_barplot()`, `plot_motif_barplot()`. # Installation ## Prerequisites fRagmentomics is built and tested under **R version 4.4.3**. ## System Dependencies fRagmentomics optionally makes use **`bcftools`** (tested with version 1.21) for variant normalization. If using `apply_bcftools_norm=TRUE` argument, please, ensure it is installed and accessible in your system's PATH. We recommend using a conda environment with bcftools installed but you can also use any other environment manager or rely on tools already installed in your system. ```sh # conda install -c bioconda mamba mamba install -c bioconda bcftools=1.21 ``` ## R Package Installation You may install the package from github or from conda. **1. Install from conda** ```r # To use bcftools normalisation (RECOMMENDED for indels) mamba env create -n fRagmentomics-env -c elsab-lab -c conda-forge -c bioconda r-fragmentomics bcftools=1.21 ``` **2. Install from github** When installing from GitHub, pre-installing heavy Bioconductor deps via Conda can avoid slow source compilation. ```bash Rscript -e 'if (!requireNamespace("remotes", quietly=TRUE)) install.packages("remotes", repos="https://mirror.ibcp.fr/pub/CRAN/"); remotes::install_github("ElsaB-Lab/fRagmentomics", build_vignettes=FALSE, upgrade="never")' ``` NOTE: If you hit compilation issues (e.g. with Rsamtools / BiocParallel), try one of the two approaches below. - **With Conda/Mamba** (preinstall Rsamtools) ```bash mamba env create -n fRagmentomics-env -c conda-forge -c bioconda bioconductor-rsamtools bioconductor-variantannotation bcftools=1.21 # To use bcftools normalisation (RECOMMENDED for indels) mamba activate fRagmentomics-env Rscript -e 'if (!requireNamespace("BiocManager", quietly=TRUE)) install.packages("BiocManager", repos="https://mirror.ibcp.fr/pub/CRAN/"); if (!requireNamespace("remotes", quietly=TRUE)) install.packages("remotes", repos="https://mirror.ibcp.fr/pub/CRAN/"); remotes::install_github("ElsaB-Lab/fRagmentomics", build_vignettes=FALSE, upgrade="never")' ``` - **Without Conda/Mamba** (use BiocManager to get Bioconductor deps) ```bash Rscript -e 'if (!requireNamespace("BiocManager", quietly=TRUE)) install.packages("BiocManager", repos="https://mirror.ibcp.fr/pub/CRAN/"); if (!requireNamespace("remotes", quietly=TRUE)) install.packages("remotes", repos="https://mirror.ibcp.fr/pub/CRAN/"); BiocManager::install(c("Rsamtools","VariantAnnotation","GenomicAlignments","BiocParallel"), ask=FALSE, update=FALSE); remotes::install_github("ElsaB-Lab/fRagmentomics", build_vignettes=FALSE, upgrade="never")' ``` After these steps are complete, you can load the package into your R session with `library(fRagmentomics)`. # A Complete Analysis Workflow To make it user-friendly, the entire analysis is performed by a single function: `run_fRagmentomics()`. We will use the example data included with the package to demonstrate a typical workflow. For a deeper dive into the package's methodology, please refer to the `README.md` file on the project's [GitHub repository](https://github.com/ElsaB-Lab/fRagmentomics). The README provides detailed explanations of: - The normalization for converting indel representations into a standardized VCF format. - The context-aware algorithm for determining read status. - The formulas used for precise fragment size calculation. - The logic for classifying fragments into different mutational groups. ## 1. Loading the Package and Example Data We begin by loading the fRagmentomics library. To facilitate reproducibility, the package ships with a fully anonymized test dataset comprising a small BAM file, its corresponding FASTA reference, and a mutation file. This test sample originates from a **lung cancer** case harboring an **EGFR exon 19 15 bp in-frame deletion** — a well-characterized oncogenic driver associated with sensitivity to tyrosine kinase inhibitors. The scripts used to generate, subset, and anonymize the test BAM and FASTA files are available in the package directory: `inst/scripts/` ```{r Loading the Package and Example Data, message=TRUE, warning=TRUE} suppressPackageStartupMessages({ library(fRagmentomics) }) # Locate the example files bundled with the package mut_file <- system.file( "extdata/mutation", "cfdna-egfr-del_chr7_55241864_55243064_10k.mutations.tsv", package = "fRagmentomics" ) bam_file <- system.file( "extdata/bam", "cfdna-egfr-del_chr7_55241864_55243064_10k.bam", package = "fRagmentomics" ) fasta_file <- system.file( "extdata/fasta", "hg19_chr7_55231864_55253064.fa", package = "fRagmentomics" ) # Print the file paths to confirm they were found mut_file bam_file fasta_file # Detect whether bcftools is available on the system PATH has_bcftools <- nzchar(Sys.which("bcftools")) if (!has_bcftools) { message( "bcftools not found on PATH; normalization examples will run without it." ) } ``` To handle 0-based mutation coordinates, set the `one_based = FALSE` parameter. This will ensure the positions are correctly converted. Furthermore, `fRagmentomics` automatically standardizes indel representations to resolve ambiguities, as the same variant can often be described in multiple ways (e.g., left-aligned vs. right-aligned). The following table summarizes the accepted formats: **Simple Format** | To describe a... | `REF` Column | `ALT` Column | `POS` Column | |:-------------------------|:----------------------------------|:----------------------------------|:-------------------------------------------| | Deletion of **"AT"** | `AT` | `""`, `-`, `.`, `_`, `NA` | Position of the first deleted base (`A`) | | Insertion of **"CT"** | `""`, `-`, `.`, `_`, `NA` | `CT` | Position of the base *before* the insertion | **VCF-Style Padded Format** | To describe a... | `REF` Column | `ALT` Column | `POS` Column | |:-----------------------------------|:-------------|:-------------|:--------------------------------------------| | Deletion of **"AT"** from "G**AT**" | `GAT` | `G` | Position of the anchor base (`G`) | | Insertion of **"CT**" after "A" | `A` | `ACT` | Position of the anchor base (`A`) | ## 2. Running the Analysis ### Basic Workflow First, we will run `run_fRagmentomics()` with the default arguments. We provide the paths to our three input files, a `sample_id` for annotation, and specify the number of cores for parallel processing. When `apply_bcftools_norm = TRUE`, the pipeline calls `bcftools norm` to left-align and normalize indels, removing representation/positional ambiguity. To use this option you must have `bcftools` installed and discoverable on your system (see the *Installation* section of the README). If `bcftools` is unavailable, you may set `apply_bcftools_norm = FALSE`, but indel positions will not be left-aligned or normalized, which may lead to discrepancies across callers and reduce comparability. In this vignette, we detect bcftools availability via `has_bcftools` and toggle normalization accordingly so the examples run on systems without it. ```{r run_analysis_basic, message=TRUE, warning=TRUE} if (!has_bcftools) { message("bcftools not detected; running without external normalization.") } # Run the full analysis pipeline with default settings df_frag_basic <- run_fRagmentomics( mut = mut_file, bam = bam_file, fasta = fasta_file, sample_id = "cfdna-egfr-del", apply_bcftools_norm = has_bcftools, verbose = FALSE, n_cores = 1 ) # View the dimensions and the first few rows cat("Dimensions of basic results:", dim(df_frag_basic), "\n") head(df_frag_basic) ``` ### Advanced Workflow: Customizing the Analysis The `run_fRagmentomics()` function can be customized for more specific use cases. In the example below, we will perform a more stringent analysis and request additional information in the output table. Specifically, we will: - **Narrow the search window** to 1000 bp around the variant using `neg_offset_mate_search` and `pos_offset_mate_search`. - **Apply stricter BAM filters** by requiring reads to be in a "proper pair" using `flag_bam_list`. - **Request additional output columns** for bam information (`report_bam_info`), soft-clip counts (`report_softclip`), and the 3-base motifs at the fragment ends (`report_5p_3p_bases_fragment`). - **Keep fragments that do not pass quality control using the parameter retain_fail_qc**: e.g., reads with incorrect orientation, mapped to different chromosomes, unmapped mates. ```{r run_analysis_advanced, message=TRUE, warning=TRUE} if (!has_bcftools) { message( "bcftools not detected; running advanced workflow without external normalization." ) } # Run the analysis with custom parameters df_frag_advanced <- run_fRagmentomics( mut = mut_file, bam = bam_file, fasta = fasta_file, sample_id = "cfdna-egfr-del", neg_offset_mate_search = -500, pos_offset_mate_search = 500, flag_bam_list = list( isProperPair = TRUE, isUnmappedQuery = FALSE, hasUnmappedMate = FALSE, isSecondaryAlignment = FALSE ), report_bam_info = TRUE, report_softclip = TRUE, retain_fail_qc = TRUE, report_5p_3p_bases_fragment = 3, apply_bcftools_norm = has_bcftools, verbose = FALSE, n_cores = 1 ) # Observe that stricter filtering may result in fewer fragments cat("Dimensions of advanced results:", dim(df_frag_advanced), "\n") # View the newly added columns head(df_frag_advanced[, c( "Fragment_Id", "TLEN", "Nb_Fragment_Bases_Softclip_5p", "Fragment_Bases_5p" )]) ``` ## 3. Exploring the Output **Note**: The following outputs and visualizations are generated from a small, simulated test dataset. They are intended to demonstrate the package's functionality and are not representative of real biological results. The `run_fRagmentomics()` function returns a `DataFrame` where each row represents the analysis of a single fragment. Let's explore the key columns to understand the results. ### A First Look at the Results Table ```{r dataframe_output} # Dataframe output df_frag_basic ``` ### Understanding the Fragment Status The most important columns for interpreting the results are `Fragment_Status_Simple` and `Fragment_Status_Detail`. The **`Fragment_Status_Simple`** column provides a high-level classification for each fragment. The table below explains each possible label: | `Fragment_Status_Simple` | Represents... | Typical Read Evidence (`Read_5p_Status` / `Read_3p_Status`) | |:-------------------------|:---------------------------------------------------------------------------|:-------------------------------------------------------------------| | **`MUT`** | The fragment confidently supports the **mutant allele**. | `MUT`/`MUT`, `MUT`/`AMB` or `MUT`/`NA`| | **`WT`** | The fragment confidently supports an allele that is **not the target**. | `WT`/`WT`, `WT`/`AMB` or `WT`/`NA`| | **`OTH`** | The two reads give **conflicting** high-confidence information. | `OTH`/`OTH`, `OTH`/`AMB` or `OTH`/`NA`| | **`N/I`** | The fragment is non-informative: either **discordant** or **ambiguous**. | `MUT`/`WT`, `MUT`/`OTH`, `WT`/`OTH`, `AMB`/`AMB` or `AMB`/`NA` | You can quickly summarize the results for a given mutation by counting the occurrences of each status: ```{r count_status_simple} table(df_frag_advanced$Fragment_Status_Simple) ``` The **`Fragment_Status_Detail`** column concatenates the statuses of the two reads (e.g., `"WT & AMB"`). ```{r count_status_detail} table(df_frag_advanced$Fragment_Status_Detail) ``` ### Exploring Other Key Features You can also quickly summarize other key fragmentomic features like the calculated **fragment size**: ```{r summary_size} summary(df_frag_advanced$Fragment_Size) ``` And you can easily extract the final **VAF** calculated for each mutation: ```{r extract_vaf} unique(df_frag_advanced$VAF) ``` For a complete description of every column in the output, please refer to the `README.md` file on the package's [GitHub repository](https://github.com/ElsaB-Lab/fRagmentomics). # Visualizing Fragmentomic Features The `df_frag_advanced` object, produced by `run_fRagmentomics()`, can be directly used with the package's suite of plotting functions to visualize the results. ## Fragment Size Distribution One of the most common analyses is to compare the size distribution of fragments that carry the mutation (`MUT`) against those that do not (`WT`). The `plot_size_distribution()` function is designed for this purpose. ### **1. Comparing Groups with a Histogram** Let's start by creating a histogram to compare our two main groups. We'll set `show_histogram = TRUE` and `show_density = FALSE`. Using `histo_args` to set an `alpha` value adds transparency, which is helpful when bars overlap. ```{r plot_size_histogram} plot_size_distribution( df_fragments = df_frag_advanced, vals_z = c("MUT", "WT"), show_histogram = TRUE, show_density = FALSE, x_limits = c(100, 420), histo_args = list(alpha = 0.5) ) ``` ### **2. Using Density Curves** Alternatively, we can represent the same data using smooth density curves, which can make it easier to see the shape and peaks of the distributions. This is the default behavior (`show_density = TRUE`). Here, we'll also show all available groups in the data by leaving `vals_z` as the default `NULL`. ```{r plot_size_density} plot_size_distribution( df_fragments = df_frag_advanced, show_histogram = FALSE, show_density = TRUE, x_limits = c(100, 420) ) ``` ### **3. Visualizing the Overall Distribution** Finally, to see the size distribution of all fragments combined into a single group, we can set `col_z = NULL`. This example also demonstrates how to overlay a histogram and a density curve in the same plot. ```{r plot_size_ungrouped} plot_size_distribution( df_fragments = df_frag_advanced, col_z = NULL, show_histogram = TRUE, show_density = TRUE, x_limits = c(100, 420), histo_args = list(alpha = 0.5) ) ``` ## Detailed 3-Base End Motif Proportions For a deeper dive into the 3-base motifs at fragment ends, the package provides `plot_motif_barplot()`. This function has three visualization modes, controlled by the `representation` argument. ### 1. Hierarchical View (`representation = "split_by_base"`) This is the default view. It creates a detailed plot with nested facets for each base position, providing a complete landscape of all 64 possible motif proportions. ```{r plot_motif_hierarchical} plot_motif_barplot( df_fragments = df_frag_basic, representation = "split_by_base", vals_z = c("MUT", "WT") ) ``` We can also customize this plot, for instance by filtering to show only motifs that start with 'C' or 'G' using `motif_start`, and passing additional arguments (like `alpha` for transparency) to the underlying `geom_bar()` via the `...` parameter. ```{r plot_motif_hierarchical_custom} plot_motif_barplot( df_fragments = df_frag_basic, representation = "split_by_base", motif_start = c("C", "G"), vals_z = c("MUT", "WT"), alpha = 0.7 # Pass 'alpha' to geom_bar() ) ``` ### 2. Differential Analysis (`representation = "differential"`) This mode is designed to directly compare **exactly two groups**. It calculates the log2 Fold Change (log2FC) of the proportion of each motif between the groups, making it easy to spot motifs that are enriched or depleted in one group relative to the other. ```{r plot_motif_differential} plot_motif_barplot( df_fragments = df_frag_advanced, representation = "differential", vals_z = c("MUT", "WT") ) ``` ### 3. Side-by-side Comparison (`representation = "split_by_motif"`) This mode creates a more traditional bar chart. Each 3-base motif is placed on the x-axis, with side-by-side bars showing the proportion for each group. This is useful for directly comparing the frequency of specific, known motifs. ```{r plot_motif_side_by_side} plot_motif_barplot( df_fragments = df_frag_basic, representation = "split_by_motif", vals_z = c("MUT", "WT") ) ``` ### End Motif Sequence Logos To visualize the nucleotide frequency at each position of the fragment ends, the package provides `plot_ggseqlogo_meme()`. This function uses the popular `ggseqlogo` package to create sequence logo plots, which are excellent for identifying conserved patterns or sequence preferences. #### 1. Comparing Motifs Between Groups A common use case is to compare the end-motif patterns of different fragment groups. Here, we'll look at the first 6 bases (`motif_size = 6`) of the 5' end (`motif_type = "Start"`) for `MUT` versus `WT` fragments. ```{r plot_logo_comparison} plot_ggseqlogo_meme( df_fragments = df_frag_basic, motif_type = "Start", motif_size = 6, vals_z = c("MUT", "WT"), colors_z = c("#F6BD60", "#84A59D", "#FD96A9", "#083D77") ) ``` #### 2. Visualizing Both Ends of a Single Group You can also analyze a single group in more detail. By setting `motif_type = "Both"`, the function will display the 5'-end motif and the 3'-end motif side-by-side. Let's look at the 4-base motifs for just the `MUT` fragments. ```{r plot_logo_single_group} plot_ggseqlogo_meme( df_fragments = df_frag_basic, motif_type = "Both", motif_size = 4, vals_z = "MUT", colors_z = c("#F6BD60", "#84A59D", "#FD96A9", "#083D77") ) ``` ## Overall Nucleotide Frequency The `plot_freq_barplot()` function provides a summary of the nucleotide composition at fragment ends. Instead of looking at specific motifs, it calculates the overall proportion of each of the four bases (A, C, G, T) across the motif length. This is useful for spotting broad compositional biases between groups and includes a global Chi-squared test for statistical significance. In this example, we will analyze the overall nucleotide frequency in the first 5 bases (`motif_size = 5`) of the 5' end (`motif_type = "Start"`) of the fragments. We will also pass the `alpha` argument via `...` to make the bars transparent. ```{r plot_freq_barplot} plot_freq_barplot( df_fragments = df_frag_basic, motif_size = 5, motif_type = "Start", alpha = 0.7 ) ``` # Session Information This section provides details about the R session and package versions used to generate this vignette, ensuring reproducibility. ```{r session_info} sessionInfo() ```