---
title: "Import and representation of ProteinGym data"
author: 
    - name: Tram Nguyen
      affiliation: Department of Biomedical Informatics, Harvard Medical School
      email: Tram_Nguyen@hms.harvard.edu
    - name: Pascal Notin
      affiliation: Department of Systems Biology, Harvard Medical School
    - name: Aaron W Kollasch
      affiliation: Department of Systems Biology, Harvard Medical School
    - name: Debora Marks
      affiliation: Department of Systems Biology, Harvard Medical School
    - name: Ludwig Geistlinger
      affiliation: Department of Biomedical Informatics, Harvard Medical School
package: ProteinGymR
output:
    BiocStyle::html_document:
      self_contained: yes 
      toc: true
      toc_float: true
      toc_depth: 2
      code_folding: show
date: "June 18, 2025"
bibliography: references.bib
vignette: >
    %\VignetteIndexEntry{Data access and representation}
    %\VignetteEncoding{UTF-8}
    %\VignetteEngine{knitr::rmarkdown}
editor_options: 
    markdown: 
      wrap: 80
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>",
    crop = NULL,
    message = FALSE
)
```

# Installation

Install the package using Bioconductor. Start R and enter:

```{r, eval = FALSE}
if(!requireNamespace("BiocManager", quietly = TRUE))
        install.packages("BiocManager")
BiocManager::install("ProteinGymR")
```

# Setup

Now, load the package and dependencies used in the vignette.

```{r, message = FALSE}
library(ProteinGymR)
library(tidyr)
library(dplyr)
library(stringr)
library(ggplot2)
```

# Introduction

Predicting the effects of mutations in proteins is critical to many
applications, from understanding genetic disease to designing novel proteins to
address our most pressing challenges in climate, agriculture and healthcare.
Despite an increase in machine learning-based protein modeling methods,
assessing the effectiveness of these models is problematic due to the use of
distinct, often contrived, experimental datasets and variable performance across
different protein families.

[ProteinGym](https://proteingym.org/) is a large-scale and holistic set of
benchmarks specifically designed for protein fitness prediction and design
curated by @Notin2023.
It encompasses both a broad collection of over 250 standardized deep mutational
scanning (DMS) assays, spanning millions of mutated sequences, as well as
curated clinical datasets providing high-quality expert annotations about
mutation effects. Furthermore, ProteinGym reports the performance of a diverse
set of over 70 high-performing models from various subfields (eg., mutation
effects, inverse folding) into a unified benchmark.

ProteinGym datasets are publicly available as a community resource both on
[Zenodo](https://doi.org/10.5281/zenodo.13932632) and the official [ProteinGym
website](https://proteingym.org/) under the MIT license.

# Available datasets

The `ProteinGymR` package provides the following analysis-ready datasets from
ProteinGym:

1.  DMS assay scores from 217 assays measuring the impact of all possible amino
    acid substitutions across 186 proteins. The dataset can be obtained using
    the `dms_substitutions()` function

2.  AlphaMissense pathogenicity scores for \~1.6 M substitutions in the
    ProteinGym DMS data. The data is provided with `am_scores()`.

3.  Reference file containing metadata associated with the 217 DMS assays, such
    as taxon, protein sequence length, UniProt ID, etc.

4.  Five model performance metrics ("AUC", "MCC", "NDCG", "Spearman",
    "Top_recall") for 79 models across 217 assays calculated on DMS
    substitutions in a zero-shot setting. The data can be obtained with
    `zeroshot_DMS_metrics()`.

5.  Model scores on the DMS substitutions for 79 models in the zero-shot
    setting. Load with `zeroshot_substitutions()`.

6.  Two model performance metrics ("Spearman", and "MSE") for 12 models across
    217 assays (as of Bioc 3.21) calculated on DMS substitutions in a
    semi-supervised setting. Load in this data with `supervised_metrics()`.

7.  Model scores on the DMS substitutions for 11 semi-supervised models with 3
    folding schemes: contiguous, modulo, and random. Loaded in with
    `supervised_substitutions()` and by changing the "fold_scheme" argument,
    respectively.

8.  PDB files for 197 protein structures, to be used in the `plot_structure()`
    3D visualization function.

# Data import

ProteinGym data can be obtained through
[ExperimentHub](https://bioconductor.org/packages/release/bioc/html/ExperimentHub.html).

## DMS data

Deep mutational scanning is an experimental technique that provides
experimental data on the fitness effects of all possible single mutations in
a protein [@Fowler2014].
For each position in a protein, the amino acid residue is mutated and the
fitness effects are recorded. While most mutations tend to be deleterious, some
can enhance protein activity. In addition to analyzing single mutations, this
method can also examine the effects of multiple mutations, yielding insights
into protein structure and function. Overall, DMS scores provide a detailed map
of how changes in a protein's sequence affect its function, offering valuable

Datasets in `ProteinGymR` can be easily loaded with built-in functions.

```{r import dms}
dms_data <- dms_substitutions()
```

View the DMS study names for the first 6 assays.

```{r view studies}
head(names(dms_data))
```

View an example of one DMS assay.

```{r view assay}
head(dms_data[[1]])
```

For each DMS assay, the columns show the UniProt protein identifier, the DMS
experiment assay identifier, the amino acid substitution at a given sequence 
position, the mutated protein sequence, the recorded DMS score, and a binary 
DMS score bin categorizing whether the mutation has an effect on fitness 
(1) or not (0). For details, see `?dms_substitutions` and the reference 
publication from @Notin2023.

Here, we obtain the metadata table that provides additional information for the DMS experiments.

```{r queryEH}
eh <- ExperimentHub::ExperimentHub()
AnnotationHub::query(eh, "ProteinGymR")

dms_metadata <- eh[["EH9607"]]
names(dms_metadata)
```

There are 45 columns representing metadata for DMS assays. See @Notin2023 for details on the individual metadata columns.

# Model benchmarking

The function `benchmark_models()` can be used to compare
performance across several variant effect prediction models when using the 
DMS data as ground truth. This function takes in one of the five
available metrics, and and compares the performance of up to 5 out of the 
79 available models.

In the zero-shot setting, the effects of mutations on fitness are predicted 
without relying on ground-truth labels for the protein of interest. 
Robust zero-shot performance is particularly informative when labels are
subject to several biases or scarcely available (e.g., labels for rare genetic
pathologies).

Model performance was evaluated across 5 metrics:

1.  Spearman's rank correlation coefficient (default metric)
2.  Area Under the ROC Curve (AUC)
3.  Matthews Correlation Coefficient (MCC), most suitable for bimodal DMS 
measurements
4.  Normalized Discounted Cumulative Gains (NDCG)
5.  Top K Recall (top 10% of DMS values)

To account for the fact that certain protein functions are overrepresented 
in the list of proteins assayed with DMS (e.g., thermostability), these metrics 
were first calculated within groups of proteins with similar functions. The 
final value of the metric is then the average of these averages, giving each 
functional group equal weight. The final values are referred to as the 
‘corrected average’.

Due to the often non-linear relationship between protein function and organism
fitness [@Boucher2016], the
Spearman’s rank correlation coefficient is typically an appropriate choice for 
evaluating model performance against experimental measurements. However, in 
situations where DMS measurements exhibit a bimodal profile, rank correlations 
may not be the optimal choice. Therefore, additional metrics are also provided, 
such as the Area Under the ROC Curve (AUC) and the Matthews Correlation 
Coefficient (MCC), which compare binarized model scores and experimental 
measurements. Furthermore, for certain goals (e.g., optimizing functional 
properties of designed proteins), it is more important that a model is able to 
correctly identify the most functional protein variants, rather than properly 
capture the overall distribution of all assayed variants. For such scenarios, 
it is beneficial to use the Normalized Discounted Cumulative Gains (NDCG) which 
prioritizes models that return high scores for sequences with high DMS value 
(corresponding to strong gain in fitness). Alternatively, the Top K Recall 
(with K being set to the top 10% of DMS values) can also be informative for such 
scenarios.

To view all available zero-shot models, use the function: `available_models()`.

```{r, available_models}
available_models()
```

Plot the AUC metric for 5 models.

```{r, warning=FALSE, fig.wide = TRUE}
benchmark_models(metric = "AUC", 
    models = c("GEMME", "CARP_600K", "ESM1b", "VESPA", "ProtGPT2"))
```

Here, GEMME performed the best, achieving highest AUC of the 5 selected models. 
If not specified by the user, Spearman correlation is used as the default 
metric. For more information about the models and metrics, see the
function documentation `?benchmark_models()`.

# Session Info

```{r, sesh info}
sessionInfo()
```

# References