--- title: "Import and representation of ProteinGym data" author: - name: Tram Nguyen affiliation: Department of Biomedical Informatics, Harvard Medical School email: Tram_Nguyen@hms.harvard.edu - name: Pascal Notin affiliation: Department of Systems Biology, Harvard Medical School - name: Aaron W Kollasch affiliation: Department of Systems Biology, Harvard Medical School - name: Debora Marks affiliation: Department of Systems Biology, Harvard Medical School - name: Ludwig Geistlinger affiliation: Department of Biomedical Informatics, Harvard Medical School package: ProteinGymR output: BiocStyle::html_document: self_contained: yes toc: true toc_float: true toc_depth: 2 code_folding: show date: "June 18, 2025" bibliography: references.bib vignette: > %\VignetteIndexEntry{Data access and representation} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} editor_options: markdown: wrap: 80 --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", crop = NULL, message = FALSE ) ``` # Installation Install the package using Bioconductor. Start R and enter: ```{r, eval = FALSE} if(!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("ProteinGymR") ``` # Setup Now, load the package and dependencies used in the vignette. ```{r, message = FALSE} library(ProteinGymR) library(tidyr) library(dplyr) library(stringr) library(ggplot2) ``` # Introduction Predicting the effects of mutations in proteins is critical to many applications, from understanding genetic disease to designing novel proteins to address our most pressing challenges in climate, agriculture and healthcare. Despite an increase in machine learning-based protein modeling methods, assessing the effectiveness of these models is problematic due to the use of distinct, often contrived, experimental datasets and variable performance across different protein families. [ProteinGym](https://proteingym.org/) is a large-scale and holistic set of benchmarks specifically designed for protein fitness prediction and design curated by @Notin2023. It encompasses both a broad collection of over 250 standardized deep mutational scanning (DMS) assays, spanning millions of mutated sequences, as well as curated clinical datasets providing high-quality expert annotations about mutation effects. Furthermore, ProteinGym reports the performance of a diverse set of over 70 high-performing models from various subfields (eg., mutation effects, inverse folding) into a unified benchmark. ProteinGym datasets are publicly available as a community resource both on [Zenodo](https://doi.org/10.5281/zenodo.13932632) and the official [ProteinGym website](https://proteingym.org/) under the MIT license. # Available datasets The `ProteinGymR` package provides the following analysis-ready datasets from ProteinGym: 1. DMS assay scores from 217 assays measuring the impact of all possible amino acid substitutions across 186 proteins. The dataset can be obtained using the `dms_substitutions()` function 2. AlphaMissense pathogenicity scores for \~1.6 M substitutions in the ProteinGym DMS data. The data is provided with `am_scores()`. 3. Reference file containing metadata associated with the 217 DMS assays, such as taxon, protein sequence length, UniProt ID, etc. 4. Five model performance metrics ("AUC", "MCC", "NDCG", "Spearman", "Top_recall") for 79 models across 217 assays calculated on DMS substitutions in a zero-shot setting. The data can be obtained with `zeroshot_DMS_metrics()`. 5. Model scores on the DMS substitutions for 79 models in the zero-shot setting. Load with `zeroshot_substitutions()`. 6. Two model performance metrics ("Spearman", and "MSE") for 12 models across 217 assays (as of Bioc 3.21) calculated on DMS substitutions in a semi-supervised setting. Load in this data with `supervised_metrics()`. 7. Model scores on the DMS substitutions for 11 semi-supervised models with 3 folding schemes: contiguous, modulo, and random. Loaded in with `supervised_substitutions()` and by changing the "fold_scheme" argument, respectively. 8. PDB files for 197 protein structures, to be used in the `plot_structure()` 3D visualization function. # Data import ProteinGym data can be obtained through [ExperimentHub](https://bioconductor.org/packages/release/bioc/html/ExperimentHub.html). ## DMS data Deep mutational scanning is an experimental technique that provides experimental data on the fitness effects of all possible single mutations in a protein [@Fowler2014]. For each position in a protein, the amino acid residue is mutated and the fitness effects are recorded. While most mutations tend to be deleterious, some can enhance protein activity. In addition to analyzing single mutations, this method can also examine the effects of multiple mutations, yielding insights into protein structure and function. Overall, DMS scores provide a detailed map of how changes in a protein's sequence affect its function, offering valuable Datasets in `ProteinGymR` can be easily loaded with built-in functions. ```{r import dms} dms_data <- dms_substitutions() ``` View the DMS study names for the first 6 assays. ```{r view studies} head(names(dms_data)) ``` View an example of one DMS assay. ```{r view assay} head(dms_data[[1]]) ``` For each DMS assay, the columns show the UniProt protein identifier, the DMS experiment assay identifier, the amino acid substitution at a given sequence position, the mutated protein sequence, the recorded DMS score, and a binary DMS score bin categorizing whether the mutation has an effect on fitness (1) or not (0). For details, see `?dms_substitutions` and the reference publication from @Notin2023. Here, we obtain the metadata table that provides additional information for the DMS experiments. ```{r queryEH} eh <- ExperimentHub::ExperimentHub() AnnotationHub::query(eh, "ProteinGymR") dms_metadata <- eh[["EH9607"]] names(dms_metadata) ``` There are 45 columns representing metadata for DMS assays. See @Notin2023 for details on the individual metadata columns. # Model benchmarking The function `benchmark_models()` can be used to compare performance across several variant effect prediction models when using the DMS data as ground truth. This function takes in one of the five available metrics, and and compares the performance of up to 5 out of the 79 available models. In the zero-shot setting, the effects of mutations on fitness are predicted without relying on ground-truth labels for the protein of interest. Robust zero-shot performance is particularly informative when labels are subject to several biases or scarcely available (e.g., labels for rare genetic pathologies). Model performance was evaluated across 5 metrics: 1. Spearman's rank correlation coefficient (default metric) 2. Area Under the ROC Curve (AUC) 3. Matthews Correlation Coefficient (MCC), most suitable for bimodal DMS measurements 4. Normalized Discounted Cumulative Gains (NDCG) 5. Top K Recall (top 10% of DMS values) To account for the fact that certain protein functions are overrepresented in the list of proteins assayed with DMS (e.g., thermostability), these metrics were first calculated within groups of proteins with similar functions. The final value of the metric is then the average of these averages, giving each functional group equal weight. The final values are referred to as the ‘corrected average’. Due to the often non-linear relationship between protein function and organism fitness [@Boucher2016], the Spearman’s rank correlation coefficient is typically an appropriate choice for evaluating model performance against experimental measurements. However, in situations where DMS measurements exhibit a bimodal profile, rank correlations may not be the optimal choice. Therefore, additional metrics are also provided, such as the Area Under the ROC Curve (AUC) and the Matthews Correlation Coefficient (MCC), which compare binarized model scores and experimental measurements. Furthermore, for certain goals (e.g., optimizing functional properties of designed proteins), it is more important that a model is able to correctly identify the most functional protein variants, rather than properly capture the overall distribution of all assayed variants. For such scenarios, it is beneficial to use the Normalized Discounted Cumulative Gains (NDCG) which prioritizes models that return high scores for sequences with high DMS value (corresponding to strong gain in fitness). Alternatively, the Top K Recall (with K being set to the top 10% of DMS values) can also be informative for such scenarios. To view all available zero-shot models, use the function: `available_models()`. ```{r, available_models} available_models() ``` Plot the AUC metric for 5 models. ```{r, warning=FALSE, fig.wide = TRUE} benchmark_models(metric = "AUC", models = c("GEMME", "CARP_600K", "ESM1b", "VESPA", "ProtGPT2")) ``` Here, GEMME performed the best, achieving highest AUC of the 5 selected models. If not specified by the user, Spearman correlation is used as the default metric. For more information about the models and metrics, see the function documentation `?benchmark_models()`. # Session Info ```{r, sesh info} sessionInfo() ``` # References