--- title: "Import and visualization of ProteinGym data" author: - name: Tram Nguyen affiliation: Department of Biomedical Informatics, Harvard Medical School email: Tram_Nguyen@hms.harvard.edu - name: Pascal Notin affiliation: Department of Systems Biology, Harvard Medical School - name: Aaron W Kollasch affiliation: Department of Systems Biology, Harvard Medical School - name: Debora Marks affiliation: Department of Systems Biology, Harvard Medical School - name: Ludwig Geistlinger affiliation: Department of Biomedical Informatics, Harvard Medical School package: ProteinGymR output: BiocStyle::html_document: self_contained: yes toc: true toc_float: true toc_depth: 2 code_folding: show date: "`r doc_date()`" vignette: > %\VignetteIndexEntry{Data access and visualization} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} editor_options: markdown: wrap: 80 --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", crop = NULL, message = FALSE ) ``` # Installation Install the package using Bioconductor. Start R and enter: ```{r, eval = FALSE} if(!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("ProteinGymR") ``` # Setup Now, load the package and dependencies used in the vignette. ```{r, message = FALSE} library(ProteinGymR) library(tidyr) library(dplyr) library(stringr) library(ggplot2) library(ComplexHeatmap) library(AnnotationHub) ``` [ProteinGym]: https://proteingym.org/ [pg_publication]: https://proceedings.neurips.cc/paper_files/paper/2023/hash/cac723e5ff29f65e3fcbb0739ae91bee-Abstract-Datasets_and_Benchmarks.html [zenodo]: https://zenodo.org/records/13936340 [dms_paper]: https://www.nature.com/articles/nmeth.3027 [physiochem]: https://biology.stackexchange.com/questions/105321/arrangement-of-amino-acids-in-the-protein-alphabet [boucher]: https://pubmed.ncbi.nlm.nih.gov/27010590/ # Introduction Predicting the effects of mutations in proteins is critical to many applications, from understanding genetic disease to designing novel proteins to address our most pressing challenges in climate, agriculture and healthcare. Despite an increase in machine learning-based protein modeling methods, assessing the effectiveness of these models is problematic due to the use of distinct, often contrived, experimental datasets and variable performance across different protein families. [ProteinGym v1.1][ProteinGym] is a large-scale and holistic set of benchmarks specifically designed for protein fitness prediction and design curated by ([Notin et al. 2023][pg_publication]). It encompasses both a broad collection of over 250 standardized deep mutational scanning (DMS) assays, spanning millions of mutated sequences, as well as curated clinical datasets providing high-quality expert annotations about mutation effects. Furthermore, ProteinGym reports the performance of a diverse set of over 60 high-performing models from various subfields (eg., mutation effects, inverse folding) into a unified benchmark. ProteinGym v1.1 datasets are openly available as a community resource both on [Zenodo][zenodo] and the official [ProteinGym website][ProteinGym]. # Data The `ProteinGymR` package provides the following analysis-ready datasets from ProteinGym v1.1: 1. DMS assay scores from 217 assays measuring the impact of all possible amino acid substitutions across 186 proteins. The data is provided with `dms_substitutions()`. 2. AlphaMissense pathogenicity scores for ~1.6 M substitutions in the ProteinGym DMS data. The data is provided with `am_scores()`. 3. Five model performance metrics ("AUC", "MCC", "NDCG", "Spearman", "Top_recall") for 62 models across 217 assays calculated on DMS substitutions in a zero-shot setting. The data is provided with `zeroshot_DMS_metrics()`. 4. Reference file containing metadata associated with the 217 DMS assays. # Explore and visualize data This vignette explores and visualizes the first dataset of DMS scores. Deep mutational scanning is an experimental technique that provides comprehensive data on the functional effects of all possible single mutations in a protein ([Fowler & Fields 2014][dms_paper]). For each position in a protein, the amino acid residue is mutated and the fitness effects are recorded. While most mutations tend to be deleterious, some can enhance protein activity. In addition to analyzing single mutations, this method can also examine the effects of multiple mutations, yielding insights into protein structure and function. Overall, DMS scores provide a detailed map of how changes in a protein's sequence affect its function, offering valuable yet complex insights for researchers studying protein biology. ## Load and explore the DMS data from ExperimentHub Datasets in `ProteinGymR` can be easily loaded with built-in functions. ```{r import dms} dms_data <- dms_substitutions() ``` View the DMS study names for the first 6 assays. ```{r view studies} head(names(dms_data)) ``` View an example of one DMS assay. ```{r view assay} head(dms_data[[1]]) ``` For each DMS assay, the columns show the UniProt protein identifier, the DMS experiment assay identifier, the mutant at a given protein position, the mutated protein sequence, the recorded DMS score, and a binary DMS score bin categorizing whether the mutation has an affect on fitness (1) or not (0). For more details, access the function documentation with `?dms_substitutions()` and the reference publication from Notin et al. 2023. To access the metadata associated with each DMS assay, we can load in the reference table. Do this by querying all datasets on ExperimentHub affilitated with "ProteinGymR". ```{r queryEH} eh <- ExperimentHub::ExperimentHub() AnnotationHub::query(eh, "ProteinGymR") dms_metadata <- eh[["EH9607"]] names(dms_metadata) ``` There are 45 columns representing metadata for DMS assays. For more information about the information, see the [ProteinGym publication][pg_publication]. ## Visualization of DMS data with ComplexHeatmap Explore an assay and create a heatmap of the DMS scores. ```{r ACE2} ACE2 <- dms_data[["ACE2_HUMAN_Chan_2020"]] ``` We want to grab the reference amino acid, protein position, and mutant residue from the "mutant" column of the dataset. ```{r heatmap, warning = FALSE, fig.wide = TRUE} ACE2 <- ACE2 |> dplyr::mutate( ref = str_sub(ACE2$mutant, 1, 1), pos = as.integer( gsub(".*?([0-9]+).*", "\\1", ACE2$mutant) ), alt = str_sub(ACE2$mutant, -1) ) ACE2 <- ACE2 |> select("ref", "pos", "alt", "DMS_score") head(ACE2) ## Reshape the data to wide format ACE2_wide <- ACE2 |> select(-ref) |> pivot_wider(names_from = alt, values_from = DMS_score) |> arrange(pos) ## Subset to first 100 position ACE2_wide <- ACE2_wide |> filter(pos <= 100) head(ACE2_wide) ## Convert to matrix pos <- ACE2_wide$pos alt <- colnames(ACE2_wide) alt <- alt[-c(1)] heatmap_matrix <- ACE2_wide |> select(2:length(ACE2_wide)) |> as.matrix() ## Set amino acid position as rownames of matrix rownames(heatmap_matrix) <- pos ## Transpose so position is x-axis heatmap_matrix <- t(heatmap_matrix) ## Reorder rows based on physiochemical properties phyiochem_order <- "DEKRHNQSTPGAVILMCFYW" phyiochem_order <- unlist(strsplit(phyiochem_order, split = "")) reordered_matrix <- heatmap_matrix[match(phyiochem_order, rownames(heatmap_matrix)), ] ## Create the heatmap ComplexHeatmap::Heatmap(reordered_matrix, name = "DMS Score", cluster_rows = FALSE, cluster_columns = FALSE, show_row_names = TRUE, show_column_names = TRUE) ``` The heatmap shows the DMS score at each position along the given protein (x-axis) where a residue was mutated (alternate amino acid on y-axis). For this demonstration, we subset to the first 100 positions and grouped the amino acids by their physiochemical properties (DE,KRH,NQ,ST,PGAVIL,MC,FYW). See [here][physiochem] for more information. As a note, not all positions along the protein sequence may be subjected to mutation for every DMS assay. This results from the specific research objectives, prioritization choices of the investigators, or technical constraints inherent to the experimental design. A low DMS score indicates low fitness, while a higher DMS score indicates high fitness. Based on the "ACE2_HUMAN_Chan_2020" assay, we can see that at positions 90 and 92, fitness remained high despite across amino acid changes; possibly suggestive of a benign region of the protein. However, several mutations at position 48 resulted in low fitness. This could represent an important region for protein function where any perturbation would likely be deleterious. # Benchmarking across models We will now use the built-in function `benchmark_models()` to compare performance across several variant effect prediction models calculated on the 217 DMS assays in the zero-shot setting. This function takes in one of the five available metrics, and compares up to 5 models of the 62 available. In the zero-shot setting, experimental phenotypical measurements from a given assay are predicted without having access to any ground-truth labels at training time. Robust zero-shot performance is particularly informative when labels are subject to several biases or scarcely available (e.g., labels for rare genetic pathologies). Model performance was evaluted across 5 metrics: 1. Spearman's rank correlation coefficient (primary metric) 2. Area Under the ROC Curve (AUC) 3. Matthews Correlation Coefficient (MCC) for bimodal DMS measurements 4. Normalized Discounted Cumulative Gains (NDCG) for identifying the most functional protein variants 5. Top K Recall (top 10% of DMS values) To avoid placing too much weight on properties with many assays (e.g., thermostability), these metrics were first calculated within groups of assays that measure similar functions. The final value of the metric is then the average of these averages, giving each functional group equal weight. The final values are referred to as the ‘corrected average’. Due to the often non-linear relationship between protein function and organism fitness ([Boucher et al., 2016][boucher]), the Spearman’s rank correlation coefficient is the most generally appropriate metric for model performance on experimental measurements. However, in situations where DMS measurements exhibit a bimodal profile, rank correlations may not be the optimal choice. Therefore, additional metrics are also provided, such as the Area Under the ROC Curve (AUC) and the Matthews Correlation Coefficient (MCC), which compare model scores with binarized experimental measurements. Furthermore, for certain goals (e.g., optimizing functional properties of designed proteins), it is more important that a model is able to correctly identify the most functional protein variants, rather than properly capture the overall distribution of all assayed variants. Thus, we also calculate the Normalized Discounted Cumulative Gains (NDCG), which up-weights a model if it gives its highest scores to sequences with the highest DMS value. Finally, we also calculate Top K Recall, where we select K to be the top 10% of DMS values. To view all available models, use the function: `available_models()` ```{r, available_models} available_models() ``` Plot the AUC metric for 5 models. ```{r, warning=FALSE, fig.wide = TRUE} benchmark_models(metric = "AUC", models = c("GEMME", "CARP_600K", "ESM_1b", "EVmutation", "ProtGPT2")) ``` Based on the AUC metric of evaluation, GEMME performed the best while of the 5 selected models. If the `metric` argument is not defined, the default used is a Spearman correlation. For more information about the models and metrics, see the function documentation `?benchmark_models()`. # Reference Notin, P., Kollasch, A., Ritter, D., van Niekerk, L., Paul, S., Spinner, H., Rollins, N., Shaw, A., Orenbuch, R., Weitzman, R., Frazer, J., Dias, M., Franceschi, D., Gal, Y., & Marks, D. (2023). ProteinGym: Large-Scale Benchmarks for Protein Fitness Prediction and Design. In A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, & S. Levine (Eds.), Advances in Neural Information Processing Systems (Vol. 36, pp. 64331-64379). Curran Associates, Inc. Fowler, D., Fields, S. Deep mutational scanning: a new style of protein science. Nat Methods 11, 801–807 (2014). doi: 10.1038/nmeth.3027. Boucher JI, Bolon DN, Tawfik DS. Quantifying and understanding the fitness effects of protein mutations: Laboratory versus nature. Protein Sci. 2016 Jul; 25(7):1219-26. doi: 10.1002/pro.2928. # Session Info ```{r, sesh info} sessionInfo() ```