# Standard setup chunk
knitr::opts_chunk$set(echo = TRUE, collapse = TRUE)
# Load libraries required for the vignette to build
library(PMScanR)
library(ggseqlogo)
library(seqinr)
library(plotly)
## Loading required package: ggplot2
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
The PMScanR package provides large-scale identification, analysis, and visualization of protein motifs. The package integrates various methods to facilitate motif identification, characterization, and visualization. It includes functions for running PS-Scan, a PROSITE database tool. Additionally, PMScanR supports format conversion to GFF, enhancing downstream analyses such as graphical representation and database integration. The library offers multiple visualization tools, including heatmaps, sequence logos, and pie charts, enabling a deeper understanding of motif distribution and conservation. Through its integration with PROSITE, PMScanR provides access to up-to-date motif data.
Proteins play a crucial role in biological processes, with their functions closely related to structure. Protein functions are often associated with the presence of specific motifs, which are short, sometimes repetitive amino acid sequences essential for distinctive molecular interactions or modifications. Most of the existing bioinformatics tools focus mainly on the identification of known motifs and often do not provide interactive analysis and visualization tools during motif extraction. Moreover, they do not take into account the effect of single variations on an entire domain or protein motif. These limitations highlight the need for a tool that can automate and scale the analysis. To address this, we have developed PMScanR, an R-based package designed to facilitate and automate the prediction and evaluation of the effect of single amino acid substitutions on the occurrence of protein motifs on a large scale. However, existing tools lacked the capability to perform comparative analysis of multiple motifs across multiple sequences, a gap that PMScanR was particularly developed to fill.
To install this package, start R (version “4.4” or higher) and enter:
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("PMScanR")
library(PMScanR)
Here we show the most basic pipeline for protein motif analysis: scanning a sequence file, processing the results, and generating a occurrence plot visualization. This code chunk assumes you have a FASTA file ready for analysis.
fasta_file <- system.file("extdata", "hemoglobins.fasta", package = "PMScanR")
runPsScan(in_file = fasta_file, out_file = "results.gff", out_format = "gff")
gff_data <- as.data.frame(rtracklayer::import.gff("results.gff"))
motif_matrix <- gff2matrix(gff_data)
matrix2OP(motif_matrix)
If the user prefers to perform the analysis using a graphical user interface (GUI), they can simply run the function runPMScanRShiny(). This will launch a Shiny app that opens an interactive window. The window can be used both within R and in a web browser, providing a clickable, user-friendly interface that allows the entire analysis, including visualizations, to be carried out without needing to write code.
# This command launches the interactive Shiny app
runPMScanRShiny()
Alternatively, if the user wishes to work directly with the code, the library provides a set of functions to perform the full analysis, including protein motif identification and visualization. This can be done through an R script, where users can execute and customize the analysis programmatically. Each function included in the package is described below, along with an explanation of its purpose and functionality.
The first step of the analyses is to establish the working environment.
# Setting working directory is user-specific, e.g.:
# setwd("/path/to/your/working/directory")
For the purpose of this vignette, we will use sample files included with the PMScanR package. These files represent the input (FASTA) and various outputs (GFF, PSA, TXT) generated by the PROSITE analysis.
# 1. Load example FASTA file (Input for runPsScan)
fasta_file <- system.file("extdata", "hemoglobins.fasta", package = "PMScanR")
# 2. Load example GFF output
gff_file <- system.file("extdata", "out_Hb_gff.txt", package = "PMScanR")
# 3. Load example PSA output
psa_file <- system.file("extdata", "out_Hb_psa.txt", package = "PMScanR")
# 4. Load example PROSITE text output (Scan format)
prosite_txt_file <- system.file("extdata", "out_Hb_PROSITE.txt", package = "PMScanR")
Once the data is accessible, you can move on to the analysis.
runPsScan()This function runs the ps_scan tool to identify protein motifs. It automatically detects the operating system and handles the downloading and caching of required PROSITE databases (using BiocFileCache) upon the first run.
The runPsScan function allows you to specify the output format via the out_format argument. Available formats include: scan (default), fasta, psa, msa, gff, pff, epff, sequence, matchlist, ipro.
Regardless of the chosen output format (gff, psa, or scan), PMScanR allows you to import the data into R and perform the complete analysis.
# This command is not evaluated in the vignette as it requires an external
# dependency (Perl) and can be time-consuming during the first run.
# Example: Generate GFF output
runPsScan(in_file = fasta_file, out_format = 'gff', out_file = "results.gff")
The PMScanR package is designed to work flexibly with different PROSITE output formats. Whether you have a GFF, PSA, or a standard text file, you can load it into a standardized data frame in R.
Option A: Loading GFF files
If you generated a GFF file, use rtracklayer::import.gff and convert it to a data frame.
gff_data <- as.data.frame(rtracklayer::import.gff(gff_file))
# The data frame now contains all necessary columns (including Sequence)
head(gff_data)
## seqnames start end width strand source type score phase
## 1 NP_000508.1 39 41 3 * ps_scan|v1.89 PS00005 NA NA
## 2 NP_000508.1 138 140 3 * ps_scan|v1.89 PS00005 NA NA
## 3 NP_000508.1 4 7 4 * ps_scan|v1.89 PS00006 NA NA
## 4 NP_000508.1 19 24 6 * ps_scan|v1.89 PS00008 NA NA
## 5 NP_000508.1 59 62 4 * ps_scan|v1.89 PS00009 NA NA
## 6 NP_000508.1 2 142 141 * ps_scan|v1.89 PS01033 42.058 NA
## Name
## 1 PKC_PHOSPHO_SITE
## 2 PKC_PHOSPHO_SITE
## 3 CK2_PHOSPHO_SITE
## 4 MYRISTYL
## 5 AMIDATION
## 6 GLOBIN
## Sequence
## 1 TtK
## 2 TsK
## 3 SpaD
## 4 GAhaGE
## 5 hGKK
## 6 VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF---DLSHGSAQVKGHGKKVADALTNAVAHV---DDMPNALSALSDLHAhKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR
## SequenceDescription SkipFlag Level RawScore
## 1 NP_000508.1 hemoglobin subunit alpha [Homo sapiens] 1 <NA> <NA>
## 2 NP_000508.1 hemoglobin subunit alpha [Homo sapiens] 1 <NA> <NA>
## 3 NP_000508.1 hemoglobin subunit alpha [Homo sapiens] 1 <NA> <NA>
## 4 NP_000508.1 hemoglobin subunit alpha [Homo sapiens] 1 <NA> <NA>
## 5 NP_000508.1 hemoglobin subunit alpha [Homo sapiens] 1 <NA> <NA>
## 6 NP_000508.1 hemoglobin subunit alpha [Homo sapiens] <NA> 0 4843
## FeatureFrom FeatureTo
## 1 <NA> <NA>
## 2 <NA> <NA>
## 3 <NA> <NA>
## 4 <NA> <NA>
## 5 <NA> <NA>
## 6 1 -1
Option B: Loading PSA files
If you generated a PSA file, use the readPsa() function.
psa_data <- readPsa(psa_file)
head(psa_data)
## seqnames start end width strand source type score phase
## 1 NP_000508.1 39 41 3 * PSA PS00005 NA NA
## 2 NP_000508.1 138 140 3 * PSA PS00005 NA NA
## 3 NP_000508.1 4 7 4 * PSA PS00006 NA NA
## 4 NP_000508.1 19 24 6 * PSA PS00008 NA NA
## 5 NP_000508.1 59 62 4 * PSA PS00009 NA NA
## 6 NP_000508.1 2 142 141 * PSA PS01033 NA NA
## Name
## 1 PKC_PHOSPHO_SITE
## 2 PKC_PHOSPHO_SITE
## 3 CK2_PHOSPHO_SITE
## 4 MYRISTYL
## 5 AMIDATION
## 6 GLOBIN
## Sequence
## 1 TtK
## 2 TsK
## 3 SpaD
## 4 GAhaGE
## 5 hGKK
## 6 VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF---DLSHGSAQVKGHGKKVADALTNAVAHV---DDMPNALSALSDLHAhKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR
## SequenceDescription SkipFlag Level RawScore FeatureFrom FeatureTo
## 1 <NA> <NA> <NA> <NA> <NA> <NA>
## 2 <NA> <NA> <NA> <NA> <NA> <NA>
## 3 <NA> <NA> <NA> <NA> <NA> <NA>
## 4 <NA> <NA> <NA> <NA> <NA> <NA>
## 5 <NA> <NA> <NA> <NA> <NA> <NA>
## 6 <NA> <NA> 0 <NA> <NA> <NA>
Option C: Loading standard PROSITE text files
If you have a standard output file (format ‘scan’), use the readProsite() function.
prosite_data <- readProsite(prosite_txt_file)
head(prosite_data)
## seqnames start end width strand source type score phase
## 1 NP_000508.1 39 41 3 * PROSITE PS00005 NA NA
## 2 NP_000508.1 138 140 3 * PROSITE PS00005 NA NA
## 3 NP_000508.1 4 7 4 * PROSITE PS00006 NA NA
## 4 NP_000508.1 19 24 6 * PROSITE PS00008 NA NA
## 5 NP_000508.1 59 62 4 * PROSITE PS00009 NA NA
## 6 NP_000508.1 2 142 141 * PROSITE PS01033 NA NA
## Name
## 1 PKC_PHOSPHO_SITE
## 2 PKC_PHOSPHO_SITE
## 3 CK2_PHOSPHO_SITE
## 4 MYRISTYL
## 5 AMIDATION
## 6 GLOBIN
## Sequence
## 1 TtK
## 2 TsK
## 3 SpaD
## 4 GAhaGE
## 5 hGKK
## 6 VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF---DLSHGSAQVKGL=0HGKKVADALTNAVAHV---DDMPNALSALSDLHAhKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR
## SequenceDescription SkipFlag Level RawScore FeatureFrom
## 1 Protein kinase C phosphorylation site. <NA> <NA> <NA> <NA>
## 2 Protein kinase C phosphorylation site. <NA> <NA> <NA> <NA>
## 3 Casein kinase II phosphorylation site. <NA> <NA> <NA> <NA>
## 4 N-myristoylation site. <NA> <NA> <NA> <NA>
## 5 Amidation site. <NA> <NA> <NA> <NA>
## 6 Globin domain profile. <NA> <NA> <NA> <NA>
## FeatureTo
## 1 <NA>
## 2 <NA>
## 3 <NA>
## 4 <NA>
## 5 <NA>
## 6 <NA>
Once the data is loaded into a data frame (using any of the options above), the downstream analysis is identical. You can generate heatmaps, pie charts, and sequence logos from the same object.
gff2matrix()This function converts the data frame into a binary motif-occurrence matrix. This matrix is the required input for the heatmap visualization functions. Each row represents a unique motif, each column represents a sequence, and a 1 indicates the presence of that motif.
Note: This function works with data loaded from GFF, PSA, or TXT files, as long as they are parsed into the standard format shown above.
# We can use the data loaded from Option A, B or C.
# Here we use 'gff_data' as an example.
motif_matrix <- gff2matrix(gff_data)
# Display the first few rows of the resulting matrix
head(motif_matrix)
## NP_000508.1 NP_000549.1 NP_000550.2 NP_000175.1 NP_001305150.1
## PS00001:125-128 0 0 0 0 0
## PS00001:148-151 0 0 0 0 0
## PS00001:152-155 0 0 0 0 0
## PS00001:177-180 0 0 0 0 0
## PS00001:182-185 0 0 0 0 0
## PS00001:189-192 0 0 0 0 0
## NP_001350615.1 NP_001138679.1 NP_599030.1 NP_071441.1
## PS00001:125-128 0 0 0 0
## PS00001:148-151 0 0 0 0
## PS00001:152-155 0 0 0 0
## PS00001:177-180 0 0 1 0
## PS00001:182-185 0 0 0 0
## PS00001:189-192 0 1 0 0
## NP_001305067.1
## PS00001:125-128 1
## PS00001:148-151 1
## PS00001:152-155 1
## PS00001:177-180 0
## PS00001:182-185 1
## PS00001:189-192 0
After using this function a occurrence plot can be generated by using:
matrix2OP() and matrixOP()These functions generate interactive occurrence plots from the motif-occurrence matrix. matrix2OP creates a standard rectangular occurrence plot, while matrix2SquareOP creates one with a square aspect ratio.
# Generate a standard occurrence plot from the motif_matrix
occurrencePlot <- matrix2OP(input = motif_matrix)
occurrencePlot
# Generate a square occurrence plot from the motif_matrix
squareOccurrencePlot <- matrix2SquareOP(input = motif_matrix)
squareOccurrencePlot
freqPie()This function generates a pie chart to visualize the frequency distribution of each motif type found in the analysis.
pie_chart <- freqPie(gff_data)
print(pie_chart)
extractProteinMotifs()To generate sequence logos, you can use the extractProteinMotifs() function. This function parses the output file generated by runPsScan() (supporting GFF, PSA, and TXT formats) and extracts all instances of each identified motif into a list, where the keys are PROSITE IDs.
# This reads the PROSITE analysis output file from disk and extracts motifs.
# The format is detected automatically, but can also be specified explicitly
# (e.g., format = "gff").
protein_motifs <- extractProteinMotifs(psa_file)
# Check the PROSITE IDs (keys) found in the file
head(names(protein_motifs))
## [1] "PS00001" "PS00004" "PS00005" "PS00006" "PS00007" "PS00008"
# Generate sequence logo for the first motif found in the list
ggseqlogo::ggseqlogo(protein_motifs[[1]], seq_type='aa')
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## ℹ The deprecated feature was likely used in the ggseqlogo package.
## Please report the issue at <https://github.com/omarwagih/ggseqlogo/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
extractSegments()If you want to analyze raw sequences instead of identified motifs (e.g., to look at a specific region across all proteins), use extractSegments().
# Read the FASTA file into a list of sequences
sequences <- seqinr::read.fasta(file = fasta_file, seqtype = "AA")
# Extract segments from position 10 to 20 from all sequences
segments <- extractSegments(sequences, from = 10, to = 20)
# Generate the sequence logo from the extracted segments
ggseqlogo::ggseqlogo(unlist(segments), seq_type = "aa")
Sigrist C.J.A., de Castro E., Cerutti L., Cuche B.A., Hulo N., Bridge A., Bougueleret L., Xenarios I. (2012). New and continuing developments at PROSITE. Nucleic Acids Res.
sessionInfo()
## R version 4.5.2 (2025-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.3 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.22-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] plotly_4.11.0 ggplot2_4.0.1 seqinr_4.2-36 ggseqlogo_0.2
## [5] PMScanR_1.0.1 BiocStyle_2.38.0
##
## loaded via a namespace (and not attached):
## [1] DBI_1.2.3 bitops_1.0-9
## [3] httr2_1.2.1 rlang_1.1.6
## [5] magrittr_2.0.4 ade4_1.7-23
## [7] otel_0.2.0 matrixStats_1.5.0
## [9] compiler_4.5.2 RSQLite_2.4.5
## [11] vctrs_0.6.5 reshape2_1.4.5
## [13] stringr_1.6.0 pkgconfig_2.0.3
## [15] crayon_1.5.3 fastmap_1.2.0
## [17] magick_2.9.0 dbplyr_2.5.1
## [19] XVector_0.50.0 labeling_0.4.3
## [21] Rsamtools_2.26.0 promises_1.5.0
## [23] rmarkdown_2.30 tinytex_0.58
## [25] purrr_1.2.0 bit_4.6.0
## [27] xfun_0.54 cachem_1.1.0
## [29] cigarillo_1.0.0 jsonlite_2.0.0
## [31] shinyFiles_0.9.3 blob_1.2.4
## [33] later_1.4.4 DelayedArray_0.36.0
## [35] BiocParallel_1.44.0 parallel_4.5.2
## [37] R6_2.6.1 bslib_0.9.0
## [39] stringi_1.8.7 RColorBrewer_1.1-3
## [41] rtracklayer_1.70.0 GenomicRanges_1.62.0
## [43] jquerylib_0.1.4 Rcpp_1.1.0
## [45] Seqinfo_1.0.0 bookdown_0.45
## [47] SummarizedExperiment_1.40.0 knitr_1.50
## [49] IRanges_2.44.0 httpuv_1.6.16
## [51] Matrix_1.7-4 tidyselect_1.2.1
## [53] dichromat_2.0-0.1 abind_1.4-8
## [55] yaml_2.3.11 codetools_0.2-20
## [57] curl_7.0.0 lattice_0.22-7
## [59] tibble_3.3.0 plyr_1.8.9
## [61] Biobase_2.70.0 shiny_1.11.1
## [63] withr_3.0.2 S7_0.2.1
## [65] evaluate_1.0.5 BiocFileCache_3.0.0
## [67] Biostrings_2.78.0 pillar_1.11.1
## [69] BiocManager_1.30.27 filelock_1.0.3
## [71] MatrixGenerics_1.22.0 stats4_4.5.2
## [73] generics_0.1.4 RCurl_1.98-1.17
## [75] S4Vectors_0.48.0 scales_1.4.0
## [77] xtable_1.8-4 glue_1.8.0
## [79] lazyeval_0.2.2 tools_4.5.2
## [81] BiocIO_1.20.0 data.table_1.17.8
## [83] GenomicAlignments_1.46.0 fs_1.6.6
## [85] XML_3.99-0.20 grid_4.5.2
## [87] tidyr_1.3.1 crosstalk_1.2.2
## [89] restfulr_0.0.16 cli_3.6.5
## [91] rappdirs_0.3.3 S4Arrays_1.10.1
## [93] viridisLite_0.4.2 dplyr_1.1.4
## [95] gtable_0.3.6 sass_0.4.10
## [97] digest_0.6.39 BiocGenerics_0.56.0
## [99] SparseArray_1.10.4 rjson_0.2.23
## [101] htmlwidgets_1.6.4 farver_2.1.2
## [103] memoise_2.0.1 htmltools_0.5.8.1
## [105] lifecycle_1.0.4 httr_1.4.7
## [107] mime_0.13 bit64_4.6.0-1
## [109] MASS_7.3-65