Contents

# Standard setup chunk
knitr::opts_chunk$set(echo = TRUE, collapse = TRUE)
# Load libraries required for the vignette to build
library(PMScanR)
library(ggseqlogo)
library(seqinr)
library(plotly)
## Loading required package: ggplot2
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout

1 1 Introduction

The PMScanR package provides large-scale identification, analysis, and visualization of protein motifs. The package integrates various methods to facilitate motif identification, characterization, and visualization. It includes functions for running PS-Scan, a PROSITE database tool. Additionally, PMScanR supports format conversion to GFF, enhancing downstream analyses such as graphical representation and database integration. The library offers multiple visualization tools, including heatmaps, sequence logos, and pie charts, enabling a deeper understanding of motif distribution and conservation. Through its integration with PROSITE, PMScanR provides access to up-to-date motif data.

Proteins play a crucial role in biological processes, with their functions closely related to structure. Protein functions are often associated with the presence of specific motifs, which are short, sometimes repetitive amino acid sequences essential for distinctive molecular interactions or modifications. Most of the existing bioinformatics tools focus mainly on the identification of known motifs and often do not provide interactive analysis and visualization tools during motif extraction. Moreover, they do not take into account the effect of single variations on an entire domain or protein motif. These limitations highlight the need for a tool that can automate and scale the analysis. To address this, we have developed PMScanR, an R-based package designed to facilitate and automate the prediction and evaluation of the effect of single amino acid substitutions on the occurrence of protein motifs on a large scale. However, existing tools lacked the capability to perform comparative analysis of multiple motifs across multiple sequences, a gap that PMScanR was particularly developed to fill.

1.1 1.1 Installation and loading

To install this package, start R (version “4.4” or higher) and enter:

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("PMScanR")

library(PMScanR)

2 Quick start

Here we show the most basic pipeline for protein motif analysis: scanning a sequence file, processing the results, and generating a occurrence plot visualization. This code chunk assumes you have a FASTA file ready for analysis.

fasta_file <- system.file("extdata", "hemoglobins.fasta", package = "PMScanR")

runPsScan(in_file = fasta_file, out_file = "results.gff", out_format = "gff")

gff_data <- as.data.frame(rtracklayer::import.gff("results.gff"))
motif_matrix <- gff2matrix(gff_data)

matrix2OP(motif_matrix)

3 2 Data manipulation and overall usage

3.1 2.1 GUI

If the user prefers to perform the analysis using a graphical user interface (GUI), they can simply run the function runPMScanRShiny(). This will launch a Shiny app that opens an interactive window. The window can be used both within R and in a web browser, providing a clickable, user-friendly interface that allows the entire analysis, including visualizations, to be carried out without needing to write code.

# This command launches the interactive Shiny app
runPMScanRShiny()

3.2 2.2 Command Line

Alternatively, if the user wishes to work directly with the code, the library provides a set of functions to perform the full analysis, including protein motif identification and visualization. This can be done through an R script, where users can execute and customize the analysis programmatically. Each function included in the package is described below, along with an explanation of its purpose and functionality.

3.2.1 2.2.1 Loading example data

The first step of the analyses is to establish the working environment.

# Setting working directory is user-specific, e.g.:
# setwd("/path/to/your/working/directory")

For the purpose of this vignette, we will use sample files included with the PMScanR package. These files represent the input (FASTA) and various outputs (GFF, PSA, TXT) generated by the PROSITE analysis.

# 1. Load example FASTA file (Input for runPsScan)
fasta_file <- system.file("extdata", "hemoglobins.fasta", package = "PMScanR")

# 2. Load example GFF output
gff_file <- system.file("extdata", "out_Hb_gff.txt", package = "PMScanR")

# 3. Load example PSA output
psa_file <- system.file("extdata", "out_Hb_psa.txt", package = "PMScanR")

# 4. Load example PROSITE text output (Scan format)
prosite_txt_file <- system.file("extdata", "out_Hb_PROSITE.txt", package = "PMScanR")

Once the data is accessible, you can move on to the analysis.

3.2.1.1 runPsScan()

This function runs the ps_scan tool to identify protein motifs. It automatically detects the operating system and handles the downloading and caching of required PROSITE databases (using BiocFileCache) upon the first run.

The runPsScan function allows you to specify the output format via the out_format argument. Available formats include: scan (default), fasta, psa, msa, gff, pff, epff, sequence, matchlist, ipro.

Regardless of the chosen output format (gff, psa, or scan), PMScanR allows you to import the data into R and perform the complete analysis.

# This command is not evaluated in the vignette as it requires an external
# dependency (Perl) and can be time-consuming during the first run.

# Example: Generate GFF output
runPsScan(in_file = fasta_file, out_format = 'gff', out_file = "results.gff")

3.2.2 2.2.2 Parsing Results

The PMScanR package is designed to work flexibly with different PROSITE output formats. Whether you have a GFF, PSA, or a standard text file, you can load it into a standardized data frame in R.

Option A: Loading GFF files If you generated a GFF file, use rtracklayer::import.gff and convert it to a data frame.

gff_data <- as.data.frame(rtracklayer::import.gff(gff_file))
# The data frame now contains all necessary columns (including Sequence)
head(gff_data)
##      seqnames start end width strand        source    type  score phase
## 1 NP_000508.1    39  41     3      * ps_scan|v1.89 PS00005     NA    NA
## 2 NP_000508.1   138 140     3      * ps_scan|v1.89 PS00005     NA    NA
## 3 NP_000508.1     4   7     4      * ps_scan|v1.89 PS00006     NA    NA
## 4 NP_000508.1    19  24     6      * ps_scan|v1.89 PS00008     NA    NA
## 5 NP_000508.1    59  62     4      * ps_scan|v1.89 PS00009     NA    NA
## 6 NP_000508.1     2 142   141      * ps_scan|v1.89 PS01033 42.058    NA
##               Name
## 1 PKC_PHOSPHO_SITE
## 2 PKC_PHOSPHO_SITE
## 3 CK2_PHOSPHO_SITE
## 4         MYRISTYL
## 5        AMIDATION
## 6           GLOBIN
##                                                                                                                                              Sequence
## 1                                                                                                                                                 TtK
## 2                                                                                                                                                 TsK
## 3                                                                                                                                                SpaD
## 4                                                                                                                                              GAhaGE
## 5                                                                                                                                                hGKK
## 6 VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF---DLSHGSAQVKGHGKKVADALTNAVAHV---DDMPNALSALSDLHAhKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR
##                                   SequenceDescription SkipFlag Level RawScore
## 1 NP_000508.1 hemoglobin subunit alpha [Homo sapiens]        1  <NA>     <NA>
## 2 NP_000508.1 hemoglobin subunit alpha [Homo sapiens]        1  <NA>     <NA>
## 3 NP_000508.1 hemoglobin subunit alpha [Homo sapiens]        1  <NA>     <NA>
## 4 NP_000508.1 hemoglobin subunit alpha [Homo sapiens]        1  <NA>     <NA>
## 5 NP_000508.1 hemoglobin subunit alpha [Homo sapiens]        1  <NA>     <NA>
## 6 NP_000508.1 hemoglobin subunit alpha [Homo sapiens]     <NA>     0     4843
##   FeatureFrom FeatureTo
## 1        <NA>      <NA>
## 2        <NA>      <NA>
## 3        <NA>      <NA>
## 4        <NA>      <NA>
## 5        <NA>      <NA>
## 6           1        -1

Option B: Loading PSA files If you generated a PSA file, use the readPsa() function.

psa_data <- readPsa(psa_file)
head(psa_data)
##      seqnames start end width strand source    type score phase
## 1 NP_000508.1    39  41     3      *    PSA PS00005    NA    NA
## 2 NP_000508.1   138 140     3      *    PSA PS00005    NA    NA
## 3 NP_000508.1     4   7     4      *    PSA PS00006    NA    NA
## 4 NP_000508.1    19  24     6      *    PSA PS00008    NA    NA
## 5 NP_000508.1    59  62     4      *    PSA PS00009    NA    NA
## 6 NP_000508.1     2 142   141      *    PSA PS01033    NA    NA
##               Name
## 1 PKC_PHOSPHO_SITE
## 2 PKC_PHOSPHO_SITE
## 3 CK2_PHOSPHO_SITE
## 4         MYRISTYL
## 5        AMIDATION
## 6           GLOBIN
##                                                                                                                                              Sequence
## 1                                                                                                                                                 TtK
## 2                                                                                                                                                 TsK
## 3                                                                                                                                                SpaD
## 4                                                                                                                                              GAhaGE
## 5                                                                                                                                                hGKK
## 6 VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF---DLSHGSAQVKGHGKKVADALTNAVAHV---DDMPNALSALSDLHAhKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR
##   SequenceDescription SkipFlag Level RawScore FeatureFrom FeatureTo
## 1                <NA>     <NA>  <NA>     <NA>        <NA>      <NA>
## 2                <NA>     <NA>  <NA>     <NA>        <NA>      <NA>
## 3                <NA>     <NA>  <NA>     <NA>        <NA>      <NA>
## 4                <NA>     <NA>  <NA>     <NA>        <NA>      <NA>
## 5                <NA>     <NA>  <NA>     <NA>        <NA>      <NA>
## 6                <NA>     <NA>     0     <NA>        <NA>      <NA>

Option C: Loading standard PROSITE text files If you have a standard output file (format ‘scan’), use the readProsite() function.

prosite_data <- readProsite(prosite_txt_file)
head(prosite_data)
##      seqnames start end width strand  source    type score phase
## 1 NP_000508.1    39  41     3      * PROSITE PS00005    NA    NA
## 2 NP_000508.1   138 140     3      * PROSITE PS00005    NA    NA
## 3 NP_000508.1     4   7     4      * PROSITE PS00006    NA    NA
## 4 NP_000508.1    19  24     6      * PROSITE PS00008    NA    NA
## 5 NP_000508.1    59  62     4      * PROSITE PS00009    NA    NA
## 6 NP_000508.1     2 142   141      * PROSITE PS01033    NA    NA
##               Name
## 1 PKC_PHOSPHO_SITE
## 2 PKC_PHOSPHO_SITE
## 3 CK2_PHOSPHO_SITE
## 4         MYRISTYL
## 5        AMIDATION
## 6           GLOBIN
##                                                                                                                                                 Sequence
## 1                                                                                                                                                    TtK
## 2                                                                                                                                                    TsK
## 3                                                                                                                                                   SpaD
## 4                                                                                                                                                 GAhaGE
## 5                                                                                                                                                   hGKK
## 6 VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF---DLSHGSAQVKGL=0HGKKVADALTNAVAHV---DDMPNALSALSDLHAhKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR
##                      SequenceDescription SkipFlag Level RawScore FeatureFrom
## 1 Protein kinase C phosphorylation site.     <NA>  <NA>     <NA>        <NA>
## 2 Protein kinase C phosphorylation site.     <NA>  <NA>     <NA>        <NA>
## 3 Casein kinase II phosphorylation site.     <NA>  <NA>     <NA>        <NA>
## 4                 N-myristoylation site.     <NA>  <NA>     <NA>        <NA>
## 5                        Amidation site.     <NA>  <NA>     <NA>        <NA>
## 6                 Globin domain profile.     <NA>  <NA>     <NA>        <NA>
##   FeatureTo
## 1      <NA>
## 2      <NA>
## 3      <NA>
## 4      <NA>
## 5      <NA>
## 6      <NA>

3.2.3 2.2.3 Visualization and Analysis

Once the data is loaded into a data frame (using any of the options above), the downstream analysis is identical. You can generate heatmaps, pie charts, and sequence logos from the same object.

3.2.3.1 gff2matrix()

This function converts the data frame into a binary motif-occurrence matrix. This matrix is the required input for the heatmap visualization functions. Each row represents a unique motif, each column represents a sequence, and a 1 indicates the presence of that motif.

Note: This function works with data loaded from GFF, PSA, or TXT files, as long as they are parsed into the standard format shown above.

# We can use the data loaded from Option A, B or C. 
# Here we use 'gff_data' as an example.
motif_matrix <- gff2matrix(gff_data)

# Display the first few rows of the resulting matrix
head(motif_matrix)
##                 NP_000508.1 NP_000549.1 NP_000550.2 NP_000175.1 NP_001305150.1
## PS00001:125-128           0           0           0           0              0
## PS00001:148-151           0           0           0           0              0
## PS00001:152-155           0           0           0           0              0
## PS00001:177-180           0           0           0           0              0
## PS00001:182-185           0           0           0           0              0
## PS00001:189-192           0           0           0           0              0
##                 NP_001350615.1 NP_001138679.1 NP_599030.1 NP_071441.1
## PS00001:125-128              0              0           0           0
## PS00001:148-151              0              0           0           0
## PS00001:152-155              0              0           0           0
## PS00001:177-180              0              0           1           0
## PS00001:182-185              0              0           0           0
## PS00001:189-192              0              1           0           0
##                 NP_001305067.1
## PS00001:125-128              1
## PS00001:148-151              1
## PS00001:152-155              1
## PS00001:177-180              0
## PS00001:182-185              1
## PS00001:189-192              0

After using this function a occurrence plot can be generated by using:

3.2.3.2 matrix2OP() and matrixOP()

These functions generate interactive occurrence plots from the motif-occurrence matrix. matrix2OP creates a standard rectangular occurrence plot, while matrix2SquareOP creates one with a square aspect ratio.

# Generate a standard occurrence plot from the motif_matrix
occurrencePlot <- matrix2OP(input = motif_matrix)
occurrencePlot
# Generate a square occurrence plot from the motif_matrix
squareOccurrencePlot <- matrix2SquareOP(input = motif_matrix)
squareOccurrencePlot

3.2.3.3 freqPie()

This function generates a pie chart to visualize the frequency distribution of each motif type found in the analysis.

pie_chart <- freqPie(gff_data)
print(pie_chart)

4 References

Appendix

Sigrist C.J.A., de Castro E., Cerutti L., Cuche B.A., Hulo N., Bridge A., Bougueleret L., Xenarios I. (2012). New and continuing developments at PROSITE. Nucleic Acids Res.

A Session Information

sessionInfo()
## R version 4.5.2 (2025-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.3 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.22-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] plotly_4.11.0    ggplot2_4.0.1    seqinr_4.2-36    ggseqlogo_0.2   
## [5] PMScanR_1.0.1    BiocStyle_2.38.0
## 
## loaded via a namespace (and not attached):
##   [1] DBI_1.2.3                   bitops_1.0-9               
##   [3] httr2_1.2.1                 rlang_1.1.6                
##   [5] magrittr_2.0.4              ade4_1.7-23                
##   [7] otel_0.2.0                  matrixStats_1.5.0          
##   [9] compiler_4.5.2              RSQLite_2.4.5              
##  [11] vctrs_0.6.5                 reshape2_1.4.5             
##  [13] stringr_1.6.0               pkgconfig_2.0.3            
##  [15] crayon_1.5.3                fastmap_1.2.0              
##  [17] magick_2.9.0                dbplyr_2.5.1               
##  [19] XVector_0.50.0              labeling_0.4.3             
##  [21] Rsamtools_2.26.0            promises_1.5.0             
##  [23] rmarkdown_2.30              tinytex_0.58               
##  [25] purrr_1.2.0                 bit_4.6.0                  
##  [27] xfun_0.54                   cachem_1.1.0               
##  [29] cigarillo_1.0.0             jsonlite_2.0.0             
##  [31] shinyFiles_0.9.3            blob_1.2.4                 
##  [33] later_1.4.4                 DelayedArray_0.36.0        
##  [35] BiocParallel_1.44.0         parallel_4.5.2             
##  [37] R6_2.6.1                    bslib_0.9.0                
##  [39] stringi_1.8.7               RColorBrewer_1.1-3         
##  [41] rtracklayer_1.70.0          GenomicRanges_1.62.0       
##  [43] jquerylib_0.1.4             Rcpp_1.1.0                 
##  [45] Seqinfo_1.0.0               bookdown_0.45              
##  [47] SummarizedExperiment_1.40.0 knitr_1.50                 
##  [49] IRanges_2.44.0              httpuv_1.6.16              
##  [51] Matrix_1.7-4                tidyselect_1.2.1           
##  [53] dichromat_2.0-0.1           abind_1.4-8                
##  [55] yaml_2.3.11                 codetools_0.2-20           
##  [57] curl_7.0.0                  lattice_0.22-7             
##  [59] tibble_3.3.0                plyr_1.8.9                 
##  [61] Biobase_2.70.0              shiny_1.11.1               
##  [63] withr_3.0.2                 S7_0.2.1                   
##  [65] evaluate_1.0.5              BiocFileCache_3.0.0        
##  [67] Biostrings_2.78.0           pillar_1.11.1              
##  [69] BiocManager_1.30.27         filelock_1.0.3             
##  [71] MatrixGenerics_1.22.0       stats4_4.5.2               
##  [73] generics_0.1.4              RCurl_1.98-1.17            
##  [75] S4Vectors_0.48.0            scales_1.4.0               
##  [77] xtable_1.8-4                glue_1.8.0                 
##  [79] lazyeval_0.2.2              tools_4.5.2                
##  [81] BiocIO_1.20.0               data.table_1.17.8          
##  [83] GenomicAlignments_1.46.0    fs_1.6.6                   
##  [85] XML_3.99-0.20               grid_4.5.2                 
##  [87] tidyr_1.3.1                 crosstalk_1.2.2            
##  [89] restfulr_0.0.16             cli_3.6.5                  
##  [91] rappdirs_0.3.3              S4Arrays_1.10.1            
##  [93] viridisLite_0.4.2           dplyr_1.1.4                
##  [95] gtable_0.3.6                sass_0.4.10                
##  [97] digest_0.6.39               BiocGenerics_0.56.0        
##  [99] SparseArray_1.10.4          rjson_0.2.23               
## [101] htmlwidgets_1.6.4           farver_2.1.2               
## [103] memoise_2.0.1               htmltools_0.5.8.1          
## [105] lifecycle_1.0.4             httr_1.4.7                 
## [107] mime_0.13                   bit64_4.6.0-1              
## [109] MASS_7.3-65