1 Introduction

Most proteomics experiments need protein (peptide) separation and cleavage procedures before these molecules could be analyzed or identified by mass spectrometry or other analytical tools.

cleaver allows in-silico cleavage of polypeptide sequences to e.g. create theoretical mass spectrometry data.

The cleavage rules are taken from the ExPASy PeptideCutter tool (Gasteiger et al. 2005).

2 Simple Usage

Loading the cleaver package:


Getting help and list all available cleavage rules:


Cleaving of Gastric juice peptide 1 (P01358) using Trypsin:

## cleave it
cleave("LAAGKVEDSD", enzym="trypsin")
## [1] "LAAGK" "VEDSD"
## get the cleavage ranges
cleavageRanges("LAAGKVEDSD", enzym="trypsin")
##      start end
## [1,]     1   5
## [2,]     6  10
## get only cleavage sites
cleavageSites("LAAGKVEDSD", enzym="trypsin")
## [1] 5

Sometimes cleavage is not perfect and the enzym miss some cleavage positions:

## miss one cleavage position
cleave("LAAGKVEDSD", enzym="trypsin", missedCleavages=1)
cleavageRanges("LAAGKVEDSD", enzym="trypsin", missedCleavages=1)
##      start end
## [1,]     1  10
## miss zero or one cleavage positions
cleave("LAAGKVEDSD", enzym="trypsin", missedCleavages=0:1)
## [1] "LAAGK"      "VEDSD"      "LAAGKVEDSD"
cleavageRanges("LAAGKVEDSD", enzym="trypsin", missedCleavages=0:1)
##      start end
## [1,]     1   5
## [2,]     6  10
## [3,]     1  10

Combine cleaver and Biostrings (Pages et al., n.d.):

## create AAStringSet object
p <- AAStringSet(c(gaju="LAAGKVEDSD", pnm="AGEPKLDAGV"))

## cleave it
cleave(p, enzym="trypsin")
## AAStringSetList of length 2
## [["gaju"]] LAAGK VEDSD
## [["pnm"]] AGEPK LDAGV
cleavageRanges(p, enzym="trypsin")
## IRangesList object of length 2:
## $gaju
## IRanges object with 2 ranges and 0 metadata columns:
##           start       end     width
##       <integer> <integer> <integer>
##   [1]         1         5         5
##   [2]         6        10         5
## $pnm
## IRanges object with 2 ranges and 0 metadata columns:
##           start       end     width
##       <integer> <integer> <integer>
##   [1]         1         5         5
##   [2]         6        10         5
cleavageSites(p, enzym="trypsin")
## $gaju
## [1] 5
## $pnm
## [1] 5

3 Insulin & Somatostatin Example

Downloading Insulin (P01308) and Somatostatin (P61278) sequences from the UniProt (The UniProt Consortium 2012) database using UniProt.ws (Carlson, n.d.).

## load UniProt.ws library

## select species Homo sapiens
up <- UniProt.ws(taxId=9606)

## download sequences of Insulin/Somatostatin
s <- select(up,
    keys=c("P01308", "P61278"),

## fetch only sequences
sequences <- setNames(s$Sequence, s$Entry)

## remove whitespaces
sequences <- gsub(pattern="[[:space:]]", replacement="", x=sequences)

Cleaving using Pepsin:

cleave(sequences, enzym="pepsin")
## $P01308
##  [1] "MA"              "L"               "W"               "MRLLP"          
##  [5] "LL"              "A"               "WGPDPAAA"        "F"              
##  [9] "VNQH"            "CGSH"            "VEA"             "Y"              
## [13] "VCGERG"          "FF"              "YTPKTRREAED"     "QVGQVE"         
## [17] "GGGPGAGS"        "LQP"             "LA"              "EGS"            
## [21] "QKRGIVEQCCTSICS" "Q"               "EN"              "CN"             
## $P61278
##  [1] "ML"                    "SCRL"                  "QCA"                  
##  [4] "L"                     "AA"                    "SIV"                  
##  [7] "A"                     "GCVTGAPSDPRL"          "RQ"                   
## [10] "FL"                    "QKS"                   "LAAAAGKQEL"           
## [13] "AK"                    "Y"                     "AE"                   
## [16] "SEPNQTENDA"            "LEPED"                 "SQAAEQDEMRL"          
## [19] "EL"                    "QRSANSNPAMAPRERKAGCKN" "FF"                   
## [22] "W"                     "KT"                    "FTSC"

4 Isotopic Distribution Of Tryptic Digested Insulin

A common use case of in-silico cleavage is the calculation of the isotopic distribution of peptides (which were enzymatic digested in the in-vitro experimental workflow). Here BRAIN (Claesen et al. 2012; Dittwald et al. 2013) is used to calculate the isotopic distribution of cleaver’s output. (please note: it is only a toy example, e.g. the relation of intensity values between peptides isn’t correct).

## load BRAIN library

## cleave insulin
cleavedInsulin <- cleave(sequences[1], enzym="trypsin")[[1]]

## create empty plot area
plot(NA, xlim=c(150, 4300), ylim=c(0, 1),
     xlab="mass", ylab="relative intensity",
     main="tryptic digested insulin - isotopic distribution")

## loop through peptides
for (i in seq(along=cleavedInsulin)) {
  ## count C, H, N, O, S atoms in current peptide
  atoms <- BRAIN::getAtomsFromSeq(cleavedInsulin[[i]])
  ## calculate isotopic distribution
  d <- useBRAIN(atoms)
  ## draw peaks
  lines(d$masses, d$isoDistr, type="h", col=2)

5 Session Information

## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## Matrix products: default
## BLAS:   /media/volume/teran2_disk/biocbuild/bbs-3.20-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## time zone: America/New_York
## tzcode source: system (glibc)
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## other attached packages:
##  [1] BRAIN_1.51.0        lattice_0.22-6      PolynomF_2.0-8     
##  [4] UniProt.ws_2.45.1   RSQLite_2.3.7       cleaver_1.43.0     
##  [7] Biostrings_2.73.2   GenomeInfoDb_1.41.2 XVector_0.45.0     
## [10] IRanges_2.39.2      S4Vectors_0.43.2    BiocGenerics_0.51.3
## [13] BiocStyle_2.33.1   
## loaded via a namespace (and not attached):
##  [1] KEGGREST_1.45.1         xfun_0.48               bslib_0.8.0            
##  [4] Biobase_2.65.1          rjsoncons_1.3.1         vctrs_0.6.5            
##  [7] tools_4.4.1             generics_0.1.3          curl_5.2.3             
## [10] tibble_3.2.1            fansi_1.0.6             AnnotationDbi_1.67.0   
## [13] highr_0.11              blob_1.2.4              pkgconfig_2.0.3        
## [16] BiocBaseUtils_1.7.3     dbplyr_2.5.0            lifecycle_1.0.4        
## [19] GenomeInfoDbData_1.2.13 compiler_4.4.1          progress_1.2.3         
## [22] tinytex_0.53            htmltools_0.5.8.1       sass_0.4.9             
## [25] yaml_2.3.10             pillar_1.9.0            crayon_1.5.3           
## [28] jquerylib_0.1.4         cachem_1.1.0            magick_2.8.5           
## [31] tidyselect_1.2.1        digest_0.6.37           dplyr_1.1.4            
## [34] bookdown_0.41           grid_4.4.1              fastmap_1.2.0          
## [37] cli_3.6.3               magrittr_2.0.3          utf8_1.2.4             
## [40] httpcache_1.2.0         prettyunits_1.2.0       filelock_1.0.3         
## [43] UCSC.utils_1.1.0        bit64_4.5.2             rmarkdown_2.28         
## [46] httr_1.4.7              bit_4.5.0               png_0.1-8              
## [49] hms_1.1.3               memoise_2.0.1           evaluate_1.0.1         
## [52] knitr_1.48              BiocFileCache_2.13.2    rlang_1.1.4            
## [55] Rcpp_1.0.13             glue_1.8.0              DBI_1.2.3              
## [58] BiocManager_1.30.25     jsonlite_1.8.9          R6_2.5.1               
## [61] zlibbioc_1.51.2


