make-data.R
)The package contains only a subset of the most important data generated over a period of five years. To get an impression an overview of all annotated sample (S) and workunits (W) in the B-Fabric system, Türker et al. (2010), is graphed in the timeline plots.
the NGS data p1644
the mass spec data p1875
make-data.R
)NL42_100K.fastq.gz
Sample NGS data contains 100K merged MiSeq reads that demonstrate the linkage between nanobodies (NB) and flycodes (FC) in FASTQ.
NL42_100K <- NestLink:::.getReadsFromFastq("inst/extdata/NL42_100K.fastq.gz")
save(NL42_100K, file="inst/extdata/NestLink_NL42_100K.RData")
knownNB.txt
An optional part of the NestLink workflow is the usage of known nanobodies in the sequencing experiment to estimate sensitity and specificity levels. This example file contains nucleotide sequences of nanobodies that should be detectable in this experiment. In the later workflow, these nanabodies are highlighted and labeled as known NB.
nanobodyFlycodeLinkage.RData
NGS ground truth derived by applying the function runNGSAnalysis
to the two previous files.
expFile <- query(eh, c("NestLink", "NL42_100K.fastq.gz"))[[1]]
expect_true(file.exists(expFile))
scratchFolder <- tempdir()
setwd(scratchFolder)
knownNB_File <- query(eh, c("NestLink", "knownNB.txt"))[[1]]
knownNB_data <- read.table(knownNB_File,
sep='\t',
header = TRUE,
row.names = 1,
stringsAsFactors = FALSE)
knownNB <- Biostrings::translate(DNAStringSet(knownNB_data$Sequence))
names(knownNB) <- rownames(knownNB_data)
knownNB <- sapply(knownNB, toString)
param <- list()
param[['NB_Linker1']] <- "GGCCggcggGGCC"
param[['NB_Linker2']] <- "GCAGGAGGA"
param[['ProteaseSite']] <- "TTAGTCCCAAGA"
param[['FC_Linker']] <- "GGCCaaggaggcCGG"
param[['knownNB']] <- knownNB
param[['nReads']] <- 100
param[['minRelBestHitFreq']] <- 0.8
param[['minConsensusScore']] <- 0.9
param[['maxMismatch']] <- 1
param[['minNanobodyLength']] <- 348
param[['minFlycodeLength']] <- 33
param[['FCminFreq']] <- 1
nanobodyFlycodeLinkage.RData <- runNGSAnalysis(file = expFile[1], param)
NB.tryptic
and FC.tryptic
Both files are the output of the previous NGS step generating the linkage between NBs and FCs.
The files are used to demonstrate the detectability of the AA sequences.
The wrapper functions are extended by the SSRC prediction and the parent ion mass (pim) determined by using protViz.
The column ESP_Prediction
was generated by using the service from https://genepattern.broadinstitute.org, see also Fusaro et al. (2009).
library(NestLink)
NB <- getNB()
FC <- getFC()
The first ten lines of each table is shown below:
peptide | ESP_Prediction | cond | pim | ssrc | peptideLength |
---|---|---|---|---|---|
AAAGITYYADSVK | 0.82378 | NB | 1329.6685 | 21.93845 | 13 |
AACCPVAR | 0.39342 | NB | 904.4127 | 5.56465 | 8 |
AADPGSWGQGTPVTVSSELK | 0.64844 | NB | 1986.9767 | 26.10345 | 20 |
AADYYYGMNHWGK | 0.15954 | NB | 1575.6685 | 24.80345 | 13 |
AANPFGLVQGFGSWGK | 0.44514 | NB | 1635.8278 | 40.19691 | 16 |
AAPDYWGQGTPVTVSSELK | 0.39622 | NB | 2005.9865 | 31.76845 | 19 |
peptide | ESP_Prediction | cond | pim | ssrc | peptideLength | |
---|---|---|---|---|---|---|
120 | GSAAAAADSWLTVR | 0.75450 | FC | 1375.696 | 27.80445 | 14 |
121 | GSAAAAATDWLTVR | 0.76422 | FC | 1389.712 | 29.00445 | 14 |
122 | GSAAAAATGWLTVR | 0.65522 | FC | 1331.707 | 28.60445 | 14 |
123 | GSAAAAATVWLR | 0.65496 | FC | 1173.637 | 29.10445 | 12 |
124 | GSAAAAAYEWLTVR | 0.72754 | FC | 1465.743 | 33.10445 | 14 |
125 | GSAAAADAAWQEGGR | 0.53588 | FC | 1417.645 | 11.70445 | 15 |
F255744.RData
and WU160118.RData
the mass spec files below are available through ProteomeXchange PXD009301.
the mass spectra were assigned to peptide sequences using the most important parameter listed in the table below and the Matrix Science’s Mascot Server Perkins et al. (1999) version 2.5.
Parameter | Value |
---|---|
COM | 170819_MS1708116_NL5idx4to5_Competition2BG_db8_db10_swissprot_d_merge |
FASTA 1 | p1875_db8_20160704.fasta |
FASTA 2 | p1875_db10_20170817.fasta |
TOL | 10 |
TOLU | ppm |
ITOL | 0.6 |
ITOLU | Da |
USERNAME | egloffp |
CHARGE | 2+ |
IT_MODS | Deamidated (NQ),Oxidation (M) |
INSTRUMENT | ESI-TRAP |
release | fgcz_swissprot_d_20140403.fasta |
The results were exported as XML.
The XML was parsed and exported as data.frame using protViz Panse and Grossmann (2019) function protViz:::as.data.frame.mascot
.
The above-described results and workflows are available for registered users in B-Fabric. However, it is not necessary to access B-Fabric in order to use this package.
Go to from http://fgcz-bfabric.uzh.ch
Search for workunit id 160118
Download the resource with id 444589
The following code snippet was executed to generate the data set shiped with the NestLink package.
Here only the metadata were extracted (no MS2).
load("~/Downloads/444589.RData")
library(protViz)
library(NestLink)
WU160118 <- do.call('rbind', lapply(list("F255737", "F255744", "F255747",
"F255749", "F255751", "F255760", "F255761", "F255762"),
function(datfilename){
df <- as.data.frame.mascot(get(datfilename))
df$datfilename <- datfilename
df
}
))
save(WU160118, file = "../inst/extdata/WU160118.RData",
compress = TRUE, compression_level = 9)
The data ships with the NestLink package and can be browsed using the following code snippet:
library(ExperimentHub)
eh <- ExperimentHub();
load(query(eh, c("NestLink", "WU160118.RData"))[[1]])
class(WU160118)
## [1] "data.frame"
PATTERN <- "^GS[ASTNQDEFVLYWGP]{7}(WR|WLTVR|WQEGGR|WLR|WQSR)$"
idx <- grepl(PATTERN, WU160118$pep_seq)
WU <- WU160118[idx & WU160118$pep_score > 25,]
x |
---|
“S:/p1875/Proteomics/FUSION_2/egloffp_20170814_NL5idx4to5/20170814_02_IMACelution.raw” |
“S:/p1875/Proteomics/FUSION_2/egloffp_20170814_NL5idx4to5/20170814_03_IMACelution.raw” |
“S:/p1875/Proteomics/FUSION_2/egloffp_20170814_NL5idx4to5/20170814_05_HiLoadElution.raw” |
“S:/p1875/Proteomics/FUSION_2/egloffp_20170814_NL5idx4to5/20170814_04_HiLoadElution.raw” |
“S:/p1875/Proteomics/FUSION_2/egloffp_20170814_NL5idx4to5/20170814_08_MaxBindingBG.raw” |
“S:/p1875/Proteomics/FUSION_2/egloffp_20170814_NL5idx4to5/20170814_07_MaxBindingBG.raw” |
“S:/p1875/Proteomics/FUSION_2/egloffp_20170814_NL5idx4to5/20170814_09_MaxBinding.raw” |
“S:/p1875/Proteomics/FUSION_2/egloffp_20170814_NL5idx4to5/20170814_10_MaxBinding.raw” |
“S:/p1875/Proteomics/FUSION_2/egloffp_20170814_NL5idx4to5/20170814_12_Competition1.raw” |
“S:/p1875/Proteomics/FUSION_2/egloffp_20170814_NL5idx4to5/20170814_13_Competition1.raw” |
“S:/p1875/Proteomics/FUSION_2/egloffp_20170814_NL5idx4to5/20170814_14_Competition1BG.raw” |
“S:/p1875/Proteomics/FUSION_2/egloffp_20170814_NL5idx4to5/20170814_15_Competition1BG.raw” |
“S:/p1875/Proteomics/FUSION_2/egloffp_20170814_NL5idx4to5/20170814_17_Competition2.raw” |
“S:/p1875/Proteomics/FUSION_2/egloffp_20170814_NL5idx4to5/20170814_18_Competition2.raw” |
“S:/p1875/Proteomics/FUSION_2/egloffp_20170814_NL5idx4to5/20170814_19_Competition2BG.raw” |
“S:/p1875/Proteomics/FUSION_2/egloffp_20170814_NL5idx4to5/20170814_20_Competition2BG.raw” |
PGexport2_normalizedAgainstSBstandards_Peptides.csv
contains mass spectrometry based label free quantitative (LFQ) results of nanobodies expressed in SMEG and COLI species.
Workunit : 158716 - QEXACTIVEHF_1
20170919_16_62465_nl5idx1-3_6titratecoli.raw
20170919_05_62465_nl5idx1-3_6titratecoli.raw
Workunit : 158717 - QEXACTIVEHF_1
20170919_14_62466_nl5idx1-3_7titratesmeg.raw
20170919_09_62466_nl5idx1-3_7titratesmeg.raw
Two LC-MS/MS runs were aligned in Progenesis QI (Nonlinear Dynamics) with an alignment score of 93.1 %, followed by peak picking with an allowed ion charge of +2 to +5.
#!/bin/bash
aws --profile AnnotationContributor s3 cp NestLink/F255744.RData s3://annotation-contributor/NestLink/F255744.RData --acl public-read
aws --profile AnnotationContributor s3 cp NestLink/WU160118.RData s3://annotation-contributor/NestLink/WU160118.RData --acl public-read
aws --profile AnnotationContributor s3 cp NestLink s3://annotation-contributor/NestLink --recursive --acl public-read
load metadata
fl <- system.file("extdata", "metadata.csv", package='NestLink')
kable(metadata <- read.csv(fl, stringsAsFactors=FALSE))
Title | Description | BiocVersion | Genome | SourceType | SourceUrl | SourceVersion | Species | TaxonomyId | Coordinate_1_based | DataProvider | Maintainer | RDataClass | DispatchClass | RDataPath | Tags | Notes |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Sample NGS NB FC linkage data | Sample NGS demonstratig the linkage between nanobodies (NB) and flycodes (FC). data in FASTQ | 3.9 | NA | FASTQ | https://fgcz-bfabric.uzh.ch/bfabric/userlab/show-project.html?id=1644 | Nov 28 2018 | NA | NA | NA | Functional Genomics Center Zurich (FGCZ) | Markus Seeger m.seeger@imm.uzh.ch, Pascal Egloff p.egloff@imm.uzh.ch, Lennart Opitz lopitz@fgcz.ethz.ch | DNAStringSet | FilePath | NestLink/NL42_100K.fastq.gz | NA | md5=4a13c5c61a5b29f4fd8830c1c15419b6; |
Flycodes tryptic digested | Flycodes tryptic digested amino acid sequences with ESP_Prediction score. | 3.9 | NA | TXT | https://fgcz-bfabric.uzh.ch/bfabric/userlab/show-project.html?id=1875 | Nov 28 2018 | NA | NA | NA | Functional Genomics Center Zurich (FGCZ) | Markus Seeger m.seeger@imm.uzh.ch, Pascal Egloff p.egloff@imm.uzh.ch, Christian Panse cp@fgcz.ethz.ch | data.frame | FilePath | NestLink/FC.tryptic | NA | md5=f6faa7458350ce1805bec30e9ffdeaae; |
Nanobodies tryptic digested | Nanobodies tryptic digested amino acid sequences with ESP_Prediction score. | 3.9 | NA | TXT | https://fgcz-bfabric.uzh.ch/bfabric/userlab/show-project.html?id=1875 | Nov 28 2018 | NA | NA | NA | Functional Genomics Center Zurich (FGCZ) | Markus Seeger m.seeger@imm.uzh.ch, Pascal Egloff p.egloff@imm.uzh.ch, Christian Panse cp@fgcz.ethz.ch | data.frame | FilePath | NestLink/NB.tryptic | NA | md5=db85a806c5151113536b710d566d9cf3; |
FASTA as ground-truth for unit testing | FASTA data as ground-truth for unit testing. | 3.9 | NA | RData | https://fgcz-bfabric.uzh.ch/bfabric/userlab/show-project.html?id=1644 | Nov 28 2018 | NA | NA | NA | Functional Genomics Center Zurich (FGCZ) | Markus Seeger m.seeger@imm.uzh.ch, Pascal Egloff p.egloff@imm.uzh.ch, Lennart Opitz lopitz@fgcz.ethz.ch | data.frame | FilePath | NestLink/nanobodyFlycodeLinkage.RData | NA | md5=57b2756fb0ebcf73d4036846580cb5b2; |
Known nanobodies | Known nanobodies as nucleic acid sequences. | 3.9 | NA | TXT | https://fgcz-bfabric.uzh.ch/bfabric/userlab/show-project.html?id=1644 | Nov 28 2018 | NA | NA | NA | Functional Genomics Center Zurich (FGCZ) | Markus Seeger m.seeger@imm.uzh.ch, Pascal Egloff p.egloff@imm.uzh.ch, Lennart Opitz lopitz@fgcz.ethz.ch | data.frame | FilePath | NestLink/knownNB.txt | NA | md5=003bf82c58f0a96a2bd945d171dc907c; |
Quantitaive results for SMEG and COLI | Mass spectrometry based label free quantitative results of nanobodies expressed in SMEG and COLI species. | 3.9 | NA | CSV | https://fgcz-bfabric.uzh.ch/bfabric/userlab/show-project.html?id=1875 | Nov 28 2018 | NA | NA | NA | Functional Genomics Center Zurich (FGCZ) | Markus Seeger m.seeger@imm.uzh.ch, Pascal Egloff p.egloff@imm.uzh.ch, Christian Panse cp@fgcz.ethz.ch | data.frame | FilePath | NestLink/PGexport2_normalizedAgainstSBstandards_Peptides.csv | NA | md5=0ca525d0a65d4938f0cbc785b7e0d2d3; bfabric WU158716, WU158717 |
F255744 Mascot Search result | F255744 peptide spectrum matches (PSMs) of Flycodes. | 3.9 | NA | TXT | https://fgcz-bfabric.uzh.ch/bfabric/userlab/show-resource.html?id=409912 | Dec 13 2018 | NA | NA | NA | Functional Genomics Center Zurich (FGCZ) | Markus Seeger m.seeger@imm.uzh.ch, Pascal Egloff p.egloff@imm.uzh.ch, Christian Panse cp@fgcz.ethz.ch | data.frame | FilePath | NestLink/F255744.RData | NA | md5=d5e4d13e9ecba4231d1808c6bb0bb454; R409912 |
WU160118 Mascot Search results | WU160118 peptide spectrum matches (PSMs) Flycodes. | 3.9 | NA | TXT | https://fgcz-bfabric.uzh.ch/bfabric/userlab/show-workunit.html?id=160118 | Dec 13 2018 | NA | NA | NA | Functional Genomics Center Zurich (FGCZ) | Markus Seeger m.seeger@imm.uzh.ch, Pascal Egloff p.egloff@imm.uzh.ch, Christian Panse cp@fgcz.ethz.ch | data.frame | FilePath | NestLink/WU160118.RData | NA | md5=a17f4505e322d440bc0e9edf8e5277bb; bfabric WU160118 |
query and load NestLink package data from aws s3
library(ExperimentHub)
eh <- ExperimentHub();
query(eh, "NestLink")
## ExperimentHub with 8 records
## # snapshotDate(): 2021-05-05
## # $dataprovider: Functional Genomics Center Zurich (FGCZ)
## # $species: NA
## # $rdataclass: data.frame, DNAStringSet
## # additional mcols(): taxonomyid, genome, description,
## # coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
## # rdatapath, sourceurl, sourcetype
## # retrieve records with, e.g., 'object[["EH2063"]]'
##
## title
## EH2063 | Sample NGS NB FC linkage data
## EH2064 | Flycodes tryptic digested
## EH2065 | Nanobodies tryptic digested
## EH2066 | FASTA as ground-truth for unit testing
## EH2067 | Known nanobodies
## EH2068 | Quantitaive results for SMEG and COLI
## EH2069 | F255744 Mascot Search result
## EH2070 | WU160118 Mascot Search results
load(query(eh, c("NestLink", "F255744.RData"))[[1]])
dim(F255744)
## [1] 15655 21
load(query(eh, c("NestLink", "WU160118.RData"))[[1]])
dim(WU160118)
## [1] 128390 22
Here is the compiled output of sessionInfo()
:
## R version 4.1.0 (2021-05-18)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.2 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.13-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.13-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 parallel stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] knitr_1.33 scales_1.1.1
## [3] ggplot2_3.3.3 NestLink_1.8.0
## [5] ShortRead_1.50.0 GenomicAlignments_1.28.0
## [7] SummarizedExperiment_1.22.0 Biobase_2.52.0
## [9] MatrixGenerics_1.4.0 matrixStats_0.58.0
## [11] Rsamtools_2.8.0 GenomicRanges_1.44.0
## [13] BiocParallel_1.26.0 protViz_0.6.8
## [15] gplots_3.1.1 Biostrings_2.60.0
## [17] GenomeInfoDb_1.28.0 XVector_0.32.0
## [19] IRanges_2.26.0 S4Vectors_0.30.0
## [21] ExperimentHub_2.0.0 AnnotationHub_3.0.0
## [23] BiocFileCache_2.0.0 dbplyr_2.1.1
## [25] BiocGenerics_0.38.0 BiocStyle_2.20.0
##
## loaded via a namespace (and not attached):
## [1] nlme_3.1-152 bitops_1.0-7
## [3] bit64_4.0.5 RColorBrewer_1.1-2
## [5] filelock_1.0.2 httr_1.4.2
## [7] tools_4.1.0 bslib_0.2.5.1
## [9] utf8_1.2.1 R6_2.5.0
## [11] KernSmooth_2.23-20 mgcv_1.8-35
## [13] colorspace_2.0-1 DBI_1.1.1
## [15] withr_2.4.2 tidyselect_1.1.1
## [17] bit_4.0.4 curl_4.3.1
## [19] compiler_4.1.0 DelayedArray_0.18.0
## [21] labeling_0.4.2 bookdown_0.22
## [23] sass_0.4.0 caTools_1.18.2
## [25] rappdirs_0.3.3 stringr_1.4.0
## [27] digest_0.6.27 rmarkdown_2.8
## [29] jpeg_0.1-8.1 pkgconfig_2.0.3
## [31] htmltools_0.5.1.1 highr_0.9
## [33] fastmap_1.1.0 rlang_0.4.11
## [35] RSQLite_2.2.7 shiny_1.6.0
## [37] farver_2.1.0 jquerylib_0.1.4
## [39] generics_0.1.0 hwriter_1.3.2
## [41] jsonlite_1.7.2 gtools_3.8.2
## [43] dplyr_1.0.6 RCurl_1.98-1.3
## [45] magrittr_2.0.1 GenomeInfoDbData_1.2.6
## [47] Matrix_1.3-3 munsell_0.5.0
## [49] Rcpp_1.0.6 fansi_0.4.2
## [51] lifecycle_1.0.0 stringi_1.6.2
## [53] yaml_2.2.1 zlibbioc_1.38.0
## [55] grid_4.1.0 blob_1.2.1
## [57] promises_1.2.0.1 crayon_1.4.1
## [59] lattice_0.20-44 splines_4.1.0
## [61] KEGGREST_1.32.0 magick_2.7.2
## [63] pillar_1.6.1 codetools_0.2-18
## [65] glue_1.4.2 BiocVersion_3.13.1
## [67] evaluate_0.14 latticeExtra_0.6-29
## [69] BiocManager_1.30.15 png_0.1-7
## [71] vctrs_0.3.8 httpuv_1.6.1
## [73] gtable_0.3.0 purrr_0.3.4
## [75] assertthat_0.2.1 cachem_1.0.5
## [77] xfun_0.23 mime_0.10
## [79] xtable_1.8-4 later_1.2.0
## [81] tibble_3.1.2 AnnotationDbi_1.54.0
## [83] memoise_2.0.0 ellipsis_0.3.2
## [85] interactiveDisplayBase_1.30.0
Egloff, Pascal, Iwan Zimmermann, Fabian M. Arnold, Cedric A.J. Hutter, Damien Damien Morger, Lennart Opitz, Lucy Poveda, et al. 2018. “Engineered Peptide Barcodes for In-Depth Analyses of Binding Protein Ensembles.” bioRxiv. https://doi.org/10.1101/287813.
Fusaro, V. A., D. R. Mani, J. P. Mesirov, and S. A. Carr. 2009. “Prediction of high-responding peptides for targeted protein assays by mass spectrometry.” Nat. Biotechnol. 27 (2): 190–98.
Panse, Christian, and Jonas Grossmann. 2019. protViz: Visualizing and Analyzing Mass Spectrometry Related Data in Proteomics. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org.
Perkins, David N., Darryl J. C. Pappin, David M. Creasy, and John S. Cottrell. 1999. “Probability-Based Protein Identification by Searching Sequence Databases Using Mass Spectrometry Data.” Electrophoresis 20 (18): 3551–67. https://doi.org/10.1002/(sici)1522-2683(19991201)20:18<3551::aid-elps3551>3.0.co;2-2.
Türker, Can, Fuat Akal, Dieter Joho, Christian Panse, Simon Barkow-Oesterreicher, Hubert Rehrauer, and Ralph Schlapbach. 2010. “B-Fabric: The Swiss Army Knife for Life Sciences.” In Proceedings of the 13th International Conference on Extending Database Technology - EDBT 10. ACM Press. https://doi.org/10.1145/1739041.1739135.