Reproducing results of a publication. Systems biology of vaccination for seasonal influenza in humans. Nakaya et al (2011).
The paper uses system biology approach to study immune response to vaccination against influenza in three seasons.
Two cohorts vaccinated with TIV (2007 & 2008). And one with LAIV (2008).
The data was made available on GEO after publication. The SuperSeries is referenced in the paper.
library(GEOquery)
gse <- getGEO("GSE29619")
The returned object is a list
of 3 ExpressionSet
s. One per SubSeries/cohort.
gse
## $`GSE29619-GPL13158_series_matrix.txt.gz`
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 54715 features, 163 samples
## element names: exprs
## protocolData: none
## phenoData
## sampleNames: GSM733843 GSM733844 ... GSM734021 (163 total)
## varLabels: title geo_accession ... data_row_count (43 total)
## varMetadata: labelDescription
## featureData
## featureNames: 1007_PM_s_at 1053_PM_at ... AFFX-TrpnX-M_at (54715
## total)
## fvarLabels: ID GB_ACC ... Gene Ontology Molecular Function (16
## total)
## fvarMetadata: Column Description labelDescription
## experimentData: use 'experimentData(object)'
## Annotation: GPL13158
##
## $`GSE29619-GPL3921_series_matrix.txt.gz`
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 22277 features, 84 samples
## element names: exprs
## protocolData: none
## phenoData
## sampleNames: GSM734022 GSM734023 ... GSM734105 (84 total)
## varLabels: title geo_accession ... data_row_count (38 total)
## varMetadata: labelDescription
## featureData
## featureNames: 1007_s_at 1053_at ... AFFX-TrpnX-M_at (22277
## total)
## fvarLabels: ID GB_ACC ... Gene Ontology Molecular Function (16
## total)
## fvarMetadata: Column Description labelDescription
## experimentData: use 'experimentData(object)'
## Annotation: GPL3921
##
## $`GSE29619-GPL570_series_matrix.txt.gz`
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 54675 features, 27 samples
## element names: exprs
## protocolData: none
## phenoData
## sampleNames: GSM733816 GSM733817 ... GSM733842 (27 total)
## varLabels: title geo_accession ... data_row_count (43 total)
## varMetadata: labelDescription
## featureData
## featureNames: 1007_s_at 1053_at ... AFFX-TrpnX-M_at (54675
## total)
## fvarLabels: ID GB_ACC ... Gene Ontology Molecular Function (16
## total)
## fvarMetadata: Column Description labelDescription
## experimentData: use 'experimentData(object)'
## Annotation: GPL570
The seasons cannot be identified without looking in the objects.
names(gse)
## [1] "GSE29619-GPL13158_series_matrix.txt.gz"
## [2] "GSE29619-GPL3921_series_matrix.txt.gz"
## [3] "GSE29619-GPL570_series_matrix.txt.gz"
Let’s look at the metadata available on GEO
library(Biobase)
es_LAIV <- gse[[1]]
head(pData(es_LAIV), 3)
## title geo_accession
## GSM733843 2008 LAIV subject ID 1 at D0 post-vaccination GSM733843
## GSM733844 2008 LAIV subject ID 1 at D3 post-vaccination GSM733844
## GSM733845 2008 LAIV subject ID 1 at D7 post-vaccination GSM733845
## status submission_date last_update_date type
## GSM733843 Public on Jul 10 2011 May 29 2011 Jul 10 2011 RNA
## GSM733844 Public on Jul 10 2011 May 29 2011 Jul 10 2011 RNA
## GSM733845 Public on Jul 10 2011 May 29 2011 Jul 10 2011 RNA
## channel_count
## GSM733843 1
## GSM733844 1
## GSM733845 1
## source_name_ch1
## GSM733843 Peripheral blood mononuclear cells of subjects vaccinated with Influenza vaccine at day 0 (before vaccination), and days 3 and 7 post-vaccination.
## GSM733844 Peripheral blood mononuclear cells of subjects vaccinated with Influenza vaccine at day 0 (before vaccination), and days 3 and 7 post-vaccination.
## GSM733845 Peripheral blood mononuclear cells of subjects vaccinated with Influenza vaccine at day 0 (before vaccination), and days 3 and 7 post-vaccination.
## organism_ch1 characteristics_ch1 characteristics_ch1.1
## GSM733843 Homo sapiens cell type: PBMC subject id: 1
## GSM733844 Homo sapiens cell type: PBMC subject id: 1
## GSM733845 Homo sapiens cell type: PBMC subject id: 1
## characteristics_ch1.2 characteristics_ch1.3
## GSM733843 time point: D0 vaccine: LAIV (FluMist, MedImmune)
## GSM733844 time point: D3 vaccine: LAIV (FluMist, MedImmune)
## GSM733845 time point: D7 vaccine: LAIV (FluMist, MedImmune)
## characteristics_ch1.4
## GSM733843 hai titer (day 0) - a/south dakota/6/2007 (h1n1): 40
## GSM733844 hai titer (day 0) - a/south dakota/6/2007 (h1n1): 40
## GSM733845 hai titer (day 0) - a/south dakota/6/2007 (h1n1): 40
## characteristics_ch1.5
## GSM733843 hai titer (day 28) - a/south dakota/6/2007 (h1n1): 40
## GSM733844 hai titer (day 28) - a/south dakota/6/2007 (h1n1): 40
## GSM733845 hai titer (day 28) - a/south dakota/6/2007 (h1n1): 40
## characteristics_ch1.6
## GSM733843 hai titer (day 0) - a/uruguay/716/2007 nymc x-175c (h3n2): 40
## GSM733844 hai titer (day 0) - a/uruguay/716/2007 nymc x-175c (h3n2): 40
## GSM733845 hai titer (day 0) - a/uruguay/716/2007 nymc x-175c (h3n2): 40
## characteristics_ch1.7
## GSM733843 hai titer (day 28) - a/uruguay/716/2007 nymc x-175c (h3n2): 40
## GSM733844 hai titer (day 28) - a/uruguay/716/2007 nymc x-175c (h3n2): 40
## GSM733845 hai titer (day 28) - a/uruguay/716/2007 nymc x-175c (h3n2): 40
## characteristics_ch1.8
## GSM733843 hai titer (day 0) - b/florida 4/2006: 20
## GSM733844 hai titer (day 0) - b/florida 4/2006: 20
## GSM733845 hai titer (day 0) - b/florida 4/2006: 20
## characteristics_ch1.9 molecule_ch1
## GSM733843 hai titer (day 28) - b/florida 4/2006: 40 total RNA
## GSM733844 hai titer (day 28) - b/florida 4/2006: 40 total RNA
## GSM733845 hai titer (day 28) - b/florida 4/2006: 40 total RNA
## extract_protocol_ch1
## GSM733843 Following PMBC isolation from CPT, 2 x 10^6 cells were lysed in 1 ml of TRIzol and stored at -80C (Cat# 15596-026; Invitrogen Life Technologies). After all time points were collected for a subject, the samples were thawed, and the RNA isolation proceeded according to the manufacturer’s protocol.
## GSM733844 Following PMBC isolation from CPT, 2 x 10^6 cells were lysed in 1 ml of TRIzol and stored at -80C (Cat# 15596-026; Invitrogen Life Technologies). After all time points were collected for a subject, the samples were thawed, and the RNA isolation proceeded according to the manufacturer’s protocol.
## GSM733845 Following PMBC isolation from CPT, 2 x 10^6 cells were lysed in 1 ml of TRIzol and stored at -80C (Cat# 15596-026; Invitrogen Life Technologies). After all time points were collected for a subject, the samples were thawed, and the RNA isolation proceeded according to the manufacturer’s protocol.
## label_ch1
## GSM733843 biotin
## GSM733844 biotin
## GSM733845 biotin
## label_protocol_ch1
## GSM733843 Total RNA sample quality was evaluated by spectrophotometer to determine quantity, protein contamination and organic solvent contamination, and an Agilent 2100 Bioanalyzer was used to check for RNA degradation. Two-round in vitro transcription amplification and labeling was performed starting with 50 ng intact, total RNA per sample, following the Affymetrix protocol.
## GSM733844 Total RNA sample quality was evaluated by spectrophotometer to determine quantity, protein contamination and organic solvent contamination, and an Agilent 2100 Bioanalyzer was used to check for RNA degradation. Two-round in vitro transcription amplification and labeling was performed starting with 50 ng intact, total RNA per sample, following the Affymetrix protocol.
## GSM733845 Total RNA sample quality was evaluated by spectrophotometer to determine quantity, protein contamination and organic solvent contamination, and an Agilent 2100 Bioanalyzer was used to check for RNA degradation. Two-round in vitro transcription amplification and labeling was performed starting with 50 ng intact, total RNA per sample, following the Affymetrix protocol.
## taxid_ch1
## GSM733843 9606
## GSM733844 9606
## GSM733845 9606
## hyb_protocol
## GSM733843 Hybridization was performed on Human U133 Plus 2.0 Arrays (using GeneTitan platform, Affymetrix, or individual cartridges) for 16 h at 45 oC, and 60 r.p.m. in a Hybridization Oven 640 (Affymetrix), slides were washed and stained with a Fluidics Station 450 (Affymetrix).
## GSM733844 Hybridization was performed on Human U133 Plus 2.0 Arrays (using GeneTitan platform, Affymetrix, or individual cartridges) for 16 h at 45 oC, and 60 r.p.m. in a Hybridization Oven 640 (Affymetrix), slides were washed and stained with a Fluidics Station 450 (Affymetrix).
## GSM733845 Hybridization was performed on Human U133 Plus 2.0 Arrays (using GeneTitan platform, Affymetrix, or individual cartridges) for 16 h at 45 oC, and 60 r.p.m. in a Hybridization Oven 640 (Affymetrix), slides were washed and stained with a Fluidics Station 450 (Affymetrix).
## scan_protocol
## GSM733843 Scanning was performed on a 7th generation GeneChip Scanner 3000, and the Affymetrix GCOS software was used to perform image analysis and generate raw intensity data.
## GSM733844 Scanning was performed on a 7th generation GeneChip Scanner 3000, and the Affymetrix GCOS software was used to perform image analysis and generate raw intensity data.
## GSM733845 Scanning was performed on a 7th generation GeneChip Scanner 3000, and the Affymetrix GCOS software was used to perform image analysis and generate raw intensity data.
## description
## GSM733843 2008-LAIV-1-D0
## GSM733844 2008-LAIV-1-D3
## GSM733845 2008-LAIV-1-D7
## data_processing
## GSM733843 RMA normalization was performed using Expression Console software (Affymetrix Inc, version 1.1).
## GSM733844 RMA normalization was performed using Expression Console software (Affymetrix Inc, version 1.1).
## GSM733845 RMA normalization was performed using Expression Console software (Affymetrix Inc, version 1.1).
## platform_id contact_name contact_email contact_phone
## GSM733843 GPL13158 Helder,I,Nakaya hnakaya@emory.edu 1-404-712-2594
## GSM733844 GPL13158 Helder,I,Nakaya hnakaya@emory.edu 1-404-712-2594
## GSM733845 GPL13158 Helder,I,Nakaya hnakaya@emory.edu 1-404-712-2594
## contact_laboratory contact_department contact_institute
## GSM733843 Bali Pulendran Emory Vaccine Center Emory University
## GSM733844 Bali Pulendran Emory Vaccine Center Emory University
## GSM733845 Bali Pulendran Emory Vaccine Center Emory University
## contact_address contact_city contact_state
## GSM733843 954 Gatewood Road, room 2040 Atlanta GA
## GSM733844 954 Gatewood Road, room 2040 Atlanta GA
## GSM733845 954 Gatewood Road, room 2040 Atlanta GA
## contact_zip/postal_code contact_country
## GSM733843 30329 USA
## GSM733844 30329 USA
## GSM733845 30329 USA
## supplementary_file
## GSM733843 ftp://ftp.ncbi.nlm.nih.gov/pub/geo/DATA/supplementary/samples/GSM733nnn/GSM733843/GSM733843.CEL.gz
## GSM733844 ftp://ftp.ncbi.nlm.nih.gov/pub/geo/DATA/supplementary/samples/GSM733nnn/GSM733844/GSM733844.CEL.gz
## GSM733845 ftp://ftp.ncbi.nlm.nih.gov/pub/geo/DATA/supplementary/samples/GSM733nnn/GSM733845/GSM733845.CEL.gz
## supplementary_file.1
## GSM733843 ftp://ftp.ncbi.nlm.nih.gov/pub/geo/DATA/supplementary/samples/GSM733nnn/GSM733843/GSM733843.chp.gz
## GSM733844 ftp://ftp.ncbi.nlm.nih.gov/pub/geo/DATA/supplementary/samples/GSM733nnn/GSM733844/GSM733844.chp.gz
## GSM733845 ftp://ftp.ncbi.nlm.nih.gov/pub/geo/DATA/supplementary/samples/GSM733nnn/GSM733845/GSM733845.chp.gz
## data_row_count
## GSM733843 54715
## GSM733844 54715
## GSM733845 54715
Without any additional information the various seasons cannot be combined.
combine(gse[[1]], gse[[2]])
## Error in combine(gse[[1]], gse[[2]]): objects have different annotations: GPL13158, GPL3921
The platforms are different accross seasons.
In order to do something simple like differential gene expression analyses:
ExpressionSet
for visit and patient IDannotate
to map probes to genescombine
into one expression matrixCombining this information with another dataset would require additional data cleaning (assuming the IDs are consistent accros experiment).
This study was funded by HIPC. As a result, all of its data has been curated and standardized by ImmPort and is now publicly available on ImmuneSpace. We will use ImmuneSpaceR to retrieve the gene-expression data and associated metadata.
First, we instantiate a connection to the study of interest.
library(ImmuneSpaceR)
sdy269 <- CreateConnection("SDY269")
sdy269
## Immunespace Connection to study SDY269
## URL: https://www.immunespace.org/Studies/SDY269
## User: unknown_user at not_a_domain.com
## Available datasets
## demographics
## pcr
## cohort_membership
## elisa
## elispot
## hai
## gene_expression_files
## fcs_analyzed_result
## fcs_sample_files
## Expression Matrices
## TIV_2008
## LAIV_2008
The returned ImmuneSpaceConnection
object shows available datasets and expression matrices available for the selected study. ImmuneSpace uses Reference classes or R5
. Functions are members of the object, thus the $ semantics to access member functions in the following samples.
TIV2008 <- sdy269$getGEMatrix("TIV_2008")
TIV2008
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 54715 features, 80 samples
## element names: exprs
## protocolData: none
## phenoData
## sampleNames: BS586131 BS586187 ... BS586267 (80 total)
## varLabels: biosample_accession participant_id ...
## study_time_collected_unit (5 total)
## varMetadata: labelDescription
## featureData
## featureNames: 1007_PM_s_at 1053_PM_at ... AFFX-r2-TagQ-5_at
## (54715 total)
## fvarLabels: FeatureId gene_symbol
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:
head(pData(TIV2008))
## biosample_accession participant_id cohort
## BS586131 BS586131 SUB112832.269 TIV Group 2008
## BS586187 BS586187 SUB112832.269 TIV Group 2008
## BS586243 BS586243 SUB112832.269 TIV Group 2008
## BS586128 BS586128 SUB112834.269 TIV Group 2008
## BS586240 BS586240 SUB112834.269 TIV Group 2008
## BS586132 BS586132 SUB112841.269 TIV Group 2008
## study_time_collected study_time_collected_unit
## BS586131 0 Days
## BS586187 3 Days
## BS586243 7 Days
## BS586128 0 Days
## BS586240 7 Days
## BS586132 0 Days
Only essential information is displayed in the phenoData,
getGEMatrix
accepts multiple IDs as input. Additionally setting summary = TRUE
returns the matrix with expression values averaged by gene. Allowing an easy merge of the two ExpressionSets.
es269 <- sdy269$getGEMatrix(c("TIV_2008", "LAIV_2008"), summary = TRUE)
es269
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 16910 features, 163 samples
## element names: exprs
## protocolData: none
## phenoData
## sampleNames: BS586131 BS586187 ... BS586111 (163 total)
## varLabels: biosample_accession participant_id ...
## study_time_collected_unit (5 total)
## varMetadata: labelDescription
## featureData
## featureNames: 1 2 ... 16910 (16910 total)
## fvarLabels: FeatureId gene_symbol
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:
For convenience, cohorts can also be downloaded by names.
es269 <- sdy269$getGEMatrix(cohort = c("TIV Group 2008", "LAIV group 2008"), summary = TRUE)
SDY269 only contains the two cohorts asociated with the 2008 season but the GEO SuperSeries had three cohorts. The data has been split between multiple studies. SDY61 contains the data for the 2007 season. We have to combine the gene-expression results accross these two studies to get the data that was used in the publication.
In order to query multiple studies, we can instantiate a connection to all studies at once.
all <- CreateConnection("") #All studies
es <- all$getGEMatrix(c("TIV_2007", "TIV_2008", "LAIV_2008"), summary = TRUE)
head(pData(es))
## biosample_accession participant_id cohort
## BS586015 BS586015 SUB112830.61 TIV Group 2007
## BS586024 BS586024 SUB112830.61 TIV Group 2007
## BS586033 BS586033 SUB112830.61 TIV Group 2007
## BS586016 BS586016 SUB112832.61 TIV Group 2007
## BS586025 BS586025 SUB112832.61 TIV Group 2007
## BS586034 BS586034 SUB112832.61 TIV Group 2007
## study_time_collected study_time_collected_unit
## BS586015 0 Days
## BS586024 3 Days
## BS586033 7 Days
## BS586016 0 Days
## BS586025 3 Days
## BS586034 7 Days
With a single command, we can download the three relevant normalized and transformed expression matrices, summarize them by gene and combine into a unique ExpressionSet with relevant phenoData.
It is also a lot faster than GEO as each matrix is available as a single file.
Unlike expresssion matrices, the other datasets are stored as tables in the database.
They can be accessed using the getDataset
function.
library(data.table)
sdy269$listDatasets()
## datasets
## demographics
## pcr
## cohort_membership
## elisa
## elispot
## hai
## gene_expression_files
## fcs_analyzed_result
## fcs_sample_files
## Expression Matrices
## TIV_2008
## LAIV_2008
hai <- sdy269$getDataset("hai")
hai
## participant_id age_reported gender race cohort
## 1: SUB112847.269 24 Male White TIV Group 2008
## 2: SUB112860.269 32 Female White LAIV group 2008
## 3: SUB112842.269 28 Male White TIV Group 2008
## 4: SUB112859.269 32 Male White LAIV group 2008
## 5: SUB112878.269 28 Female White TIV Group 2008
## ---
## 332: SUB112866.269 34 Male White LAIV group 2008
## 333: SUB112867.269 37 Male White LAIV group 2008
## 334: SUB112883.269 23 Female Asian LAIV group 2008
## 335: SUB112848.269 39 Female White TIV Group 2008
## 336: SUB112850.269 41 Male White LAIV group 2008
## study_time_collected study_time_collected_unit
## 1: 28 Days
## 2: 0 Days
## 3: 28 Days
## 4: 0 Days
## 5: 28 Days
## ---
## 332: 0 Days
## 333: 0 Days
## 334: 28 Days
## 335: 0 Days
## 336: 28 Days
## virus value_reported
## 1: A/Uruguay/716/2007 NYMC X-175C (H3N2) 320
## 2: A/South Dakota/6/2007 (H1N1) 40
## 3: B/Brisbane/3/2007 80
## 4: A/South Dakota/6/2007 (H1N1) 5
## 5: A/Uruguay/716/2007 NYMC X-175C (H3N2) 160
## ---
## 332: A/South Dakota/6/2007 (H1N1) 40
## 333: B/Florida/4/2006 5
## 334: A/South Dakota/6/2007 (H1N1) 20
## 335: A/Brisbane/59/2007 (H1N1) 640
## 336: B/Florida/4/2006 5
For performance, the function returns data.table
objects. It is especially interesting when querying accross all studies.
ahai <- all$getDataset("hai")
Because getDataset
is a wrapper around Rlabkey
’s labkey.selectRows
function, we use makeFilter
to create filters that can be interpreted by LabKey.
library(Rlabkey)
virus_filter <- makeFilter(c("virus", "CONTAINS", "H1N1"))
hai_f <- sdy269$getDataset("hai", colFilter = virus_filter)
virus_filter2 <- makeFilter(c("virus", "EQUAL", "A/Brisbane/59/2007 (H1N1)"))
hai_f <- sdy269$getDataset("hai", colFilter = virus_filter2)
#multiple filters can be specified
analyte_filter <- makeFilter(c("Analyte", "EQUAL", "IFNg"), c("Study time collected", "IN", "0;7"))
elisa <- sdy269$getDataset("elisa", colFilter = analyte_filter)
With the same information accross datasets, we can merge results from various assay types.
Here we use the elispot and flow cytometry results to reproduce Figure 1d of Nakaya et al.
# Elispot
analyte_filter2 <- makeFilter(c("Analyte", "EQUAL", "IgG"), c("Study time collected", "EQUAL", "7"))
elispot <- sdy269$getDataset("elispot", colFilter = analyte_filter2, reload = TRUE)
elispot <- elispot[, elispot_response := spot_number_reported + 1]
elispot <- elispot[, list(participant_id, elispot_response)]
# Flow
fcs <- sdy269$getDataset("fcs_analyzed_result")
fcs <- fcs[, fcs_response := (as.double(population_cell_number) + 1) / as.double(base_parent_population)][study_time_collected == 7]
res <- merge(elispot, fcs, by = "participant_id")
library(ggplot2)
ggplot(res, aes(x = as.double(fcs_response), y = elispot_response, color = cohort)) +
geom_point() + scale_y_log10() + scale_x_log10() + geom_smooth(method = "lm") +
xlab("Total plasmablasts (%)") + ylab("Influenza specific cells\n (per 10^6 PBMCs)") +
theme_IS()
Only a few lines were needed to calculate fold-change and a sinle merge
operation is enough to combine data accross assays and bind metadata.
For the exploration of most datasets, quick_plot
is a function that takes advantage of the standardized datasets to plot the data in a relevant fashion. It has a limited number of options and should be used for having a first look at the data.
More advanced plots can be achieved by downloading the data in the R session, as seen in the use case above.
sdy269$quick_plot("hai", normalize = FALSE)
sdy269$quick_plot("hai", filter = virus_filter2, normalize = FALSE, color = "Age", shape = "Gender")
This is the same function that is used by the interactive Data Explorer Module.
virus_filter3 <- makeFilter(c("cohort", "contains", "TIV"), c("study_time_collected", "IN", "0;21;28;30;180"))
all$quick_plot("hai", filter = virus_filter3, normalize = TRUE, color = "Age")
The reports available on ImmuneSpace are written using ImmuneSpaceR. The code is available and will work both through LabKey’s report on the web portal and locally.
RCurl
data.table
for handling downloaded datasets