CoreGx 1.4.2
The current implementation for the @sensitivity slot in a PharmacoSet has
some limitations.
Firstly, it does not natively support dose-response experiments with multiple drugs and/or cancer cell lines. As a result we have not been able to include this data into a PharmacoSet thus far.
Secondly, drug combination data has the potential to scale to high dimensionality. As a result we need an object that is highly performant to ensure computations on such data can be completed in a timely manner.
The current use case is supporting drug and cell-line combinations in
PharmacoGx, but we wanted to create something flexible enough to fit
other use cases. As such, the current class makes no mention of drugs or
cell-lines, nor anything specifically related to Bioinformatics or Computation
Biology. Rather, we tried to design a general purpose data structure which
could support high dimensional data for any use case.
Our design takes the best aspects
of the SummarizedExperiment and MultiAssayExperiment classes and implements
them using the data.table package, which provides an R API to a rich set of
tools for high performance data processing implemented in C.
We have borrowed directly from the SummarizedExperiment class
for the rowData, colData, metadata and assays slot names.
We also implemented the SummarizedExperiment accessor generics for the
LongTable.
There are, however, some important differences which make this object more flexible when dealing with high dimensional data.
Unlike a SummarizedExperiment, there are three distinct
classes of columns in rowData and colData.
The first is the rowKey or colKey, these are implemented internally to keep
mappings between each assay and the associated samples or drugs; these will not
be returned by the accessors by default. The second is the rowIDs and
colIDs, these hold all of the information necessary to uniquely identify a
row or column and are used to generate the rowKey and colKey. Finally, there
are the rowMeta and colMeta columns, which store any additional data about
samples or drugs not required to uniquely identify a row in either table.
Within the assays the rowKey and colKey are combined to form a primary key
for each assay row. This is required because each assay is stored in ‘long’
format, instead of wide format as in the assay matrices within a
SummarizedExperiment. Thanks to the fast implementation of binary search
within the data.table package, assay tables can scale up to tens or even
hundreds of millions of rows while still being relatively performant.
Also worth noting is the cardinality between rowData and colData for a given
assay within the assays list. As indicated by the lower connection between these
tables and an assay, for each row or column key there may be zero or more rows in
the assay table. Conversely for each row in the assay there may be zero or one key
in colData or rowData. When combined, the rowKey and colKey for a given
row in an assay become a composite key which uniquely identify an observation.
To deal with the complex kinds of experimental designs which can be stored
in a LongTable, we have engineered a new object to help document and validate
the way data is mapped from raw data files, as a single large data.frame or
data.table, to the various slots of a LongTable object.
The DataMapper is an abstract class, which means in cannot be instatiated.
Its purpose is to provide a description of the concept of a DataMapper and
define a basic interface for any classes inheriting from it. A DataMapper is
simply a way to map columns from some raw data file to the slots of an S4 class.
It is similar to a schema in SQL in that it defines the valid parts of an
object (analogously a SQL table), but differs in that no types are specified or
enforced at this time.
This object is not important for general users, but may be useful for other
developers who want to map from some raw data to some S4 class. In this case,
any derived data mapper should inherit from the DataMapper abstract class.
Only one slot is defined by default, a list or List in the @rawdata slot.
An accessor method, rawdata(DataMapper), is defined to assign and retrieve
the raw data from your mapper object.
The LongTableDataMapper class is the first concrete sub-class of a
DataMapper. It is the object which defines how to go from a single
data.frame or data.table of raw experimental data to a properly formatted
and valid LongTable object. This is accomplished by defining various mappings,
which let the the user decide which columns from rawdata should go into which
slots of the object. Each slot mapping is implemented as a list of character
vectors specifying the column names from rawdata to assign to each slot.
Additionally, a helper method has been included, guessMapping, that will
try to determine which columns of a LongTables rawdata should be assigned
to which slots, and therefore which maps.
To get started making a LongTable lets have a look at some rawdata which is
a subset of the data from Oneil et al., 2016. The full set of rawdata is
available for exploration and download from
SynergxDB.ca, a free and open source web-app and
database of publicly available drug combination sensitivity experiments which we
created and released (Seo et al., 2019).
The data was generated as part of the commercial activities of the pharmaceutical company Merck, and is thus named according.
filePath <- '../data/merckLongTable.csv'
merckDT <- fread(filePath, na.strings=c('NULL'))
colnames(merckDT)
## [1] "drug1id" "drug2id" "drug1dose"
## [4] "drug2dose" "combination_name" "cellid"
## [7] "batchid" "viability1" "viability2"
## [10] "viability3" "viability4" "mu/muMax_published"
## [13] "X/X0_published"
| drug1id | drug2id | drug1dose | drug2dose | combination_name |
|---|---|---|---|---|
| 5-FU | Bortezomib | 0.35 | 0.00045 | 5-FU & Bortezomib |
| 5-FU | Bortezomib | 0.35 | 0.00200 | 5-FU & Bortezomib |
| 5-FU | Bortezomib | 0.35 | 0.00900 | 5-FU & Bortezomib |
| 5-FU | Bortezomib | 0.35 | 0.04000 | 5-FU & Bortezomib |
| 5-FU | L778123 | 0.35 | 0.32500 | 5-FU & L778123 |
| 5-FU | L778123 | 0.35 | 0.80000 | 5-FU & L778123 |
| combination_name | cellid | batchid | viability1 | viability2 | viability3 | viability4 | mu/muMax_published | X/X0_published |
|---|---|---|---|---|---|---|---|---|
| 5-FU & Bortezomib | A2058 | 1 | 0.814 | 0.754 | 0.765 | 0.849 | 0.880 | 0.847 |
| 5-FU & Bortezomib | A2058 | 1 | 0.792 | 0.788 | 0.840 | 0.852 | 0.897 | 0.867 |
| 5-FU & Bortezomib | A2058 | 1 | 0.696 | 0.831 | 0.690 | 0.806 | 0.854 | 0.817 |
| 5-FU & Bortezomib | A2058 | 1 | 0.637 | 0.678 | 0.625 | 0.627 | 0.767 | 0.724 |
| 5-FU & L778123 | A2058 | 1 | 0.679 | 0.795 | 0.731 | 0.700 | 0.830 | 0.790 |
| 5-FU & L778123 | A2058 | 1 | 0.667 | 0.734 | 0.596 | 0.613 | 0.773 | 0.730 |
We can see that all the data related to the treatment response experiment is contained within this table.
To get an idea of where in a LongTable this data should go, lets come up
with some guesses for mappings.
# Our guesses of how we may identify rows, columns and assays
groups <- list(
justDrugs=c('drug1id', 'drug2id'),
drugsAndDoses=c('drug1id', 'drug2id', 'drug1dose', 'drug2dose'),
justCells=c('cellid'),
cellsAndBatches=c('cellid', 'batchid'),
assays1=c('drug1id', 'drug2id', 'cellid'),
assays2=c('drug1id', 'drug2id', 'drug1dose', 'drug2dose', 'cellid', 'batchid')
)
# Decide if we want to subset out mapped columns after each group
subsets <- c(FALSE, TRUE, FALSE, TRUE, FALSE, TRUE)
# First we put our data in the `LongTableDataMapper`
LTdataMapper <- LongTableDataMapper(rawdata=merckDT)
# Then we can test our hypotheses, subset=FALSE means we don't remove mapped
# columns after each group is mapped
guess <- guessMapping(LTdataMapper, groups=groups, subset=subsets)
## [CoreGx::guessMapping,LongTableDataMapper-method]
## Mapping for group justDrugs: drug1id, drug2id
## [CoreGx::guessMapping,LongTableDataMapper-method]
## Mapping for group drugsAndDoses: drug1id, drug2id, drug1dose, drug2dose
## [CoreGx::guessMapping,LongTableDataMapper-method]
## Mapping for group justCells: cellid
## [CoreGx::guessMapping,LongTableDataMapper-method]
## Mapping for group cellsAndBatches: cellid, batchid
## [CoreGx::guessMapping,LongTableDataMapper-method]
## Mapping for group assays1: drug1id, drug2id, cellid
## [CoreGx::guessMapping,LongTableDataMapper-method]
## Mapping for group assays2: drug1id, drug2id, drug1dose, drug2dose, cellid, batchid
guess
## $metadata
## $metadata$id_columns
## [1] NA
##
## $metadata$mapped_columns
## character(0)
##
##
## $justDrugs
## $justDrugs$id_columns
## [1] "drug1id" "drug2id"
##
## $justDrugs$mapped_columns
## [1] "combination_name"
##
##
## $drugsAndDoses
## $drugsAndDoses$id_columns
## [1] "drug1id" "drug2id" "drug1dose" "drug2dose"
##
## $drugsAndDoses$mapped_columns
## [1] "combination_name"
##
##
## $justCells
## $justCells$id_columns
## [1] "cellid"
##
## $justCells$mapped_columns
## character(0)
##
##
## $cellsAndBatches
## $cellsAndBatches$id_columns
## [1] "cellid" "batchid"
##
## $cellsAndBatches$mapped_columns
## character(0)
##
##
## $assays1
## $assays1$id_columns
## [1] "drug1id" "drug2id" "cellid"
##
## $assays1$mapped_columns
## character(0)
##
##
## $assays2
## $assays2$id_columns
## [1] "drug1id" "drug2id" "drug1dose" "drug2dose" "cellid" "batchid"
##
## $assays2$mapped_columns
## [1] "viability1" "viability2" "viability3"
## [4] "viability4" "mu/muMax_published" "X/X0_published"
##
##
## $unmapped
## character(0)
Since we want our LongTable to have drugs as rows and samples as columns,
we see that both justDrug and drugsAndDoses yield the same result. So we
do not yet prefer one over the other. Looking at justCells and
cellsAndBatches, we see one column maps to each of them and therefore still
have no preference. For assay1 however, we see that no columns mapped, while
assay2 maps many of raw data columns.
Since assays will be subset based on the rowKey and colKey, we know that
the rowIDs must be drugsAndDose and the the colIDs must be cellsAndBatches.
Therefore, to uniquely identify an observation in any given assay we need
all of these columns. We can use this information to assign maps to our
LongTableDataMapper.
rowDataMap(LTdataMapper) <- guess$drugsAndDose
colDataMap(LTdataMapper) <- guess$cellsAndBatches
Looking at our mapped columns for assay2, we must decide if we want these
to go into more than one assay. If we do, we should name each item of our
assayMap for the LongTableDataMapper and specify it in a list of
character vectors, one for each assay. Since viability is the raw experimental
measurement and the final two columns are summaries of it, we will assign them
to two assays:sensitivity and profiles.
assays <- list(
sensitivity=guess$assays2[[2]][seq_len(4)],
profiles=guess$assays2[[2]][c(5, 6)]
)
assays
## $sensitivity
## [1] "viability1" "viability2" "viability3" "viability4"
##
## $profiles
## [1] "mu/muMax_published" "X/X0_published"
assayMap(LTdataMapper) <- assays
The metaConstruct method accepts a DataMapper object as its only argument,
and uses the information in that DataMapper to preprocess all rawdata and
map them to the appropriate slots of an S4 object. In our case, we are mapping
from the merckDT data.table to a LongTable.
At minimum, a LongTableDataMapper must specify the rowDataMap, colDataMap,
and assayMap. Additional maps are available, see ?LongTableDataMapper-class
and ?LongTableDataMapper-accessors for more details.
After configuration, creating the object is very straight forward.
longTable <- metaConstruct(LTdataMapper)
As mentioned previously, a LongTable has both list and table like behaviours.
For table like operations, a given LongTable can be thought of as a rowKey
by colKey rectangular object.
To support data.frame like sub-setting for this object, the constructor makes
pseudo row and column names, which are the ID columns for each row of
rowData or colData pasted together with a ‘:’. The ordering of these
columns is preserved in the pseudo-dim names, so be sure to arrange them
as desirged before creating the LongTable.
head(rownames(longTable))
## [1] "5-FU:Bortezomib:0.35:0.00045" "5-FU:Bortezomib:0.35:0.002"
## [3] "5-FU:Bortezomib:0.35:0.009" "5-FU:Bortezomib:0.35:0.04"
## [5] "5-FU:L778123:0.35:0.325" "5-FU:L778123:0.35:0.8"
We see that the rownames for the Merck LongTable are the cell-line name
pasted to the batch id.
head(colnames(longTable))
## [1] "A2058:1" "A2058:3" "A2780:1" "A2780:2" "A375:1" "A375:2"
For the column names, a similar pattern is followed by combining the colID columns in the form ‘drug1:drug2:drug1dose:drug2dose’.
data.frame SubsettingWe can subset a LongTable using the same row and column name syntax as
with a data.frame or matrix.
row <- rownames(longTable)[1]
columns <- colnames(longTable)[1:2]
longTable[row, columns]
## < LongTable >
## dim: 1 1
## assays(2): sensitivity profiles
## rownames(1): 5-FU:Bortezomib:0.35:0.00045
## rowData(5): drug1id drug2id drug1dose drug2dose combination_name
## colnames(1): A2058:1
## colData(2): cellid batchid
## metadata(0): none
However, unlike a data.frame or matrix this subsetting also accepts partial
row and column names as well as regex queries.
head(rowData(longTable), 3)
## drug1id drug2id drug1dose drug2dose combination_name
## 1: 5-FU Bortezomib 0.35 0.00045 5-FU & Bortezomib
## 2: 5-FU Bortezomib 0.35 0.00200 5-FU & Bortezomib
## 3: 5-FU Bortezomib 0.35 0.00900 5-FU & Bortezomib
head(colData(longTable), 3)
## cellid batchid
## 1: A2058 1
## 2: A2058 3
## 3: A2780 1
For example, if we want to get all instance where ‘5-FU’ is the drug:
longTable['5-FU', ]
## < LongTable >
## dim: 21 5
## assays(2): sensitivity profiles
## rownames(21): 5-FU:Bortezomib:0.35:0.00045 5-FU:Bortezomib:0.35:0.002 ... 5-FU:geldanamycin:0.35:2 MK-4541:5-FU:0.045:10
## rowData(5): drug1id drug2id drug1dose drug2dose combination_name
## colnames(5): A2058:1 A2780:1 A375:1 A427:1 CAOV3:1
## colData(2): cellid batchid
## metadata(0): none
This has matched all colnames where 5-FU was in either drug1 or drug2. If we only want to match drug1, we have several options:
all.equal(longTable['5-FU:*:*:*', ], longTable['^5-FU', ])
## [1] TRUE
As a technicaly note, ‘*’ is replaced with ‘.*’ internally for regex queries. This was implemented to mimic the linux shell style patten matching that most command-line users are familiar with.
data.table SubsettingIn addition to regex queries, a LongTable object supports arbitrarily complex
subset queries using the data.table API. To access this API, you will need to
use the . function, which allows you to pass raw R expressions to be evaluated
inside the i and j arguments for dataTable[i, j].
For example if we want to subset to rows where the cell line is VCAP and columns where drug1 is Temozolomide and drug2 is either Lapatinib or Bortezomib:
longTable[
# row query
.(drug1id == 'Temozolomide' & drug2id %in% c('Lapatinib', 'Bortezomib')),
.(cellid == 'CAOV3') # column query
]
## < LongTable >
## dim: 8 1
## assays(2): sensitivity profiles
## rownames(8): Temozolomide:Bortezomib:2.75:0.00045 Temozolomide:Bortezomib:2.75:0.002 ... Temozolomide:Lapatinib:2.75:1.1 Temozolomide:Lapatinib:2.75:5
## rowData(5): drug1id drug2id drug1dose drug2dose combination_name
## colnames(1): CAOV3:1
## colData(2): cellid batchid
## metadata(0): none
We can also invert matches or subset on other columns in rowData or colData:
subLongTable <-
longTable[.(drug1id == 'Temozolomide' & drug2id != 'Lapatinib'),
.(batchid != 2)]
To show that it works as expected:
print(paste0('drug2id: ', paste0(unique(rowData(subLongTable)$drug2id),
collapse=', ')))
## [1] "drug2id: ABT-888, BEZ-235, Bortezomib, Dasatinib, Erlotinib, MK-2206, MK-5108, MK-8669, MK-8776, PD325901, SN-38, Sorafenib, geldanamycin"
print(paste0('batchid: ', paste0(unique(colData(subLongTable)$batchid),
collapse=', ')))
## [1] "batchid: 1"
head(rowData(longTable), 3)
## drug1id drug2id drug1dose drug2dose combination_name
## 1: 5-FU Bortezomib 0.35 0.00045 5-FU & Bortezomib
## 2: 5-FU Bortezomib 0.35 0.00200 5-FU & Bortezomib
## 3: 5-FU Bortezomib 0.35 0.00900 5-FU & Bortezomib
head(rowData(longTable, key=TRUE), 3)
## drug1id drug2id drug1dose drug2dose combination_name rowKey
## 1: 5-FU Bortezomib 0.35 0.00045 5-FU & Bortezomib 1
## 2: 5-FU Bortezomib 0.35 0.00200 5-FU & Bortezomib 2
## 3: 5-FU Bortezomib 0.35 0.00900 5-FU & Bortezomib 3
head(colData(longTable), 3)
## cellid batchid
## 1: A2058 1
## 2: A2058 3
## 3: A2780 1
head(colData(longTable, key=TRUE), 3)
## cellid batchid colKey
## 1: A2058 1 1
## 2: A2058 3 2
## 3: A2780 1 3
assays <- assays(longTable)
assays[[1]]
## viability1 viability2 viability3 viability4 rowKey colKey
## 1: 0.814 0.754 0.765 0.849 1 1
## 2: 0.214 0.195 0.186 0.223 1 3
## 3: 1.064 1.080 1.082 1.009 1 5
## 4: 0.675 0.582 0.482 0.516 1 8
## 5: 0.845 0.799 0.799 0.759 1 10
## ---
## 3796: 0.090 0.043 0.112 0.103 744 1
## 3797: 0.025 0.022 0.029 0.023 744 3
## 3798: 0.151 0.146 0.144 0.171 744 5
## 3799: 0.142 0.166 0.124 0.175 744 8
## 3800: 0.091 0.084 0.134 0.119 744 10
assays[[2]]
## mu/muMax_published X/X0_published rowKey colKey
## 1: 0.880 0.847 1 1
## 2: 0.384 0.426 1 3
## 3: 1.033 1.047 1 5
## 4: 0.676 0.638 1 8
## 5: 0.708 0.667 1 10
## ---
## 3796: -0.187 0.193 744 1
## 3797: -0.445 0.135 744 3
## 3798: 0.090 0.283 744 5
## 3799: -0.012 0.246 744 8
## 3800: -1.935 0.017 744 10
assays <- assays(longTable, withDimnames=TRUE)
colnames(assays[[1]])
## [1] "cellid" "batchid" "drug1id" "drug2id"
## [5] "drug1dose" "drug2dose" "combination_name" "viability1"
## [9] "viability2" "viability3" "viability4"
assays <- assays(longTable, withDimnames=TRUE, metadata=TRUE)
colnames(assays[[2]])
## [1] "cellid" "batchid" "drug1id"
## [4] "drug2id" "drug1dose" "drug2dose"
## [7] "combination_name" "mu/muMax_published" "X/X0_published"
assayNames(longTable)
## [1] "sensitivity" "profiles"
Using these names we can access specific assays within a LongTable.
colnames(assay(longTable, 'sensitivity'))
## [1] "viability1" "viability2" "viability3" "viability4" "rowKey"
## [6] "colKey"
assay(longTable, 'sensitivity')
## viability1 viability2 viability3 viability4 rowKey colKey
## 1: 0.814 0.754 0.765 0.849 1 1
## 2: 0.214 0.195 0.186 0.223 1 3
## 3: 1.064 1.080 1.082 1.009 1 5
## 4: 0.675 0.582 0.482 0.516 1 8
## 5: 0.845 0.799 0.799 0.759 1 10
## ---
## 3796: 0.090 0.043 0.112 0.103 744 1
## 3797: 0.025 0.022 0.029 0.023 744 3
## 3798: 0.151 0.146 0.144 0.171 744 5
## 3799: 0.142 0.166 0.124 0.175 744 8
## 3800: 0.091 0.084 0.134 0.119 744 10
colnames(assay(longTable, 'sensitivity', withDimnames=TRUE))
## [1] "cellid" "batchid" "drug1id" "drug2id"
## [5] "drug1dose" "drug2dose" "combination_name" "viability1"
## [9] "viability2" "viability3" "viability4"
assay(longTable, 'sensitivity', withDimnames=TRUE)
## cellid batchid drug1id drug2id drug1dose drug2dose
## 1: A2058 1 5-FU Bortezomib 0.3500 0.00045
## 2: A2780 1 5-FU Bortezomib 0.3500 0.00045
## 3: A375 1 5-FU Bortezomib 0.3500 0.00045
## 4: A427 1 5-FU Bortezomib 0.3500 0.00045
## 5: CAOV3 1 5-FU Bortezomib 0.3500 0.00045
## ---
## 3796: A2058 1 geldanamycin Topotecan 0.0223 0.07750
## 3797: A2780 1 geldanamycin Topotecan 0.0223 0.07750
## 3798: A375 1 geldanamycin Topotecan 0.0223 0.07750
## 3799: A427 1 geldanamycin Topotecan 0.0223 0.07750
## 3800: CAOV3 1 geldanamycin Topotecan 0.0223 0.07750
## combination_name viability1 viability2 viability3 viability4
## 1: 5-FU & Bortezomib 0.814 0.754 0.765 0.849
## 2: 5-FU & Bortezomib 0.214 0.195 0.186 0.223
## 3: 5-FU & Bortezomib 1.064 1.080 1.082 1.009
## 4: 5-FU & Bortezomib 0.675 0.582 0.482 0.516
## 5: 5-FU & Bortezomib 0.845 0.799 0.799 0.759
## ---
## 3796: geldanamycin & Topotecan 0.090 0.043 0.112 0.103
## 3797: geldanamycin & Topotecan 0.025 0.022 0.029 0.023
## 3798: geldanamycin & Topotecan 0.151 0.146 0.144 0.171
## 3799: geldanamycin & Topotecan 0.142 0.166 0.124 0.175
## 3800: geldanamycin & Topotecan 0.091 0.084 0.134 0.119
O’Neil J, Benita Y, Feldman I, Chenard M, Roberts B, Liu Y, Li J, Kral A, Lejnine S, Loboda A, Arthur W, Cristescu R, Haines BB, Winter C, Zhang T, Bloecher A, Shumway SD. An Unbiased Oncology Compound Screen to Identify Novel Combination Strategies. Mol Cancer Ther. 2016 Jun;15(6):1155-62. doi: 10.1158/1535-7163.MCT-15-0843. Epub 2016 Mar 16. PMID: 26983881.
Heewon Seo, Denis Tkachuk, Chantal Ho, Anthony Mammoliti, Aria Rezaie, Seyed Ali Madani Tonekaboni, Benjamin Haibe-Kains, SYNERGxDB: an integrative pharmacogenomic portal to identify synergistic drug combinations for precision oncology, Nucleic Acids Research, Volume 48, Issue W1, 02 July 2020, Pages W494–W501, https://doi.org/10.1093/nar/gkaa421