Many kinds of data can be expressed as a simple table. So why would you not just use a data.frame to hold all your data?
Data as a single table quickly becomes hard to represent efficiently if the following things happen:
Any of these things will cause your table to become very sparse and inneficient at representing the contents. So when that happens, it makes sense to break the data down into multiple tables. Relational databases like SQLite offer both an efficient means for storing data of this kind as well as a simple language for easily and quickly extracting it. SQL also allows you to recombine the results into single tables as needed.
1st we need a database file. Lets steal one from an installed package
dbFile <- system.file("extdata",
"TxDb.Hsapiens.UCSC.hg19.knownGene.sqlite",
package="TxDb.Hsapiens.UCSC.hg19.knownGene")
dbFile
## [1] "/Library/Frameworks/R.framework.devel/Versions/3.0/Resources/library/TxDb.Hsapiens.UCSC.hg19.knownGene/extdata/TxDb.Hsapiens.UCSC.hg19.knownGene.sqlite"
Now let's get a connection
library(RSQLite)
## Loading required package: DBI
txcon <- dbConnect(SQLite(), dbname=dbFile)
There is a handy helper function that lists tables:
dbListTables(txcon)
## [1] "cds" "chrominfo" "exon" "gene" "metadata" "splicing" "transcript"
Aand another one that lists fields for a table
dbListFields(txcon, name="metadata")
## [1] "name" "value"
And now: let's actually look at some useful SQL queries..
First: let's look at a basic select statement
dbGetQuery(txcon, "SELECT * FROM metadata")
## name value
## 1 Db type TranscriptDb
## 2 Supporting package GenomicFeatures
## 3 Data source UCSC
## 4 Genome hg19
## 5 Organism Homo sapiens
## 6 UCSC Table knownGene
## 7 Resource URL http://genome.ucsc.edu/
## 8 Type of Gene ID Entrez Gene ID
## 9 Full dataset yes
## 10 miRBase build ID GRCh37
## 11 transcript_nrow 82960
## 12 exon_nrow 289969
## 13 cds_nrow 237533
## 14 Db created by GenomicFeatures package from Bioconductor
## 15 Creation time 2013-09-19 00:15:15 -0700 (Thu, 19 Sep 2013)
## 16 GenomicFeatures version at creation time 1.13.40
## 17 RSQLite version at creation time 0.11.4
## 18 DBSCHEMAVERSION 1.0
Now just get the 1st column
dbGetQuery(txcon, "SELECT name FROM metadata")
## name
## 1 Db type
## 2 Supporting package
## 3 Data source
## 4 Genome
## 5 Organism
## 6 UCSC Table
## 7 Resource URL
## 8 Type of Gene ID
## 9 Full dataset
## 10 miRBase build ID
## 11 transcript_nrow
## 12 exon_nrow
## 13 cds_nrow
## 14 Db created by
## 15 Creation time
## 16 GenomicFeatures version at creation time
## 17 RSQLite version at creation time
## 18 DBSCHEMAVERSION
So using that, we can explore the chrominfo table like this
dbListFields(txcon, name="chrominfo")
## [1] "_chrom_id" "chrom" "length" "is_circular"
dbGetQuery(txcon, "SELECT chrom,length FROM chrominfo LIMIT 4")
## chrom length
## 1 chr1 249250621
## 2 chr2 243199373
## 3 chr3 198022430
## 4 chr4 191154276
And we can restrict our results to only get those chromosomes that are longer than 100000000 using a “WHERE” clause
dbGetQuery(txcon, "SELECT chrom,length FROM chrominfo WHERE length > 100000000")
## chrom length
## 1 chr1 249250621
## 2 chr2 243199373
## 3 chr3 198022430
## 4 chr4 191154276
## 5 chr5 180915260
## 6 chr6 171115067
## 7 chr7 159138663
## 8 chr8 146364022
## 9 chr9 141213431
## 10 chr10 135534747
## 11 chr11 135006516
## 12 chr12 133851895
## 13 chr13 115169878
## 14 chr14 107349540
## 15 chr15 102531392
## 16 chrX 155270560
Or we can restrict it to where the chromsome name is 'chrX'
dbGetQuery(txcon, "SELECT chrom,length FROM chrominfo WHERE chrom='chrX'")
## chrom length
## 1 chrX 155270560
So far so good. But what if we want to get data from a couple tables at once?
lets look at a couple of different tables. 1st the gene table
dbGetQuery(txcon, "SELECT * FROM gene LIMIT 4")
## gene_id _tx_id
## 1 1 70455
## 2 1 70456
## 3 10 31944
## 4 100 72132
And now the transcript table
dbGetQuery(txcon, "SELECT * FROM transcript LIMIT 4")
## _tx_id tx_name tx_chrom tx_strand tx_start tx_end
## 1 1 uc001aaa.3 chr1 + 11874 14409
## 2 2 uc010nxq.1 chr1 + 11874 14409
## 3 3 uc010nxr.1 chr1 + 11874 14409
## 4 4 uc001aal.1 chr1 + 69091 70008
Now for these two tables, we can do a simple inner join like this
sql <- "SELECT gene_id, tx_name
FROM gene,transcript
WHERE gene._tx_id=transcript._tx_id
LIMIT 4"
dbGetQuery(txcon, sql)
## gene_id tx_name
## 1 1 uc002qsd.4
## 2 1 uc002qsf.2
## 3 10 uc003wyw.1
## 4 100 uc002xmj.3
This same join can also be written in another fashion using the AS keyword to generate an alias.
sql <- "SELECT gene_id, tx_name, tx_strand
FROM gene AS g, transcript AS t
WHERE g._tx_id=t._tx_id
LIMIT 4"
dbGetQuery(txcon, sql)
## gene_id tx_name tx_strand
## 1 1 uc002qsd.4 -
## 2 1 uc002qsf.2 -
## 3 10 uc003wyw.1 +
## 4 100 uc002xmj.3 -
At 1st it may seem like the alias is just a way to make queries shorter, but it is actually useful for subqueries. A subquery is when you need to make a query inside of another query. So for example you may want to combine a couple of queries like this:
sql <- "SELECT * FROM gene AS g,
(SELECT * FROM transcript WHERE tx_strand='+') AS t
WHERE g._tx_id=t._tx_id
LIMIT 4"
dbGetQuery(txcon, sql)
## gene_id _tx_id _tx_id tx_name tx_chrom tx_strand tx_start tx_end
## 1 10 31944 31944 uc003wyw.1 chr8 + 18248755 18258723
## 2 100008586 75890 75890 uc004doa.3 chrX + 49217763 49233491
## 3 100009676 14200 14200 uc003dvg.3 chr3 + 101395274 101398057
## 4 10002 54546 54546 uc002ath.1 chr15 + 72102894 72107270
In the above example, the query “SELECT * FROM transcript WHERE tx_strand='+' is run and then the results of that are joined with the contents of the genes table by the outer query.”
If this all seems like it can get complicated in a hurry that's because it can. But fortunately, there exists another way for joining multiple tables that involves a lot less typing. When you have several tables that all contain the same foreign key, you can join them with the “USING” keyword like this:
sql <- "SELECT * FROM (SELECT * FROM gene, splicing USING(_tx_id)), transcript USING (_tx_id) LIMIT 4"
dbGetQuery(txcon, sql)
## gene_id _tx_id exon_rank _exon_id _cds_id tx_name tx_chrom tx_strand tx_start tx_end
## 1 1 70455 1 250818 206062 uc002qsd.4 chr19 - 58858172 58864865
## 2 1 70455 2 250817 206061 uc002qsd.4 chr19 - 58858172 58864865
## 3 1 70455 3 250816 206060 uc002qsd.4 chr19 - 58858172 58864865
## 4 1 70455 4 250815 206059 uc002qsd.4 chr19 - 58858172 58864865
Another query for joining these three tables would look like this:
sql <- "SELECT * FROM gene AS g, transcript AS t, splicing AS s WHERE g._tx_id=s._tx_id AND s._tx_id=t._tx_id LIMIT 4"
dbGetQuery(txcon, sql)
## gene_id _tx_id _tx_id tx_name tx_chrom tx_strand tx_start tx_end _tx_id exon_rank _exon_id _cds_id
## 1 1 70455 70455 uc002qsd.4 chr19 - 58858172 58864865 70455 1 250818 206062
## 2 1 70455 70455 uc002qsd.4 chr19 - 58858172 58864865 70455 2 250817 206061
## 3 1 70455 70455 uc002qsd.4 chr19 - 58858172 58864865 70455 3 250816 206060
## 4 1 70455 70455 uc002qsd.4 chr19 - 58858172 58864865 70455 4 250815 206059
Of course there are other many other keywords that you can learn about as well, but this should be enough to get you started.
Exercise 1: Use what you have learned to extract out the names and chromosomes from the transcript table.
Exercise 2: Now look at the splicing and exon tables. Notice how the splicing table has the same field “_exon_id” as the exon table. Use that field to join the contents of the two tables.
Exercise 3: Now look at the transcript table again. Notice that it too has a key that can be used to join to the splicing table. Now use the ALIAS feature described above along with using () to group a subequery so that you can merge the exon, splicing AND transcript table into a single result.
[ Back to top ]
1st lets look at some of the data we want to database.
gfl <- system.file("extdata","gene_names.txt", package="UnderstandingRBioc")
gene_names <- read.table(gfl, header=TRUE)
head(gene_names)
## gene_id symbol name
## 1 3979178 ND4L NADH dehydrogenase subunit 4L
## 2 3979179 ND4 NADH dehydrogenase subunit 4
## 3 3979180 ND5 NADH dehydrogenase subunit 5
## 4 3979181 ND6 NADH dehydrogenase subunit 6
## 5 3979182 CYTB cytochrome b
## 6 3979183 ND1 NADH dehydrogenase subunit 1
cfl <- system.file("extdata","chroms.txt", package="UnderstandingRBioc")
chroms <- read.table(cfl, header=TRUE)
head(chroms)
## gene_id chromosome
## 1 3979178 MT
## 2 3979179 MT
## 3 3979180 MT
## 4 3979181 MT
## 5 3979182 MT
## 6 3979183 MT
pfl <- system.file("extdata","pmids.txt", package="UnderstandingRBioc")
pmids <- read.table(pfl, header=TRUE)
head(pmids)
## gene_id pmid
## 1 3979178 17654009
## 2 3979179 17654009
## 3 3979180 17654009
## 4 3979181 17654009
## 5 3979182 17654009
## 6 3979183 17654009
Notice that for each of these data.frames, there is a common column that we want to use as a foreign key…
Next we need a new database. One that we can write into. The following code will spawn a DB in memory.
library(RSQLite)
dbFile <- sprintf("%s.sqlite", tempfile())
dbFile
## [1] "/var/folders/mq/fq9btv5n4wjclr32gn155fn80000gp/T//RtmpKZVyvm/file2722ebc6c88.sqlite"
con <- dbConnect(SQLite(), dbname=dbFile)
And once you create a table as below: your DB will exist on the disk as well.
Create table statements allow us to create a table.
sql <- "CREATE TABLE gene_info (
gene_id TEXT,
gene_symbol TEXT,
gene_name TEXT
)"
dbGetQuery(con, sql)
## NULL
You can easily use the helpers from before to verify that table is there
dbListTables(con)
## [1] "gene_info"
And look at its fields as well
dbListFields(con, name="gene_info")
## [1] "gene_id" "gene_symbol" "gene_name"
Exercise 4: Now that you know how to make a table, make one for the chrom and pmid data.frames above. Be sure to include all the fields that you want to populate.
[ Back to top ]
Lets start by paying close attention to the names of our data.frame. Notice how each column is named.
head(gene_names)
## gene_id symbol name
## 1 3979178 ND4L NADH dehydrogenase subunit 4L
## 2 3979179 ND4 NADH dehydrogenase subunit 4
## 3 3979180 ND5 NADH dehydrogenase subunit 5
## 4 3979181 ND6 NADH dehydrogenase subunit 6
## 5 3979182 CYTB cytochrome b
## 6 3979183 ND1 NADH dehydrogenase subunit 1
Now we are going to insert that entire data.frame into the table we created above. But we are going to carefully name the columns of the data.frame that we want to insert by using their names like this:
sql <- "INSERT INTO gene_info
VALUES ($gene_id, $symbol, $name)"
dbBeginTransaction(con)
## [1] TRUE
dbGetPreparedQuery(con, sql, bind.data = gene_names)
## NULL
dbCommit(con)
## [1] TRUE
Exercise 5: Now take the next step and put the data from the pmid and chrom data.frames into your database. If you have not already done so, you should also make the gene_info table described above. Once you have done this. take a minute to make sure that the data you have inserted is present by writing some queries to extract it.
[ Back to top ]
Now we are going to talk about how we can formally make our database into an actual R object that people can use with select() etc.
You always have to put the “Db type” and the “Supporting package” into a metadata table for your package. These fields are required by the loadDb supporting method so that it can spawn an AnnotationDb based object for you from your database. The 1st of these tells loadDb what kind of object it is supposed to make and the 2nd tells it where the methods for that object will be found. In this case we know what the “Db type” will be since it will be the new object type we want to define. But it is not yet clear what the “Supporting package” will be since we haven't made a package to hold our new methods yet. So just put AnnotationDbi in as a placeholder for now.
dbGetQuery(con, "CREATE Table metadata (name TEXT, value TEXT)")
## NULL
dbGetQuery(con, "INSERT INTO metadata VALUES ('Db type','myHamsterDb')")
## NULL
dbGetQuery(con, "INSERT INTO metadata VALUES ('Supporting package',
'AnnotationDbi')")
## NULL
Test for success:
dbGetQuery(con, "SELECT * FROM metadata")
## name value
## 1 Db type myHamsterDb
## 2 Supporting package AnnotationDbi
Then declare the class and then use loadDb to make an instance of your object.
library(AnnotationDbi)
## Loading required package: BiocGenerics
## Loading required package: parallel
##
## Attaching package: 'BiocGenerics'
##
## The following objects are masked from 'package:parallel':
##
## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ, clusterExport, clusterMap, parApply, parCapply,
## parLapply, parLapplyLB, parRapply, parSapply, parSapplyLB
##
## The following object is masked from 'package:stats':
##
## xtabs
##
## The following objects are masked from 'package:base':
##
## Filter, Find, Map, Position, Reduce, anyDuplicated, append, as.data.frame, as.vector, cbind, colnames,
## duplicated, eval, get, intersect, lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
## rank, rbind, rep.int, rownames, sapply, setdiff, sort, table, tapply, union, unique, unlist
##
## Loading required package: Biobase
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
setRefClass("myHamsterDb", contains="AnnotationDb")
myHamster.db <- loadDb(dbFile)
Exercise 6: All databases that intend to eventually use the select method need to also have a metadata table with a name and value field. Run the code above to make sure you have one too. But then look at the original transcriptDb (txcon) database that we started with and examine the kind of data that is found in it's metadata table. What other kinds of data would be good to include here?
[ Back to top ]
So once you have an object, you will want to make convenient accessors. There are four that most annotation packages support: 'collumns', 'keys', 'keytypes' and 'select'. The columns method normally just returns potential fields that can concievably be returned from the AnnotationDb object when using select. Here is how we can implement that method. First of all we should define a function that knows how to access the connection to our AnnotationDb object and then extract the fields from it.
.cols <- function(x)
{
con <- AnnotationDbi:::dbConn(x)
list <- dbListTables(con)
unwanted <- c("metadata")
list <- list[!list %in% unwanted]
fields <- as.vector(sapply(list, dbListFields, con=con))
toupper(fields)
}
Then simply wrap this into a method
setMethod("columns", "myHamsterDb", function(x){.cols(x)})
## [1] "columns"
Then we can call it like this
columns(myHamster.db)
## [1] "GENE_ID" "GENE_SYMBOL" "GENE_NAME"
Exercise 7: The keytypes method often returns the same results as the columns method. Assuming that this is the case this time create a keytypes method.
[ Back to top ]
dbGetQuery(txcon, "SELECT tx_name,tx_chrom FROM transcript LIMIT 4")
## tx_name tx_chrom
## 1 uc001aaa.3 chr1
## 2 uc010nxq.1 chr1
## 3 uc010nxr.1 chr1
## 4 uc001aal.1 chr1
dbGetQuery(txcon,
"SELECT * FROM splicing, exon WHERE splicing._exon_id=exon._exon_id LIMIT 4")
## _tx_id exon_rank _exon_id _cds_id _exon_id exon_name exon_chrom exon_strand exon_start exon_end
## 1 1 1 1 NA 1 <NA> chr1 + 11874 12227
## 2 1 2 3 NA 3 <NA> chr1 + 12613 12721
## 3 1 3 5 NA 5 <NA> chr1 + 13221 14409
## 4 3 1 1 NA 1 <NA> chr1 + 11874 12227
dbGetQuery(txcon,
"SELECT * FROM transcript AS t,
(SELECT * FROM splicing,exon WHERE splicing._exon_id=exon._exon_id) AS e
WHERE e._tx_id=t._tx_id LIMIT 4")
## _tx_id tx_name tx_chrom tx_strand tx_start tx_end _tx_id exon_rank _exon_id _cds_id _exon_id:1 exon_name exon_chrom
## 1 1 uc001aaa.3 chr1 + 11874 14409 1 1 1 NA 1 <NA> chr1
## 2 1 uc001aaa.3 chr1 + 11874 14409 1 2 3 NA 3 <NA> chr1
## 3 1 uc001aaa.3 chr1 + 11874 14409 1 3 5 NA 5 <NA> chr1
## 4 3 uc010nxr.1 chr1 + 11874 14409 3 1 1 NA 1 <NA> chr1
## exon_strand exon_start exon_end
## 1 + 11874 12227
## 2 + 12613 12721
## 3 + 13221 14409
## 4 + 11874 12227
sql <- "CREATE TABLE chrom (
gene_id TEXT,
chromosome TEXT
)"
dbGetQuery(con, sql)
## NULL
sql <- "CREATE TABLE pmid (
gene_id TEXT,
pmid TEXT
)"
dbGetQuery(con, sql)
## NULL
sql <- "INSERT INTO chrom
VALUES ($gene_id, $chromosome)"
dbBeginTransaction(con)
## [1] TRUE
dbGetPreparedQuery(con, sql, bind.data = chroms)
## NULL
dbCommit(con)
## [1] TRUE
sql <- "INSERT INTO pmid
VALUES ($gene_id, $pmid)"
dbBeginTransaction(con)
## [1] TRUE
dbGetPreparedQuery(con, sql, bind.data = pmids)
## NULL
dbCommit(con)
## [1] TRUE
dbGetQuery(con, "SELECT * FROM chrom LIMIT 4")
## gene_id chromosome
## 1 3979178 MT
## 2 3979179 MT
## 3 3979180 MT
## 4 3979181 MT
dbGetQuery(con, "SELECT * FROM pmid LIMIT 4")
## gene_id pmid
## 1 3979178 17654009
## 2 3979179 17654009
## 3 3979180 17654009
## 4 3979181 17654009
This is the code that will create a metadata table for you:
dbGetQuery(con, "CREATE Table metadata (name TEXT, value TEXT)")
dbGetQuery(con, "INSERT INTO metadata VALUES ('Db type','myHamsterDb')")
dbGetQuery(con, "INSERT INTO metadata VALUES ('Supporting package',
'AnnotationDbi')")
And looking at the other metadata table we see all sorts of interesting stuff:
dbGetQuery(txcon, "SELECT * FROM metadata")
## name value
## 1 Db type TranscriptDb
## 2 Supporting package GenomicFeatures
## 3 Data source UCSC
## 4 Genome hg19
## 5 Organism Homo sapiens
## 6 UCSC Table knownGene
## 7 Resource URL http://genome.ucsc.edu/
## 8 Type of Gene ID Entrez Gene ID
## 9 Full dataset yes
## 10 miRBase build ID GRCh37
## 11 transcript_nrow 82960
## 12 exon_nrow 289969
## 13 cds_nrow 237533
## 14 Db created by GenomicFeatures package from Bioconductor
## 15 Creation time 2013-09-19 00:15:15 -0700 (Thu, 19 Sep 2013)
## 16 GenomicFeatures version at creation time 1.13.40
## 17 RSQLite version at creation time 0.11.4
## 18 DBSCHEMAVERSION 1.0
Since the keytypes method will be the same as columns, we can just use the same underlying function here.
setMethod("keytypes", "myHamsterDb", function(x){.cols(x)})
## [1] "keytypes"
Then we can call it like this
keytypes(myHamster.db)
## [1] "C(\"GENE_ID\", \"CHROMOSOME\")" "C(\"GENE_ID\", \"GENE_SYMBOL\", \"GENE_NAME\")"
## [3] "C(\"GENE_ID\", \"PMID\")"