# Airway smooth muscle cells Here we provide the code which was used to contruct the *SummarizedExperiment* object of the *airway* experiment data package. The experiment citation is: Himes BE, Jiang X, Wagner P, Hu R, Wang Q, Klanderman B, Whitaker RM, Duan Q, Lasky-Su J, Nikolos C, Jester W, Johnson M, Panettieri R Jr, Tantisira KG, Weiss ST, Lu Q. "RNA-Seq Transcriptome Profiling Identifies CRISPLD2 as a Glucocorticoid Responsive Gene that Modulates Cytokine Function in Airway Smooth Muscle Cells." PLoS One. 2014 Jun 13;9(6):e99625. PMID: [24926665](http://www.ncbi.nlm.nih.gov/pubmed/24926665). GEO: [GSE52778](http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE52778). From the abstract, a brief description of the RNA-Seq experiment on airway smooth muscle (ASM) cell lines: "Using RNA-Seq, a high-throughput sequencing method, we characterized transcriptomic changes in four primary human ASM cell lines that were treated with dexamethasone - a potent synthetic glucocorticoid (1 micromolar for 18 hours)." # Obtaining sample information from GEO The following code chunk obtains the sample information from the series matrix file downloaded from GEO. The columns are then parsed and new columns with shorter names and factor levels are added. ```{r} suppressPackageStartupMessages( library( "GEOquery" ) ) suppressPackageStartupMessages( library( "airway" ) ) dir <- system.file("extdata",package="airway") geofile <- file.path(dir, "GSE52778_series_matrix.txt") gse <- getGEO(filename=geofile) pdata <- pData(gse)[,grepl("characteristics",names(pData(gse)))] names(pdata) <- c("treatment","tissue","ercc_mix","cell","celltype") pdataclean <- data.frame(treatment=sub("treatment: (.*)","\\1",pdata$treatment), cell=sub("cell line: (.*)","\\1",pdata$cell), row.names=rownames(pdata)) pdataclean$dex <- ifelse(grepl("Dex",pdataclean$treatment),"trt","untrt") pdataclean$albut <- ifelse(grepl("Albut",pdataclean$treatment),"trt","untrt") pdataclean$SampleName <- rownames(pdataclean) pdataclean$treatment <- NULL ``` The information which connects the sample information from GEO with the SRA run id is downloaded from [SRA](http://www.ncbi.nlm.nih.gov/sra/?term=SRP033351) using the **Send to: File** button. ```{r} srafile <- file.path(dir, "SraRunInfo_SRP033351.csv") srp <- read.csv(srafile) srpsmall <- srp[,c("Run","avgLength","Experiment","Sample","BioSample","SampleName")] ``` These two *data.frames* are merged and then we subset to only the samples not treated with albuterol (these samples were not included in the analysis of the publication). ```{r} coldata <- merge(pdataclean, srpsmall, by="SampleName") rownames(coldata) <- coldata$Run coldata <- coldata[coldata$albut == "untrt",] coldata$albut <- NULL coldata ``` Finally, the sample table was saved to a CSV file for future reference. This file is included in the `inst/extdata` directory of this package. ```{r eval=FALSE} write.csv(coldata, file="sample_table.csv") ``` # Downloading FASTQ files from SRA A file containing the SRA run numbers was created: `files`. This file was used to download the sequenced reads from the SRA using `wget`. The following command was used to extract the FASTQ file from the `.sra` files, using the [SRA Toolkit](http://www.ncbi.nlm.nih.gov/books/NBK47540/) ``` cat files | parallel -j 7 fastq-dump --split-files {}.sra ``` # Aligning reads The reads were aligned using the [STAR read aligner](https://code.google.com/p/rna-star/) to GRCh37 using the annotations from Ensembl release 75. ``` for f in `cat files`; do STAR --genomeDir ../STAR/ENSEMBL.homo_sapiens.release-75 \ --readFilesIn fastq/$f\_1.fastq fastq/$f\_2.fastq \ --runThreadN 12 --outFileNamePrefix aligned/$f.; done ``` [SAMtools](http://samtools.sourceforge.net/) was used to generate BAM files. ``` cat files | parallel -j 7 samtools view -bS aligned/{}.Aligned.out.sam -o aligned/{}.bam ``` # Counting reads A transcript database for the homo sapiens Ensembl genes was obtained from Biomart. ```{r eval=FALSE} library( "GenomicFeatures" ) txdb <- makeTranscriptDbFromBiomart( biomart="ensembl", dataset="hsapiens_gene_ensembl") exonsByGene <- exonsBy( txdb, by="gene" ) ``` The BAM files were specified using the `SRR` id from the SRA. A yield size of 2 million reads was used to cap the memory used during read counting. ```{r eval=FALSE} sampleTable <- read.csv( "sample_table.csv", row.names=1 ) fls <- file.path("aligned",rownames(sampleTable), ".bam") library( "Rsamtools" ) bamLst <- BamFileList( fls, yieldSize=2000000 ) ``` The following `summarizeOverlaps` call distributed the 8 paired-end BAM files to 8 workers. This used a maximum of 16 Gb per worker and the time elapsed was 50 minutes. ```{r eval=FALSE} library( "BiocParallel" ) register( MulticoreParam( workers=8 ) ) library( "GenomicAlignments" ) airway <- summarizeOverlaps( features=exonsByGene, reads=bamLst, mode="Union", singleEnd=FALSE, ignore.strand=TRUE, fragments=TRUE ) ``` The sample information was then added as column data. ```{r eval=FALSE} colData(airway) <- DataFrame( sampleTable ) ``` Finally, we attached the `MIAME` information using the Pubmed ID. ```{r eval=FALSE} library( "annotate" ) miame <- SimpleList(pmid2MIAME("24926665")) miame[[1]]@url <- "http://www.ncbi.nlm.nih.gov/pubmed/24926665" # because R's CHECK doesn't like non-ASCII characters in data objects # or in vignettes. the actual char was used in the first argument miame[[1]]@abstract <- gsub("micro","micro",abstract(miame[[1]])) miame[[1]]@abstract <- gsub("beta","beta",abstract(miame[[1]])) exptData(airway) <- miame save(airway, file="airway.RData") ``` # Information on the SummarizedExperiment Below we print out some basic summary statistics on the `airway` object which is provided by this experiment data package. ```{r} library("airway") data(airway) airway as.data.frame(colData(airway)) summary(colSums(assay(airway))/1e6) metadata(rowData(airway)) ``` # Session information ```{r} sessionInfo() ```