---
title: "1. Introduction to R "
author: "Sonali Arora"
output:
BiocStyle::html_document:
toc: true
toc_depth: 2
vignette: >
% \VignetteIndexEntry{1. Introduction to Bioconductor}
% \VignetteEngine{knitr::rmarkdown}
---
```{r style, echo = FALSE, results = 'asis'}
BiocStyle::markdown()
options(width=100, max.print=1000)
options(useFancyQuotes=FALSE)
knitr::opts_chunk$set(
eval=as.logical(Sys.getenv("KNITR_EVAL", "TRUE")),
cache=as.logical(Sys.getenv("KNITR_CACHE", "TRUE")),
error=FALSE)
```
Author: Sonali Arora (sarora@fredhutch.org)
Date: 20-22 July, 2015
The material in this course requires R version 3.2.1 and Bioconductor
version 3.2
## R
- http://r-project.org
- Open-source, statistical programming language; widely used in academia, finance, pharma, . . .
- Core language, 'base' and > 4000 contributed packages
- Interactive sessions, scripts, packages
## Useful Functions in base R
- `dir`, `list.files`
- List files
- `read.table`, `scan`
- Read Data into R
- `c`, `factor`, `data.frame`, `matrix`
- Create vectors , data.frame and matrices to store data
- `summary`, `table`, `xtabs`
- Summarize or cross-tabulate data.
- `plot`
- Plot data to visualize it
- `match`, `%in%`, `which`
- find elements of one vector in another.
- `split`, `cut`
- Split or cut vectors.
- `strsplit`, `grep`, `sub`
- Operate on character vectors.
- `lapply`, `sapply`, `mapply`
- Apply function to elements of lists.
- `t.test`, `lm`, `anova`
- Compare two or several groups.
- `dist` , `hclust`
- Cluster Data
- `biocLite`, `install.packages`
- Install packages in R from online repository
- `traceback`, `debug`, `browser`
- debug errors
## Getting help in R
- ?data.frame
- methods(lm), methods(class=class(fit))
- ?"plot"
- help(package="Biostrings")
- vignette(package="GenomicRanges")
- StackOverflow; R-help mailing list
## Data types in R
- Vectors
- logical, integer, numeric, character, . . .
- list() - contains other vectors (recursive)
- factor(), NA - statistical concepts
- Can be named - c(Seattle=1, Portland=2)
- matrix(), array() - a vector with a 'dim' attribute.
- data.frame()
- like spreadsheets; list of equal length vectors.
- Homogenous types within a column, heterogenous types across columns.
- Other classes
- more complicated arrangement of vectors.
- Examples
- the value returned by lm();
- the DNAStringSet class used to hold DNA sequences.
- plain, 'accessor', 'generic', and 'method' functions
- Packages
- base, recommended, contributed.
## R programming concepts
- Functions - built-in (e.g., rnorm()); user-defined
```{r}
mean(1:10)
rnorm(1:10)
summary(rnorm(1:10))
```
- Subsetting - logical, numeric, character;
```{r}
data(iris)
# find those rows where petal.width is exactly 0.2
iris[iris$Petal.Width==0.2,]
# find those rows where sepal.length is less than 4.5
iris[iris$Sepal.Length < 4.5,]
# find all rows belonging to setosa
setosa_iris = iris[iris$Species=="setosa",]
dim(setosa_iris)
head(setosa_iris)
```
- Iteration - over vector elements, lapply(), mapply(), apply()
```{r}
# drop the column containing characters i.e., Species
iris <- iris[,!( names(iris) %in% "Species")]
dim(iris)
# find the mean of the first 4 numerical columns
lapply(iris, mean) # simpler: colMeans(iris)
# simplify the result
sapply(iris, mean)
# find the mean for each row.
apply(iris, 1 , mean) #simpler : rowMeans(iris)
```
## R as a Statistical Computing Environment
```{r}
# define a vector
x <- rnorm(1000)
# vectorized calculation
y <- x + rnorm(1000, sd=.8)
# object construction
df <- data.frame(x=x, y=y)
# linear model
fit <- lm(y ~ x, df)
```
## Visualizing Data in R
```{r}
par(mfrow=c(1,2))
plot(y ~ x, df, cex.lab=2)
abline(fit, col="red", lwd=2)
library(ggplot2)
ggplot(df, aes(x, y)) +
geom_point() +
stat_smooth(method="lm")
```
## `sessionInfo()`
```{r sessionInfo}
sessionInfo()
```