2 R: First Impressions

Type values and mathematical formulas into R’s command prompt

1 + 1

## [1] 2

Assign values to symbols (variables)

x = 1
x + x

## [1] 2

Invoke functions such as c(), which takes any number of values and returns a single vector

x = c(1, 2, 3)
x

## [1] 1 2 3

R functions, such as sqrt(), often operate efficiently on vectors

y = sqrt(x)
y

## [1] 1.000000 1.414214 1.732051

There are often several ways to accomplish a task in R

x = c(1, 2, 3)
x

## [1] 1 2 3

x <- c(4, 5, 6)
x

## [1] 4 5 6

x <- 7:9
x

## [1] 7 8 9

10:12 -> x
x

## [1] 10 11 12

Sometimes R does ‘surprising’ things that can be fun to figure out

x <- c(1, 2, 3) -> y
x

## [1] 1 2 3

## [1] 1 2 3

2.1 R Data types: vector and list

‘Atomic’ vectors

Types include integer, numeric (float-point; real), complex, logical, character, raw (bytes)

people <- c("Lori", "Nitesh", "Valerie", "Herve")
people

## [1] "Lori"    "Nitesh"  "Valerie" "Herve"

Atomic vectors can be named

population <- c(Buffalo=259000, Rochester=210000, `New York`=8400000)
population

##   Buffalo Rochester  New York 
##    259000    210000   8400000

log10(population)

##   Buffalo Rochester  New York 
##  5.413300  5.322219  6.924279

Statistical concepts like NA (“not available”)

truthiness <- c(TRUE, FALSE, NA)
truthiness

## [1]  TRUE FALSE    NA

Logical concepts like ‘and’ (&), ‘or’ (|), and ‘not’ (!)

!truthiness

## [1] FALSE  TRUE    NA

truthiness | !truthiness

## [1] TRUE TRUE   NA

truthiness & !truthiness

## [1] FALSE FALSE    NA

Numerical concepts like infinity (Inf) or not-a-number (NaN, e.g., 0 / 0)

undefined_numeric_values <- c(NA, 0/0, NaN, Inf, -Inf)
undefined_numeric_values

## [1]   NA  NaN  NaN  Inf -Inf

sqrt(undefined_numeric_values)

## Warning in sqrt(undefined_numeric_values): NaNs produced

## [1]  NA NaN NaN Inf NaN

Common string manipulations

toupper(people)

## [1] "LORI"    "NITESH"  "VALERIE" "HERVE"

substr(people, 1, 3)

## [1] "Lor" "Nit" "Val" "Her"

R is a green consumer – recycling short vectors to align with long vectors

x <- 1:3
x * 2            # '2' (vector of length 1) recycled to c(2, 2, 2)

## [1] 2 4 6

truthiness | NA

## [1] TRUE   NA   NA

truthiness & NA

## [1]    NA FALSE    NA

It’s very common to nest operations, which can be simultaneously compact, confusing, and expressive ([: subset; <: less than)

substr(tolower(people), 1, 3)

## [1] "lor" "nit" "val" "her"

population[population < 1000000]

##   Buffalo Rochester 
##    259000    210000

Lists

The list type can contain other vectors, including other lists

frenemies = list(
    friends=c("Larry", "Richard", "Vivian"),
    enemies=c("Dick", "Mike")
)
frenemies

## $friends
## [1] "Larry"   "Richard" "Vivian" 
## 
## $enemies
## [1] "Dick" "Mike"

[ subsets one list to create another list, [[ extracts a list element

frenemies[1]

## $friends
## [1] "Larry"   "Richard" "Vivian"

frenemies[c("enemies", "friends")]

## $enemies
## [1] "Dick" "Mike"
## 
## $friends
## [1] "Larry"   "Richard" "Vivian"

frenemies[["enemies"]]

## [1] "Dick" "Mike"

Factors

Character-like vectors, but with values restricted to specific levels

sex = factor(c("Male", "Male", "Female"),
             levels=c("Female", "Male", "Hermaphrodite"))
sex

## [1] Male   Male   Female
## Levels: Female Male Hermaphrodite

sex == "Female"

## [1] FALSE FALSE  TRUE

table(sex)

## sex
##        Female          Male Hermaphrodite 
##             1             2             0

sex[sex == "Female"]

## [1] Female
## Levels: Female Male Hermaphrodite

2.2 Classes: data.frame and beyond

Variables are often related to one another in a highly structured way, e.g., two ‘columns’ of data in a spreadsheet

x = rnorm(1000)       # 1000 random normal deviates
y = x + rnorm(1000)   # another 1000 deviates, as a function of x
plot(y ~ x)           # relationship between x and y

Convenient to manipulate them together

data.frame(): like columns in a spreadsheet

df = data.frame(X=x, Y=y)
head(df)           # first 6 rows

##             X           Y
## 1 -1.03278893 -3.68339332
## 2  1.52890241 -0.03821038
## 3  0.09607513  0.19225389
## 4  0.25224108  0.67252467
## 5 -0.31291377  0.57568412
## 6  1.76355837  0.66167142

plot(Y ~ X, df)    # same as above

See all data with View(df). Summarize data with summary(df)

summary(df)

##        X                  Y           
##  Min.   :-3.44631   Min.   :-4.18143  
##  1st Qu.:-0.63470   1st Qu.:-0.87538  
##  Median : 0.11961   Median : 0.06751  
##  Mean   : 0.05908   Mean   : 0.07693  
##  3rd Qu.: 0.75172   3rd Qu.: 1.01160  
##  Max.   : 3.08440   Max.   : 4.62605

Easy to manipulate data in a coordinated way, e.g., access column X with $ and subset for just those values greater than 0

positiveX = df[df$X > 0,]
head(positiveX)

##             X           Y
## 2  1.52890241 -0.03821038
## 3  0.09607513  0.19225389
## 4  0.25224108  0.67252467
## 6  1.76355837  0.66167142
## 9  0.52269290  2.14571411
## 10 0.07547879 -0.11688563

plot(Y ~ X, positiveX)

R is introspective – ask it about itself

class(df)

## [1] "data.frame"

dim(df)

## [1] 1000    2

colnames(df)

## [1] "X" "Y"

matrix() a related class, where all elements have the same type (a data.frame() requires elements within a column to be the same type, but elements between columns can be different types).

A scatterplot makes one want to fit a linear model (do a regression analysis)

Use a formula to describe the relationship between variables
Variables found in the second argument
```
fit <- lm(Y ~ X, df)
```
Visualize the points, and add the regression line
```
plot(Y ~ X, df)
abline(fit, col="red", lwd=3)
```

Summarize the fit as an ANOVA table

anova(fit)

## Analysis of Variance Table
## 
## Response: Y
##            Df  Sum Sq Mean Sq F value    Pr(>F)    
## X           1 1077.91 1077.91  1127.3 < 2.2e-16 ***
## Residuals 998  954.31    0.96                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

N.B. – ‘Type I’ sums-of-squares, so order of independent variables matters; use drop1() for ‘Type III’. See DataCamp Quick-R

Introspection – what class is fit? What methods can I apply to an object of that class?

class(fit)

## [1] "lm"

methods(class=class(fit))

##  [1] add1           alias          anova          case.names    
##  [5] coerce         confint        cooks.distance deviance      
##  [9] dfbeta         dfbetas        drop1          dummy.coef    
## [13] effects        extractAIC     family         formula       
## [17] hatvalues      influence      initialize     kappa         
## [21] labels         logLik         model.frame    model.matrix  
## [25] nobs           plot           predict        print         
## [29] proj           qr             residuals      rstandard     
## [33] rstudent       show           simulate       slotsFromS3   
## [37] summary        variable.names vcov          
## see '?methods' for accessing help and source code

2.3 Help!

Help available in Rstudio or interactively

Check out the help page for rnorm()
```
?rnorm
```
‘Usage’ section describes how the function can be used
```
rnorm(n, mean = 0, sd = 1)
```
Arguments, some with default values. Arguments matched first by name, then position
‘Arguments’ section describes what the arguments are supposed to be
‘Value’ section describes return value
‘Examples’ section illustrates use
Often include citations to relevant technical documentation, reference to related functions, obscure details
Can be intimidating, but in the end actually very useful

A.1 – Using R

11 - 12 September 2017

Contents

1 RStudio: A Quick Tour

2 R: First Impressions

2.1 R Data types: vector and list

2.2 Classes: data.frame and beyond

2.3 Help!