1 The tidyverse

We’ll adopt a particular approach to data input, wrangling, and basic analysis known as the ‘tidyverse’. Start by loading the tidyverse package.

library(tidyverse)

We’ll cover the following functions:

Data input
- read_csv(): data input from a comma-separate value file, as a (data.frame-like) tibble.
Pipes
- %>%: ‘pipe’ data from a source to a function.
- %$%: extract a column in a pipe.
- .: refer to the incoming data.
Data manipulation
- group_by(): define groups of rows based on column values
- summarize(): apply functions to groups of data to produce a summary of the data.
- filter(): filter rows to match criteria in columns
- select(): select columns for subsequent use
- mutate(): update or add columns of data.
Other functions and concepts
- %in%: identify elements of the left-hand vector that are elements of the set defined by the right-hand vector.
- t.test(): perform a t-test.
- boxplot(), hist(): basic visualization.
- ~: specify a formula describing the relationship between a dependent (left-hand side) variable and independent (right-hand side) variable(s).

2 Exercise 1: BRFSS Survey Data

We will explore a subset of data collected by the CDC through its extensive Behavioral Risk Factor Surveillance System (BRFSS) telephone survey. Check out the link for more information. We’ll look at a subset of the data.

Use file.choose() to find the path to the file ‘BRFSS-subset.csv’
```
path <- file.choose()
```

Input the data using read_csv(), assigning to a variable brfss and visualizing the first few rows.

brfss <- read_csv(path)

## Parsed with column specification:
## cols(
##   Age = col_integer(),
##   Weight = col_double(),
##   Sex = col_character(),
##   Height = col_double(),
##   Year = col_integer()
## )

brfss

## # A tibble: 20,000 x 5
##      Age   Weight    Sex Height  Year
##    <int>    <dbl>  <chr>  <dbl> <int>
##  1    31 48.98798 Female 157.48  1990
##  2    57 81.64663 Female 157.48  1990
##  3    43 80.28585   Male 177.80  1990
##  4    72 70.30682   Male 170.18  1990
##  5    31 49.89516 Female 154.94  1990
##  6    58 54.43108 Female 154.94  1990
##  7    45 69.85323   Male 172.72  1990
##  8    37 68.03886   Male 180.34  1990
##  9    33 65.77089 Female 170.18  1990
## 10    75 70.76041 Female 152.40  1990
## # ... with 19,990 more rows

From looking at the data…
- How many individuals aer in the sample
- What variables have been measured?
- Can you guess at the units used for, e.g., Weight and Height?
The tidyverse uses a ‘pipe’, %>% to send data from one command to another. There are small number of key functions for manipulating data. We’ll use group_by() to group the data by Sex, and then summarize(n=n()) to count the number of observations in each group.
```
brfss %>% group_by(Sex) %>% summarize(N = n())
```
```
## # A tibble: 2 x 2
##      Sex     N
##    <chr> <int>
## 1 Female 12039
## 2   Male  7961
```

Use group_by(Year, Sex) and summarize(N = n()) to summarize the number of individuals from each year and sex.

brfss %>% group_by(Year, Sex) %>% summarize(N = n())

## # A tibble: 4 x 3
## # Groups:   Year [?]
##    Year    Sex     N
##   <int>  <chr> <int>
## 1  1990 Female  5718
## 2  1990   Male  4282
## 3  2010 Female  6321
## 4  2010   Male  3679

Calculate the average age in each year and sex by adding the argument Age = mean(Age, na.rm=TRUE) to summarize()

brfss %>% group_by(Year, Sex) %>% 
    summarize(N = n(), Age = mean(Age, na.rm=TRUE))

## # A tibble: 4 x 4
## # Groups:   Year [?]
##    Year    Sex     N      Age
##   <int>  <chr> <int>    <dbl>
## 1  1990 Female  5718 46.23327
## 2  1990   Male  4282 43.90552
## 3  2010 Female  6321 57.08824
## 4  2010   Male  3679 56.24993

Year is input as an integer vector, and Sex as a character vector. Actually, though, these are both factors. Use mutate() and factor() to update the type of these columns. Re-assign the updated tibble to brfss

brfss %>% mutate(Year = factor(Year), Sex = factor(Sex))

## # A tibble: 20,000 x 5
##      Age   Weight    Sex Height   Year
##    <int>    <dbl> <fctr>  <dbl> <fctr>
##  1    31 48.98798 Female 157.48   1990
##  2    57 81.64663 Female 157.48   1990
##  3    43 80.28585   Male 177.80   1990
##  4    72 70.30682   Male 170.18   1990
##  5    31 49.89516 Female 154.94   1990
##  6    58 54.43108 Female 154.94   1990
##  7    45 69.85323   Male 172.72   1990
##  8    37 68.03886   Male 180.34   1990
##  9    33 65.77089 Female 170.18   1990
## 10    75 70.76041 Female 152.40   1990
## # ... with 19,990 more rows

brfss <- brfss %>% mutate(Year = factor(Year), Sex = factor(Sex))

There are several other pipes available (see also the magrittr package). %$% extracts a column. Here we look at the levels() of the factor that we created.
```
brfss %$% Sex %>% levels()
```
```
## [1] "Female" "Male"
```
```
brfss %$% Year%>% levels()
```
```
## [1] "1990" "2010"
```

It’s usually better to ‘clean’ data as soon as possible. Visit the help page ?read_csv, look at the col_types = argument, and the help pages ?cols and ?col_factor. Input the data in it’s correct format, with Sex and Year as factors

col_types <- cols(
    Age = col_integer(),
    Weight = col_double(),
    Sex = col_factor(c("Female", "Male")),
    Height = col_double(),
    Year = col_factor(c("1990", "2010"))
)
brfss <- read_csv(path, col_types = col_types)
brfss

## # A tibble: 20,000 x 5
##      Age   Weight    Sex Height   Year
##    <int>    <dbl> <fctr>  <dbl> <fctr>
##  1    31 48.98798 Female 157.48   1990
##  2    57 81.64663 Female 157.48   1990
##  3    43 80.28585   Male 177.80   1990
##  4    72 70.30682   Male 170.18   1990
##  5    31 49.89516 Female 154.94   1990
##  6    58 54.43108 Female 154.94   1990
##  7    45 69.85323   Male 172.72   1990
##  8    37 68.03886   Male 180.34   1990
##  9    33 65.77089 Female 170.18   1990
## 10    75 70.76041 Female 152.40   1990
## # ... with 19,990 more rows

brfss %>% summary()

##       Age            Weight           Sex            Height     
##  Min.   :18.00   Min.   : 34.93   Female:12039   Min.   :105.0  
##  1st Qu.:36.00   1st Qu.: 61.69   Male  : 7961   1st Qu.:162.6  
##  Median :51.00   Median : 72.57                  Median :168.0  
##  Mean   :50.99   Mean   : 75.42                  Mean   :169.2  
##  3rd Qu.:65.00   3rd Qu.: 86.18                  3rd Qu.:177.8  
##  Max.   :99.00   Max.   :278.96                  Max.   :218.0  
##  NA's   :139     NA's   :649                     NA's   :184    
##    Year      
##  1990:10000  
##  2010:10000  
##              
##              
##              
##              
##

Use filter() to create a subset of the data consisting of only the 1990 observations (Year in the set that consists of the single element 1990, Year %in% 1990). Optionally, save this to a new variable brfss_1990.

brfss %>% filter(Year %in% 1990)

## # A tibble: 10,000 x 5
##      Age   Weight    Sex Height   Year
##    <int>    <dbl> <fctr>  <dbl> <fctr>
##  1    31 48.98798 Female 157.48   1990
##  2    57 81.64663 Female 157.48   1990
##  3    43 80.28585   Male 177.80   1990
##  4    72 70.30682   Male 170.18   1990
##  5    31 49.89516 Female 154.94   1990
##  6    58 54.43108 Female 154.94   1990
##  7    45 69.85323   Male 172.72   1990
##  8    37 68.03886   Male 180.34   1990
##  9    33 65.77089 Female 170.18   1990
## 10    75 70.76041 Female 152.40   1990
## # ... with 9,990 more rows

brfss_1990 <- brfss %>% filter(Year %in% 1990)

Pipe this subset to t.test() to ask whether Weight depends on Sex. The first argument to t.test is a ‘formula’ describing the relation between dependent and independent variables; we use the formula Weight ~ Sex. The second argument to t.test is the data set to use – indicate the data from the pipe with data = .

brfss %>% filter(Year %in% 1990) %>% t.test(Weight ~ Sex, data = .)

## 
##  Welch Two Sample t-test
## 
## data:  Weight by Sex
## t = -58.734, df = 9214, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -16.90767 -15.81554
## sample estimates:
## mean in group Female   mean in group Male 
##             64.81838             81.17999

What about differences between weights of males (or females) in 1990 versus 2010?

Use boxplot() to plot the weights of the Male individuals. Can you transform weight, e.g., taking the square root, before plotting? Interpret the results. Do similar boxplots for the t-tests of the previous question.
```
brfss %>% filter(Sex %in% "Male") %>% boxplot(Weight ~ Year, data = .)
```
```
brfss %>% filter(Sex %in% "Male") %>% mutate(SqrtWeight = sqrt(Weight)) %>%
    boxplot(SqrtWeight ~ Year, data = .)
```
Use hist() to plot a histogram of weights of the 1990 Female individuals. From ?hist, the function is expecting a vector of values, so use %$% to select the Weight column and pipe to hist().
```
brfss %>% filter(Year %in% "1990", Sex %in% "Female") %$% Weight %>% 
    hist(main="1990 Female Weight")
```

3 Exercise 2: ALL Phenotypic Data

This data comes from an (old) Acute Lymphoid Leukemia microarray data set.

Choose the file that contains ALL (acute lymphoblastic leukemia) patient information and input the date using read.csv(); for read.csv(), use row.names=1 to indicate that the first column contains row names.

path <- file.choose()    # look for ALL-phenoData.csv

stopifnot(file.exists(path))
pdata <- read_csv(path)

## Warning: Missing column names filled in: 'X1' [1]

## Parsed with column specification:
## cols(
##   .default = col_character(),
##   age = col_integer(),
##   `t(4;11)` = col_logical(),
##   `t(9;22)` = col_logical(),
##   cyto.normal = col_logical(),
##   ccr = col_logical(),
##   relapse = col_logical(),
##   transplant = col_logical()
## )

## See spec(...) for full column specifications.

pdata

## # A tibble: 128 x 22
##       X1   cod  diagnosis   sex   age    BT remission    CR   date.cr
##    <chr> <chr>      <chr> <chr> <int> <chr>     <chr> <chr>     <chr>
##  1 01005  1005  5/21/1997     M    53    B2        CR    CR  8/6/1997
##  2 01010  1010  3/29/2000     M    19    B2        CR    CR 6/27/2000
##  3 03002  3002  6/24/1998     F    52    B4        CR    CR 8/17/1998
##  4 04006  4006  7/17/1997     M    38    B1        CR    CR  9/8/1997
##  5 04007  4007  7/22/1997     M    57    B2        CR    CR 9/17/1997
##  6 04008  4008  7/30/1997     M    17    B1        CR    CR 9/27/1997
##  7 04010  4010 10/30/1997     F    18    B1        CR    CR  1/7/1998
##  8 04016  4016  2/10/2000     M    16    B1        CR    CR 4/17/2000
##  9 06002  6002  3/19/1997     M    15    B2        CR    CR  6/9/1997
## 10 08001  8001  1/15/1997     M    40    B2        CR    CR 3/26/1997
## # ... with 118 more rows, and 13 more variables: `t(4;11)` <lgl>,
## #   `t(9;22)` <lgl>, cyto.normal <lgl>, citog <chr>, mol.biol <chr>, `fusion
## #   protein` <chr>, mdr <chr>, kinet <chr>, ccr <lgl>, relapse <lgl>,
## #   transplant <lgl>, f.u <chr>, `date last seen` <chr>

Use select() to select some columns, e.g., mol.biol and BT. Use filter() to filter to include only females over 40.

pdata %>% select(mol.biol, BT)

## # A tibble: 128 x 2
##    mol.biol    BT
##       <chr> <chr>
##  1  BCR/ABL    B2
##  2      NEG    B2
##  3  BCR/ABL    B4
##  4 ALL1/AF4    B1
##  5      NEG    B2
##  6      NEG    B1
##  7      NEG    B1
##  8      NEG    B1
##  9      NEG    B2
## 10  BCR/ABL    B2
## # ... with 118 more rows

pdata %>% filter(sex %in% "F", age > 40)

## # A tibble: 17 x 22
##       X1   cod  diagnosis   sex   age    BT remission                 CR
##    <chr> <chr>      <chr> <chr> <int> <chr>     <chr>              <chr>
##  1 03002  3002  6/24/1998     F    52    B4        CR                 CR
##  2 16004 16004  4/19/1997     F    58    B1        CR                 CR
##  3 16009 16009  7/11/2000     F    43    B2      <NA>               <NA>
##  4 19005 19005 11/15/1997     F    48    B1        CR                 CR
##  5 20002 20002   5/9/1997     F    58    B2        CR                 CR
##  6 24005 24005   1/3/1997     F    45    B1        CR                 CR
##  7 24011 24011   8/5/1997     F    51    B2      <NA> DEATH IN INDUCTION
##  8 26003 26003  2/18/1998     F    49    B4        CR                 CR
##  9 27004 27004 10/20/1998     F    48    B2       REF                REF
## 10 28007 28007  2/21/1997     F    47    B3        CR                 CR
## 11 28021 28021  3/18/1998     F    54    B3        CR        DEATH IN CR
## 12 28032 28032  9/26/1998     F    52    B1        CR                 CR
## 13 30001 30001  1/16/1997     F    54    B3      <NA> DEATH IN INDUCTION
## 14 49006 49006  8/12/1998     F    43    B2        CR                 CR
## 15 57001 57001  1/29/1997     F    53    B3      <NA> DEATH IN INDUCTION
## 16 62001 62001 11/11/1997     F    50    B4       REF                REF
## 17 02020  2020  3/23/2000     F    48    T2      <NA> DEATH IN INDUCTION
## # ... with 14 more variables: date.cr <chr>, `t(4;11)` <lgl>,
## #   `t(9;22)` <lgl>, cyto.normal <lgl>, citog <chr>, mol.biol <chr>, `fusion
## #   protein` <chr>, mdr <chr>, kinet <chr>, ccr <lgl>, relapse <lgl>,
## #   transplant <lgl>, f.u <chr>, `date last seen` <chr>

Use the mol.biol column to filter the data to contain individuals in the set c("BCR/ABL", "NEG") (i.e., they have mol.biol equal to BCR/ABL or NEG))

bcrabl <- pdata %>% filter(mol.biol %in% c("BCR/ABL", "NEG"))

We’d like to tidy the data by mutating mol.biol to be a factor. We’d also like to mutate the BT column (B- or T-cell subtypes) to be just B or T, using substr(BT, 1, 1) (i.e., for each element of BT, taking the substring that starts at letter 1 and goes to letter 1 – the first letter)

bcrabl <- bcrabl %>% mutate(
    mol.biol = factor(mol.biol), 
    B_or_T = factor(substr(BT, 1, 1))
)

How many bcrabl samples have B- and T-cell types in each of the BCR/ABL and NEG groups?

bcrabl %>% group_by(B_or_T, mol.biol) %>% summarize(N = n())

## # A tibble: 3 x 3
## # Groups:   B_or_T [?]
##   B_or_T mol.biol     N
##   <fctr>   <fctr> <int>
## 1      B  BCR/ABL    37
## 2      B      NEG    42
## 3      T      NEG    32

Calculate the average age of males and females in the BCR/ABL and NEG treatment groups.

bcrabl %>% group_by(sex, mol.biol) %>% summarize(age = mean(age, na.rm=TRUE))

## # A tibble: 5 x 3
## # Groups:   sex [?]
##     sex mol.biol      age
##   <chr>   <fctr>    <dbl>
## 1     F  BCR/ABL 39.93750
## 2     F      NEG 30.42105
## 3     M  BCR/ABL 40.50000
## 4     M      NEG 27.21154
## 5  <NA>      NEG      NaN

Use t.test() to compare the age of individuals in the BCR/ABL versus NEG groups; visualize the results using boxplot(). In both cases, use the formula interface and . to refer to the incoming data set. Consult the help page ?t.test and re-do the test assuming that variance of ages in the two groups is identical. What parts of the test output change?

bcrabl %>% t.test(age ~ mol.biol, .)

## 
##  Welch Two Sample t-test
## 
## data:  age by mol.biol
## t = 4.8172, df = 68.529, p-value = 8.401e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   7.13507 17.22408
## sample estimates:
## mean in group BCR/ABL     mean in group NEG 
##              40.25000              28.07042

bcrabl %>% boxplot(age ~ mol.biol, .)

A.2 – Data Input and Manipulation

11 - 12 September 2017

Contents

1 The tidyverse

2 Exercise 1: BRFSS Survey Data

3 Exercise 2: ALL Phenotypic Data