We’ll adopt a particular approach to data input, wrangling, and basic analysis known as the ‘tidyverse’. Start by loading the tidyverse package.
library(tidyverse)
We’ll cover the following functions:
read_csv()
: data input from a comma-separate value file, as a (data.frame
-like) tibble
.%>%
: ‘pipe’ data from a source to a function.%$%
: extract a column in a pipe..
: refer to the incoming data.group_by()
: define groups of rows based on column valuessummarize()
: apply functions to groups of data to produce a summary of the data.filter()
: filter rows to match criteria in columnsselect()
: select columns for subsequent usemutate()
: update or add columns of data.%in%
: identify elements of the left-hand vector that are elements of the set defined by the right-hand vector.t.test()
: perform a t-test.boxplot()
, hist()
: basic visualization.~
: specify a formula describing the relationship between a dependent (left-hand side) variable and independent (right-hand side) variable(s).We will explore a subset of data collected by the CDC through its extensive Behavioral Risk Factor Surveillance System (BRFSS) telephone survey. Check out the link for more information. We’ll look at a subset of the data.
Use file.choose()
to find the path to the file ‘BRFSS-subset.csv’
path <- file.choose()
Input the data using read_csv()
, assigning to a variable brfss
and visualizing the first few rows.
brfss <- read_csv(path)
## Parsed with column specification:
## cols(
## Age = col_integer(),
## Weight = col_double(),
## Sex = col_character(),
## Height = col_double(),
## Year = col_integer()
## )
brfss
## # A tibble: 20,000 x 5
## Age Weight Sex Height Year
## <int> <dbl> <chr> <dbl> <int>
## 1 31 48.98798 Female 157.48 1990
## 2 57 81.64663 Female 157.48 1990
## 3 43 80.28585 Male 177.80 1990
## 4 72 70.30682 Male 170.18 1990
## 5 31 49.89516 Female 154.94 1990
## 6 58 54.43108 Female 154.94 1990
## 7 45 69.85323 Male 172.72 1990
## 8 37 68.03886 Male 180.34 1990
## 9 33 65.77089 Female 170.18 1990
## 10 75 70.76041 Female 152.40 1990
## # ... with 19,990 more rows
From looking at the data…
How many individuals aer in the sample
What variables have been measured?
Can you guess at the units used for, e.g., Weight and Height?
The tidyverse uses a ‘pipe’, %>%
to send data from one command to another. There are small number of key functions for manipulating data. We’ll use group_by()
to group the data by Sex
, and then summarize(n=n())
to count the number of observations in each group.
brfss %>% group_by(Sex) %>% summarize(N = n())
## # A tibble: 2 x 2
## Sex N
## <chr> <int>
## 1 Female 12039
## 2 Male 7961
Use group_by(Year, Sex)
and summarize(N = n())
to summarize the number of individuals from each year and sex.
brfss %>% group_by(Year, Sex) %>% summarize(N = n())
## # A tibble: 4 x 3
## # Groups: Year [?]
## Year Sex N
## <int> <chr> <int>
## 1 1990 Female 5718
## 2 1990 Male 4282
## 3 2010 Female 6321
## 4 2010 Male 3679
Calculate the average age in each year and sex by adding the argument Age = mean(Age, na.rm=TRUE)
to summarize()
brfss %>% group_by(Year, Sex) %>%
summarize(N = n(), Age = mean(Age, na.rm=TRUE))
## # A tibble: 4 x 4
## # Groups: Year [?]
## Year Sex N Age
## <int> <chr> <int> <dbl>
## 1 1990 Female 5718 46.23327
## 2 1990 Male 4282 43.90552
## 3 2010 Female 6321 57.08824
## 4 2010 Male 3679 56.24993
Year
is input as an integer vector, and Sex as a character vector. Actually, though, these are both factors. Use mutate()
and factor()
to update the type of these columns. Re-assign the updated tibble to brfss
brfss %>% mutate(Year = factor(Year), Sex = factor(Sex))
## # A tibble: 20,000 x 5
## Age Weight Sex Height Year
## <int> <dbl> <fctr> <dbl> <fctr>
## 1 31 48.98798 Female 157.48 1990
## 2 57 81.64663 Female 157.48 1990
## 3 43 80.28585 Male 177.80 1990
## 4 72 70.30682 Male 170.18 1990
## 5 31 49.89516 Female 154.94 1990
## 6 58 54.43108 Female 154.94 1990
## 7 45 69.85323 Male 172.72 1990
## 8 37 68.03886 Male 180.34 1990
## 9 33 65.77089 Female 170.18 1990
## 10 75 70.76041 Female 152.40 1990
## # ... with 19,990 more rows
brfss <- brfss %>% mutate(Year = factor(Year), Sex = factor(Sex))
There are several other pipes available (see also the magrittr package). %$%
extracts a column. Here we look at the levels()
of the factor that we created.
brfss %$% Sex %>% levels()
## [1] "Female" "Male"
brfss %$% Year%>% levels()
## [1] "1990" "2010"
It’s usually better to ‘clean’ data as soon as possible. Visit the help page ?read_csv
, look at the col_types =
argument, and the help pages ?cols
and ?col_factor
. Input the data in it’s correct format, with Sex and Year as factors
col_types <- cols(
Age = col_integer(),
Weight = col_double(),
Sex = col_factor(c("Female", "Male")),
Height = col_double(),
Year = col_factor(c("1990", "2010"))
)
brfss <- read_csv(path, col_types = col_types)
brfss
## # A tibble: 20,000 x 5
## Age Weight Sex Height Year
## <int> <dbl> <fctr> <dbl> <fctr>
## 1 31 48.98798 Female 157.48 1990
## 2 57 81.64663 Female 157.48 1990
## 3 43 80.28585 Male 177.80 1990
## 4 72 70.30682 Male 170.18 1990
## 5 31 49.89516 Female 154.94 1990
## 6 58 54.43108 Female 154.94 1990
## 7 45 69.85323 Male 172.72 1990
## 8 37 68.03886 Male 180.34 1990
## 9 33 65.77089 Female 170.18 1990
## 10 75 70.76041 Female 152.40 1990
## # ... with 19,990 more rows
brfss %>% summary()
## Age Weight Sex Height
## Min. :18.00 Min. : 34.93 Female:12039 Min. :105.0
## 1st Qu.:36.00 1st Qu.: 61.69 Male : 7961 1st Qu.:162.6
## Median :51.00 Median : 72.57 Median :168.0
## Mean :50.99 Mean : 75.42 Mean :169.2
## 3rd Qu.:65.00 3rd Qu.: 86.18 3rd Qu.:177.8
## Max. :99.00 Max. :278.96 Max. :218.0
## NA's :139 NA's :649 NA's :184
## Year
## 1990:10000
## 2010:10000
##
##
##
##
##
Use filter()
to create a subset of the data consisting of only the 1990 observations (Year
in the set that consists of the single element 1990
, Year %in% 1990
). Optionally, save this to a new variable brfss_1990
.
brfss %>% filter(Year %in% 1990)
## # A tibble: 10,000 x 5
## Age Weight Sex Height Year
## <int> <dbl> <fctr> <dbl> <fctr>
## 1 31 48.98798 Female 157.48 1990
## 2 57 81.64663 Female 157.48 1990
## 3 43 80.28585 Male 177.80 1990
## 4 72 70.30682 Male 170.18 1990
## 5 31 49.89516 Female 154.94 1990
## 6 58 54.43108 Female 154.94 1990
## 7 45 69.85323 Male 172.72 1990
## 8 37 68.03886 Male 180.34 1990
## 9 33 65.77089 Female 170.18 1990
## 10 75 70.76041 Female 152.40 1990
## # ... with 9,990 more rows
brfss_1990 <- brfss %>% filter(Year %in% 1990)
Pipe this subset to t.test()
to ask whether Weight depends on Sex. The first argument to t.test
is a ‘formula’ describing the relation between dependent and independent variables; we use the formula Weight ~ Sex
. The second argument to t.test
is the data set to use – indicate the data from the pipe with data = .
brfss %>% filter(Year %in% 1990) %>% t.test(Weight ~ Sex, data = .)
##
## Welch Two Sample t-test
##
## data: Weight by Sex
## t = -58.734, df = 9214, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -16.90767 -15.81554
## sample estimates:
## mean in group Female mean in group Male
## 64.81838 81.17999
What about differences between weights of males (or females) in 1990 versus 2010?
Use boxplot()
to plot the weights of the Male individuals. Can you transform weight, e.g., taking the square root, before plotting? Interpret the results. Do similar boxplots for the t-tests of the previous question.
brfss %>% filter(Sex %in% "Male") %>% boxplot(Weight ~ Year, data = .)
brfss %>% filter(Sex %in% "Male") %>% mutate(SqrtWeight = sqrt(Weight)) %>%
boxplot(SqrtWeight ~ Year, data = .)
Use hist()
to plot a histogram of weights of the 1990 Female individuals. From ?hist
, the function is expecting a vector of values, so use %$%
to select the Weight
column and pipe to hist()
.
brfss %>% filter(Year %in% "1990", Sex %in% "Female") %$% Weight %>%
hist(main="1990 Female Weight")
This data comes from an (old) Acute Lymphoid Leukemia microarray data set.
Choose the file that contains ALL (acute lymphoblastic leukemia) patient information and input the date using read.csv()
; for read.csv()
, use row.names=1
to indicate that the first column contains row names.
path <- file.choose() # look for ALL-phenoData.csv
stopifnot(file.exists(path))
pdata <- read_csv(path)
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
## .default = col_character(),
## age = col_integer(),
## `t(4;11)` = col_logical(),
## `t(9;22)` = col_logical(),
## cyto.normal = col_logical(),
## ccr = col_logical(),
## relapse = col_logical(),
## transplant = col_logical()
## )
## See spec(...) for full column specifications.
pdata
## # A tibble: 128 x 22
## X1 cod diagnosis sex age BT remission CR date.cr
## <chr> <chr> <chr> <chr> <int> <chr> <chr> <chr> <chr>
## 1 01005 1005 5/21/1997 M 53 B2 CR CR 8/6/1997
## 2 01010 1010 3/29/2000 M 19 B2 CR CR 6/27/2000
## 3 03002 3002 6/24/1998 F 52 B4 CR CR 8/17/1998
## 4 04006 4006 7/17/1997 M 38 B1 CR CR 9/8/1997
## 5 04007 4007 7/22/1997 M 57 B2 CR CR 9/17/1997
## 6 04008 4008 7/30/1997 M 17 B1 CR CR 9/27/1997
## 7 04010 4010 10/30/1997 F 18 B1 CR CR 1/7/1998
## 8 04016 4016 2/10/2000 M 16 B1 CR CR 4/17/2000
## 9 06002 6002 3/19/1997 M 15 B2 CR CR 6/9/1997
## 10 08001 8001 1/15/1997 M 40 B2 CR CR 3/26/1997
## # ... with 118 more rows, and 13 more variables: `t(4;11)` <lgl>,
## # `t(9;22)` <lgl>, cyto.normal <lgl>, citog <chr>, mol.biol <chr>, `fusion
## # protein` <chr>, mdr <chr>, kinet <chr>, ccr <lgl>, relapse <lgl>,
## # transplant <lgl>, f.u <chr>, `date last seen` <chr>
Use select()
to select some columns, e.g., mol.biol
and BT
. Use filter()
to filter to include only females over 40.
pdata %>% select(mol.biol, BT)
## # A tibble: 128 x 2
## mol.biol BT
## <chr> <chr>
## 1 BCR/ABL B2
## 2 NEG B2
## 3 BCR/ABL B4
## 4 ALL1/AF4 B1
## 5 NEG B2
## 6 NEG B1
## 7 NEG B1
## 8 NEG B1
## 9 NEG B2
## 10 BCR/ABL B2
## # ... with 118 more rows
pdata %>% filter(sex %in% "F", age > 40)
## # A tibble: 17 x 22
## X1 cod diagnosis sex age BT remission CR
## <chr> <chr> <chr> <chr> <int> <chr> <chr> <chr>
## 1 03002 3002 6/24/1998 F 52 B4 CR CR
## 2 16004 16004 4/19/1997 F 58 B1 CR CR
## 3 16009 16009 7/11/2000 F 43 B2 <NA> <NA>
## 4 19005 19005 11/15/1997 F 48 B1 CR CR
## 5 20002 20002 5/9/1997 F 58 B2 CR CR
## 6 24005 24005 1/3/1997 F 45 B1 CR CR
## 7 24011 24011 8/5/1997 F 51 B2 <NA> DEATH IN INDUCTION
## 8 26003 26003 2/18/1998 F 49 B4 CR CR
## 9 27004 27004 10/20/1998 F 48 B2 REF REF
## 10 28007 28007 2/21/1997 F 47 B3 CR CR
## 11 28021 28021 3/18/1998 F 54 B3 CR DEATH IN CR
## 12 28032 28032 9/26/1998 F 52 B1 CR CR
## 13 30001 30001 1/16/1997 F 54 B3 <NA> DEATH IN INDUCTION
## 14 49006 49006 8/12/1998 F 43 B2 CR CR
## 15 57001 57001 1/29/1997 F 53 B3 <NA> DEATH IN INDUCTION
## 16 62001 62001 11/11/1997 F 50 B4 REF REF
## 17 02020 2020 3/23/2000 F 48 T2 <NA> DEATH IN INDUCTION
## # ... with 14 more variables: date.cr <chr>, `t(4;11)` <lgl>,
## # `t(9;22)` <lgl>, cyto.normal <lgl>, citog <chr>, mol.biol <chr>, `fusion
## # protein` <chr>, mdr <chr>, kinet <chr>, ccr <lgl>, relapse <lgl>,
## # transplant <lgl>, f.u <chr>, `date last seen` <chr>
Use the mol.biol
column to filter the data to contain individuals in the set c("BCR/ABL", "NEG")
(i.e., they have mol.biol
equal to BCR/ABL
or NEG
))
bcrabl <- pdata %>% filter(mol.biol %in% c("BCR/ABL", "NEG"))
We’d like to tidy the data by mutating mol.biol
to be a factor. We’d also like to mutate the BT
column (B- or T-cell subtypes) to be just B
or T
, using substr(BT, 1, 1)
(i.e., for each element of BT
, taking the substring that starts at letter 1 and goes to letter 1 – the first letter)
bcrabl <- bcrabl %>% mutate(
mol.biol = factor(mol.biol),
B_or_T = factor(substr(BT, 1, 1))
)
How many bcrabl samples have B- and T-cell types in each of the BCR/ABL and NEG groups?
bcrabl %>% group_by(B_or_T, mol.biol) %>% summarize(N = n())
## # A tibble: 3 x 3
## # Groups: B_or_T [?]
## B_or_T mol.biol N
## <fctr> <fctr> <int>
## 1 B BCR/ABL 37
## 2 B NEG 42
## 3 T NEG 32
Calculate the average age of males and females in the BCR/ABL and NEG treatment groups.
bcrabl %>% group_by(sex, mol.biol) %>% summarize(age = mean(age, na.rm=TRUE))
## # A tibble: 5 x 3
## # Groups: sex [?]
## sex mol.biol age
## <chr> <fctr> <dbl>
## 1 F BCR/ABL 39.93750
## 2 F NEG 30.42105
## 3 M BCR/ABL 40.50000
## 4 M NEG 27.21154
## 5 <NA> NEG NaN
Use t.test()
to compare the age of individuals in the BCR/ABL versus NEG groups; visualize the results using boxplot()
. In both cases, use the formula
interface and .
to refer to the incoming data set. Consult the help page ?t.test
and re-do the test assuming that variance of ages in the two groups is identical. What parts of the test output change?
bcrabl %>% t.test(age ~ mol.biol, .)
##
## Welch Two Sample t-test
##
## data: age by mol.biol
## t = 4.8172, df = 68.529, p-value = 8.401e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 7.13507 17.22408
## sample estimates:
## mean in group BCR/ABL mean in group NEG
## 40.25000 28.07042
bcrabl %>% boxplot(age ~ mol.biol, .)