Type: | Package |
Title: | Example Data Sets for Causal Inference Textbooks |
Version: | 0.1.4 |
Description: | Example data sets to run the example problems from causal inference textbooks. Currently, contains data sets for Huntington-Klein, Nick (2021 and 2025) "The Effect" https://theeffectbook.net, first and second edition, Cunningham, Scott (2021 and 2025, ISBN-13: 978-0-300-25168-5) "Causal Inference: The Mixtape", and Hernán, Miguel and James Robins (2020) "Causal Inference: What If" https://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/. |
License: | MIT + file LICENSE |
Depends: | R (≥ 2.10) |
Imports: | tibble |
Encoding: | UTF-8 |
LazyData: | true |
RoxygenNote: | 7.3.2 |
URL: | https://github.com/NickCH-K/causaldata |
BugReports: | https://github.com/NickCH-K/causaldata/issues |
NeedsCompilation: | no |
Packaged: | 2024-10-24 20:21:34 UTC; nickc |
Author: | Nick Huntington-Klein
|
Maintainer: | Nick Huntington-Klein <nhuntington-klein@seattleu.edu> |
Repository: | CRAN |
Date/Publication: | 2024-10-24 20:40:02 UTC |
causaldata: Example Data Sets for Causal Inference Textbooks
Description
Example data sets to run the example problems from causal inference textbooks. Currently, contains data sets for Huntington-Klein, Nick (2021 and 2025) "The Effect" https://theeffectbook.net, first and second edition, Cunningham, Scott (2021 and 2025, ISBN-13: 978-0-300-25168-5) "Causal Inference: The Mixtape", and Hernán, Miguel and James Robins (2020) "Causal Inference: What If" https://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/.
Author(s)
Maintainer: Nick Huntington-Klein nhuntington-klein@seattleu.edu (ORCID)
Authors:
Malcolm Barrett malcolmbarrett@gmail.com (ORCID)
See Also
Useful links:
U.S. Women's Labor-Force Participation
Description
The Mroz
data frame has 753 rows and 8 columns. The observations, from the Panel Study of Income Dynamics (PSID), are married women.
Usage
Mroz
Format
A data frame with 753 rows and 8 variables
- lfp
Labor-force participation
- k5
Number of children 5 years old or younger
- k618
Number of children 6 to 17 years old
- age
Age in years
- wc
Wife attended college
- hc
Husband attended college
- lwg
Log expected wage rate. For women in the labor force, the actual wage rate; for women not in the labor force, an imputed value based on the regression of lwg on the other variables.
- inc
Family income exclusive of wife's income
Details
This data set is a lightly edited version of the one found in the carData package in R. It is used in the Describing Relationships chapter of The Effect.
Source
Mroz, T. A. (1987) The sensitivity of an empirical model of married women's hours of work to economic and statistical assumptions. *Econometrica* 55, 765–799.
John Fox, Sanford Weisberg and Brad Price (2020). carData: Companion to Applied Regression Data Sets. R package version 3.0-4. https://CRAN.R-project.org/package=carData
References
Fox, J. (2016) *Applied Regression Analysis and Generalized Linear Models,* Third Edition. Sage.
Fox, J. (2000) *Multiple and Generalized Nonparametric Regression.* Sage.
Fox, J. and Weisberg, S. (2019) *An R Companion to Applied Regression.* Third Edition, Sage.
Long. J. S. (1997) *Regression Models for Categorical and Limited Dependent Variables.* Sage.
Huntington-Klein. 2021. The Effect: An Introduction to Research Design and Causality. https://theeffectbook.net.
Data on abortion legalization and sexually transmitted infections
Description
This data looks at the effect of abortion legalization on the incidence of gonnorhea among 15-19 year olds, as a measure of risky behavior. Treatment is whether abortion is legalized at the time that the eventual 15-19 year olds are born.
Usage
abortion
Format
A data frame with 19584 rows and 22 variables
- fip
State FIPS code
- age
Age in years
- race
Race - 1 = white, 2 = black
- year
Year
- t
Year but counted on a different scale
- sex
Sex: 1 = male, 2 = female
- totpop
Total population
- ir
Incarcerated Males per 100,000
- crack
Crack index
- alcohol
Alcohol consumption per capita
- income
Real income per capita
- ur
State unemployment rate
- poverty
Poverty rate
- repeal
In a state with an early repeal of abortion prohibition
- acc
AIDS mortality per 100,000 cumulative in t, t-1, t-2, t-3
- wht
White Indicator
- male
Male Indicator
- lnr
Logged gonnorhea cases per 100,000 in 15-19 year olds
- younger
From the younger group
- fa
State-younger interaction
- pi
Parental involvement law in effect
- bf15
Is a black female in the 15-19 age group
Details
This data is used in the Difference-in-Differences chapter of Causal Inference: The Mixtape by Cunningham.
Source
Cunningham, Scott, and Christopher Cornwell. 2013. “The Long-Run Effect of Abortion on Sexually Transmitted Infections.” American Law and Economics Review 15 (1): 381–407.
References
Cunningham. 2021. Causal Inference: The Mixtape. Yale Press. https://mixtape.scunning.com/index.html.
Data from a survey of internet-mediated sex workers
Description
This data comes from a survey of 700 internet-mediated sex workers in 2008 and 2009, asking the same sex workers standard labor market information over several time periods.
Usage
adult_services
Format
A data frame with 1787 rows and 31 variables
- id
Provider identifier
- session
Client session identifier
- age
Age of provider
- age_cl
Age of Client
- appearance_cl
Client Attractiveness (Scale of 1 to 10)
- bmi
Body Mass Index
- schooling
Imputed Years of Schooling
- asq_cl
Age of Client Squared
- provider_second
Second Provider Involved
- asian_cl
Asian Client
- black_cl
Black Client
- hispanic_cl
Hispanic Client
- othrace_cl
Other Ethnicity Client
- reg
Client was a Regular
- hot
Met Client in Hotel
- massage_cl
Gave Client a Massage
- lnw
Log of Hourly Wage
- llength
Ln(Length)
- unsafe
Unprotected sex with client of any kind
- asian
race==1. Asian
- black
race==2. Black
- hispanic
race==3. Hispanic
- other
race==4. Other
- white
race==5. White
- asq
Age of provider squared
- cohab
ms==Cohabitating (living with a partner) but unmarried
- married
ms==Currently married and living with your spouse
- divorced
ms==Divorced and not remarried
- separated
ms==Married but not currently living with your spouse
- nevermarried
ms==Single and never married
- widowed
ms==Widowed and not remarried
Details
This data is used in the Panel Data chapter of Causal Inference: The Mixtape by Cunningham.
Source
Cunningham, Scott, and Todd D. Kendall. 2011. “Prostitution 2.0: The Changing Face of Sex Work.” Journal of Urban Economics 69: 273–87.
Cunningham, Scott, and Todd D. Kendall. 2014. “Examining the Role of Client Reviews and Reputation Within Online Prostitution.” In, edited by Scott Cunningham and Manisha Shah. Vol. Handbook on the Economics of Prostitution. Oxford University Press.
Cunningham, Scott, and Todd D. Kendall. 2016. “Prostitution Labor Supply and Education.” Review of Economics of the Household. Forthcoming.
References
Cunningham. 2021. Causal Inference: The Mixtape. Yale Press. https://mixtape.scunning.com/index.html.
Automobile data from Stata
Description
This data, which comes standard in Stata, originally came from the April 1979 issue of Consumer Reports and from the United States Government EPA statistics on fuel consumption; they were compiled and published by Chambers et al. (1983).
Usage
auto
Format
A data frame with 74 rows and 12 variables
- make
Make and Model
- price
Price
- mpg
Mileage (mpg)
- rep78
Repair Record 1978
- headroom
Headroom (in.)
- trunk
Trunk space (cu. ft.)
- weight
Weight (lbs.)
- length
Length (in.)
- turn
Turn Circle (ft.)
- displacement
Displacement (cu. in.)
- gear_ratio
Gear Ratio
- foreign
Car type; 0 = Domestic, 1 = Foreign
Details
This data is used in the Probability and Regression Review chapter of Causal Inference: The Mixtape.
Source
Chambers, J. M., W. S. Cleveland, B. Kleiner, and P. A. Tukey. 1983. Graphical Methods for Data Analysis. Belmont, CA: Wadsworth.
References
Cunningham. 2021. Causal Inference: The Mixtape. Yale Press. https://mixtape.scunning.com/index.html.
Data on avocado sales
Description
This data set includes information on the average price and total amount of avocados sold across 169 weeks from 2015 to 2018. This data covers only sales of 'conventional' avocados that take place in California.
Usage
avocado
Format
A data frame with 169 rows and 3 variables:
- Date
Date of observation
- AveragePrice
Average avocado price
- TotalVolume
Total volume of avocados sold
Details
This data was used in the Identification chapter of The Effect by Huntington-Klein
Source
Kiggins, Justin. 2018. https://www.kaggle.com/neuromusic/avocado-prices/
References
Huntington-Klein. 2021. The Effect: An Introduction to Research Design and Causality. https://theeffectbook.net.
Data from "Black Politicians are More Intrinsically Motivated to Advance Blacks' Interests"
Description
The black_politicians
data contains data from Broockman (2013) on a field experiment where the author sent fictional emails purportedly sent by Black people to legislators in the United States. The experiment sought to determine whether the effect of the email being from "out-of-district" (someone who can't vote for you and so provides no extrinsic motivation to reply) would have a smaller effect on response rates for Black legislators than for non-Black ones, providing evidence of additional intrinsic motivation on the part of Black legislators to help Black people.
Usage
black_politicians
Format
A data frame with 5593 rows and 14 variables
- leg_black
Legislator receiving email is Black
- treat_out
Email is from out-of-district
- responded
Legislator responded to email
- totalpop
District population
- medianhhincom
District median household income
- black_medianhh
District median household income among Black people
- white_medianhh
District median household income among White people
- blackpercent
Percentage of district that is Black
- statessquireindex
State's Squire index
- nonblacknonwhite
Legislator receiving email is neither Black nor White
- urbanpercent
Percentage of district that is urban
- leg_senator
Legislator receiving email is a senator
- leg_democrat
Legislator receiving email is in the Democratic party
- south
Legislator receiving email is in the Southern United States
Details
This data is used in the Matching chapter of The Effect.
Source
Broockman, D.E., 2013. Black politicians are more intrinsically motivated to advance blacks’ interests: A field experiment manipulating political incentives. American Journal of Political Science, 57(3), pp.521-536.
References
Huntington-Klein. 2021. The Effect: An Introduction to Research Design and Causality. https://theeffectbook.net.
Data on castle-doctrine statutes and violent crime
Description
This data looks at the impact of castle-doctrine statutes on violent crime. Data from the FBI Uniform Crime Reports Summary files are combined with information on castle-doctrine/stand-your-ground law impementation in different states.
Usage
castle
Format
A data frame with 19584 rows and 22 variables
- year
Year
- post
After-treatment
- sid
state id
- robbery_gun_r
Region-quarter fixed effects
- jhcitizen_c
justifiable homicide by private citizen count
- jhpolice_c
justifiable homicide by police count
- homicide
homicide count per 100,000 state population
- robbery
Region-quarter fixed effects
- assault
aggravated assault count per 100,000 state population
- burglary
burglary count per 100,000 state population
- larceny
larceny count per 100,000 state population
- motor
motor vehicle theft count per 100,000 state population
- murder
murder count per 100,000 state population
- unemployrt
unemployment rate
- blackm_15_24
% of black male aged 15-24
- whitem_15_24
% of white male aged 15-24
- blackm_25_44
% of black male aged 25-44
- whitem_25_44
% of white male aged 25-44
- poverty
poverty rate
- l_homicide
Logged crime rate
- l_larceny
Logged crime rate
- l_motor
Logged crime rate
- l_police
Logged police presence
- l_income
Logged income
- l_prisoner
Logged number of prisoners
- l_lagprisoner
Lagged log prisoners
- l_exp_subsidy
Logged subsidy spending
- l_exp_pubwelfare
Logged public welfare spending
- lead1,lead2,lead3,lead4,lead5,lead6,lead7,lead8,lead9,lag0,lag1,lag2,lag3,lag4,lag5
Indicators of how many time periods until/since treatment
- popwt
Population weight
- r20001,r20002,r20003,r20004,r20011,r20012,r20013,r20014,r20021,r20022,r20023,r20024,r20031,r20032,r20033,r20034,r20041,r20042,r20043,r20044,r20051,r20052,r20053,r20054,r20061,r20062,r20063,r20064,r20071,r20072,r20073,r20074,r20081,r20082,r20083,r20084,r20091,r20092,r20093,r20094,r20101,r20102,r20103,r20104
Region-quarter fixed effects
- trend_1,trend_10,trend_11,trend_12,trend_13,trend_14,trend_15,trend_16,trend_17,trend_18,trend_19,trend_2,trend_20,trend_21,trend_22,trend_23,trend_24,trend_25,trend_26,trend_27,trend_28,trend_29,trend_3,trend_30,trend_31,trend_32,trend_33,trend_34,trend_35,trend_36,trend_37,trend_38,trend_39,trend_4,trend_40,trend_41,trend_42,trend_43,trend_44,trend_45,trend_46,trend_47,trend_48,trend_49,trend_5,trend_50,trend_51,trend_6,trend_7,trend_8,trend_9
State linear time trends
Details
This data is used in the Difference-in-Differences chapter of Causal Inference: The Mixtape by Cunningham.
Source
Cheng, Cheng, and Mark Hoekstra. 2013. “Does Strengthening Self-Defense Law Deter Crime or Escalate Violence? Evidence from Expansions to Castle Doctrine.” Journal of Human Resources 48 (3): 821–54.
References
Cunningham. 2021. Causal Inference: The Mixtape. Yale Press. https://mixtape.scunning.com/index.html.
Data on Drug Arrests from the Crown Court Sentencing Survey
Description
The ccdrug
data contains data on drug arrests from the Crown Court Sentencing Survey between 2012 and 2015 in England and Wales, allowing for a look at differential sentencing rates for men and women, with a set of controls for features that should impact sentencing.
Usage
ccdrug
Format
A data frame with 16973 rows and 45 variables
- custody
Taken in to custody.
- male
Is a male
- first_offense
This is the first offense
- age
Age in ten-year bins
- offense
Offense type
- prev_convictions
Previous convictions, in bins of None, 1-3, 4-9, or 10+
- drg_class
Type of drug
- drg_culpability
Level of culpability for crime
- drg_increasing_ser_other_1, drg_increasing_ser_other_3, drg_increasing_ser_other_4, drg_increasing_ser_other_5, drg_increasing_ser_other_6, drg_increasing_ser_other_7, drg_increasing_ser_other_8, drg_increasing_ser_other_9, drg_increasing_ser_other_10, drg_increasing_ser_other_11, drg_increasing_ser_other_12, drg_increasing_ser_other_13, drg_increasing_ser_other_14, drg_increasing_ser_other_15, drg_increasing_ser_other_17, drg_increasing_ser_other_18, drg_increasing_ser_other_19, drg_increasing_ser_other_20, drg_increasing_ser_other_21, drg_reducing_ser_1, drg_reducing_ser_2, drg_reducing_ser_3, drg_reducing_ser_4, drg_reducing_ser_5, drg_reducing_ser_6, drg_reducing_ser_7, drg_reducing_ser_8, drg_reducing_ser_9, drg_reducing_ser_10, drg_reducing_ser_11, drg_reducing_ser_12, drg_reducing_ser_13, drg_reducing_ser_14, drg_reducing_ser_15, drg_reducing_ser_16, drg_increasing_ser_stat_2, drg_increasing_ser_stat_3
A set of indicators that should increase or reduce the likelihood of being taken into custody. See variable labels for specific definitions.
Details
This data set is used in the Partial Identification chapter of The Effect.
Source
Pina Sanchez, J., & Harris, L., 2020. Sentencing gender? Investigating the presence of gender disparities in Crown Court sentences. Criminal Law Review, 2020(1), pp. 3-28.
References
Huntington-Klein. 2021. The Effect: An Introduction to Research Design and Causality. https://theeffectbook.net.
Data from Card (1995) to estimate the effect of college education on earnings
Description
Data from the National Longitudinal Survey Young Men Cohort. This data is used to estimate the effect of college education on earnings, using the presence of a nearby (in-county) college as an instrument for college attendance.
Usage
close_college
Format
A data frame with 3010 rows and 8 variables
- lwage
Log wages
- educ
Years of education
- exper
Years of work experience
- black
Race: Black
- south
In the southern United States
- married
Is married
- smsa
In a Standard Metropolitan Statistical Area (urban)
- nearc4
There is a four-year college in the county
Details
This data is used in the Instrumental Variables chapter of Causal Inference: The Mixtape by Cunningham.
Source
Card, David. 1995. “Aspects of Labour Economics: Essays in Honour of John Vanderkamp.” In. University of Toronto Press.
References
Cunningham. 2021. Causal Inference: The Mixtape. Yale Press. https://mixtape.scunning.com/index.html.
A close-elections regression discontinuity study from Lee, Moretti, and Butler (2004)
Description
This data comes from a close-elections regression discontinuity study from Lee, Moretti, and Butler (2004). The design is intended to test convergence and divergence in policy. Major effects of electing someone from a particular party on policy outcomes *in a close race* indicates that the victor does what they want. Small or null effects indicate that the electee moderates their position towards their nearly-split electorate.
Usage
close_elections_lmb
Format
A data frame with 13588 rows and 9 variables
- state
ICPSR state code
- district
district code
- id
Election ID
- score
ADA voting score (higher = more liberal)
- year
Year of election
- demvoteshare
Democratic share of the vote
- democrat
Democratic victory
- lagdemocrat
Lagged Democratic victory
- lagdemvoteshare
Lagged democratic share of the vote
Details
This data is used in the Regression Discontinuity chapter of Causal Inference: The Mixtape by Cunningham.
Source
Lee, David S., Enrico Moretti, and Matthew J. Butler. 2004. “Do Voters Affect or Elect Policies: Evidence from the U.S. House.” Quarterly Journal of Economics 119 (3): 807–59.
References
Cunningham. 2021. Causal Inference: The Mixtape. Yale Press. https://mixtape.scunning.com/index.html.
Observational counterpart to nsw_mixtape data
Description
Data from the Current Population Survey on participation in the National Supported Work Demonstration (NSW) job-training program experiment. This is used as an observational comparison to the NSW experimental data from the nsw_mixtape data.
Usage
cps_mixtape
Format
A data frame with 15992 rows and 11 variables
- data_id
Individual ID
- treat
In the National Supported Work Demonstration Job Training Program
- age
Age in years
- educ
Years of education
- black
Race: Black
- hisp
Ethnicity: Hispanic
- marr
Married
- nodegree
Has no degree
- re74
Real earnings 1974
- re75
Real earnings 1975
- re78
Real earnings 1978
Details
This data is used in the Matching and Subclassification chapter of Causal Inference: The Mixtape by Cunningham.
Source
Dehejia, Rajeev H., and Sadek Wahba. 1999. “Causal Effects in Nonexperimental Studies: Reevaluating the Evaluation of Training Programs.” Journal of the American Statistical Association 94 (448): 1053–62.".
References
Cunningham. 2021. Causal Inference: The Mixtape. Yale Press. https://mixtape.scunning.com/index.html.
Data on Taiwanese Credit Card Holders
Description
Data from the UCI Machine Learning Repository on Taiwanese credit card holders, the amount of their credit card bill, and whether their payment was late.
Usage
credit_cards
Format
A data frame with 30000 rows and 4 variables
- LateSept
Credit card payment is late in Sept 2005
- LateApril
Credit card payment is late in April 2005
- BillApril
Total bill in April 2005 in thousands of New Taiwan Dollars
- AGE
Age of card-holder
Details
This data is used in the Matching chapter of The Effect by Huntington-Klein.
Source
Lichman, Moshe. 2013. UCI Machine Learning Repository.
References
Huntington-Klein. 2021. The Effect: An Introduction to Research Design and Causality. https://theeffectbook.net.
Gapminder data
Description
The gapminder
data contains data on life expectancy and GDP per capita by country and year.
Usage
gapminder
Format
A data frame with 1704 rows and 6 variables
- country
The country
- continent
The continent the country is in
- year
The year data was collected. Ranges from 1952 to 2007 in increments of 5 years
- lifeExp
Life expectancy at birth, in years
- pop
Population
- gdpPercap
GDP per capita (US$, inflation-adjusted)
Details
This data set is the same one found in the gapminder package in R as of 2020. This data set is used in the Fixed Effects chapter of The Effect.
Source
https://www.gapminder.org/data/
Jennifer Bryan (2017). gapminder: Data from Gapminder. R package version 0.3.0. https://CRAN.R-project.org/package=gapminder
References
Huntington-Klein. 2021. The Effect: An Introduction to Research Design and Causality. https://theeffectbook.net.
Google Stock Data
Description
The google_stock
data contains data on daily stock returns for Google and the S&P 500 for May through Augut 2015, centering around the August 10, 2015 announcement that Google would reorganize under parent company Alphabet.
Usage
google_stock
Format
A data frame with 84 rows and 3 variables
- Date
The date
- Google_Return
Daily GOOG Stock Return (1 = 100 percent daily return)
- SP500_Return
Daily S&P 500 Index Return (1 = 100 percent daily return)
Details
This data was downloaded using the tidyquant package, and is used in the Event Studies chapter of The Effect.
Source
Matt Dancho and Davis Vaughan (2021). tidyquant: Tidy Quantitative Financial Analysis. R package version 1.0.3. https://CRAN.R-project.org/package=tidyquant
References
Huntington-Klein. 2021. The Effect: An Introduction to Research Design and Causality. https://theeffectbook.net.
Data from "Government Transfers and Political Support"
Description
The gov_transfers
data contains data from Manacorda, Miguel, and Vigorito (2011) on government transfer program that was administered based on an income cutoff. Data is pre-limited to households that were just around the income cutoff.
Usage
gov_transfers
Format
A data frame with 1948 rows and 5 variables
- Income_Centered
Income measure, centered around program cutoff (negative value = eligible)
- Education
Household average years of education among those 16+
- Age
Household average age
- Participation
Participation in transfers
- Support
Measure of support for the government
Details
This data is used in the Regression Discontinuity chapter of The Effect.
Source
Manacorda, M., Miguel, E. and Vigorito, A., 2011. Government transfers and political support. American Economic Journal: Applied Economics, 3(3), pp.1-28.
References
Huntington-Klein. 2021. The Effect: An Introduction to Research Design and Causality. https://theeffectbook.net.
Data from "Government Transfers and Political Support" for Density Tests
Description
The gov_transfers_density
data contains data from Manacorda, Miguel, and Vigorito (2011) on government transfer program that was administered based on an income cutoff. As opposed to the gov_transfers
data set, this data set only contains income information, but has a wider range of it, for use with density discontinuity tests.
Usage
gov_transfers_density
Format
A data frame with 52549 rows and 1 variable:
- Income_Centered
Income measure, centered around program cutoff (negative value = eligible)
Details
This data is used in the Regression Discontinuity chapter of The Effect.
Source
Manacorda, M., Miguel, E. and Vigorito, A., 2011. Government transfers and political support. American Economic Journal: Applied Economics, 3(3), pp.1-28.
References
Huntington-Klein. 2021. The Effect: An Introduction to Research Design and Causality. https://theeffectbook.net.
Data from a fictional randomized heart transplant study
Description
greek_data
is a fictional data set from Table 2.2 in Chapter 2 of Causal Inference. From the book: "Table 2.2 shows the data from our heart transplant randomized study. Besides data on treatment A (1 if the individual received a transplant, 0 otherwise) and outcome Y (1 if the individual died, 0 otherwise), Table 2.2 also contains data on the prognostic factor L (1 if the individual was in critical condition, 0 otherwise), which we measured before treatment was assigned."
Usage
greek_data
Format
A data frame with 20 rows and 4 variables:
- name
The name of a Greek god
- l
A prognostic factor
- a
The treatment, a heart transplant
- y
The outcome, death
Source
Hernán and Robins. Causal Inference. https://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/
Data from "How do Mortgage Subsidies Affect Home Ownership? Evidence from the Mid-Century GI Bills"
Description
The mortgages
data contains data from Fetter (2015) on home ownership rates by men, focusing on whether they were born at the right time to be eligible for mortgage subsidies based on their military service.
Usage
mortgages
Format
A data frame with 214144 rows and 6 variables
- bpl
Birth State
- qob
Quarter of birth
- nonwhite
White/nonwhite race indicator. 1 = Nonwhite
- vet_wwko
Veteran of either the Korean war or World War II
- home_ownership
Owns a home
- qob_minus_kw
Quarter of birth centered on eligibility for mortgage subsidy (0+ = eligible)
Details
This data is used in the Regression Discontinuity chapter of The Effect.
Source
Fetter, D.K., 2013. How do mortgage subsidies affect home ownership? Evidence from the mid-century GI bills. American Economic Journal: Economic Policy, 5(2), pp.111-47.
References
Huntington-Klein. 2021. The Effect: An Introduction to Research Design and Causality. https://theeffectbook.net.
National Health and Nutrition Examination Survey Data I Epidemiologic Follow-up Study
Description
nhefs
is a cleaned data set of the data used in Causal Inference by Hernán and Robins. nhefs
is dataset containing data from the National Health and Nutrition Examination Survey Data I Epidemiologic Follow-up Study (NHEFS). The NHEFS was jointly initiated by the National Center for Health Statistics and the National Institute on Aging in collaboration with other agencies of the United States Public Health Service. A detailed description of the NHEFS, together with publicly available data sets and documentation, can be found at https://wwwn.cdc.gov/nchs/nhanes/nhefs/.
Usage
nhefs
Format
A data frame with 1629 rows and 67 variables. The codebook is available as nhefs_codebook
.
Source
https://wwwn.cdc.gov/nchs/nhanes/nhefs/
References
Hernán and Robins. Causal Inference. https://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/
NHEFS Codebook
Description
nhefs_codebook
is the codebook for nhefs
and nhefs_complete
.
Usage
nhefs_codebook
Format
A data frame with 64 rows and 2 variables.
- variable
The variable being described
- description
The variable description
Source
https://wwwn.cdc.gov/nchs/nhanes/nhefs/
References
Hernán and Robins. Causal Inference. https://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/
Complete-Data National Health and Nutrition Examination Survey Data I Epidemiologic Follow-up Study
Description
nhefs_complete
is the same as nhefs
, but only participants with complete data are included. The variables that need to be complete to be included are: qsmk
, sex
, race
, age
, school
, smokeintensity
, smokeyrs
, exercise
, active
, wt71
, wt82
, and wt82_71
.
Usage
nhefs_complete
Format
A data frame with 1556 rows and 67 variables. The codebook is available as nhefs_codebook
.
Source
https://wwwn.cdc.gov/nchs/nhanes/nhefs/
References
Hernán and Robins. Causal Inference. https://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/
Data from the National Supported Work Demonstration (NSW) job-training program
Description
Data from the National Supported Work Demonstration (NSW) job-training program experiment, where those treated were guaranteed a job for 9-18 months.
Usage
nsw_mixtape
Format
A data frame with 445 rows and 11 variables
- data_id
Individual ID
- treat
In the National Supported Work Demonstration Job Training Program
- age
Age in years
- educ
Years of education
- black
Race: Black
- hisp
Ethnicity: Hispanic
- marr
Married
- nodegree
Has no degree
- re74
Real earnings 1974
- re75
Real earnings 1975
- re78
Real earnings 1978
Details
This data is used in the Matching and Subclassification chapter of Causal Inference: The Mixtape by Cunningham.
Source
Lalonde, Robert. 1986. “Evaluating the Econometric Evaluations of Training Programs with Experimental Data.” American Economic Review 76 (4): 604–20.
Dehejia, Rajeev H., and Sadek Wahba. 1999. “Causal Effects in Nonexperimental Studies: Reevaluating the Evaluation of Training Programs.” Journal of the American Statistical Association 94 (448): 1053–62.".
References
Cunningham. 2021. Causal Inference: The Mixtape. Yale Press. https://mixtape.scunning.com/index.html.
Organ Donation Data
Description
The organ_donation
data contains data from Kessler and Roth (2014) on organ donation rates by state and quarter. The state of California enacted an active-choice phrasing for their organ donation sign-up questoin in Q32011. The only states included in the data are California and those that can serve as valid controls; see Kessler and Roth (2014).
Usage
organ_donations
Format
A data frame with 162 rows and 3 variables
- State
The state, where California is the Treated group
- Quarter
Quarter of observation, in "Q"QYYYY format
- Rate
Organ donation rate
- Quarter_Num
Quarter of observation in numerical format. 1 = Quarter 4, 2010
Details
This data is used in the Difference-in-Differences chapter of The Effect.
Source
Kessler, J.B. and Roth, A.E., 2014. Don't take 'no' for an answer: An experiment with actual organ donor registrations. National Bureau of Economic Research working paper No. 20378. https://www.nber.org/papers/w20378
References
Huntington-Klein. 2021. The Effect: An Introduction to Research Design and Causality. https://theeffectbook.net.
Data on Restaurant Inspections
Description
The restaurant_inspections
data contains data on restaurant health inspections performed in Anchorage, Alaska.
Usage
restaurant_inspections
Format
A data frame with 27178 rows and 5 variables
- business_name
Name of restaurant/chain
- inspection_score
Health Inspection Score
- Year
Year of inspection
- NumberofLocations
Number of locations in restaurant chain
- Weekend
Was the inspection performed on a weekend?
Details
This data set is used in the Regression chapter of The Effect.
Source
Camus, Louis-Ashley. 2020. https://www.kaggle.com/loulouashley/inspection-score-restaurant-inspection
References
Huntington-Klein. 2021. The Effect: An Introduction to Research Design and Causality. https://theeffectbook.net.
A simple simulated data set for calculating p-values
Description
This simulated data allows for a quick and easy calculation of a p-value using randomization inference.
Usage
ri
Format
A data frame with 8 rows and 5 variables
- name
Fictional Name
- d
Treatment
- y
Outcome
- y0
Outcome if untreated
- y1
Outcome if treated
Details
This data is used in the Potential Outcomes Causal Model chapter of Causal Inference: The Mixtape by Cunningham.
Source
Cunningham. 2021. Causal Inference: The Mixtape. Yale Press. https://mixtape.scunning.com/index.html.
References
Cunningham. 2021. Causal Inference: The Mixtape. Yale Press. https://mixtape.scunning.com/index.html.
Earnings and Loan Repayment in US Four-Year Colleges
Description
From the College Scorecard, this data set contains by-college-by-year data on how students who attended those colleges are doing.
Usage
scorecard
Format
A data frame with 48,445 rows and 8 variables:
- unitid
College identifiers
- inst_name
Name of the college or university
- state_abbr
Two-letter abbreviation for the state the college is in
- pred_degree_awarded_ipeds
Predominant degree awarded. 1 = less-than-two-year, 2 = two-year, 3 = four-year+
- year
Year in which outcomes are measured
- earnings_med
Median earnings among students (a) who received federal financial aid, (b) who began as undergraduates at the institution ten years prior, (c) with positive yearly earnings
- count_not_working
Number of students who are (a) not working (not necessarily unemployed), (b) received federal financial aid, and (c) who began as undergraduates at the institution ten years prior
- count_working
Number of students who are (a) working, (b) who received federal financial aid, and (c) who began as undergraduates at the institution ten years prior
Details
This data is not just limited to four-year colleges and includes a very wide variety of institutions.
Note that the labor market (earnings, working) and repayment rate data do not refer to the same cohort of students, but rather are matched on the year in which outcomes are recorded. Labor market data refers to cohorts beginning college as undergraduates ten years prior, repayment rate data refers to cohorts entering repayment seven years prior.
Data was downloaded using the Urban Institute's educationdata
package.
This data was used in the Describing Variables chapter of The Effect by Huntington-Klein
Source
Education Data Portal (Version 0.4.0 - Beta), Urban Institute, Center on Education Data and Policy, accessed June 28, 2019. https://educationdata.urban.org/documentation/, Scorecard.
References
Huntington-Klein. 2021. The Effect: An Introduction to Research Design and Causality. https://theeffectbook.net.
Data from John Snow's 1855 study of the cause of cholera
Description
A subset of the aggregated death rate data from Snow's legendary study of the source of the London Cholera outbreak.
Usage
snow
Format
A data frame with 4 rows and 4 variables
- year
Year
- supplier
Water pump supplier
- treatment
Status of water pump
- deathrate
Deaths per 10k 1851 population
Details
This data is used in the Difference-in-Differences chapter of The Effect by Huntington-Klein.
Source
Snow, John. 1855. 'On the Mode of Communication of Cholera'. John Churchill."
Coleman, Thomas. 2019. 'Causality in the time of cholera: John Snow as a prototype for causal inference.' SSRN 3262234."
References
Huntington-Klein. 2021. The Effect: An Introduction to Research Design and Causality. https://theeffectbook.net.
Data from "Social Networks and the Decision to Insure"
Description
The social_insure
data contains data from Jai, De Janvry, and Saoudlet (2015) on a two-round social network-based experiment on getting farmers to get insurance. See the paper for more details.
Usage
social_insure
Format
A data frame with 1410 rows and 13 variables
- address
Natural village
- village
Administrative village
- takeup_survey
Whether farmer ended up purchasing insurance. (1 = yes)
- age
Household Characteristics - Age
- agpop
Household Characteristics - Household Size
- ricearea_2010
Area of Rice Production
- disaster_prob
Perceived Probability of Disasters Next Year
- male
Household Caracteristics: Gender of Household Head (1 = male)
- default
"Default option" in experimental format assigned to. (1 = default is to buy, 0 = default is to not buy)
- intensive
Whether or not was assigned to "intensive" experimental session (1 = yes)
- risk_averse
Risk aversion measurement
- literacy
1 = literate, 0 = illiterate
- pre_takeup_rate
Takeup rate prior to experiment
Details
This data is used in the Instrumental Variables chapter of The Effect.
Source
Cai, J., De Janvry, A. and Sadoulet, E., 2015. Social networks and the decision to insure. American Economic Journal: Applied Economics, 7(2), pp.81-108.
References
Huntington-Klein. 2021. The Effect: An Introduction to Research Design and Causality. https://theeffectbook.net.
Data on prison capacity expansion in Texas
Description
This data looks at the massive expansion in prison capacity in Texas that occurred in 1993 under Governor Ann Richards, and the effect of that expansion on the number of Black men in prison.
Usage
texas
Format
A data frame with 816 rows and 12 variables
- statefip
State FIPS code
- year
Year
- bmprison
Number of Black men in prison
- wmprison
Number of White men in prison
- alcohol
Alcohol consumption per capita
- income
Median income
- ur
Unemployment rate
- poverty
Poverty rate
- black
Percentage of the population that is Black
- perc1519
Percentage of the population that is age 15-19
- aidscapita
AIDS mortality per 100,000 in t
- state
State name
Details
This data is used in the Synthetic Control chapter of Causal Inference: The Mixtape by Cunningham.
Source
Cunningham and Kang. 2019. “Studying the Effect of Incarceration Shocks to Drug Markets.” Unpublished manuscript. http://www.scunning.com/files/mass_incarceration_and_drug_abuse.pdf
References
Cunningham. 2021. Causal Inference: The Mixtape. Yale Press. https://mixtape.scunning.com/index.html.
Data from HIV information experiment in Thornton (2008)
Description
thornton_hiv
comes from an experiment in Malawi looking at whether cash incentives could encourage people to learn the results of their HIV tests.
Usage
thornton_hiv
Format
A data frame with 4820 rows and 7 variables
- villnum
Village ID
- got
Got HIV results
- distvct
Distance in kilometers
- tinc
Total incentive
- any
Received any incentive
- age
Age
- hiv2004
HIV results
Details
This data is used in the Potential Outcomes Causal Model chapter of Causal Inference: The Mixtape by Cunningham.
Source
Thornton, Rebecca L. 2008. 'The Demand for, and Impact of, Learning Hiv Status.' American Economic Review 98 (5): 1829–63.
References
Cunningham. 2021. Causal Inference: The Mixtape. Yale Press. https://mixtape.scunning.com/index.html.
Data from the sinking of the Titanic
Description
titanic
comes from the sinking of the Titanic, and can be used to look at survival by different demographic characteristics.
Usage
titanic
Format
A data frame with 4820 rows and 7 variables
- class
class (ticket)
- age
Age (Child vs. Adult)
- sex
Gender
- survived
Survived
Details
This data is used in the Matching and Subclassification chapter of Causal Inference: The Mixtape by Cunningham.
Source
British Board of Trade (1990), Report on the Loss of the ‘Titanic’ (S.S.). British Board of Trade Inquiry Report (reprint). Gloucester, UK: Allan Sutton Publishing.
References
Cunningham. 2021. Causal Inference: The Mixtape. Yale Press. https://mixtape.scunning.com/index.html.
Simulated data from a job training program for a bias reduction method
Description
This simulated data is used to demonstrate the bias-reduction method in matching as per Abadie and Imbens (2011).
Usage
training_bias_reduction
Format
A data frame with 8 rows and 4 variables
- Unit
Unit ID
- Y
Outcome
- D
Treatment
- X
Matching variable
Details
This data is used in the Matching and Subclassification chapter of Causal Inference: The Mixtape by Cunningham.
Source
Cunningham. 2021. Causal Inference: The Mixtape. Yale Press. https://mixtape.scunning.com/index.html.
References
Cunningham. 2021. Causal Inference: The Mixtape. Yale Press. https://mixtape.scunning.com/index.html.
Simulated data from a job training program
Description
This simulated data, which is presented in the form of a full results, table, is used to demonstrate a matching procedure.
Usage
training_example
Format
A data frame with 25 rows and 9 variables
- unit_treat
Unit ID for treated observations
- age_treat
age for treated observations
- earnings_treat
earnings for treated observations
- unit_control
Unit ID for control observations
- age_control
age for control observations
- earnings_control
earnings for control observations
- unit_matched
Unit ID for matched controls
- age_matched
age for matched controls
- earnings_matched
earnings for matched controls
Details
This data is used in the Matching and Subclassification chapter of Causal Inference: The Mixtape by Cunningham.
Source
Cunningham. 2021. Causal Inference: The Mixtape. Yale Press. https://mixtape.scunning.com/index.html.
References
Cunningham. 2021. Causal Inference: The Mixtape. Yale Press. https://mixtape.scunning.com/index.html.
Data on 19th century English Poverty from Yule (1899)
Description
yule
allows for a look at the correlation between poverty relief and poverty rates in England in the 19th century.
Usage
yule
Format
A data frame with 32 rows and 5 variables
- location
Location in England
- paup
Pauperism Growth
- outrelief
Poverty Relief Growth
- old
Annual growth in aged population
- pop
Annual growth in population
Details
This data is used in the Potential Outcomes Causal Model chapter of Causal Inference: The Mixtape by Cunningham.
Source
Yule, G. Udny. 1899. 'An Investigation into the Causes of Changes in Pauperism in England, Chiefly During the Last Two Interensal Decades.' Journal of Royal Statistical Society 62: 249–95.
References
Cunningham. 2021. Causal Inference: The Mixtape. Yale Press. https://mixtape.scunning.com/index.html.