Chapter 9Econometrics Foundations

Introduction

Economics makes causal claims — minimum wages affect employment, education raises earnings, institutions determine growth. Testing these claims requires data and a method for distinguishing causation from correlation. Econometrics is that method.

This chapter is not a statistics course. We assume familiarity with basic probability and regression. Instead, we focus on the central problem of empirical economics: identification — finding credible sources of exogenous variation that allow us to estimate causal effects. Every tool in this chapter — OLS, instrumental variables, difference-in-differences, regression discontinuity — is a strategy for solving the identification problem.

By the end of this chapter, you will be able to:
  1. State the identification problem and explain why correlation does not imply causation
  2. Derive and interpret OLS estimates and diagnose omitted variable bias
  3. Explain the logic of instrumental variables and evaluate instrument validity
  4. Set up and interpret a difference-in-differences design
  5. Explain the logic of regression discontinuity designs
  6. Evaluate threats to validity in empirical research

Prerequisites: Chapters 2 and 5 (economic context for examples). Mathematical prerequisites: linear algebra, probability and statistics.

9.1 The Identification Problem

The identification problem. The difficulty of establishing that a relationship between two variables is causal rather than merely correlative.

Consider the question: does an additional year of education increase earnings? We observe that more-educated people earn more. But is this because:

Endogeneity. A regressor $X$ is endogenous when it is correlated with the error term: $Cov(X, \varepsilon) \neq 0$. This arises from omitted variables, simultaneity, or measurement error, and causes OLS to produce biased estimates.
Counterfactual. The outcome that would have been observed for a treated unit had it not received treatment. Since only one state is ever observed for each unit, the counterfactual is always hypothetical. All causal inference methods are strategies for constructing plausible counterfactuals.

Both are consistent with the observed correlation. The identification problem is that we cannot directly compare the same person with and without education — the counterfactual is unobserved.

The fundamental equation:

$$Y_i = \alpha + \beta X_i + \varepsilon_i$$ (Eq. 9.1)

where $Y_i$ is the outcome (earnings), $X_i$ is the treatment (years of education), $\beta$ is the causal parameter of interest, and $\varepsilon_i$ captures everything else affecting $Y_i$ — ability, family background, motivation, luck, health, and thousands of other factors.

The identification problem arises when $X_i$ is correlated with $\varepsilon_i$ — when the "treatment" is not randomly assigned. In statistics, this is called endogeneity. In economics, it is the norm, not the exception: people choose their education (and the choice is correlated with ability), countries choose their policies (and the choice is correlated with their economic conditions), firms choose their prices (and the choice is correlated with demand conditions).

In a randomized experiment, the treatment $X_i$ is assigned by a coin flip — it is independent of $\varepsilon_i$ by construction. But economists rarely have the luxury of randomization for the big questions. The methods in this chapter — OLS, IV, DiD, RD — are strategies for finding "natural experiments" that approximate randomization in observational data.

9.2 Ordinary Least Squares (OLS)

OLS. Minimizes the sum of squared residuals to estimate the linear relationship between $Y$ and $X$.

For the multivariate model $Y = X\beta + \varepsilon$ (matrix notation):

$$\hat{\beta}_{OLS} = (X'X)^{-1}X'Y$$ (Eq. 9.2)
Gauss-Markov assumptions. The set of conditions under which OLS is the best linear unbiased estimator: (1) linearity, (2) random sampling, (3) no perfect multicollinearity, (4) zero conditional mean ($E[\varepsilon|X] = 0$), and (5) homoskedasticity ($Var(\varepsilon|X) = \sigma^2$).

Under the Gauss-Markov assumptions, OLS has desirable properties:

  1. Linearity: The true model is linear in parameters
  2. Random sampling: Observations are independently drawn
  3. No perfect multicollinearity: No regressor is an exact linear function of others
  4. Zero conditional mean: $E[\varepsilon|X] = 0$ — the error has no systematic relationship with the regressors
  5. Homoskedasticity: $Var(\varepsilon|X) = \sigma^2$ — the error variance is constant
Zero conditional mean. The assumption $E[\varepsilon|X] = 0$: the error term has no systematic relationship with the regressors. This is the critical assumption for OLS unbiasedness. When it fails (due to omitted variables, simultaneity, or measurement error), OLS is biased.
BLUE (Best Linear Unbiased Estimator). Under the Gauss-Markov assumptions, OLS has the lowest variance among all linear unbiased estimators. "Best" = minimum variance; "Linear" = a linear function of $Y$; "Unbiased" = $E[\hat{\beta}] = \beta$.

Under these assumptions, OLS is BLUE — the Best Linear Unbiased Estimator. "Best" means lowest variance among all linear unbiased estimators. "Unbiased" means $E[\hat{\beta}] = \beta$.

The critical assumption is #4: $E[\varepsilon|X] = 0$. When this fails — due to omitted variables, simultaneity, or measurement error in $X$ — OLS is biased. The estimate $\hat{\beta}$ no longer converges to the true $\beta$ even with infinite data. This is not a small-sample problem — it is a fundamental design flaw that more data cannot fix.

Figure 9.1 — OLS Regression Explorer

A scatter plot with a fitted OLS regression line. Drag the slider to add an outlier at different vertical positions and watch the regression line tilt. Observe how a single high-leverage point can dramatically change the slope, $R^2$, and coefficients.

No outlier Moderate Extreme (Y=25)
OLS: β̂ = 0.00 | Intercept = 0.00 | R² = 0.00

Figure 9.1. OLS regression with an adjustable outlier. The outlier is placed at $X=14$ (high leverage). Drag the slider above "No outlier" to introduce it and watch the line tilt. Hover for values.

Omitted Variable Bias

Omitted variable bias. Bias in the OLS estimator caused by excluding a relevant variable that is correlated with both the dependent variable and an included regressor. The direction and magnitude of the bias depend on the sign of the omitted variable's effect and its correlation with the included regressor.

Suppose the true model is $Y = \beta_0 + \beta_1 X + \beta_2 Z + u$, but we omit $Z$ and run $Y = \alpha_0 + \alpha_1 X + e$. Then:

$$E[\hat{\alpha}_1] = \beta_1 + \beta_2 \cdot \frac{Cov(X, Z)}{Var(X)}$$ (Eq. 9.3)

The bias equals the effect of the omitted variable ($\beta_2$) times the association between the omitted variable and the included regressor.

Sign of bias:

$Cov(X, Z) > 0$$Cov(X, Z) < 0$
$\beta_2 > 0$Upward bias (overestimate $\beta_1$)Downward bias
$\beta_2 < 0$Downward biasUpward bias
Example 9.1 — Return to Education

Suppose ability ($Z$) is positively correlated with both education ($X$) and earnings ($Y$). Then $\beta_2 > 0$ (ability raises earnings) and $Cov(X,Z) > 0$ (more able people get more education). The OLS estimate of the return to education is biased upward — it attributes some of the ability effect to education.

Figure 9.2 — Omitted Variable Bias

Two panels show the same data. Left: the true relationship with the confounder (ability) shown as point color. Right: the naive OLS regression that omits ability. Drag the slider to change confounding strength and watch the bias grow.

No confounding (0) Moderate (0.5) Strong (0.95)
True β₁ = 0.50 | Naive OLS β̂ = 0.00 | Bias = 0.00

Left: True model with confounder (ability) shown as color. Darker = higher ability.

Right: Naive OLS ignoring ability. The biased line (red dashed) is steeper than the true causal effect (blue).

9.3 Instrumental Variables (IV)

When OLS is biased because $X$ is endogenous ($Cov(X, \varepsilon) \neq 0$), an instrumental variable can rescue the estimation.

Instrument ($Z$). A variable that: (1) Relevance: $Z$ is correlated with $X$ ($Cov(Z, X) \neq 0$); (2) Exclusion restriction: $Z$ affects $Y$ only through $X$ ($Cov(Z, \varepsilon) = 0$).
Relevance condition. The requirement that the instrument $Z$ is sufficiently correlated with the endogenous regressor $X$. A weak instrument (low correlation) produces unreliable IV estimates with large standard errors and bias toward OLS. The first-stage F-statistic should exceed 10.
Exclusion restriction. The assumption that the instrument $Z$ affects the outcome $Y$ only through its effect on the endogenous regressor $X$, not through any other channel: $Cov(Z, \varepsilon) = 0$. This assumption is not directly testable and must be argued on theoretical grounds.
Two-stage least squares (2SLS). An IV estimation procedure: (1) regress $X$ on $Z$ to get fitted values $\hat{X}$; (2) regress $Y$ on $\hat{X}$. The first stage isolates the exogenous variation in $X$; the second stage uses only that variation to estimate the causal effect.

Two-Stage Least Squares (2SLS):

First stage: Regress $X$ on $Z$ (and any control variables):

$$X_i = \pi_0 + \pi_1 Z_i + \nu_i$$ (First stage)

This isolates the part of $X$ driven by the instrument — the exogenous part. The fitted values $\hat{X}_i$ represent the "clean" variation in $X$.

Second stage: Regress $Y$ on $\hat{X}$. In matrix form:

$$\hat{\beta}_{IV} = (Z'X)^{-1}Z'Y$$ (Eq. 9.4)

In the simple case with one instrument and one endogenous regressor:

$$\hat{\beta}_{IV} = \frac{Cov(Z, Y)}{Cov(Z, X)}$$ (Eq. 9.5)

The IV estimate is the ratio of the reduced form (effect of $Z$ on $Y$) to the first stage (effect of $Z$ on $X$). The intuition: $Z$ affects $Y$ only through $X$ (exclusion restriction), so dividing out the first stage isolates the causal effect of $X$ on $Y$.

What IV estimates. With heterogeneous treatment effects, IV identifies the Local Average Treatment Effect (LATE) — the causal effect for the subpopulation whose behavior is changed by the instrument (the "compliers").

Weak Instruments

Weak instruments. Instruments with low correlation with the endogenous regressor (first-stage F-statistic below 10). Weak instruments cause the IV estimator to be biased toward OLS, have non-normal sampling distributions, and produce misleading confidence intervals.

If $Z$ is weakly correlated with $X$, the first stage is weak, and the IV estimate is unreliable (biased toward OLS, wide confidence intervals). Rule of thumb: first-stage F-statistic > 10.

Example 9.2 — Quarter of Birth (Angrist & Krueger 1991)

Quarter of birth was used as an instrument for years of schooling. Compulsory schooling laws mean students born earlier in the year can drop out with slightly less education. Quarter of birth is plausibly: (a) correlated with schooling (relevance), and (b) not directly related to earnings (exclusion). The IV estimate of the return to schooling was approximately 7–8% per year.

Interactive: Instrumental Variables DAG

This directed acyclic graph shows the causal structure of an IV design. Toggle between views to see how an instrument Z breaks the confounding path.

Figure 9.2. DAG for the instrumental variables design. Z is the instrument, X is the endogenous regressor, Y is the outcome, and U is the unobserved confounder. The IV strategy uses only the variation in X that is driven by Z, bypassing the confounding path through U.

9.4 Difference-in-Differences (DiD)

Difference-in-differences. A method that compares changes over time between a treatment group and a control group to estimate the causal effect of a treatment.
$$\hat{\tau}_{DiD} = (\bar{Y}_{T,post} - \bar{Y}_{T,pre}) - (\bar{Y}_{C,post} - \bar{Y}_{C,pre})$$ (Eq. 9.6)

The first difference removes time-invariant group characteristics. The second difference removes common time trends.

Parallel trends assumption. The assumption that, in the absence of treatment, the treatment and control groups would have experienced the same change in outcomes over time. Parallel trends cannot be directly tested for the post-treatment period but can be assessed by checking whether pre-treatment trends are similar.

Key assumption: Parallel trends. In the absence of treatment, the treatment and control groups would have followed the same trend. This is untestable for the post-treatment period but assessable for the pre-treatment period.

Example 9.3 — Card & Krueger (1994)

New Jersey raised its minimum wage from \$1.25 to \$1.05 in April 1992; Pennsylvania did not. The DiD estimate of the employment effect was positive (+2.7 FTE workers), contradicting the simple competitive model prediction. This study spurred a revolution in empirical labor economics.

Regression formulation:

$$Y_{it} = \alpha + \beta_1 \cdot Treat_i + \beta_2 \cdot Post_t + \tau \cdot (Treat_i \times Post_t) + \varepsilon_{it}$$ (Eq. 9.7)

Figure 9.3 — Difference-in-Differences

Two time series show a treatment group and a control group. The treatment occurs at $t = 5$. Drag the slider to change the treatment effect size and see how the DiD estimate updates. Pre-treatment parallel trends are visible.

Negative (−5) Zero Large (+10)
DiD estimate: τ̂ = 3.00

Figure 9.3. Difference-in-differences design. The dashed line shows the counterfactual — what would have happened to the treatment group without treatment (parallel to control). The gap between the actual and counterfactual outcomes at the end is the treatment effect.

9.5 Regression Discontinuity (RD)

Regression discontinuity. A method that exploits a sharp cutoff in a "running variable" that determines treatment assignment. Observations just above and just below the cutoff are similar in all respects except treatment — creating a local quasi-experiment.
Running variable. The continuous variable that determines treatment assignment in an RD design. Treatment is assigned when the running variable crosses a cutoff (e.g., a test score threshold, an age cutoff, an election margin). The running variable must not be precisely manipulable by agents.
Continuity assumption. The assumption that all factors affecting the outcome (other than treatment) vary continuously at the cutoff. If this holds, the discontinuity in the outcome at the cutoff is attributable solely to the treatment. Violated when agents can precisely sort around the threshold.
$$\hat{\tau}_{RD} = \lim_{x \downarrow c} E[Y|X = x] - \lim_{x \uparrow c} E[Y|X = x]$$ (Eq. 9.8)

Key assumption: Continuity. All factors affecting $Y$ (other than treatment) vary continuously at the cutoff — no sorting or manipulation around the threshold.

Example 9.4 — Scholarship at Score = 80

A scholarship is awarded to students scoring above 80 on an exam. Students scoring 79 and 81 are similar in ability but one gets the scholarship and the other does not. The discontinuity in outcomes (e.g., college completion rates) at the 80-point threshold estimates the causal effect of the scholarship.

Figure 9.4 — Regression Discontinuity

A scatter plot with a running variable (test score). Students above the cutoff receive treatment (scholarship). Polynomial fits on each side reveal the jump at the cutoff. Adjust the cutoff position and the bandwidth to see how the estimated treatment effect changes.

Low (30) Middle (50) High (70)
Narrow (5) Medium (25) Wide (40)
RD estimate: τ̂ = 0.00 | Cutoff = 50 | Bandwidth = 25

Figure 9.4. Regression discontinuity. The vertical dashed line marks the cutoff. Points left of the cutoff are untreated (gray); right are treated (green). The jump at the cutoff is the treatment effect estimate. Adjust the bandwidth to focus on observations near the cutoff.

9.6 Randomized Controlled Trials (RCTs)

Randomized controlled trial. Random assignment of treatment ensures that treatment and control groups are identical in expectation — eliminating confounding by construction.
$$\hat{\tau}_{RCT} = \bar{Y}_{treatment} - \bar{Y}_{control}$$ (Eq. 9.9)
Internal validity. The degree to which a study accurately estimates the causal effect within its specific context and sample. An internally valid study correctly identifies causation for the population studied. Threats include confounding, selection bias, attrition, and measurement error.
External validity. The degree to which a study's findings generalize to other populations, settings, or time periods. An RCT conducted in rural Kenya may not apply to urban India. Scaling up a program often changes the context (general equilibrium effects, different populations of compliers).

RCTs are the "gold standard" for internal validity because randomization guarantees $E[\varepsilon|X] = 0$ by construction. Banerjee, Duflo, and Kremer received the 2019 Nobel Prize for their experimental approach to alleviating global poverty.

Limitations of RCTs

Intent-to-treat (ITT). The average treatment effect of being assigned to treatment, regardless of whether the subject actually complied. ITT is always well-identified in an RCT because it compares groups as randomized. With partial compliance, ITT underestimates the effect of actually receiving treatment.
Treatment-on-treated (TOT). The average causal effect of actually receiving treatment (among compliers). Estimated as $TOT = ITT / \text{compliance rate}$. TOT answers: "What is the effect for people who actually took the treatment?" but requires stronger assumptions than ITT.
Statistical power. The probability that a study correctly rejects a false null hypothesis (i.e., detects a true treatment effect). Power depends on effect size, sample size, and variance. Underpowered studies risk failing to detect real effects (Type II error). Standard target: 80% power.
Example 9.5 — RCT with Partial Compliance

A job training program randomly assigns 500 individuals to treatment and 500 to control. Only 60% of those assigned to treatment actually attend the program (compliance rate = 0.6).

Results: Average earnings: treatment group = \$15,000, control group = \$13,000.

ITT: $\hat{\tau}_{ITT} = 25{,}000 - 23{,}000 = \\$1{,}000$. This is the effect of being offered the program.

TOT: $\hat{\tau}_{TOT} = 2{,}000 / 0.6 = \\$1{,}333$. This estimates the effect of actually attending the program (for compliers). The TOT is larger because the ITT is diluted by non-compliers.

Power check: With $n = 500$ per group, $\sigma = \\$1{,}000$, and a true effect of $\\$1{,}000$, power $\approx 0.80$. The study is adequately powered to detect the ITT.

Figure 9.5 — RCT Power Calculator

Statistical power is the probability of detecting a true treatment effect. Use the sliders to explore how effect size, sample size, and variance affect power. The power curve updates in real time, and the minimum detectable effect (MDE) at 80% power is highlighted.

Small (0.05) Medium (0.50) Large (1.50)
10 250 500
Low (0.5) Medium (1.0) High (3.0)
Power: 0.00 | MDE at 80% power: 0.00

Figure 9.5. Power curve: probability of detecting the effect as a function of effect size. The red dashed line marks 80% power. The green diamond marks the current parameter combination. The MDE is the smallest effect detectable at 80% power given sample size and variance.

9.7 Standard Errors and Inference

A point estimate without a measure of uncertainty is nearly useless.

$$Var(\hat{\beta}) = \sigma^2(X'X)^{-1}$$ (Eq. 9.10)

Standard errors (SE) are the square roots of the diagonal elements. A 95% confidence interval is approximately $\hat{\beta} \pm 1.96 \cdot SE(\hat{\beta})$.

Statistical significance: We reject $H_0: \beta = 0$ at the 5% level if $|t| = |\hat{\beta}/SE(\hat{\beta})| > 1.96$.

Economic significance vs statistical significance: A coefficient can be statistically significant but economically trivial. Conversely, an imprecise estimate can be economically large but statistically insignificant. Good empirical work discusses both.

Threats to Valid Inference

A practical rule: In modern applied economics, always use robust or clustered standard errors.

9.8 Threats to Validity

Every empirical strategy has assumptions that can fail:

StrategyKey AssumptionThreatDiagnostic
OLSNo omitted variables ($E[\varepsilon|X]=0$)ConfoundingTheory + sensitivity analysis
IVExclusion restrictionDirect effect of $Z$ on $Y$Cannot test directly; argue theoretically
IVRelevanceWeak instrumentsFirst-stage F > 10
DiDParallel trendsDifferential pre-trendsPlot pre-treatment trends
RDNo manipulation at cutoffSorting around thresholdMcCrary density test
RCTNo attrition, no spilloversDifferential dropout; contaminationBalance checks, attrition analysis

Thread Example: The Kaelani Republic

An economist wants to estimate the effect of Kaelani's new education policy (free textbooks for grades 1–6) on test scores. The policy was implemented in the eastern provinces in 2024 but not the western provinces.

Design: Difference-in-differences.

Pre-policy (2023)Post-policy (2025)Change
Eastern (treatment)5563+8
Western (control)5256+4
DiD estimate+4

The DiD estimate is 4 points. Free textbooks raised test scores by 4 points, after controlling for the common upward trend.

Threats: (1) Parallel trends: Were eastern provinces already improving faster? (2) Spillovers: Did families near the border send children to eastern schools? (3) Composition changes: Did free textbooks change enrollment?

A complementary approach: regression discontinuity at the provincial border, comparing villages just on either side.

Summary

Key Equations

LabelEquationDescription
Eq. 9.1$Y_i = \alpha + \beta X_i + \varepsilon_i$Structural equation
Eq. 9.2$\hat{\beta}_{OLS} = (X'X)^{-1}X'Y$OLS estimator
Eq. 9.3$E[\hat{\alpha}_1] = \beta_1 + \beta_2 \cdot Cov(X,Z)/Var(X)$Omitted variable bias formula
Eq. 9.5$\hat{\beta}_{IV} = Cov(Z,Y)/Cov(Z,X)$IV estimator (simple)
Eq. 9.6$\hat{\tau}_{DiD}$ = (treat change) − (control change)DiD estimator
Eq. 9.7$Y_{it} = \alpha + \beta_1 Treat + \beta_2 Post + \tau(Treat \times Post) + \varepsilon$DiD regression
Eq. 9.8$\hat{\tau}_{RD} = \lim_{x \downarrow c} E[Y|X=x] - \lim_{x \uparrow c} E[Y|X=x]$RD estimator
Eq. 9.9$\hat{\tau}_{RCT} = \bar{Y}_{treat} - \bar{Y}_{control}$RCT estimator
Eq. 9.10$Var(\hat{\beta}) = \sigma^2(X'X)^{-1}$OLS variance

Exercises

Practice

  1. Suppose you regress wages on years of education using OLS and estimate a coefficient of 0.10 (each year of education is associated with 10% higher wages). List two omitted variables that could bias this estimate and predict the direction of bias for each.
  2. An IV study uses "distance to nearest college" as an instrument for years of schooling. (a) Argue for relevance. (b) What is the exclusion restriction, and what might violate it?
  3. Two cities are compared before and after City A enacts a soda tax. Pre-tax, soda consumption in City A was 100 cans/person and in City B was 90. Post-tax, consumption is 80 in A and 85 in B. Compute the DiD estimate. What is the parallel trends assumption here?
  4. A scholarship program admits students with GPA ≥ 3.5. You have data on students with GPA from 3.0 to 4.0. (a) Describe the RD design. (b) What is the running variable? (c) What assumption must hold about student behavior near the cutoff?

Apply

  1. A government randomizes access to a job training program. 60% of those offered the program actually attend. The intent-to-treat estimate is a \$100 increase in earnings. What is the treatment-on-treated estimate? What assumption do you need, and how does this relate to IV?
  2. An economist claims that democracy causes economic growth, citing cross-country correlations. Critique this claim using the framework of this chapter. What specific identification strategy would you propose?
  3. A DiD study estimates the effect of an environmental regulation. Pre-treatment trends show the treatment group's pollution was already declining faster than the control group's. How does this violate parallel trends? In which direction is the DiD estimate biased?

Challenge

  1. Derive the OLS estimator $\hat{\beta} = (X'X)^{-1}X'Y$ by minimizing $S(\beta) = (Y - X\beta)'(Y - X\beta)$. Show that the first-order condition gives the normal equations $X'X\hat{\beta} = X'Y$.
  2. Show algebraically that when the instrument $Z$ is binary, the IV estimator reduces to the Wald estimator: $\hat{\beta}_{IV} = (\bar{Y}_1 - \bar{Y}_0)/(\bar{X}_1 - \bar{X}_0)$.
  3. Discuss the "credibility revolution" in economics (Angrist and Pischke, 2010). What changed between structural econometrics and design-based empirical work? What are the strengths and limitations of each approach?