Chapter 9: Econometrics Foundations

Economics makes causal claims — minimum wages affect employment, education raises earnings, institutions determine growth. Testing these claims requires data and a method for distinguishing causation from correlation. Econometrics is that method.

This chapter is not a statistics course. We assume familiarity with basic probability and regression. Instead, we focus on the central problem of empirical economics: identification — finding credible sources of exogenous variation that allow us to estimate causal effects. Every tool in this chapter — OLS, instrumental variables, difference-in-differences, regression discontinuity — is a strategy for solving the identification problem.

Prerequisites: Chapters 2 and 5 (economic context for examples). Mathematical prerequisites: linear algebra, probability and statistics.

9.1 The Identification Problem

Consider the question: does an additional year of education increase earnings? We observe that more-educated people earn more. But is this because:

Both are consistent with the observed correlation. The identification problem is that we cannot directly compare the same person with and without education — the counterfactual is unobserved.

where $Y_i$ is the outcome (earnings), $X_i$ is the treatment (years of education), $\beta$ is the causal parameter of interest, and $\varepsilon_i$ captures everything else affecting $Y_i$ — ability, family background, motivation, luck, health, and thousands of other factors.

The identification problem arises when $X_i$ is correlated with $\varepsilon_i$ — when the "treatment" is not randomly assigned. In statistics, this is called endogeneity. In economics, it is the norm, not the exception: people choose their education (and the choice is correlated with ability), countries choose their policies (and the choice is correlated with their economic conditions), firms choose their prices (and the choice is correlated with demand conditions).

In a randomized experiment, the treatment $X_i$ is assigned by a coin flip — it is independent of $\varepsilon_i$ by construction. But economists rarely have the luxury of randomization for the big questions. The methods in this chapter — OLS, IV, DiD, RD — are strategies for finding "natural experiments" that approximate randomization in observational data.

9.2 Ordinary Least Squares (OLS)

Under these assumptions, OLS is BLUE — the Best Linear Unbiased Estimator. "Best" means lowest variance among all linear unbiased estimators. "Unbiased" means $E[\hat{\beta}] = \beta$.

The critical assumption is #4: $E[\varepsilon|X] = 0$. When this fails — due to omitted variables, simultaneity, or measurement error in $X$ — OLS is biased. The estimate $\hat{\beta}$ no longer converges to the true $\beta$ even with infinite data. This is not a small-sample problem — it is a fundamental design flaw that more data cannot fix.

Figure 9.1 — OLS Regression Explorer

A scatter plot with a fitted OLS regression line. Drag the slider to add an outlier at different vertical positions and watch the regression line tilt. Observe how a single high-leverage point can dramatically change the slope, $R^2$, and coefficients.

Outlier Y-position: None

No outlier Moderate Extreme (Y=25)

OLS: β̂ = 0.00 | Intercept = 0.00 | R² = 0.00

Figure 9.1. OLS regression with an adjustable outlier. The outlier is placed at $X=14$ (high leverage). Drag the slider above "No outlier" to introduce it and watch the line tilt. Hover for values.

Omitted Variable Bias

Suppose the true model is $Y = \beta_0 + \beta_1 X + \beta_2 Z + u$, but we omit $Z$ and run $Y = \alpha_0 + \alpha_1 X + e$. Then:

The bias equals the effect of the omitted variable ($\beta_2$) times the association between the omitted variable and the included regressor.

	$Cov(X, Z) > 0$	$Cov(X, Z) < 0$
$\beta_2 > 0$	Upward bias (overestimate $\beta_1$)	Downward bias
$\beta_2 < 0$	Downward bias	Upward bias

Figure 9.2 — Omitted Variable Bias

Two panels show the same data. Left: the true relationship with the confounder (ability) shown as point color. Right: the naive OLS regression that omits ability. Drag the slider to change confounding strength and watch the bias grow.

Confounding strength ($\rho$): 0.60

No confounding (0) Moderate (0.5) Strong (0.95)

True β₁ = 0.50 | Naive OLS β̂ = 0.00 | Bias = 0.00

Left: True model with confounder (ability) shown as color. Darker = higher ability.

Right: Naive OLS ignoring ability. The biased line (red dashed) is steeper than the true causal effect (blue).

9.3 Instrumental Variables (IV)

When OLS is biased because $X$ is endogenous ($Cov(X, \varepsilon) \neq 0$), an instrumental variable can rescue the estimation.

This isolates the part of $X$ driven by the instrument — the exogenous part. The fitted values $\hat{X}_i$ represent the "clean" variation in $X$.

The IV estimate is the ratio of the reduced form (effect of $Z$ on $Y$) to the first stage (effect of $Z$ on $X$). The intuition: $Z$ affects $Y$ only through $X$ (exclusion restriction), so dividing out the first stage isolates the causal effect of $X$ on $Y$.

What IV estimates. With heterogeneous treatment effects, IV identifies the Local Average Treatment Effect (LATE) — the causal effect for the subpopulation whose behavior is changed by the instrument (the "compliers").

Weak Instruments

If $Z$ is weakly correlated with $X$, the first stage is weak, and the IV estimate is unreliable (biased toward OLS, wide confidence intervals). Rule of thumb: first-stage F-statistic > 10.

Example 9.2 — Quarter of Birth (Angrist & Krueger 1991)

Quarter of birth was used as an instrument for years of schooling. Compulsory schooling laws mean students born earlier in the year can drop out with slightly less education. Quarter of birth is plausibly: (a) correlated with schooling (relevance), and (b) not directly related to earnings (exclusion). The IV estimate of the return to schooling was approximately 7–8% per year.

9.4 Difference-in-Differences (DiD)

The first difference removes time-invariant group characteristics. The second difference removes common time trends.

Key assumption: Parallel trends. In the absence of treatment, the treatment and control groups would have followed the same trend. This is untestable for the post-treatment period but assessable for the pre-treatment period.

Figure 9.3 — Difference-in-Differences

Two time series show a treatment group and a control group. The treatment occurs at $t = 5$. Drag the slider to change the treatment effect size and see how the DiD estimate updates. Pre-treatment parallel trends are visible.

Treatment effect ($\tau$): 3.0

Negative (−5) Zero Large (+10)

DiD estimate: τ̂ = 3.00

Figure 9.3. Difference-in-differences design. The dashed line shows the counterfactual — what would have happened to the treatment group without treatment (parallel to control). The gap between the actual and counterfactual outcomes at the end is the treatment effect.

9.5 Regression Discontinuity (RD)

Key assumption: Continuity. All factors affecting $Y$ (other than treatment) vary continuously at the cutoff — no sorting or manipulation around the threshold.

Figure 9.4 — Regression Discontinuity

A scatter plot with a running variable (test score). Students above the cutoff receive treatment (scholarship). Polynomial fits on each side reveal the jump at the cutoff. Adjust the cutoff position and the bandwidth to see how the estimated treatment effect changes.

Cutoff position: 50

Low (30) Middle (50) High (70)

Bandwidth: 25

Narrow (5) Medium (25) Wide (40)

RD estimate: τ̂ = 0.00 | Cutoff = 50 | Bandwidth = 25

Figure 9.4. Regression discontinuity. The vertical dashed line marks the cutoff. Points left of the cutoff are untreated (gray); right are treated (green). The jump at the cutoff is the treatment effect estimate. Adjust the bandwidth to focus on observations near the cutoff.

9.6 Randomized Controlled Trials (RCTs)

RCTs are the "gold standard" for internal validity because randomization guarantees $E[\varepsilon|X] = 0$ by construction. Banerjee, Duflo, and Kremer received the 2019 Nobel Prize for their experimental approach to alleviating global poverty.

Limitations of RCTs

Example 9.5 — RCT with Partial Compliance

A job training program randomly assigns 500 individuals to treatment and 500 to control. Only 60% of those assigned to treatment actually attend the program (compliance rate = 0.6).

Results: Average earnings: treatment group = \$15,000, control group = \$13,000.

ITT: $\hat{\tau}_{ITT} = 25{,}000 - 23{,}000 = \\$1{,}000$. This is the effect of being offered the program.

TOT: $\hat{\tau}_{TOT} = 2{,}000 / 0.6 = \\$1{,}333$. This estimates the effect of actually attending the program (for compliers). The TOT is larger because the ITT is diluted by non-compliers.

Power check: With $n = 500$ per group, $\sigma = \\$1{,}000$, and a true effect of $\\$1{,}000$, power $\approx 0.80$. The study is adequately powered to detect the ITT.

Figure 9.5 — RCT Power Calculator

Statistical power is the probability of detecting a true treatment effect. Use the sliders to explore how effect size, sample size, and variance affect power. The power curve updates in real time, and the minimum detectable effect (MDE) at 80% power is highlighted.

True effect size ($\delta$): 0.50

Small (0.05) Medium (0.50) Large (1.50)

Sample size per group ($n$): 100

10 250 500

Standard deviation ($\sigma$): 1.00

Low (0.5) Medium (1.0) High (3.0)

Power: 0.00 | MDE at 80% power: 0.00

Figure 9.5. Power curve: probability of detecting the effect as a function of effect size. The red dashed line marks 80% power. The green diamond marks the current parameter combination. The MDE is the smallest effect detectable at 80% power given sample size and variance.

9.7 Standard Errors and Inference

Standard errors (SE) are the square roots of the diagonal elements. A 95% confidence interval is approximately $\hat{\beta} \pm 1.96 \cdot SE(\hat{\beta})$.

Statistical significance: We reject $H_0: \beta = 0$ at the 5% level if $|t| = |\hat{\beta}/SE(\hat{\beta})| > 1.96$.

Economic significance vs statistical significance: A coefficient can be statistically significant but economically trivial. Conversely, an imprecise estimate can be economically large but statistically insignificant. Good empirical work discusses both.

Threats to Valid Inference

A practical rule: In modern applied economics, always use robust or clustered standard errors.

9.8 Threats to Validity

Strategy	Key Assumption	Threat	Diagnostic
OLS	No omitted variables ($E[\varepsilon\|X]=0$)	Confounding	Theory + sensitivity analysis
IV	Exclusion restriction	Direct effect of $Z$ on $Y$	Cannot test directly; argue theoretically
IV	Relevance	Weak instruments	First-stage F > 10
DiD	Parallel trends	Differential pre-trends	Plot pre-treatment trends
RD	No manipulation at cutoff	Sorting around threshold	McCrary density test
RCT	No attrition, no spillovers	Differential dropout; contamination	Balance checks, attrition analysis

Thread Example: The Kaelani Republic

An economist wants to estimate the effect of Kaelani's new education policy (free textbooks for grades 1–6) on test scores. The policy was implemented in the eastern provinces in 2024 but not the western provinces.

Design: Difference-in-differences.

	Pre-policy (2023)	Post-policy (2025)	Change
Eastern (treatment)	55	63	+8
Western (control)	52	56	+4
DiD estimate			+4

The DiD estimate is 4 points. Free textbooks raised test scores by 4 points, after controlling for the common upward trend.

Threats: (1) Parallel trends: Were eastern provinces already improving faster? (2) Spillovers: Did families near the border send children to eastern schools? (3) Composition changes: Did free textbooks change enrollment?

A complementary approach: regression discontinuity at the provincial border, comparing villages just on either side.

Summary

The identification problem — distinguishing causation from correlation — is the central challenge of empirical economics.
OLS estimates linear relationships but is biased when explanatory variables are correlated with the error term. Omitted variable bias has a predictable direction.
Instrumental variables isolate exogenous variation using a variable correlated with $X$ but not directly with $Y$. The exclusion restriction is critical and untestable.
Difference-in-differences compares changes over time between treatment and control groups under the parallel trends assumption.
Regression discontinuity exploits cutoffs in running variables to create local quasi-experiments.
RCTs solve the identification problem by design but face external validity limitations.
Every strategy has assumptions and threats. Good empirical work states its identification strategy clearly and addresses potential violations.

Key Equations

Label	Equation	Description
Eq. 9.1	$Y_i = \alpha + \beta X_i + \varepsilon_i$	Structural equation
Eq. 9.2	$\hat{\beta}_{OLS} = (X'X)^{-1}X'Y$	OLS estimator
Eq. 9.3	$E[\hat{\alpha}_1] = \beta_1 + \beta_2 \cdot Cov(X,Z)/Var(X)$	Omitted variable bias formula
Eq. 9.5	$\hat{\beta}_{IV} = Cov(Z,Y)/Cov(Z,X)$	IV estimator (simple)
Eq. 9.6	$\hat{\tau}_{DiD}$ = (treat change) − (control change)	DiD estimator
Eq. 9.7	$Y_{it} = \alpha + \beta_1 Treat + \beta_2 Post + \tau(Treat \times Post) + \varepsilon$	DiD regression
Eq. 9.8	$\hat{\tau}_{RD} = \lim_{x \downarrow c} E[Y\|X=x] - \lim_{x \uparrow c} E[Y\|X=x]$	RD estimator
Eq. 9.9	$\hat{\tau}_{RCT} = \bar{Y}_{treat} - \bar{Y}_{control}$	RCT estimator
Eq. 9.10	$Var(\hat{\beta}) = \sigma^2(X'X)^{-1}$	OLS variance

Exercises

Practice

Suppose you regress wages on years of education using OLS and estimate a coefficient of 0.10 (each year of education is associated with 10% higher wages). List two omitted variables that could bias this estimate and predict the direction of bias for each.
An IV study uses "distance to nearest college" as an instrument for years of schooling. (a) Argue for relevance. (b) What is the exclusion restriction, and what might violate it?
Two cities are compared before and after City A enacts a soda tax. Pre-tax, soda consumption in City A was 100 cans/person and in City B was 90. Post-tax, consumption is 80 in A and 85 in B. Compute the DiD estimate. What is the parallel trends assumption here?
A scholarship program admits students with GPA ≥ 3.5. You have data on students with GPA from 3.0 to 4.0. (a) Describe the RD design. (b) What is the running variable? (c) What assumption must hold about student behavior near the cutoff?

Apply

A government randomizes access to a job training program. 60% of those offered the program actually attend. The intent-to-treat estimate is a \$100 increase in earnings. What is the treatment-on-treated estimate? What assumption do you need, and how does this relate to IV?
An economist claims that democracy causes economic growth, citing cross-country correlations. Critique this claim using the framework of this chapter. What specific identification strategy would you propose?
A DiD study estimates the effect of an environmental regulation. Pre-treatment trends show the treatment group's pollution was already declining faster than the control group's. How does this violate parallel trends? In which direction is the DiD estimate biased?

Challenge

Derive the OLS estimator $\hat{\beta} = (X'X)^{-1}X'Y$ by minimizing $S(\beta) = (Y - X\beta)'(Y - X\beta)$. Show that the first-order condition gives the normal equations $X'X\hat{\beta} = X'Y$.
Show algebraically that when the instrument $Z$ is binary, the IV estimator reduces to the Wald estimator: $\hat{\beta}_{IV} = (\bar{Y}_1 - \bar{Y}_0)/(\bar{X}_1 - \bar{X}_0)$.
Discuss the "credibility revolution" in economics (Angrist and Pischke, 2010). What changed between structural econometrics and design-based empirical work? What are the strengths and limitations of each approach?

Chapter 9Econometrics Foundations

Introduction