The econometrics of Chapter 10 was built on a single quiet assumption: the observations are independent draws from a fixed population. Randomize a treatment, find an instrument, exploit a discontinuity, and the i.i.d. sampling that justifies the asymptotics takes care of the rest. Most of the data a macroeconomist or a finance researcher actually works with does not arrive that way. It arrives in order. Quarterly GDP, monthly inflation, the overnight policy rate, daily equity returns: each observation sits next to its neighbors in time, and what happened last quarter is part of what determines this one.
That ordering changes everything downstream. Serial dependence means consecutive observations carry overlapping information, so the effective sample is smaller than the count of data points suggests and the standard error formulas of cross-sectional OLS understate uncertainty. Worse, many economic series trend: they wander upward over decades with no fixed mean to return to. Run an ordinary regression on two such series and you can get a high $R^2$ and a decisive $t$-statistic linking quantities that have nothing to do with each other. The cross-sectional toolkit does not merely lose efficiency on ordered data; on trending data it manufactures relationships that are not there.
A different apparatus is needed, organized around one question (what kind of process generated this series?) and a second, multivariate question once several series are in play: which shocks moved the system, and can we even say? The answer to the first builds from stationarity through ARMA models to unit-root tests. The answer to the second builds from the vector autoregression to its structural interpretation, where the chapter meets its sharpest problem: the data describe how variables move together but never say which one moved first.
为什么这很重要: Time-ordered data remember their past, and the methods that ignored the ordering will mislead you. A regression that treats this year's GDP as an independent draw is reading the same slow drift over and over and mistaking it for evidence. The whole chapter is about taking the order seriously: figuring out whether a series settles back toward a center or wanders off forever, and figuring out, when several series move together, which one is doing the pushing.
Prerequisites: the identification frame, OLS, and the serial-correlation note of Chapter 10 (Econometrics Foundations); linear algebra (the VAR is matrix-valued); basic probability. The stochastic-process intuition is built here, not assumed.
The methods below were not always part of economics. Through the 1970s the discipline learned to take the time dimension seriously, largely through the work of Clive Granger and Robert Engle on long-run relationships and volatility and Christopher Sims on multivariate dynamics; their program reshaped how macroeconomists handle data. That intellectual lineage, who arrived at these ideas and against what, belongs to the history-of-economic-thought volume rather than here, and is traced in its chapter on the information-economics and game-theory era.
Before any model can be fit, the series has to be the kind of object a model can describe. The organizing idea is stationarity. A stationary process is one whose probabilistic character is stable over time: the same mechanism is generating the data at the start of the sample as at the end. If that is true, the past is informative about the future in a way that does not itself shift; if it is not, the parameters being estimated are aimed at a moving target.
The lag operator turns dynamics into algebra. A polynomial $\phi(L)=1-\phi_1 L-\cdots-\phi_p L^p$ acting on $y_t$ encodes a whole autoregression in one symbol, and the process is stationary exactly when the roots of $\phi(z)=0$ lie outside the unit circle. When a root sits on the unit circle, stationarity fails; that is the unit root of §23.3.
Here $\{\varepsilon_t\}$ is white noise, the $\psi_j$ are square-summable weights with $\psi_0=1$, and $\kappa_t$ is the part of $y_t$ that is perfectly predictable from its own past (a deterministic trend or seasonal). The square-summability of the weights is what keeps the variance finite, and it is the formal content of "the influence of an old shock fades."
为什么这很重要: A stationary process is one whose statistical character doesn't drift. Slide a window along the series and the picture inside it looks the same. The mean you would estimate from the first decade matches the mean from the last; the size of the typical wiggle is the same throughout; how strongly today relates to last month doesn't depend on which month. White noise is the purest case: pure surprise, no memory. And Wold's result says something reassuring about everything else, because every well-behaved series is just a stream of those surprises, with old ones fading as new ones arrive. The figure below lets you watch a series stay tethered to its mean, and lets you push it toward the edge where the tether snaps.
Figure 23.1. Stationary-vs-non-stationary process explorer. A simulated AR(1) path $y_t = \delta + \phi y_{t-1} + \varepsilon_t$ with its sample mean. At low $\phi$ the series hugs its mean (stationary). As $\phi \to 1$ it wanders farther and returns more slowly. At $\phi = 1$ it becomes a random walk, with no mean reversion and permanent shocks (the §23.3 boundary case). Drag the slider across $\phi = 1$; toggle drift; reseed to confirm the behavior is generic.
Once a series is recognized as stationary, the next question is how to model it. Two primitive ways a series can remember its past combine into a single family.
A stationary series remembers, and there are two clean ways to write down what it remembers. It can carry forward its own past values, so that high output last quarter raises expected output this quarter. Or it can carry forward the echoes of past surprises, a shock to the system that takes a few periods to work through. The first is autoregression; the second is moving average. Combine them and you have the workhorse model of univariate time series.
The ACF and PACF are the diagnostic instruments of the Box-Jenkins model-building approach, and the rule that makes them useful is a contrast in how they fall off. A pure AR process has a PACF that cuts off sharply after lag $p$ (once you control for the first $p$ lags, nothing further adds predictive content) while its ACF decays gradually. A pure MA process is the mirror image: its ACF cuts off after lag $q$, while its PACF decays. The cut-off-versus-decay pattern is how a practitioner reads the order of a model off the data.
Forecasting from an ARMA model follows from the lag-operator algebra. The one-step-ahead forecast sets future innovations to their expected value of zero and rolls the estimated coefficients forward; the multi-step forecast iterates that recursion, and because the process is stationary the forecast converges to the unconditional mean $\mu$ as the horizon grows, with the forecast-error variance rising to the unconditional variance.
为什么这很重要: An AR series carries its own past forward: today's level is a faded copy of yesterday's. An MA series carries the echoes of past surprises, a shock that rings for a few periods and then is gone. They leave different fingerprints, and the two diagnostic plots read those fingerprints: for an AR process the partial-autocorrelation plot drops to zero abruptly while the plain autocorrelation tails off; for an MA process it is the other way around. You are not learning a formula so much as learning to recognize two shapes. Forecasting then needs no new idea. With no fresh surprise to expect, the best guess for the far future is just the long-run average, and the model tells you how fast you get there. Set the sliders below to a pure AR and watch one plot snap to zero; switch to a pure MA and watch the other one do it.
Figure 23.2. ARMA(1,1) simulator with sample ACF and PACF. Set a pure AR(1) ($\theta_1=0$): the PACF spikes at lag 1 then cuts off, while the ACF decays geometrically. Set a pure MA(1) ($\phi_1=0$): the ACF spikes at lag 1 then cuts off, while the PACF decays. The cut-off-versus-decay contrast is the Box-Jenkins identification rule. Drag the sliders.
All of this assumed the series was stationary. The diagnostic plots, the forecasts, the very meaning of the coefficients: each rests on a mean to return to and a finite variance. What happens when the series has neither?
Push the AR(1) coefficient all the way to one and the machinery of §23.2 breaks. The series no longer has a mean to return to. Figure 23.1, dragged across $\phi = 1$, showed it directly: below one the series is tethered and a shock decays; at exactly one the tether snaps and a shock becomes permanent. That boundary case is the random walk, the most important non-stationary process in economics because so many macro series behave like it.
The reason a unit root cannot be ignored is the trap it sets for ordinary regression. Take two random walks generated completely independently of each other, with no causal link, no common driver, nothing. Regress one on the other and, far more often than chance should allow, you will find a large $R^2$ and a $t$-statistic that comfortably clears any conventional threshold. The regression announces a strong relationship between series that share nothing. This is the spurious-regression result that Granger and Newbold warned of in 1974, and it overturns the instinct that high $R^2$ plus a significant $t$ means something real.
Under $H_0:\gamma=0$ the level $y_{t-1}$ drops out and $\Delta y_t$ is driven only by its own lagged differences and noise: a unit root. Rejecting $H_0$ in favor of $\gamma<0$ means a deviation from the level is partly pulled back, i.e. the series is stationary. The lagged differences are the "augmentation" that absorbs serial correlation in $\varepsilon_t$ so the test is valid; the number of lags $k$ is chosen by an information criterion.
为什么这很重要: A random walk has no gravity. A stationary series is held on a tether, so pull it away from its center and it springs back; a random walk has nothing pulling it home, so a shock today never washes out and the series just wanders wherever the accumulated shocks take it. That is what a unit root means: permanence instead of decay. And here is the trap. Two wanderers, set loose independently, will both drift somewhere over a long sample, and any two things that drift will look correlated, because a line through the cloud always slopes one way or the other. Your regression will report a strong relationship between two series that have never met. The fix is to ask the right question first, whether this series is tethered or wandering, using a formal unit-root test, and to study wanderers in their changes (differences) rather than their levels. The figure below makes the trap visceral: regenerate the two independent walks and watch a "significant" relationship keep reappearing out of nothing.
Figure 23.3. Spurious-regression demonstrator. Two random walks generated with independent shocks, plotted together; the readout reports the OLS regression of one on the other with its $R^2$ and $t$-statistic. Reseed and watch high $R^2$ and "significant" $t$ recur across independent draws; the trap is systematic, not a fluke. Switch the regression to first differences and the apparent relationship collapses. Reseed; toggle levels vs. differences.
A standing example. Is US real GDP a random walk? Fit an AR(1) to its logarithm and the estimated coefficient comes back very close to one; run an ADF test and it typically fails to reject the unit-root null. The practical upshot is that output is better modeled in growth rates (first differences) than in levels. The historical episode this draws on is the province of the economic-history volume, but the apparatus is the point here.
So far one series at a time. Macroeconomic questions are usually about several series at once, such as output and inflation, or the interest rate and the exchange rate, moving together. The single-equation tools generalize into a system.
Most interesting macroeconomic questions involve more than one variable. Inflation and the policy rate respond to each other; output, prices, and money move as a system. The vector autoregression is the natural generalization of the AR model to this setting: stack the variables into a vector and let each variable depend on the recent past of all of them.
Because every equation has the same regressors, the lags of all variables, OLS applied equation by equation is efficient, and no system estimator is required for the reduced form. The estimated $A_i$ matrices and the residual covariance $\Sigma$ summarize the dynamics.
A reduced-form VAR delivers two things almost for free. It forecasts the whole system jointly, often better than a structural model because it imposes few restrictions; and it tests Granger causality, telling you which series carry predictive content for which others. The impulse response and the variance decomposition seem to promise a third thing: a story about what happens after a shock. But there is a catch the prose has been signposting. The innovations $\mathbf{u}_t$ are correlated across equations. A movement in the inflation residual tends to come alongside a movement in the rate residual. So when the system "responds to a shock," whose shock was it? The reduced form cannot say.
为什么这很重要: A VAR lets every variable depend on the recent past of all the others. That buys two things immediately: a joint forecast of the whole system, and a test of which series helps predict which, called "Granger causality." But notice the careful word: helps predict is not causes. Knowing that ice-cream sales help predict drownings does not mean ice cream causes drowning; summer drives both. A VAR can tell you the policy rate helps predict inflation, and that is genuinely useful, but it stops short of saying the rate moved inflation. The reason it stops short is that the surprises in the different equations arrive together, tangled with one another, so "the system's response to a shock" is ambiguous until you untangle which shock you mean. That untangling is the next section, and it is the hardest problem in the chapter.
Figure 23.4. VAR / SVAR impulse-response explorer (the chapter's signature figure). A two-variable system: inflation and the policy rate, the monetary VAR of the standing example. Pick which variable is shocked and the panel shows both variables' dynamic responses. The Granger-causality and variance-decomposition readouts update. In §23.5, flip the Cholesky ordering and watch the impulse responses change: same data, a different identifying assumption, a different economic story. Choose the shock; set the horizon; toggle the ordering.
The impulse responses you just generated assumed an ordering, a choice about which variable can move the other within the period. That choice was made quietly. Bringing it into the open is the structural step, and it is where the data stop being able to settle the argument.
The reduced-form residuals are correlated, and that correlation is the whole problem. An economic shock, a surprise tightening of monetary policy, say, should be a single disturbance with a clear interpretation. But the reduced-form innovation in the rate equation is contaminated by whatever moved inflation at the same instant, and vice versa. The residuals are mixtures. To recover the underlying structural shocks, you have to specify how the mixtures were formed, and the data do not contain that information.
For a two-variable system, $\Sigma=BB'$ supplies three equations (two variances and one covariance) for the four elements of $B$. One restriction is missing. Imposing it is identification, and the impulse responses at horizon $h$, given by $\Theta_h = A^h B$, inherit whatever was imposed.
Three families of restriction are in common use, and Figure 23.4 lets you feel the consequence of the most common one. Toggle the Cholesky ordering between "rate first" and "inflation first" and the impulse responses visibly change shape. Nothing in the data changed: the same residuals, the same estimated dynamics. What changed is the assumption about which variable can move the other within the period, and that assumption rewrote the economic story. Short-run zero restrictions like Cholesky are one option; long-run restrictions, which assume a shock has no permanent effect on some variable (Blanchard and Quah's decomposition of demand and supply shocks is the standard example), are another, named here but not derived; sign restrictions are the modern alternative that reports the identified set rather than a single line.
This is where the chapter's framing tension lives. Christopher Sims, who introduced the VAR to macroeconomics, argued that the elaborate identifying restrictions of older structural models were "incredible," assumptions imposed for tractability rather than belief, and that an honest empirical macro should let the data speak with as few restrictions as possible. The opposing view holds that without some economic structure the impulse responses are uninterpretable, so the right move is to impose restrictions you can defend and be explicit about them. Both positions are serious, and the toggle you just used is exactly what they disagree about: how much identification the data can honestly support, and what to do about the gap.
Which side has the better of the argument is not settled here. The apparatus (what identification is, why the choice matters, what each scheme buys and gives up) is the chapter's job, and it is now in hand. The verdict over whether atheoretical VARs or structural identification won the field is argued at length in the walkthrough on econometric-methodology credibility, where the methodological stakes get the space they need.
为什么这很重要: The data hand you correlated shocks: the surprises in inflation and in the interest rate arrive tangled together. To tell an economic story you have to say which way the arrow points within the same period: did the rate move first and inflation follow, or the reverse? Economics has to supply that arrow, because the data cannot. They only show the two moving together, not the order. And here is the uncomfortable part: that single added assumption, the one the data can't check, decides the whole answer. In the figure, flipping the ordering kept every number in the dataset identical and still flipped the story. Sims's worry was that economists were smuggling in arrows they couldn't justify and calling the result a finding. The other camp says you can't avoid choosing an arrow, so choose one you can defend and say so out loud. The honest takeaway is the tension itself, which is why the verdict is argued in a walkthrough, not pronounced here.
Apparatus stop. The VAR/SVAR identification problem you just met is the technical core of the methodological-credibility debate; this is where the walkthrough gets its machinery.
The Cholesky-ordering toggle in Figure 23.4 is the whole methodological argument in one gesture: identical data, a different identifying assumption, a different impulse response. Sims's atheoretical-VAR program was a bid to minimize such untestable assumptions; the structural-identification reply is that some assumption is unavoidable and the discipline should make defensible ones explicitly. The reduced-form VAR (§23.4), the $\mathbf{u}_t = B\boldsymbol{\varepsilon}_t$ mapping (§23.5), and the menu of restrictions (Cholesky, long-run, sign) are the apparatus on which that argument is conducted.
This chapter teaches what identification is and shows that the choice changes the answer; it deliberately stops short of declaring who won. The walkthrough takes the apparatus and argues the verdict: whether the credibility revolution vindicated the design-based skeptics, whether structural macro earned its restrictions back, and where time-series identification sits in that story.
The systems so far have been built from stationary or differenced series. But differencing can throw away something real: when two series wander together, taking their changes erases the relationship that ties them. The next section recovers it.
Recall the spurious-regression warning: two independent wanderers look related, and the fix is to study them in differences. But sometimes two wanderers are genuinely tethered, not to a fixed mean each, but to each other. Short-term and long-term interest rates drift over the decades, yet the spread between them stays within a band. Consumption and income each trend upward, yet the ratio is stable. Differencing such a pair would difference away exactly the long-run relationship that matters. Cointegration is the apparatus for the case where the relationship is real.
The error-correction coefficient is the most interpretable number in this section. If $\lambda = -0.2$, then a fifth of any deviation from the long-run relationship is undone each period; if $\lambda = 0$, there is no pull and the series are not cointegrated at all, so the "relationship" is the spurious kind from §23.3. The Granger representation theorem is what makes this respectable: it guarantees that whenever genuine cointegration is present, an error-correction model is the right way to write the dynamics, so estimating the ECM is not an ad hoc add-on but the implied form. The Johansen procedure extends the idea to systems with possibly several equilibrium relations, and its rank test answers how many; the mechanics of the eigenvalue computation are beyond the scope here, but the meaning of the rank, the count of long-run ties, is the part to carry forward.
为什么这很重要: Two series can each wander forever and yet never drift apart from each other, like two dogs off the leash but tied together by a short rope. Each goes where it likes; the rope keeps the distance between them from growing. That shared distance is the long-run equilibrium, and the speed at which the rope yanks them back when they stray is the error-correction term. This is the exact opposite of the spurious-regression trap: there, two wanderers only looked related; here, they truly are, and the rope is real. Earlier the advice was to difference wanderers before studying them, but if you difference a tethered pair you cut the rope and lose the very thing you wanted to see. The figure below has the rope's tightness as a slider: at zero the two series float apart freely, and as you tighten it the gap between them settles into a stationary band even while each series keeps wandering.
Figure 23.5. Cointegration / error-correction explorer. Two I(1) series share a common stochastic trend; the second panel shows their spread $z_t = y_t - x_t$. At $|\lambda|=0$ the spread itself wanders (not cointegrated). As $|\lambda|$ rises, deviations from the long-run relationship are pulled back faster and the spread becomes visibly stationary. Drag the error-correction speed; reseed.
Everything so far has modeled the conditional mean, where the series is expected to go. A final movement turns to the conditional variance, how uncertain that expectation is, and finds that uncertainty itself has a predictable rhythm.
Look at a long series of daily stock returns and one feature jumps out before any model is fit: the turbulence comes in clumps. Quiet stretches of small moves are interrupted by stormy stretches of large moves, and the storms persist for days or weeks before calm returns. The returns themselves are close to unpredictable, as efficient markets would lead you to expect, but their size is not. Large moves cluster with large moves. This is the stylized fact that the conditional-variance models were built to capture, and it sits orthogonal to everything in §23.1 through §23.6, which modeled the conditional mean.
In the GARCH(1,1), the sum $\alpha_1+\beta_1$ measures persistence: it governs how slowly a volatility shock decays, and the unconditional variance $\omega/(1-\alpha_1-\beta_1)$ is finite only when $\alpha_1+\beta_1<1$. Estimated persistence on daily equity data is routinely above $0.95$, meaning a spike in volatility takes a long time to subside: the empirical content of "turbulence lingers."
为什么这很重要: You cannot predict whether tomorrow's return is up or down; if you could, the trade would already be made and the prediction erased. But you can predict whether tomorrow will be a calm day or a wild one, because calm and wild cluster: a stormy market today tells you the next few days are likely to be stormy too, even though it tells you nothing about which direction the moves will go. That split, where the direction of a surprise is unforecastable but the size of a surprise is forecastable, is the entire idea behind these models. A big move today feeds into a bigger expected swing tomorrow; when that feedback is strong, storms last a long time once they start. The figure below has a persistence dial: turn it low and the series looks like featureless noise; turn it toward one and watch calm and turbulent stretches organize themselves into long swells, no equation required to see it happen.
Figure 23.6. GARCH(1,1) volatility-clustering explorer. The top panel shows a simulated return series; the bottom panel shows the conditional volatility $\sigma_t$ that generated it. At low persistence ($\alpha_1+\beta_1$ small) the returns look like homoskedastic noise. As persistence approaches 1, calm and turbulent periods cluster sharply and the volatility line shows long swells. The slider guards $\alpha_1+\beta_1<1$. Drag the persistence dials; reseed.
The basic GARCH(1,1) has spawned a family of refinements, each fixing a limitation. EGARCH models the logarithm of the variance, so it allows the "leverage effect," where bad news raises volatility more than equally large good news, and needs no non-negativity constraints. GARCH-in-mean (GARCH-M) lets the conditional variance enter the return equation directly, formalizing the idea that investors demand higher expected returns when risk is high. These extensions are named here as the working vocabulary of empirical finance, not derived. Beyond them lies a frontier (Bayesian estimation of high-dimensional VARs, machine-learning approaches to volatility and prediction) that is outside the scope of a first course but worth knowing exists.
| 标签 | 方程 | 描述 |
|---|---|---|
| Eq. 23.1 | $E[y_t]=\mu,\ \mathrm{Var}(y_t)=\sigma^2,\ \mathrm{Cov}(y_t,y_{t-k})=\gamma_k$ | Covariance-stationarity conditions |
| Eq. 23.2 | $y_t = \sum_{j=0}^{\infty}\psi_j\varepsilon_{t-j} + \kappa_t$ | Wold decomposition |
| Eq. 23.3 | $y_t = c + \phi_1 y_{t-1} + \cdots + \phi_p y_{t-p} + \varepsilon_t$ | AR(p) |
| Eq. 23.4 | $y_t = \mu + \varepsilon_t + \theta_1\varepsilon_{t-1} + \cdots + \theta_q\varepsilon_{t-q}$ | MA(q) |
| Eq. 23.5 | $\phi(L)y_t = \theta(L)\varepsilon_t$ | ARMA in lag-operator form |
| Eq. 23.6 | $y_t = y_{t-1} + \varepsilon_t$ (with drift $+\,\delta$) | Random walk |
| Eq. 23.7 | $\Delta y_t = \alpha + \gamma y_{t-1} + \sum\delta_i\Delta y_{t-i} + \varepsilon_t;\ H_0:\gamma=0$ | Augmented Dickey-Fuller regression |
| Eq. 23.8 | $\mathbf{y}_t = \mathbf{c} + A_1\mathbf{y}_{t-1} + \cdots + A_p\mathbf{y}_{t-p} + \mathbf{u}_t$ | Reduced-form VAR(p) |
| Eq. 23.9 | $\mathbf{u}_t = B\boldsymbol{\varepsilon}_t,\ E[\boldsymbol{\varepsilon}_t\boldsymbol{\varepsilon}_t']=I$ | SVAR mapping (residuals to structural shocks) |
| Eq. 23.10 | $\Theta_h = A^h B$ | Structural impulse response at horizon $h$ |
| Eq. 23.11 | $z_t = y_t - \beta x_t \sim I(0)$ when $y_t, x_t \sim I(1)$ | Cointegrating relation |
| Eq. 23.12 | $\Delta y_t = \lambda(y_{t-1}-\beta x_{t-1}) + \cdots + \varepsilon_t$ | Error-correction model |
| Eq. 23.13 | $\sigma_t^2 = \omega + \sum_{i=1}^{q}\alpha_i\varepsilon_{t-i}^2$ | ARCH(q) conditional variance |
| Eq. 23.14 | $\sigma_t^2 = \omega + \sum\alpha_i\varepsilon_{t-i}^2 + \sum\beta_j\sigma_{t-j}^2$ | GARCH(p,q) conditional variance |
Wold (1938); Box & Jenkins (1970); Granger & Newbold (1974); Dickey & Fuller (1979); Engle (1982); Sims (1980); Bollerslev (1986); Engle & Granger (1987); Blanchard & Quah (1989); Johansen (1991); Hamilton (1994); Stock & Watson (2001); Enders (2014).