
と題したMetrics Mondayエントリ(原題は「2SLS–Chronicle of a Death Foretold?」)でMarc F. Bellemareが、最近話題になっているというAlwyn Young(LSE)の論文取り上げている

The paper is titled “Consistency without Inference: Instrumental Variables in Practical Application.” In it, Young uses the bootstrap to conduct a meta-analysis of 1,400 2SLS coefficients across 32 papers published in the AEA journals, and to essentially ask: “Is 2SLS all that it is cracked up to be?”
For a while now, I have been thinking that with the Credibility Revolution having brought the focus of applied micro back to getting causal (unbiased) estimates, the next logical step–the Second Credibility Revolution, so to speak–should be for the literature to focus on getting the standard errors right. Young’s paper–along with the Abadie et al. (2017) paper on clustering I discussed about a few weeks ago–is a step in that direction.


In this paper I show that two stage least squares (hereafter, 2SLS or IV) methods produce estimates that, in practice, rarely identify parameters of interest more accurately or substantively differently than is achieved by biased ordinary least squares (OLS).
...I maintain, throughout, the exact specification used by authors and their identifying assumption that the excluded variables are orthogonal to the second stage residuals. When I bootstrap, I draw samples in a fashion consistent with the error dependence within groups of observations and independence across observations implied by authors’ standard error calculations. Thus, this paper is not about point estimates or the validity of fundamental assumptions, but rather concerns itself with the quality of inference within the framework established by authors themselves. ...
I find that, depending upon the bootstrap method used, 2SLS point estimates are falsely declared significant between ⅓ and ½ of the time, while their bootstrapped 99 percent confidence interval includes the OLS point estimate between 92 and 94 percent of the time and the entirety of the bootstrapped OLS 99 percent confidence interval between 75 and 83 percent of the time. The extraordinary sampling variability of IV estimates is reflected in their sensitivity to outliers. With the removal of only or two clusters or observations 45 and 63 percent, respectively, of reported .01 significant 2SLS results can be rendered insignificant at the same level. I find that only 8 to 14 percent of regressions can reject the null that the OLS estimates are in fact unbiased at the .01 level. This is important because the ln mean squared error of 2SLS around its own population moment is on average 4.77 greater than the ln mean squared error of OLS around its population moment, so if OLS is unbiased the use of 2SLS is, from a quadratic loss point of view, a regrettable choice. Surprisingly, I find that the ln mean squared error of 2SLS around its population moment is on average 1.52 greater than that of OLS around the same moment, i.e. in applied work biased OLS is on average more accurate in estimating the IV population moment than 2SLS itself! Moreover, the bias of 2SLS methods is greater than the bias of OLS (from the 2SLS moment) in about 1/6 of coefficients. I find that the null that all first stage coefficients are zero can only be rejected at the .01 level between 52 to 70 percent of the time, i.e. in ⅓ to ½ of published regressions one cannot reject the null that the instruments are totally irrelevant and the observed correlation between the endogenous variables and the excluded instruments, despite the exogeneity of the latter in the population, is due to a wholly undesirable finite sample correlation between the instruments and the endogenous errors. Only one in ten to twelve instrumented coefficients resides in a regression that rejects the instrument irrelevance and the no-OLS bias nulls at the .01 level. Only 5 to 6 percent of instrumented coefficients meet these standards of credibility while producing a confidence interval that does not contain the OLS point estimate.
結果は使用するブートストラップの手法に左右されるが、2SLSの点推定が誤って有意とされることが1/3〜1/2の割合で生じることを私は見い出した。ブートストラップで求められたその99%の信頼区間が、OLSの点推定を92〜94%の割合で含むこと、および、ブートストラップで求められたOLSの99%の信頼区間全体を75〜83%の割合で含むことも見い出された。また、サンプリングでIV推定が極めて大きく動くことは、異常値への敏感性に表れている。たった一つもしくは二つのクラスターないし観測値を取り除くことで、1%水準で有意と報告された2SLSの結果のうちそれぞれ45%もしくは63%が同水準で非有意に転じ得る。私はまた、OLS推計値が実際には不偏である、という帰無仮説を1%水準で棄却できる回帰は8〜14%に過ぎないことも見い出した。このことは重要である。というのは、2SLSの自身の母集団モーメント周りの対数平均二乗誤差は、OLSの母集団モーメント周りの対数平均二乗誤差よりも平均的に4.77大きいので、OLSが不偏であるならば、2SLSを使用することは二次の損失という観点からすると残念な選択ということになるからである。驚くべきことに、2SLSの母集団モーメント周りの対数平均二乗誤差は、OLSの同じモーメント周りの対数平均二乗誤差よりも平均的に1.52大きい。即ち、応用研究において、偏ったOLSは、2SLS自身よりもIVの母集団モーメントを平均的にはより正確に推計しているのである! しかも、およそ1/6の係数で、2SLS手法のバイアスはOLSの(2SLSモーメントからの)バイアスよりも大きかった。私はまた、第一段階の係数はゼロであるという帰無仮説が1%水準で棄却されるのは52〜70%の割合に過ぎない、ということを見い出した。即ち、掲載論文の回帰のうち1/3〜1/2では、操作変数はまったく無関係であって、内生変数と除外された操作変数との間に観測された相関は、母集団では後者は外生的であるにも関わらず有限サンプルにおいて操作変数と内生的な誤差との間にまったく望ましくない相関が生じたことによるものである、という帰無仮説を棄却できないのである。操作変数が無関係であり、かつ、OLSが不偏である、という帰無仮説を1%水準で棄却した回帰によって求められた操作変数を用いた係数は、10〜12に1つに過ぎない。信頼区間にOLSの点推定を含まず、これらの信頼性基準を満たす操作変数を用いた係数は、5〜6%に過ぎない。


As regards the title of this post, the question mark at the end signas that I don’t think applied econometricians will stop using 2SLS. I do think, however, that the religious reverence in which 2SLS results using plausibly exogenous IV are held might weaken in the near future given the inference issues highlighted by Young.

*1:cf. ここ