というNBER論文が上がっている(H/T タイラー・コーエン)。原題は「How Credible is the Credibility Revolution?」で、著者はKevin Lang(ボストン大)。

Suppose you test a null hypothesis, and the t turns out to be 1.96. Assume the model is correctly specified and the t − statistic is really distributed as t. What is the probability that the null hypothesis is actually true?
...if we don’t stop to think, most of us trained in the frequentist tradition will respond “5 percent.” As Colquhoun (2014) points out, this is obviously incorrect. The probability that the null is false depends on the likelihood of getting a t of 1.96 if the null is false and, thus, indirectly, on the power of the test. The probability also depends on the ex-ante probability that the null was true, your prior if you are a Bayesian. If we are almost sure the null hypothesis is false, we should continue believing that the null is false even when we fail to reject. This is the message of DeLong and Lang (1992), who find that at least two-thirds of published unrejected nulls are false and cannot reject that 100% of the unrejected nulls are false when the unrejected hypothesis is central to the paper’s message. They conclude that journals publish unrejected nulls only when failing to reject them is very surprising.
My approach addresses the counterpart to the question in DeLong and Lang (1992): what proportion of rejected nulls are true?
I limit the sample to articles that measure causal effects using techniques associated with the credibility revolution (instrumental variables, randomized controlled trials, difference-in-differences, matching). This is not intended to disparage the contribution of the credibility revolution. Although I have been critical of some of the abuses of the techniques it promotes (Kahn-Lang and Lang 2020), these techniques have greatly influenced the profession, including me, in generally positive ways. However, studies drawing on credibility revolution techniques often claim “convincing evidence” of a causal effect such that we may draw a strong policy conclusion from a single study. My goal is to help us think more clearly about hypothesis testing in policy research. I focus on credibility revolution techniques because, as I have noted elsewhere (Lang and Palacios 2018), structural labor economists rarely put standard errors on their policy estimates. Moreover, most structural papers do not test a clearly stated hypothesis.
Using the model, I ask what proportion of rejected nulls are, in fact, true. Under my preferred specification, I estimate that 41% of published rejected nulls are false rejections. Almost two-thirds of narrow rejections, those with t just above 1.96, are false rejections. To get to the conventional .05 level requires a |t| greater than 5.48. Only 18% of rejected nulls, including those with |t| > 10, satisfy this requirement. In a policy context, unless the level of statistical significance dramatically exceeds current conventional levels, this will generally require us to be cautious about applying the findings of a single study, even one conducted honestly and carefully. Of course, in a decision-theoretic context, how certain we need to be depends on the costs of type 1 and type 2 errors.

新発見の統計的有意性のp値の閾値は5%から0.5%に下げよ - himaginary’s diaryでは有意水準を5%から0.5%に下げる提言を紹介したが、|t|の5.48や10を基準とするのであれば、0.5%よりも遥かに厳しい基準を要求することになる。