p値と信頼区間に関して繰り返し起こる論争を再訪する

という論文の要約と結論が、バージニア工科大学の統計学者Deborah G. MayoのブログError Statistics blogで紹介されている（H/T Dave Giles）。原題は「Recurring Controversies About P Values and Confidence Intervals Revisited」で、著者はMayoの同僚のAris Spanos。
以下はその要旨。

The use, abuse, interpretations and reinterpretations of the notion of a P value has been a hot topic of controversy since the 1950s in statistics and several applied fields, including psychology, sociology, ecology, medicine, and economics.
The initial controversy between Fisher’s significance testing and the Neyman and Pearson (N-P; 1933) hypothesis testing concerned the extent to which the pre-data Type I error probability α can address the arbitrariness and potential abuse of Fisher’s post-data threshold for the P value. Fisher adopted a falsificationist stance and viewed the P value as an indicator of disagreement (inconsistency, contradiction) between data x0 and the null hypothesis (H0). Indeed, Fisher (1925: 80) went as far as to claim that ‘‘The actual value of p…indicates the strength of evidence against the hypothesis.’’ Neyman’s behavioristic interpretation of the pre-data Type I and II error probabilities precluded any evidential interpretation for the accept/reject the null (H0) rules, insisting that accept (reject) H0 does not connote the truth (falsity) of H0. The last exchange between these protagonists (Fisher 1955, Pearson 1955, Neyman 1956) did nothing to shed light on these issues. By the early 1960s, it was clear that neither account of frequentist testing provided an adequate answer to the question (Mayo 1996): When do data x0 provide evidence for or against a hypothesis H?
The primary aim of this paper is to revisit several charges, interpretations, and comparisons of the P value with other procedures as they relate to their primary aims and objectives, the nature of the questions posed to the data, and the nature of their underlying reasoning and the ensuing inferences. The idea is to shed light on some of these issues using the error-statistical perspective; see Mayo and Spanos (2011).
（拙訳）
p値という概念の使用、誤用、解釈と再解釈は、1950年代以降、統計学やその応用分野である心理学、社会学、生態学、医学、そして経済学などにおいて大いなる議論の的となってきた。
フィッシャーの有意性検定とネイマン＝ピアソン（N-P; 1933）の仮説検定の間の論争は、データを適用する前の第一種の過誤の確率αが、フィッシャーのデータを適用した後のp値の閾値における恣意性と誤用の可能性の問題をどの程度解決できるか、という点を論題としていた。フィッシャーは反証主義の立場を取り、p値をデータx₀と帰無仮説（H₀）の間の不一致（不整合、矛盾）の指標と考えていた。実際、フィッシャー（1925: 80）では、「p値の実際の値・・・は仮説を棄却する実証データの強度を示している」とまで主張している。データを適用する前の第一種ならびに第二種の過誤の確率に関するネイマンの行動主義的な解釈は、帰無仮説（H₀）の採択／棄却ルールについての実証データに基づく解釈を一切排除した。ネイマンの解釈では、H₀の採択（棄却）はH₀が真（偽）であることを意味しない。両派の主役の間の最後のやり取り（Fisher 1955, Pearson 1955, Neyman 1956）では、この問題に関しての前進はまるで見られなかった。1960年代初頭までには、頻度主義的な検定のいずれの説明も、データx₀が仮説Hを支持もしくは否定する証拠を提供するのはどういう時なのか、という問いに対し適切な解を提供しないことが明らかとなった（Mayo 1996）。
本稿の主な目的は、p値の批判、解釈、および他の手法との比較――それらの手法の主要目的ないし目標、データについての問題設定の性質、基本となる推論方法とそれに基づく推定の性質、という面での比較――を再訪することにある。ここでは、誤り統計学の観点を用いてこれらの問題を解明する；Mayo and Spanos（2011）を参照のこと。

以下は結論部。

The paper focused primarily on certain charges, claims, and interpretations of the P value as they relate to CIs and the AIC. It was argued that some of these comparisons and claims are misleading because they ignore key differences in the procedures being compared, such as (1) their primary aims and objectives, (2) the nature of the questions posed to the data, as well as (3) the nature of their underlying reasoning and the ensuing inferences.
In the case of the P value, the crucial issue is whether Fisher’s evidential interpretation of the P value as ‘‘indicating the strength of evidence against H0’’ is appropriate. It is argued that, despite Fisher’s maligning of the Type II error, a principled way to provide an adequate evidential account, in the form of post-data severity evaluation, calls for taking into account the power of the test.
The error-statistical perspective brings out a key weakness of the P value and addresses several foundational issues raised in frequentist testing, including the fallacies of acceptance and rejection as well as misinterpretations of observed CIs; see Mayo and Spanos (2011). The paper also uncovers the connection between model selection procedures and hypothesis testing, revealing the inherent unreliability of the former. Hence, the choice between different procedures should not be ‘‘stylistic’’ (Murtaugh 2013), but should depend on the questions of interest, the answers sought, and the reliability of the procedures.
（拙訳）
本稿は、p値に関する幾つかの批判や主張、ならびに信頼区間や赤池情報基準との関連における解釈に主に焦点を当てた。ここでは、そうした比較や主張の中には誤解を招くものがある、と論じた。というのは、それらは(1)主要目的ないし目標、(2)データについての問題設定の性質、そして(3)基本となる推論方法とそれに基づく推定の性質、といった比較対象の手法における重要な相違を無視しているからである。
p値について言えば、極めて重要な問題は、「仮説を棄却する実証データの強度を示している」というフィッシャーのp値の実証データに基づく解釈が適切かどうか、という点にある。フィッシャーは第二種の過誤を軽視したが、データを適用した後の厳密性の評価という形での適切な実証的説明を提供する合理的な方法においては、検定力を考慮する必要がある、とここでは論じた。
誤り統計学の観点はp値の重要な弱点を浮き彫りにし、採択や棄却の誤りや観測された信頼区間の誤った解釈といった、頻度主義的な検定において惹起する幾つかの根本的な問題を明らかにする；Mayo and Spanos（2011）参照。本稿ではまた、モデル選択手順と仮説検定との関係を明らかにし、前者が本質的に信頼性に欠けることを示した。従って、相異なる手順からどれを選択するかは「スタイルの問題」（Murtaugh 2013*1）ではなく、問題の設定や求める解答や手順の信頼性に依存する。

ちなみに上記の結論部の最後で言及されたPaul A. Murtaugh（オレゴン州立大）は、同じ掲載誌（Ecology）にリジョインダーを寄せている。また、Murtaughの元論文「In defense of P values」はこちらで読める（上記のSpanos論文とそれに対するMurtaughのリジョインダーと同じく、Mayoが自ブログの中でpdfを公開している）。以下はその要旨。

Statistical hypothesis testing has been widely criticized by ecologists in recent years. I review some of the more persistent criticisms of P values and argue that most stem from misunderstandings or incorrect interpretations, rather than from intrinsic shortcomings of the P value. I show that P values are intimately linked to confidence intervals and to differences in Akaike’s information criterion (DAIC), two metrics that have been advocated as replacements for the P value. The choice of a threshold value of DAIC that breaks ties among competing models is as arbitrary as the choice of the probability of a Type I error in hypothesis testing, and several other criticisms of the P value apply equally to DAIC. Since P values, confidence intervals, and DAIC are based on the same statistical information, all have their places in modern statistical practice. The choice of which to use should be stylistic, dictated by details of the application rather than by dogmatic, a priori considerations.
（拙訳）
統計的仮説検定は近年多くの生態学者から批判されている。ここではp値に対するとりわけ根強い批判の幾つかを取り上げ、その大部分が、p値の本質的な欠点というよりは、誤解や誤った解釈に基づくものだと論じる。またp値が、信頼区間ならびに赤池情報基準の差分（DAIC）という、p値に取って代わるものとして提唱されている2つの指標と密接に関連していることを示す。競合するモデルに対する決め手になるDAICの閾値の選択も、仮説検定における第一種の過誤の確率の選択と同じくらい恣意的であり、p値に対するその他の批判も同様にDAICに当てはまる。p値も信頼区間もDAICも同じ統計情報に基づいているため、いずれも現代統計学の実務において利用価値がある。どれを使うかは、教義に基づいて先験的に選ぶのではなく、適用の詳細によって決まるスタイルの問題として考えるべきである。

それらについてMayoは、自ブログエントリのコメント欄で、「I just quickly scanned Murtaugh’s paper. I credit him for at least trying to defend P-values against criticisms based on mere misuses. He doesn’t deal with Spanos’s points in his reply.（Murtaugh論文をざっと読んだけど、少なくとも単なる誤用に基づく批判に対しp値を擁護したのは結構であるものの、リジョインダーではSpanosの指摘に対応していない）」と述べている。また、p値が条件付き確率か否かを巡って昨年Mayoと論争した（cf.ここ）アンドリュー・ゲルマンも同コメント欄に降臨し、自ブログでMurtaugh論文を取り上げたことを触れた上で、Murtaughに賛成する点もあればそうでない点もある、と述べている。なお、Mayoは、そもそもこの人のこと知らなかったんだけど、とコメントしているが、それに対しMurtaughの大学の同級生だったという別のコメンターが、彼はまともな統計学者ですよ（Paul Murtaugh is indeed a sensible statistician）と応じている。

*1:論文の参考文献では2013ではなく2014になっている。