p値の価値 - himaginary’s diary

今月初めに米統計学会がp値の使用に関する6つの原則を公表した。その責任者である同学会Executive DirectorのRonald L. Wassersteinは、Retraction Watchという論文撤回監視ブログ*1のインタビューに応じ、最近の再現性危機問題が今回の声明の背景にあることを説明している（H/T Mostly Economics）。日本でもこの6原則は各所で取り上げられており、Naverまとめがその辺りに詳しい。

米統計学会のサイトでは、この6原則を提示した声明文書と共に、同文書のp値の議論に関する21人の統計学者の反応も併せて公開している。そのうちUCバークレー教授のPhilip B. Starkが、表題の小論（原題は「The Value of p-Values」）で、今回の声明の精神は買うが、内容には若干の違和感がある、として以下の点を指摘している。

The informal definition of a p-value at the beginning of the document is vague and unhelpful.

The statement draws a distinction between “the null hypothesis” and “the underlying assumptions” under which the p-value is calculated. But the null hypothesis is the complete set of assumptions under which the p-value is calculated.

The “other approaches” section ignores the fact that the assumptions of some of those methods are identical to those of p-values. Indeed, some of the methods use p-values as input (e.g., the False Discovery Rate).

The statement ignores the fact that hypothesis tests apply in many situations in which there is no parameter or notion of an “effect,” and hence nothing to estimate or to calculate an uncertainty for.

The statement ignores the crucial distinction between frequentist and Bayesian inference.

［脚注での追加指摘］The document has other problems, among them: It characterizes a p-value of 0.05 as “weak”evidence against the null hypothesis, but strength of evidence depends crucially on context. It categorically recommends using multiple numerical and graphical summaries of data, but there are situations in which these would be gratuitous distractions—if not an invitation to p-hacking!

（拙訳）

文書の最初でのp値の略式の定義は曖昧で役に立たない*2。

声明は、「帰無仮説」と、p値を計算する基となる「その根底にある仮定」とを区別している。しかし帰無仮説とは、p値を計算する基となる仮定の一式である*3。

「他のアプローチ」セクションでは、それらの手法の中には仮定がp値と同一のものもあるという事実を無視している。実際のところ、手法の中にはp値が入力となるものもある（＝偽発見率）。

声明は、「効果」のパラメータや概念が存在しない状況に仮説検定が適用される場合が数多くあるという事実を無視している。その場合、不確実性を推計ないし計算する対象が存在しない。

声明は、頻度主義とベイズ主義の推計の決定的な違いを無視している。

［脚注での追加指摘］声明には他にも次のような問題がある：

0.05というp値を帰無仮説に反する「弱い」証拠と位置付けているが、証拠の強さは状況に決定的に依存する。

数字およびグラフによるデータの複数のまとめを活用することを大いに推奨しているが、そうしたことが却って注意を逸らすことになり、下手をするとpハッキングへの入り口となるような状況も存在する。

この後Starkは、彼自身が考えたより簡明な説明を、今回の声明の代替として提示している。

Science progresses in part by ruling out potential explanations of data. p-values help assess whether a given explanation is adequate. The explanation being assessed is often called “the null hypothesis.”
If the p-value is small, either the explanation is wrong, or the explanation is right but something unlikely happened—something that had a probability equal to the p-value. Small p-values are stronger evidence that the explanation is wrong: the data cast doubt on that explanation.
If the p-value is large, the explanation accounts for the data adequately—although the explanation might still be wrong. Large p-values are not evidence that the explanation is right: lack of evidence that an explanation is wrong is not evidence that the explanation is right. If the data are few or low quality, they might not provide much evidence, period.
There is no bright line for whether an explanation is adequate: scientific context matters.
A p-value is computed by assuming that the explanation is right. The p-value is not the probability that the explanation is right.
p-values do not measure the size or importance of an effect, but they help distinguish real effects from artifacts. In this way, they complement estimates of effect size and confidence intervals.
Moreover, p-values can be used in some contexts in which the notion of “effect size” does not make sense. Hence, p-values may be useful in situations in which estimates of effect size and confidence intervals are not.
Like all tools, p-values can be misused. One common misuse is to hunt for explanations that have small p-values, and report only those, without taking into account or reporting the hunting. Such “p-hacking,” “significance hunting,” selective reporting, and failing to account for the fact that more than one explanation was examined (“multiplicity”) can make the reported p-values misleading.
Another misuse involves testing “straw man” explanations that have no hope of explaining the data: null hypotheses that have little connection to how the data were collected or generated. If the explanation is unrealistic, a small p-value is not surprising. Nor is it illuminating.
Many fields and many journals consider a result to be scientifically established if and only if a p-value is below some threshold, such as 0.05. This is poor science and poor statistics, and creates incentives for researchers to “game” their analyses by p-hacking, selective reporting, ignoring multiplicity, and using inappropriate or contrived null hypotheses.
Such misuses can result in scientific “discoveries” that turn out to be false or that cannot be replicated. This has contributed to the current “crisis of reproducibility” in science.
（拙訳）
科学の進歩の一部は、データに関する説明の候補を除外することから成り立っている。p値は、ある説明が適切かどうかを評価する助けとなる。評価の対象となる説明は「帰無仮説」と呼ばれることが多い。
もしp値が小さければ、説明が誤っているか、説明は正しいが何か起こり難いことが起きたか、のいずれかである。その起こり難いことの確率はp値に等しい。小さなp値は説明が誤っている強い証拠である。即ち、データはその説明に疑義を投げ掛けている、ということである。
もしp値が大きければ、その説明はデータを適切に表現している。ただ、それでもその説明が誤っている可能性はある。大きなp値は説明が正しい証拠にはならない。説明が誤っている証拠の欠如は、説明が正しい証拠ではないのだ。もしデータが少量もしくは低品質ならば、あまり証拠を提供することはできず、そこで話は終わる。
説明が適切か否かの明確な線引きは存在しない。科学的な文脈が問題となるのだ。
p値は説明が正しいと仮定して計算される。p値は説明が正しい確率ではない。
p値は効果の大きさないし重要性を測るわけではないが、実際の効果と偽の効果を区別する助けになる。その点で、効果量や信頼区間の推計を補完する。
また、p値は、「効果量」という概念が意味を持たない状況でも使用できることがある。そのため、効果量や信頼区間の推計が役に立たない状況でも役に立つ可能性がある。
すべての道具と同じく、p値も誤用され得る。一般的な誤用の一つは、p値の小さな説明を追い求め、追い求めた過程を考慮ないし報告することなしに、説明だけを報告することである。そうした「pハッキング」ないし「有意性ハンティング」の結果を選択的に報告し、一つより多い説明を調べたという事実（「多重性」）を報告しないことは、報告されたp値を人々を誤った方向に導くものとすることになりかねない。
別の誤用は、データを説明する可能性が無い「藁人形」説明を検定することである。ここでいう「藁人形」説明とは、データの収集ないし生成過程とほぼ無関係の帰無仮説である。もし説明が非現実的ならば、p値が小さいことは驚くに値しないし、何の解明にもならない。
多くの分野や多くの学術誌では、p値が0.05のようなある閾値より低い場合、そしてその場合のみ、結果が科学的に立証された、と見做している。これは科学としても統計学としても劣悪なやり方であり、pハッキングや選択的報告や多重性の無視や不適切ないし不自然な帰無仮説といった方法で分析を「弄ぶ」インセンティブを研究者に与えてしまう。
こうした誤用は、後で間違いであることが明らかになったり再現ができなかったりする科学的「発見」につながりかねない。それが現在の科学の「再現性危機」の一因となったのである。

*1:cf. Retraction Watch - Wikipedia、関連日本語記事1、関連日本語記事2。

*2:声明文書の該当すると思われる箇所：
What is a p-value?
Informally, a p-value is the probability under a specified statistical model that a statistical summary of the data (for example, the sample mean difference between two compared groups) would be equal to or more extreme than its observed value.

*3:声明文書の原則１「P-values can indicate how incompatible the data are with a specified statistical model.」の後には続けて以下の記述がある：
A p-value provides one approach to summarizing the incompatibility between a particular set of data and a proposed model for the data. The most common context is a model, constructed under a set of assumptions, together with a so-called “null hypothesis.” Often the null hypothesis postulates the absence of an effect, such as no difference between two groups, or the absence of a relationship between a factor and an outcome. The smaller the p-value, the greater the statistical incompatibility of the data with the null hypothesis, if the underlying assumptions used to calculate the p-value hold. This incompatibility can be interpreted as casting doubt on or providing evidence against the null hypothesis or the underlying assumptions.