# ビッグデータにおける統計的有意性

というエントリがBig Data Econometricsなるブログに上がっていることを昨年末にDave Gilesが紹介している

It has been recognized for some time that when using large data it becomes “too easy” to reject the null hypothesis of no statistical significance, since confidence intervals are $O(N^{-1})$ (Granger, 1998). The problem with a standard t-test in large samples is that it is replaced by its asymptotic form and the critical values are drawn from the Normal distribution. As a result, for large sample sizes the critical value for testing at the 95% significance level does not increase with the sample size. One possibility for addressing this problem is to let the critical value be a function of the sample size.

My colleague, Carlos Lamarche, at the University of Kentucky, pointed out this week that one can think about this as a testing problem for nested models. Cameron and Trivedi (2005) suggest using the Bayesian Information Criterion (BIC) for which the penalty increases with the sample size. Using the BIC for testing the significance of one variable is identical to using a two-sided t-test critical value of $\sqrt{ln(N)}$.

The plot shows how the critical value increases with the scale of the data and how this compares with the standard critical values for the t-test at different levels of significance. Using the BIC suggests using critical values greater than 2 for sample sizes larger than 1000. When using Big Data with over 1M observations, a critical value equivalent to a t-test at the 99% or even 99.9% seems advisable.
（拙訳）

Gilesは自身も2年前にこの問題を扱ったことを指摘しているが、そちらのコメント欄では、アンドリュー・ゲルマンがさらにその2年前にこの問題について書いていたことを指摘している。

*1:これ。WPがここで読める。

*2:これ