医者は間違いを犯す、よって治療法が改善できる

コロンビア大のW. Bentley MacLeodが「Viewpoint: The Human Capital Approach to Inference」というNBER論文（ungated版）を書いている（H/T Francis Diebold）。以下はその要旨。

The purpose of this essay is to discuss the “human capital” approach to inference. Observed decisions by experts can be used to organize data on their decisions using simple machine learning techniques. The fact that the human capital of these experts is heterogeneous implies that errors in decision making are inevitable, which in turn allows us to identify the conditional average treatment effect for a wider class of situations than would be possible with randomized control trials. This point is illustrated with some data from medical decision making in the context of treating depression, heart disease, and adverse childbirth events.
（拙訳）
本エッセイの目的は、推計の「人的資本」手法を論じることにある。専門家の観測された決断は、単純な機械学習技法を用いてデータをそうした決断に基づき体系化するのに使える。それら専門家の人的資本が均一でないという事実は、意思決定において誤りが不可避であることを意味し、そうした誤りは、ランダム化比較試験で許されるよりももっと多様な状況下で条件付き平均治療効果を特定することを可能にする。この点を、鬱治療、心疾患、出生時の問題という状況下における医療の意思決定のデータを用いて説明する。

Dieboldは論文について「...using economic theory in combination with machine learning to estimate conditional average treatment effects better than can be done with randomized control trials（経済理論と機械学習を組み合わせてランダム化比較試験よりも条件付き平均治療効果を良く推計）」とまとめている。

以下はungated版の導入部からの引用。

The human capital approach to inference can be used in situations where we have a large number of persons to be treated by skilled decision makers. If we were to do a randomized control trial (RCT), then individuals would be randomly allocated to treatment and control, and then we would compare the outcomes. The problem is that in many cases, particularly in medical decision making, the optimal treatment varies with the characteristics of the patient. For example, some individuals face adverse reactions to drugs, and others have a natural immunity to disease, leading to heterogeneous responses to both treatment and placebo. The potential variation is substantial, which is why physicians spend years studying different possible conditions, and associating them with the appropriate treatment.
Let us now suppose that in addition to having a large set of patients, information on their characteristics, treatment, and outcomes, we also have them matched to physicians, with a large number of patients for each physician. We then proceed by using the fact that these physicians are experts, and hence on average their treatments are helpful. Assuming that there are only two treatment choices, A or B, we can use the decisions by the physicians to organize the data by the probability that a physician chooses A for patient i with characteristics x_i. This yields a propensity score η(x_i). This is a straightforward machine learning exercise – given features x_i, what is the likelihood that choice A will be made.
One approach to machine learning would be to stop at this point. Namely, use the data to build a model of how expert physicians make choices. There is a huge literature studying this problem. For example, we can view the recent work to produce self-driving cars as one in which the machine is learning to be as good as a human at such a task. However, we can do a bit more. Once we have the propensity score, then we can proceed, as in Rosenbaum and Rubin (1983), to estimate the effect of choice conditional upon the propensity score. We differ from the standard propensity score approach in two ways. The first, is that we are concerned with the conditional average treatment effect (CATE) - the effect of treatment conditional upon characteristics x_i. As individual characteristics change, the optimal choice may change. The hope is that if we make a choice conditional upon the score x_i, this can result in better outcomes on average for individuals with this score.
Second, the goal of the propensity score estimator is to provide better control for observable characteristics, and the endogenous selection of individuals based upon their characteristics into treatment. In our case, since we have information on who treats, we can use the fact that human capital is limited, and hence physicians not only make errors, vary in the frequency with which mistakes are made. This allows use to measure the effect of treatment conditional upon patient characteristics, or CATE, and physician identity. We can ask which physicians get better performance, and what are the characteristics of their decisions that achieve better outcomes.
（拙訳）
推計の人的資本手法は、多数の人間が熟練した意思決定者の治療を受ける場合に使える。ランダム化比較試験を行う場合は、各人は治療群と対照群にランダムに割り当てられ、結果が比較される。問題は、多くの場合、特に医療上の意思決定時には、最適治療が患者の特性によって変わってくる、という点である。例えば、薬の副作用がある患者もいれば、病気に自然の免疫がある患者もいるので、治療と偽薬の双方で不均一な反応が出る。起こり得る変動は大きく、医者が何年も掛けて発症の可能性のある様々な症状を学び、それに対する適切な治療を覚えるのはそのためである。
大人数の患者の集合についての特性や治療や結果に関する情報が手元にあり、さらに、彼らを治療した医者も分かっているものとしよう。各医者は相当の人数の患者を診たものとする。そうした医者が専門家であり、平均的にはその治療が有用である、という事実から先へ進むことができる。AとBという2つの治療法しか選択肢が無いとすると、医者の決定を用いて、特性x_iを有する患者iには医者はAを選択する、という確率によってデータを体系化することができる。これによって傾向スコアη(x_i)が生成される。これは、x_iという特性が与えられた時にAが選択される可能性はどれくらいか、という素直な機械学習の課題である。
そこで話を終わらせる、というのも機械学習における一つのやり方である。即ち、データを用いて熟練した医者がどのような選択をするか、というモデルを構築するわけだ。この問題を調べた研究は数多あり、例えば自動運転車を生み出そうとする近年の研究は、そうした仕事について機械に人間と同等の仕事をさせようとする研究、と見做すことができる。しかし、もう少し先に進むこともできる。傾向スコアを手に入れたならば、 Rosenbaum and Rubin (1983)のように、傾向スコアの条件付き選択の効果を推計することができる。我々のやり方は標準的な傾向スコアの手法と2つの点において異なっている。第一に、我々は条件付き平均治療効果――特性x_iの条件付きの治療効果――に関心がある。個人の特性は変化するので、最適な選択も変化するだろう。スコアx_iに基づく条件付き選択を行えば、そのスコアを有する個人について平均的な治療効果が向上することが期待される。
第二に、傾向スコア推計量の目的は、観測可能な特性についてより良いコントロールを提供し、各人の特性に基づいて治療を受けるという内生的な選択を行うことにある。我々の場合について言えば、治療者についての情報を有しているため、人的資源には限界があるが故に医者は間違いを犯すのみならず、間違いを犯す頻度が人によって違う、という事実を利用することができる。そのことによって、患者の特性ないし条件付き平均治療効果ならびに医者、という条件に基づく治療効果を測定することができる。どの医者の治療が優れていて、そうした優れた結果を出す彼らの意思決定の特性は何か、と問うことができるわけだ。

以下は結論部からの引用。

The human capital approach begins with the hypothesis that we can use the decisions of experts to organize individuals into treatment groups that have similar characteristics, and hence the treatment effect within these groups is more homogeneous. Here machine learning techniques can be very useful because of their potential to categorize large amounts of data efficiently.
Second, even though experts are skilled, they necessarily make mistakes. Without mistakes there can be no learning - a randomized control trial is an extreme case of learning by forced randomization over possible treatments..
（拙訳）
専門家の決定を用いて同様の特性を持つ治療群に各人を分類すれば、その群での治療効果はより均質になる、という仮説が人的資本手法の出発点である。ここで、大量のデータを効率的に分類できる機械学習手法が非常に有用なものとなる。
第二に、たとえ専門家が熟練していたとしても、彼らは必然的に間違いを犯す。間違い抜きでは学習もあり得ない。ランダム化比較試験は、可能性のある治療法について敢えてランダム化することによって学習する極端なケースである。

医療の意思決定データに関しては結論部で以下のようにまとめられている。

医療の観点からは、心臓発作の際は侵襲的手技を適用することが常に望ましいとされるが、良い病院の医者ほどそれが適当でない患者、即ち、高齢の患者には侵襲的手技を適用しない。そのことは、医者は単純な医療上の必要性以外の要因を考慮して意思決定を行っている、という仮説と整合的である。

米国で帝王切開が多いのは金銭面のインセンティブのせいとされるが、低リスクの出生についてはそうであったものの、高リスクの出生については帝王切開はむしろ少なすぎた。両グループを平均し、リスクに曝された女性の数を考慮した場合、ニュージャージーの平均的な帝王切開の割合は、医学的に最適な割合を大きく下回った。このことは、平均的な治療効果だけを見ていることの危険性――最適な治療の選択における顕著な不均一性を見逃してしまう――を示す事例となっている。

鬱治療薬についてはランダム化比較試験は非常に不適切、というのは周知の事実。精神病の治療については人的資本手法が使えるほどのデータは無いが、大掛かりなデータ収集作業を行えば話が進展するかも。