6 Hypothesis Testing
The hypothesis testing problem is as follows. Based on a sample of data, \(y\), generated from \(p\left( y \mid \theta\right)\) for \(\theta\in\Theta\), the goal is to determine if \(\theta\) lies in \(\Theta_{0}\) or in \(\Theta_{1}\), two disjoint subsets of \(\Theta\). In general, the hypothesis testing problem involves an action: accepting or rejecting a hypothesis. The problem is described in terms of a null, \(\mathcal{H}_{0}\), and alternative hypothesis, \(\mathcal{H}_{1}\), which are defined as \[ \mathcal{H}_{0}:\theta\in\Theta_{0}\;\;\mathrm{and}\;\;\mathcal{H}_{1}% :\theta\in\Theta_{1}\text{.}% \]
Different types of regions generate different types of hypothesis tests. If the null hypothesis assumes that \(\theta_{0}\) is a single point, \(\Theta _{0}=\theta_{0}\), this is known as a simple or “sharp” null hypothesis. If the region consists of multiple points than the hypothesis is called a composite, which occurs if the space is unconstrained or an interval of the real line. In the case of a single parameter, one-sided tests are of the form \(\mathcal{H}_{0}:\theta<\theta_{0}\) and \(\mathcal{H}_{1}:\theta>\theta_{0}\).
There are correct decisions and two types of possible errors. The correct decisions are accepting a null or alternative that is true. A Type I error incorrectly rejects a true null, and a Type II error incorrectly accepts a false null.
| \(\theta\in\Theta_{0}\) | \(\theta\in\Theta_{1}\) | |
|---|---|---|
| Accept \(\mathcal{H}_{0}\) | Correct decision | Type II error |
| Accept \(\mathcal{H}_{1}\) | Type I error | Correct decision |
Formally, the probabilities of Type I (\(\alpha\)) and Type II (\(\beta\)) errors are defined as: \[ \alpha=P \left[ \text{reject }\mathcal{H}_{0} \mid \mathcal{H}_{0}\text{ is true }\right] \text{ and }\beta=P \left[ \text{accept }\mathcal{H}_{0} \mid \mathcal{H}_{1}\text{ is true }\right] \text{.}% \]
It is useful to think of the decision to accept or reject as a decision rule, \(d\left( y\right)\). In many cases, the decision rules form a critical region \(R\), such that \(d\left( y\right) =d_{1}\) if \(y\in R\). These regions are often take the form of simple inequalities. Next, defining the decision to accept the null is \(d\left( y\right) =d_{0}\), and the decision to accept the alternative is \(d_{1},\) the error types are \[\begin{align*} \alpha_{\theta}\left( d\right) & =P \left[ d\left( y\right) =d_{1} \mid \theta\right] \text{ if }\theta\in\Theta_{0}\text{ }(\mathcal{H}_{0}\text{ is true})\\ \beta_{\theta}\left( d\right) & =P \left[ d\left( y\right) =d_{0} \mid \theta\right] \text{ if }\theta\in\Theta_{1}\text{ }(\mathcal{H}_{1}\text{ is true})\text{.}% \end{align*}\] where both types of errors explicitly depend on the decision and the true parameter value. Notice that both of these quantities are determined by the population properties of the data. Size of the test is defined as \(\underset{\theta\in\Theta_{0}}{sup}\alpha_{\theta}\left( d\right)\) and the power is defined as \(1-\beta_{\theta}\left( d\right)\). It is always possible to set either \(\alpha_{\theta}\left( d\right)\) or \(\beta_{\theta }\left( d\right)\) equal to zero, by finding a test that always rejects alternative or null, respectively.
The total probability of making an error is \(\alpha_{\theta}\left(d\right) +\beta_{\theta}\left(d\right)\), and ideally one would seek to minimize the total error probability, absent additional information. In thinking of these tradeoffs, it is important to note that the easiest way to reduce the error probability is to gather more data, as the additional evidence should lead to more accurate decisions. In some cases, it is easy to characterize optimal tests, those that minimize the sum of the errors. Simple hypothesis tests of the form \(\mathcal{H}_{0}:\theta=\theta_{0}\) versus \(\mathcal{H}_{1}:\theta=\theta_{1}\), are one such case admiting optimal tests. Defining \(d^{\ast}\) as a test accepting \(\mathcal{H}_{0}\) if \(a_{0}f\left( y \mid \theta_{0}\right) >a_{1}f\left( y \mid \theta_{1}\right)\) and \(\mathcal{H}_{1}\) if \(a_{0}f\left( y \mid \theta_{0}\right) <a_{1}f\left( y \mid \theta _{1}\right)\), for some \(a_{0}\) and \(a_{1}\). Either \(\mathcal{H}_{0}\) or \(\mathcal{H}_{1}\) can be accepted if \(a_{0}f\left(y \mid \theta_{0}\right) =a_{1}f\left( y \mid \theta_{1}\right)\). Then, for any other test \(d\), it is not hard to show that \[ a_{0}\alpha\left( d^{\ast}\right) +a_{1}\beta\left( d^{\ast}\right) \leq a_{0}\alpha\left( d\right) +a_{1}\beta\left( d\right) \]
where \(\alpha_{d}=\alpha_{d}\left( \theta\right)\) and \(\beta_{d}=\beta_{d}\left( \theta\right)\). This result highlights the optimality of tests defining rejection regions in terms of the likelihood ratio statistic, \(f\left( y \mid \theta_{0}\right)/f\left( y \mid \theta_{1}\right)\). It turns out that the results are in fact stronger. In terms of decision theoretic properties, tests that define rejection regions based on likelihood ratios are not only admissible decisions, but form a minimal complete class, the strongest property possible.
One of the main problems in hypothesis testing is that there is often a tradeoff between the two goals of reducing type I and type II errors: decreasing \(\alpha\) leads to an increase in \(\beta\), and vice-versa. Because of this, it is common to fix \(\alpha_{\theta}\left( d\right)\), or \(sup\alpha_{\theta}\left( d\right)\), and then find a test to minimize \(\beta_{d}\left( \theta\right)\). This leads to “most powerful” tests. In thinking of these tests, there is an important result from decision theory: test procedures that use the same size level of \(\alpha\) in problems with different sample sizes are inadmissible. This is commonly done where significance is indicated by a fixed size, say 5%. The implications of this will be clearer below in examples.
6.1 The Bayesian Approach
Formally, the Bayesian approach to hypothesis testing is a special case of the model comparison results discussed earlier. The Bayesian approach just computes the posterior distribution of each hypothesis. By Bayes rule, for \(i=0,1\) \[ P \left( \mathcal{H}_{i} \mid y\right) =\frac{p\left( y \mid \mathcal{H}_{i}\right) P \left( \mathcal{H}_{i}\right) }{p\left( y\right) }\text{,}% \] where \(P \left( \mathcal{H}_{i}\right)\) is the prior probability of \(\mathcal{H}_{i}\), \(p\left( y \mid \mathcal{H}_{i}\right) =\int p\left( y \mid \theta,\mathcal{H}_{i}\right) p\left( \theta \mid \mathcal{H}_{i}\right) d\theta\) is the marginal likelihood under \(\mathcal{H}_{i}\), \(p\left( \theta \mid \mathcal{H}_{i}\right)\) is the parameter prior under \(\mathcal{H}_{i}\), and \[ p\left( y\right) =p\left( y \mid \mathcal{H}_{0}\right) P \left( \mathcal{H}_{0}\right) +p\left( y \mid \mathcal{H}_{1}\right) P \left( \mathcal{H}_{1}\right) . \]
If the hypothesis are mutually exclusive, \(P \left( \mathcal{H}_{0}\right) =1-P \left( \mathcal{H}_{1}\right)\).
The posterior odds of the null to the alternative is \[ \text{Odds}_{0,1}=\frac{P \left( \mathcal{H}_{0} \mid y\right) }{P % \left( \mathcal{H}_{1} \mid y\right) }=\frac{p\left( y \mid \mathcal{H}_{0}\right) }{p\left( y \mid \mathcal{H}_{1}\right) }\frac{P \left( \mathcal{H}_{0}\right) }{P \left( \mathcal{H}_{1}\right) }\text{.}% \]
The odds ratio updates the prior odds, \(P \left( \mathcal{H}_{0}\right) /P \left( \mathcal{H}_{1}\right)\), using the Bayes Factor, \(\mathcal{BF}_{0,1}=p\left(y \mid \mathcal{H}_{0}\right) /p\left( y \mid \mathcal{H}_{1}\right) .\) With exhaustive competing hypotheses\(,\) \(P \left( \mathcal{H}_{0} \mid y\right)\) simplifies to \[ P \left( \mathcal{H}_{0} \mid y\right) =\left( 1+\left( \mathcal{BF}_{0,1}\right) ^{-1}\frac{\left( 1-P \left( \mathcal{H}_{0}\right) \right) }{P \left( \mathcal{H}_{0}\right) }\right) ^{-1}\text{,}% \] and with equal prior probability, \(p\left( \mathcal{H}_{0} \mid y\right) =\left( 1+\left( \mathcal{BF}_{0,1}\right) ^{-1}\right) ^{-1}\). Both Bayes factors and posterior probabilities can be used for comparing hypotheses. Jeffreys (1961) advocated using Bayes factors, and provided a scale for measuring the strength of evidence that was given earlier. Bayes factors merely indicate that the null hypothesis is more likely if \(\mathcal{BF}_{0,1}>1\), \(p\left( y \mid \mathcal{H}_{0}\right) >p\left( y \mid \mathcal{H}_{1}\right)\). The Bayesian approach merely compares density ordinates of \(p\left( y \mid \mathcal{H}_{0}\right)\) and \(p\left( y \mid \mathcal{H}_{1}\right)\), which mechanically involves plugging in the observed data into the functional form of the marginal likelihood.
For a point null, \(\mathcal{H}_{0}:\theta=\theta_{0}\), the parameter prior is \(p\left( \theta \mid \mathcal{H}_{0}\right) =\delta_{\theta_{0}}\left( \theta\right)\) (a Dirac mass at \(\theta_{0}\)), which implies that \(p\left( y \mid \mathcal{H}_{0}\right) =\int p\left( y \mid \theta_{0}\right) p\left( \theta \mid \mathcal{H}_{0}\right) d\theta=p\left( y \mid \theta_{0}\right)\). With a general alternative, \(\mathcal{H}_{1}:\theta\neq\theta_{0}\), the probability of the null is \[ P \left( \theta=\theta_{0} \mid y\right) =\frac{p\left( y \mid \theta _{0}\right) P \left( \mathcal{H}_{0}\right) }{p\left( y \mid \theta _{0}\right) P \left( \mathcal{H}_{0}\right) +\left( 1-p\left( \mathcal{H}_{0}\right) \right) \int_{\Theta}p\left( y \mid \theta,\mathcal{H}_{1}\right) p\left( \theta \mid \mathcal{H}_{1}\right) d\theta}, \] where \(p\left( \theta \mid \mathcal{H}_{1}\right)\) is the parameter prior under the alternative. This formula will be used below.
Bayes factors and posterior null probabilities measure the relative weight of evidence of the hypotheses. Traditional hypothesis involves an additional decision or action: to accept or reject the null hypothesis. For Bayesian, this typically requires some statement of the utility/loss codifies the benefits/costs of making a correct or incorrect decisision. The simplest situation occurs if one assumes a zero loss of making a correct decision. The loss incurred when accepting the null (alternative) when the alternative is true (false) is \(L\left( d_{0} \mid \mathcal{H}_{1}\right)\) and \(L\left( d_{0} \mid \mathcal{H}_{1}\right)\), respectively.
The Bayesian will accept or reject based on the posterior expected loss. If the expected loss of accepting the null is less than the alternative, the rational decision maker will accept the null. The posterior loss of accepting the null is \[ \mathbb{E}\left[ \text{Loss}\ mid d_{0},y\right] =L\left( d_{0} \mid \mathcal{H}_{0}\right) P \left( \mathcal{H}_{0} \mid y\right) +L\left( d_{0} \mid \mathcal{H}_{1}\right) P \left( \mathcal{H}_{1} \mid y\right) =L\left( d_{0} \mid \mathcal{H}_{1}\right) P \left( \mathcal{H}_{1} \mid y\right) , \] since the loss of making a correct decision, \(L\left( d_{0} \mid \mathcal{H}_{0}\right)\), is zero. Similarly, \[ \mathbb{E}\left[ \text{Loss} \mid d_{1},y\right] =L\left( d_{1} \mid \mathcal{H}_{0}\right) P \left( \mathcal{H}_{0} \mid y\right) +L\left( d_{1} \mid \mathcal{H}_{1}\right) P \left( \mathcal{H}_{1} \mid y\right) =L\left( d_{1} \mid \mathcal{H}_{0}\right) P \left( \mathcal{H}_{0} \mid y\right) . \] Thus, the null is accepted if \[ \mathbb{E}\left[ \text{Loss} \mid d_{0},y\right] <\mathbb{E}\left[ \text{Loss} \mid d_{1},y\right] \Longleftrightarrow L\left( d_{0} \mid \mathcal{H}_{1}\right) P \left( \mathcal{H}_{1} \mid y\right) <L\left( d_{1} \mid \mathcal{H}_{0}\right) P \left( \mathcal{H}_{0} \mid y\right) , \] which further simplifies to \[ \frac{L\left( d_{0} \mid \mathcal{H}_{1}\right) }{L\left( d_{1} \mid \mathcal{H}_{0}\right) }<\frac{P \left( \mathcal{H}_{0} \mid y\right) }% {P \left( \mathcal{H}_{1} \mid y\right) }\text{.}% \] In the case of equal losses, this simplifies to accept the null if \(P \left( \mathcal{H}_{1} \mid y\right) <P \left( \mathcal{H}_{0} \mid y\right)\). One advantage of Bayes procedures is that the resulting estimators and decisions are always admissible.
Example 6.1 (Enigma machine: Code-breaking) Consider an alphabet of \(26\) letters. Let \(x\) and \(y\) be two codes of length \(T\). We will look to see how many letters match (\(M\)) and don’t match \(N\). In these sequences. Even though the codes are describing different sentences, when letters are the same, if the same code is being used then the coed sequence will have a match. To compute the bayes factor we need the joint probabilities \[ P( x,y\mid \mathcal{H}_0 ) \; \; \mathrm{ and} \; \; P( x,y\mid \mathcal{H}_1 ), \] where under \(\mathcal{H}_0\) they are different codes, in which case the joint prob is \(( 1 / A )^{2T}\). For \(\mathcal{H}_1\) we first need to know the chance of the same letter matching. If \(p_t\) denotes the frequencies of the use of English letters, then we have this match probability \(m = \sum_{i} p_i^2\) which is about \(2/26\). Hence for a particular set of letters \[ P( x_i , y_i \mid \mathcal{H}_1 ) = \frac{m}{A} \; \mathrm{ if} \; x_i =y_i \; \; \mathrm{ and} \; \; P( x_i , y_i \mid \mathcal{H}_1 ) = \frac{1-m}{A(A-1)} \; \mathrm{ if} \; x_i \neq y_i. \] Hence the log Bayes factor is \[\begin{align*} \ln \frac{P( x,y\mid \mathcal{H}_1 )}{P( x,y\mid \mathcal{H}_0 )} & = M \ln \frac{ m/A}{1/A^2} +N \ln \frac{ ( 1-m ) / A(A-1) }{ 1/ A^2} \\ & = M \ln mA + N \ln \frac{ ( 1-m )A }{A-1 } \end{align*}\] The first term comes when you get a match and the increase in the Bayes factor is large, \(3.1\) (on a \(log_{10}\))-scale, otherwise you get a no-match and the Bayes factor decreases by \(- 0.18\).
Example, \(N=4\), \(M=47\) out of \(T=51\), then gives evidence of $2.5 $ to \(1\) in favor of \(\mathcal{H}_1\)
How long a sequence do you need to look at? Calculate the expected log odds. Turing and Good figured you needed sequences of about length \(400\). Can also look at doubles and triples.
6.2 Alternative Approaches
The two main alternatives to the Bayesian approach are significance testing using \(p-\)values, developed by Ronald Fisher, and the Neyman-Pearson approach.
6.2.1 Significance testing using p-values
Fisher’s approach posits a test statistic, \(T\left( y\right)\), based on the observed data. In Fisher’s mind, if the value of the statistic was highly unlikely to have occured under \(\mathcal{H}_{0}\), then the \(\mathcal{H}_{0}\) should be rejected. Formally, the \(p-\)value is defined as \[ p=P \left[ T\left( Y\right) >T\left( y\right) \mid \mathcal{H}_{0}\right] , \] where \(y\) is the observed sample and \(Y=\left( Y_{1}, \ldots ,Y_{T}\right)\) is a random sample generated from model \(p\left( Y \mid \mathcal{H}_{0}\right)\), that is, the null distribution of the test-statistic in repeated samples. Thus, the \(p-\)value is the probability that a data set would generate a more extreme statistic under the null hypothesis, and not the probability of the null, conditional on the data.
The testing procedure is simple. Fisher (1946, p. 80) argues that: *“If P (the p-value) is between* \(0.1\) and \(0.9\), there is is certainly no reason to suspect the hypothesis tested. If it is below \(0.02\), it is strongly indicated that the hypothesis fails to account for the whole of the facts. We shall not be astray if we draw a line at 0.05 and consider that higher values of \(\mathcal{X}^{2}\) indicate a real discrepancy.” Defining \(\alpha\) as the significance level, the tests rejects \(\mathcal{H}_{0}\) if \(p<\alpha\). Fisher advocated a fixed significance level of \(5\%\), based largely that \(5\%\) is roughly the tail area of a mean zero normal distribution more than two standard deviations from \(0\), indicating a statistically significant departure. In practice, testing with \(p-\)values involves identifying a critical value, \(t_{\alpha}\), and rejecting the null if the observed statistic \(t\left( y\right)\) is more extreme than \(t_{\alpha}\). For example, for a significance test of the sample mean, \(t\left( y\right) =\left( \overline{y}-\theta_{0}\right) /se\left( \overline{y}\right)\), where \(se\left( \overline{y}\right)\) is the standard error of \(\overline{y}\); the \(5\%\) critical value is 1.96; and Fisher would reject the null if \(t\left( y\right) >t_{\alpha}\).
Fisher interpreted the \(p-value\) as the weight or measure of evidence of the null hypothesis. The alternative hypothesis is noticeable in its absence in Fisher’s approach. Fisher largely rejected the consideration of alternatives, believing that researchers should weigh the evidence or draw conclusions about the observed data rather than making decisions such as accepting or rejecting hypotheses based on it.
There are a number of issues with Fisher’s approach. The first and most obvious criticism is that it is possible to reject the null, when the alternative hypothesis is less likely. This is an inherent problem in using population tail probabilities–essentially rare events. Just because a rare event has occurred does not mean the null is incorrect, unless there is a more likely alternative. This situation often arises in court cases, where a rare event like a murder has occurred. Decisions based on p-values generates a problem called prosecutor’s Fallacy, which is discussed below. Second, Fisher’s approach relies on population properties (the distribution of the statistic under the null) that would only be revealed in repeated samples or asymptotically. Thus, the testing procedure relies on data that is not yet seen, a violation of what is known as the likelihood principle. As noted by Jeffreys’ (1939, pp. 315-316): “What the use of P implies, therefore, is that a hypothesis that may be true may be rejected because it has not predicted observable data that have not occurred. This seems a remarkable procedure”
Third, Fisher is agnostic regarding the source of the test statistics, providing no discussion of how the researcher decides to focus on one test statistic over another. In some simple models, the distribution of properly scaled sufficient statistics provides natural test statistics (e.g., the \(t-\)test). In more complicated models, Fisher is silent on the sources. In many cases, there are numerous test statistics (e.g., testing for normality), and test choice is clearly subjective. For example, in GMM tests, the choice of test moments is clearly a subjective choice. Finally, from a practical perspective, \(p-\)values have a serious deficiency: tests using \(p\)-values often appear to give the wrong answer, in the sense that they provide a highly misleading impression of the weight of evidence in many samples. A number of examples of this will be given below, but in all cases, Fisher’s approach tends to over-reject the null hypotheses.
6.2.2 Neyman-Pearson
The motivation for the Neyman-Pearson (NP) approach was W.S. Gosset, the famous `Student’ who invented the \(t-\)test. In analyzing a hypothesis, Student argued that a hypothesis is not rejected unless an alternative is available that provides a more plausible explanation of the data, in which case. Mathematically, this suggests analyzing the likelihood ratio, \[ \mathcal{LR}_{0,1}=\frac{p\left( y \mid \mathcal{H}_{0}\right) }{p\left( y \mid \mathcal{H}_{1}\right) }\text{,}% \] and rejecting the null in favor of the alternative when the likelihood ratio is small enough, \(\mathcal{LR}_{0,1}<k\). This procedures conforms in spirit with the Bayesian approach.
The main problem was one of finding a value of the cut off parameter \(k.\) From the discussion above, by varying \(k\), one varies the probabilities of type one and type two errors in the testing procedure. Originally, NP argued this tradeoff should be subjectively specified: “how the balance (between the type I and II errors) should be struck must be left to the investigator” (Neyman and Pearson (1933a, p. 296) and “we attempt to adjust the balance between the risks \(P_{1}\)\(P_{2}\) to meet the type of problem before us” (1933b, p. 497). This approach, however, was not “objective *, and they then advocated fixing \(\alpha\), the probability of a type I error, in order to determine \(k\). This led to their famous lemma:
Lemma 6.1 (Neyman-Pearson Lemma) Consider the simple hypothesis test of \(\mathcal{H}_{0}:\theta=\theta_{0}\) versus \(\mathcal{H}_{1}:\theta =\theta_{1}\) and suppose that the null is rejected if \(\mathcal{LR}_{0,1}<k_{\alpha}\), where \(k_{\alpha}\) is chosen to fix the probability of a type I error at \(\alpha:\)% \[ \alpha=P \left[ y:\mathcal{LR}_{0,1}<k_{\alpha} \mid \mathcal{H}_{0}\right] \text{.}% \] Then, this test is the most powerful test of size \(\alpha\) in the sense that any other test with greater power, must have a higher size.
In the case of composite hypothesis tests, parameter estimation is required under the alternative, which can be done via maximum likelihood, leading to the likelihood ratio \[ \mathcal{LR}_{0,1}=\frac{p\left( y \mid \mathcal{H}_{0}\right) }{\underset {\theta\in\Theta}{\sup}p\left( y \mid \Theta\right) }=\frac{p\left( y \mid \mathcal{H}_{0}\right) }{p\left( y \mid \widehat{\theta}\right) }\text{,}% \] where \(\widehat{\theta}\) is the MLE. Because of this, \(0\leq\mathcal{LR}_{0,1}\leq 1\) for composite hypotheses. In multi-parameter cases, finding the distribution of the likelihood ratio is more difficult, requiring asymptotic approximations to calibrate \(k_{\alpha}.\)
At first glance, the NP approach appears similar to the Bayesian approach, as it takes into account the likelihood ratio. However, like the \(p-\)value, the NP approach has a critical flaw. Neyman and Pearson fix the Type I error, and then minimizes the type II error. In many practical cases, \(\alpha\) is set at \(5\%\) and the resulting \(\beta\) is often very small, close to 0. Why is this a reasonable procedure? Given the previous discussion, this is essentially a very strong prior over the relative benefits/costs of different types of errors. While these assumptions may be warranted in certain settings, it is difficult to a priori understand why this procedure would generically make sense. The next section highlights how the \(p-\)value and NP approaches can generate counterintuitive and even absurd results in standard settings.
6.3 Examples and Paradoxes
This section provides a number of paradoxes arising when using different hypothesis testing procedures. The common strands of the examples will be discussed at the end of the section.
Example 6.2 (Neyman-Pearson tests) Consider testing \(\mathcal{H}_{0}:\mu=\mu_{0}\) versus \(\mathcal{H}_{1}:\mu=\mu_{1}\), \(y_{t}\sim\mathcal{N}\left( \mu,\sigma^{2}\right)\) and \(\mu_{1}>\mu_{0}\). For this simple test, the likelihood ratio is given by \[ \mathcal{LR}_{0,1}=\frac{\exp\left( -\frac{1}{2\sigma^{2}}% %TCIMACRO{\tsum \nolimits_{t=1}^{T}}% %BeginExpansion {\textstyle\sum\nolimits_{t=1}^{T}} %EndExpansion \left( y_{t}-\mu_{0}\right) ^{2}\right) }{\exp\left( -\frac{1}{2\sigma ^{2}}% %TCIMACRO{\tsum \nolimits_{t=1}^{T}}% %BeginExpansion {\textstyle\sum\nolimits_{t=1}^{T}} %EndExpansion \left( y_{t}-\mu_{1}\right) ^{2}\right) }=\exp\left( -\frac{T}{\sigma^{2}% }\left( \mu_{1}-\mu_{0}\right) \left( \overline{y}-\frac{1}{2}\left( \mu_{0}+\mu_{1}\right) \right) \right) \text{.}% \] Since \(\mathcal{BF}_{0,1}=\mathcal{LR}_{0,1}\), assuming equal prior probabilities and symmetric losses, the Bayesian accepts \(\mathcal{H}_{0}\) if \(\mathcal{BF}_{0,1}>1\). Thus, the Bayes procedure rejects \(\mathcal{H}_{0}\) if \(\overline{y}>\frac{1}{2}\left( \mu_{0}+\mu_{1}\right)\) for any \(T\) and \(\sigma^{2}\), with \(\mu_{0}\),\(\mu_{1}\), \(T,\)and \(\sigma^{2}\) determining the strength of the rejction. If \(\mathcal{BF}_{0,1}=1\), there is equal evidence for the two hypotheses.
The NP procedure proceeds by first setting \(\alpha=0.05,\) and rejects when \(\mathcal{LR}_{0,1}\) is large. This is equivalent to rejecting when \(\overline{y}\) is large, generating an `optimal’ rejection region of the form \(\overline{y}>c\). The cutoff value \(c\) is calibrated via the size of the test, \[ P \left[ reject\text{ }\mathcal{H}_{0} \mid \mathcal{H}_{0}\right] =P \left[ \overline{y}>c \mid \mu_{0}\right] =P \left[ \frac{\left( \overline{y}-\mu_{0}\right) }{\sigma/\sqrt{T}}>\frac{\left( c-\mu_{0}\right) }{\sigma/\sqrt{T}} \mid \mathcal{H}_{0}\right] . \] The size equals \(\alpha\) if \(\sqrt{T}\left( c-\mu_{0}\right) /\sigma =z_{\alpha}\). Thus, the NP test rejects if then if \(\overline{y}>\mu _{0}+\sigma z_{\alpha}/\sqrt{T}\). Notice that the tests rejects regardless of the value of \(\mu_{1}\), which is rather odd, since \(\mu_{1}\) does not enter into the size of the test only the power. The probability of a type II error is \[ \beta=P \left[ \text{accept }\mathcal{H}_{0} \mid \mathcal{H}_{1}\right] =P \left[ \overline{y}\leq\mu_{0}+\frac{\sigma}{\sqrt{T}}z_{\alpha } \mid \mathcal{H}_{1}\right] =\int_{-\infty}^{\mu_{0}+\frac{\sigma}{\sqrt{T}% }z_{\alpha}}p\left( \overline{y} \mid \mu_{1}\right) d\overline{y}\text{,}% \] where \(p\left( \overline{y} \mid \mu_{1}\right) \sim\mathcal{N}\left( \mu _{1},\sigma^{2}/T\right)\).
These tests can generate strikingly different conclusions. Consider a test of \(\mathcal{H}_{0}:\mu=0\) versus \(\mathcal{H}_{1}:\mu=5\), based on \(T=100\) observations drawn from \(y_{t}\sim\mathcal{N}\left( \mu,10^{2}\right)\) with \(\overline{y}=2\). For NP, since \(\sigma/\sqrt{T}=1\), \(\overline{y}\) is two standard errors away from \(0\), thus \(\mathcal{H}_{0}\) is rejected at the 5% level (the same conclusion holds for \(p-\)values). Since \(p(\overline {y}=2 \mid \mathcal{H}_{0})=0.054\) and \(p(\overline{y}=2 \mid \mathcal{H}_{1})=0.0044\), the Bayes factor is \(\mathcal{BF}_{0,1}=12.18\) and \(P \left( \mathcal{H}_{0} \mid y\right) =92.41\%\). Thus, the Bayesian is quite sure the null is true, while Neyman-Pearson reject the null.
The paradox can be seen in two different ways. First, although \(\overline{y}\) is actually closer to \(\mu_{0}\) than \(\mu_{1}\), the NP test rejects \(\mathcal{H}_{0}\). This is counterintuitive and makes little sense. The problem is one of calibration. The classical approach develops a test such that 5% of the time, a correct null would be rejected. The power of the test is easy to compute and implies that \(\beta=0.0012\). Thus, this testing procedure will virtually never accept the null if the alternative is correct. For Bayesian procedure, assuming the prior odds is \(1\) and \(L_{0}=L_{1}\), then \(\alpha=\beta=0.0062\). Notice that the overall probability of making an error is 1.24% in the Bayesian procedure compared to 5.12% in the classical procedure. It should seem clear that the Bayesian approach is more reasonably, absent a specific motivation for inflating \(\alpha\). Second, suppose the null and alternative were reversed, testing \(\mathcal{H}_{0}:\mu=\mu_{1}\) versus \(\mathcal{H}_{1}:\mu=\mu_{0}\) In the previous example, the Bayes approach gives the same answer, while NP once again rejects the null hypothesis! Again, this result is counterintuitive and nonsensical, but is common when arbitrarily fixing \(\alpha\), which essentially hardwires the test to over-reject the null.
Example 6.3 (Lindley’s paradox) Consider the case of testing whether or not a coin is fair, based on observed coin flips, \[ \mathcal{H}_{0}:\theta=\frac{1}{2}\text{ versus }\mathcal{H}_{1}:\theta \neq\frac{1}{2}\text{,}% \] based on \(T\) observations from \(y_{t}\sim Ber\left( \theta\right)\). As an example, Table 6.1 provides 4 datasets of differing lengths. Prior to considering the formal hypothesis tests, form your own opinion on the strength of evidence regarding the hypothesis in each data set. It is common for individuals, when confronted with this data to conclude that the fourth sample provides the strongest of evidence for the null and the first sample the weakest.
| #1 | #2 | #3 | #4 | |
|---|---|---|---|---|
| # Flips | 50 | 100 | 400 | 10,000 |
| # Heads | 32 | 60 | 220 | 5098 |
| Percentage of heads | 64 | 60 | 55 | 50.98 |
Fisher’s solution to the problem posits an unbiased estimator, the sample mean, and computes the \(t-\)statistic, which is calculated under \(\mathcal{H}_{0}\): \[ t\left( y\right) =\frac{\overline{y}-E\left[ \overline{y} \mid \theta _{0}\right] }{se\left( \overline{y}\right) }=\sqrt{T}\left( 2\widehat {\theta}-1\right) \text{,}% \] where \(se\left(\overline{y}\right)\) is the standard error of \(\overline{y}\). The Bayesian solution requires marginal likelihood under the null and alternative, which are \[ p\left( y \mid \theta_{0}=1/2\right) =\prod_{t=1}^{T}p\left( y_{t} \mid \theta _{0}\right) =\left( \frac{1}{2}\right) ^{\sum_{t=1}^{T}y_{t}}\left( \frac{1}{2}\right) ^{T-\sum_{t=1}^{T}y_{t}}=\left( \frac{1}{2}\right) ^{T}, \tag{6.1}\] and, from Equation 6.1, \(p\left( y \mid \mathcal{H}_{1}\right) =B\left( a_{T},A_{T}\right) /B\left(a,A\right)\) assuming a beta prior distribution.
To compare the results, note first that in the datasets given above, \(\widehat{\theta}\) and \(T\) generate \(t_{\alpha}=1.96\) in each case. Thus, for a significance level of \(\alpha=5\%\), the null is rejected for each sample size. Assuming a flat prior distribution, the Bayes factors are \[ \mathcal{BF}_{0,1}=\left\{ \begin{array} [c]{l}% 0.8178\text{ for }N=50\text{ }\\ 1.0952\text{ for }N=100\\ 2.1673\text{ for }N=400\\ 11.689\text{ for }N=10000 \end{array} \right. , \] showing increasingly strong evidence in favor of \(\mathcal{H}_{0}\). Assuming equal prior weight for the hypotheses, the posterior probabilities are 0.45, 0.523, 0.684, and 0.921, respectively. For the smallest samples, the Bayes factor implies roughly equal odds of the null and alternative. As the sample size increase, the weight of evidence favors the null, with a 92% probabability for \(N=10K\).
Next, consider testing \(\mathcal{H}_{0}:\theta_{0}=0\) vs. \(\mathcal{H}_{1}:\theta_{0}\neq0,\) based on \(T\) observations from \(y_{t}\sim \mathcal{N}\left( \theta_{0},\sigma^{2}\right)\), where \(\sigma^{2}\) is known. This is the formal example used by Lindley to generate his paradox. Using \(p-\)values, the hypothesis is rejected if the \(t-\)statistic is greater than \(t_{\alpha}\). To generate the paradox, consider datasets that are exactly \(t_{\alpha}\) standard errors away from \(\overline{y}\), that is, \(\overline {y}^{\ast}=\theta_{0}+\sigma t_{\alpha}/\sqrt{n}\), and a uniform prior over the interval \(\left( \theta_{0}-I/2,\theta_{0}+I/2\right)\). If \(p_{0}\) is the probability of the null, then, \[\begin{align*} P \left( \theta=\theta_{0} \mid \overline{y}^{\ast}\right) & =\frac{\exp\left( -\frac{1}{2}\frac{T\left( \overline{y}^{\ast}-\theta _{0}\right) ^{2}}{\sigma^{2}}\right) p_{0}}{\exp\left( -\frac{1}{2}% \frac{T\left( \overline{y}^{\ast}-\theta_{0}\right) ^{2}}{\sigma^{2}% }\right) p_{0}+\left( 1-p_{0}\right) \int_{\theta_{0}-I/2}^{\theta_{0}% +I/2}\exp\left( -\frac{1}{2}\frac{T\left( \overline{y}^{\ast}-\theta\right) ^{2}}{\sigma^{2}}\right) I^{-1}d\theta}\\ & =\frac{\exp\left( -\frac{1}{2}t_{\alpha}^{2}\right) p_{0}}{\exp\left( -\frac{1}{2}t_{\alpha}^{2}\right) p_{0}+\frac{\left( 1-p_{0}\right) }% {I}\int_{\theta_{0}-I/2}^{\theta_{0}+I/2}\exp\left( -\frac{1}{2}\left( \frac{\left( \overline{y}^{\ast}-\theta\right) }{\sigma/\sqrt{T}}\right) \right) d\theta}\\ & \geq\frac{\exp\left( -\frac{1}{2}t_{\alpha}^{2}\right) p_{0}}{\exp\left( -\frac{1}{2}t_{\alpha}^{2}\right) p_{0}+\frac{\left( 1-p_{0}\right) }% {I}\sqrt{2\pi\sigma^{2}/T}}\rightarrow1\text{ as }T\rightarrow\infty\text{.}% \end{align*}\] In large samples, the posterior probability of the null approaches 1, whereas Fisher always reject the null. It is important to note that this holds for any \(t_{\alpha}\), thus even if the test were performed at the 1% level or lower, the posterior probability would eventually reject the null.
6.4 Prior Sensitivity
One potential criticism of the previous examples is the choice of the prior distribution. How do we know that, somehow, the prior is not biased against rejecting the null generating the paradoxes? Under this interpretation, the problem is not with the \(p-\)value but rather with the Bayesian procedure. One elegant way of dealing with the criticism is search over priors and prior parameters that minimize the probabability of the null hypothesis, thus biasing the Bayesian procedure against accepting the null hypothesis.
To see this, consider the case of testing \(\mathcal{H}_{0}:\mu_{0}=0\) vs. \(\mathcal{H}_{1}:\mu_{0}\neq0\) with observations drawn from \(y_{t} \sim\mathcal{N}\left( \theta_{0},\sigma^{2}\right)\), with \(\sigma\) known. With equal prior null and alternative probability, the probability of the null is \(p\left( \mathcal{H}_{0} \mid y\right) =\left( 1+\left( \mathcal{BF}_{0,1}\right) ^{-1}\right) ^{-1}\). Under the null, \[ p\left( y \mid \mathcal{H}_{0}\right) =\left( \frac{1}{2\pi\sigma^{2}}\right) ^{\frac{T}{2}}\exp\left( -\frac{1}{2}\left( \frac{\left( \overline {y}-\theta_{0}\right) }{\sigma/\sqrt{T}}\right) ^{2}\right) \text{.}% \] The criticism applies to the priors under the alternative. To analyze the sensitivity, consider four classes of priors under the alternative: (a) the class of normal priors, \(p\left( \theta \mid \mathcal{H}_{1}\right) \sim\mathcal{N}\left( a,A\right)\); (b) the class of all symmetric unimodal prior distributions; (c) the class of all symmetric prior distributions; and (d) the class of all proper prior distributions. These classes provide varying degrees of prior information, allowing a thorough examination of the strength of evidence.
In the first case, consider the standard conjugate prior distribution, \(p\left( \mu \mid \mathcal{H}_{1}\right) \sim\mathcal{N}\left( \mu_{0},A\right)\). Under the alternative, \[\begin{align*} p\left( y \mid \mathcal{H}_{1}\right) & =\int p\left( y \mid \mu,\mathcal{H}_{1}\right) p\left( \mu \mid \mathcal{H}_{1}\right) d\mu\\ & =\int p\left( \overline{y} \mid \mu,\mathcal{H}_{1}\right) p\left( \mu \mid \mathcal{H}_{1}\right) d\mu\text{,}% \end{align*}\] using the fact that \(\overline{y}\) is a sufficient statistic. Noting that \(p\left( \overline{y} \mid \mu,\mathcal{H}_{1}\right) \sim N\left( \mu ,\sigma^{2}/T\right)\) and \(p\left( \mu \mid \mathcal{H}_{1}\right) \sim N\left( \mu_{0},A\right)\), we can use the “substitute” instead of integrate trick to assert that \[ \overline{y}=\mu_{0}+\sqrt{A}\eta+\sqrt{\sigma^{2}/T}\varepsilon\text{,}% \] where \(\eta\) and \(\varepsilon\) are standard normal. Then, \(p\left( \overline{y} \mid \mathcal{H}_{1}\right) \sim\mathcal{N}\left( \mu_{0},A+\sigma^{2}/T\right)\). Thus, \[ \mathcal{BF}_{0,1}=\frac{p\left( y \mid \mathcal{H}_{0}\right) }{p\left( y \mid \mathcal{H}_{1}\right) }=\frac{p\left( \overline{y} \mid H_{0}\right) }{p\left( \overline{y} \mid H_{1}\right) }=\frac{\left( \sigma^{2}/T\right) ^{-\frac{1}{2}}}{\left( \sigma^{2}/T+A\right) ^{-\frac{1}{2}}}\frac {\exp\left( -\frac{1}{2}t^{2}\right) }{\exp\left( -\frac{1}{2}\frac {z^{2}\sigma^{2}/T}{A+\sigma^{2}/T}\right) }\text{.} \label{BF_normal}% \] To operationalize the test, \(A\) must be selected. \(A\) is chosen to minimizing the posterior probabilities of the null, with \(P_{norm}\left( \mathcal{H}_{0} \mid y\right)\) being the resulting lower bound on the posterior probability of the null. For \(z\geq1\), the lower bound on the posterior probability of the null is \[ P_{norm}\left( \mathcal{H}_{0} \mid y\right) =\left[ 1+\sqrt{e}\exp\left( -.5t^{2}\right) \right] ^{-1}, \] which is derived in a reference cited in the notes. This choice provides a maximal bias of the Bayesian approach toward rejecting the null. It is important to note that this is not a reasonable prior, as it was intentionally constructed to bias the null toward rejection.
For the class of all proper prior distributions, it is also easy to derive the bound. From equation above, minimizing the posterior probability is equivalent to minimizing the Bayes factor, \[ \mathcal{BF}_{0,1}=\frac{p\left( y \mid \mathcal{H}_{0}\right) }{p\left( y \mid \mathcal{H}_{1}\right) }\text{.}% \] Since \[ p\left( y \mid \mathcal{H}_{1}\right) =\int p\left( y \mid \theta,\mathcal{H}_{1}\right) p\left( \theta \mid \mathcal{H}_{1}\right) d\theta\leq p\left( y \mid \widehat{\theta}_{MLE},\mathcal{H}_{1}\right) \text{,}% \] where \(\widehat{\theta}_{MLE}=\arg\underset{\theta\neq0}{\max}p\left( y \mid \theta\right)\). The maximum likelihood estimator, maximizes the probability of the alternative, and provides a lower bound on the Bayes factor, \[ \underline{\mathcal{BF}}_{0,1}=\frac{p\left( y \mid \mathcal{H}_{0}\right) }{\underset{\theta\neq0}{\sup}p\left( y \mid \theta\right) }\text{.}% \] In this case, the bound is particularly easy to calculate and is given by \[ P_{all}\left( \mathcal{H}_{0} \mid y\right) =\left( 1+\exp\left( -\frac{t^{2}}{2}\right) \right) ^{-1}\text{.}% \] A reference cited in the notes provides the bounds for the second and third cases, generating \(P_{s,u}\left( \mathcal{H}_{0} \mid y\right)\) and \(P_{s}\left( \mathcal{H}_{0} \mid y\right)\), respectively. All of the bounds only depend on the \(t-\)statistic and constants.
Table 6.2 reports the \(t-\)statistics and associated \(p-\)values, with the remaining columns provide the posterior probability bounds. For the normal prior and choosing the prior parameter \(A\) to minimize the probability of the null, the posterior probability of the null is much larger than the \(p-\)value, in every case. For the standard case of a \(t-\)statistic of 1.96, \(P\left( \mathcal{H}_{0} \mid y\right)\) is more than six times greater than the \(p-\)value. For \(t=2.576\), \(P\left( \mathcal{H}_{0} \mid y\right)\) is almost 13 times greater than the \(p-\)value. These probabilities fall slightly for more general priors. For example, for the class of all priors, a t-statistic of 1.96/2.576 generates a lower bound for the posterior probability of 0.128/0.035\(,\) more than 2/3 times the \(p-\)value.
| \(t\)-stat | \(p\)-value | \(P_{norm}\left(H_{0} \mid y\right)\) | \(P_{s,u}\left( H_{0} \mid y\right)\) | \(P_{s}\left(H_{0} \mid y\right)\) | \(P_{all}\left(H_{0} \mid y\right)\) |
|---|---|---|---|---|---|
| 1.645 | 0.100 | 0.412 | 0.39 | 0.34 | 0.205 |
| 1.960 | 0.050 | 0.321 | 0.29 | 0.227 | 0.128 |
| 2.576 | 0.010 | 0.133 | 0.11 | 0.068 | 0.035 |
| 3.291 | 0.001 | 0.0235 | 0.018 | 0.0088 | 0.0044 |