Bayesian hypothesis testing with the Full Bayesian Evidence Test

Bayesian hypothesis testing with the Full Bayesian Evidence Test

Author: Riko Kelter

library(fbst)
#> Loading required package: cubature
#> Loading required package: ks
#> Loading required package: viridis
#> Loading required package: viridisLite
#> Loading required package: rstanarm
#> Loading required package: Rcpp
#> This is rstanarm version 2.32.1
#> - See https://mc-stan.org/rstanarm/articles/priors for changes to default priors!
#> - Default priors may change, so it's safest to specify priors, even if equivalent to the defaults.
#> - For execution on a local, multicore CPU with excess RAM we recommend calling
#>   options(mc.cores = parallel::detectCores())

This vignette explains how to use the Full Bayesian Evidence Test (FBET) for Bayesian hypothesis testing of an interval hypothesis against its alternative via the Bayesian evidence value, which includes the e-value of the FBST as a special (limiting) case.

Theory behind the FBET and Bayesian evidence value

While the Full Bayesian Significance Test (FBST) uses the e-value to quantify the evidence against a precise hypothesis, in a variety of contexts the assumption of a precise point-null hypothesis is unrealistic. Often, for example in biomedical research or the cognitive sciences, the assumption of an interval hypothesis is more appropriate. The Full Bayesian Evidence Test (FBET) generalizes the Pereira-Stern-theory which underpins the FBST and allows for testing interval hypotheses. The FBET can be used with any standard parametric model, where θ ∈ Θ ⊆ ℝp is a (possibly vector-valued) parameter of interest, p(x|θ) is the likelihood and p(θ) is the density of the prior distribution ϑ for the parameter θ.

The Bayesian evidence interval

First, the Bayesian evidence interval is required for the FBET. Let $s(\theta):=\frac{p(\theta|y)}{r(\theta)}$ be the surprise function of the Pereira-Stern FBST with a given reference function r(θ) and tangential set $\overline{T}(\nu):=\{\theta \in \Theta|s(\theta)> \nu \}$ to the null hypothesis H0 : θ = θ0. The expanded tangential set is defined as $\tilde{\overline{T}}(\nu)=\{\theta \in \Theta|s(\theta)\geq \nu \}$, which in contrast to $\overline{T}(\nu)$ also includes the parameter values θ for which s(θ) = ν.

The Bayesian evidence interval EIr(ν) with reference function r(θ) to level ν includes all parameter values θ for which $\theta \in \tilde{\overline{T}}(\nu)$.

It can be shown that the Bayesian evidence interval includes standard HPD intervals, support intervals and credible intervals as special cases.

To unify Bayesian interval estimation and hypothesis testing based on the EIr(ν) via an interval hypothesis (or the region of practical equivalence (ROPE)), the Bayesian evidence value is introduced, which is associated with a Bayesian evidence interval: For a given evidence interval EIr(ν) with reference function r(θ) to level ν, the Bayesian evidence value EvEIr(ν)R(H0) to a ROPE R for the null hypothesis H0 is given as EvEIr(ν)R(H0) := ∫EIr(ν) ∩ Rp(θ|y)dθ The corresponding Bayesian evidence value EvEIr(ν)R(H1) to a ROPE R for the alternative hypothesis H1 is given as EvEIr(ν)R(H1) := ∫EIr(ν) ∩ Rcp(θ|y)dθ where Rc is the set complement of the ROPE R. The Bayesian evidence value EvEIr(ν)R(H0) is the integral of the posterior density p(θ|y) over the intersection of the evidence interval EIr(ν) with the ROPE (or interval hypothesis) R. This means that the evidence value EvEIr(ν)R(H0) is the integral over the evidence interval EIr(ν) restricted to the ROPE (or interval hypothesis) R around the null value of H0 : θ = θ0. The Bayesian evidence value can also be called generalised e-value, as it can be shown to recover the e-value of the FBST as a special case.

The Bayesian evidence value depends on three quantities:

  • The ROPE R := [a, b], a < b, which includes the null value θ0 of H0 : θ = θ0, that is θ0 ∈ R. Thus, the ROPE R can be interpreted as an interval hypothesis H0 := [a, b] around the null value θ0.
  • the reference function r(θ) which is used for construction of the Bayesian evidence interval EIr(ν)
  • and the evidence-threshold ν used for deciding which values are included in the Bayesian evidence interval EIr(ν)

The evidence value of course also depends on the prior density p(θ), because the Bayesian evidence interval depends on the posterior density, which itself depends on the prior density p(θ). Of course, the Bayesian evidence value also depends on the data y observed.

From a hypothesis testing perspective a ROPE R is interpreted as an interval hypothesis which includes the precise null value of the hypothesis H0. Thus, from this point of view one could also omit the superscript R in EvEIr(ν)R(H0) and define the Bayesian evidence value directly for interval hypotheses H0 ∈ Θ as EvEIr(ν)(H0) := ∫EIr(ν) ∩ H0p(θ|y)dθ The general approach to consider a (small) interval hypothesis instead of a point-null hypothesis goes back to Hodges & Lehmann (1954).

Running the FBET

To run the FBET, a posterior distribution is required, which can be obtained as in any standard Bayesian workflow:

  1. Specify a joint distribution for the outcome(s) and all relevant parameters. Typically, this takes the form of a marginal prior distribution for the parameters multiplied by a likelihood for the outcome(s) conditional on the parameters. This joint distribution is proportional to a posterior distribution of the parameters conditional on the observed data.
  2. Run a Markov-Chain-Monte-Carlo (MCMC) algorithm to draw samples from posterior distribution.
  3. Evaluate the model fit with regard to the data (e.g. via posterior predictive checks) and possibly revise the model.

Based on the sample of posterior draws, the FBET can be performed as follows:

  1. Specify a null hypothesis value θ0 for the precise null hypothesis H0 : θ = θ0.
  2. Specify a ROPE R around the null hypothesis value θ0 based on prior subject-domain knowledge, pilot studies or by deciding for minimally relevant effect sizes that can be regarded as scientifically meaningful.

Conceptually, steps one and two can also be merged into a single step which specifies an interval hypothesis H0 : θ ∈ [θ0 − ε1, θ0 + ε2] around θ0 for ε1, ε2 > 0.

  1. Specify a reference function r(θ) which is used to build the surprise function $$s(\theta):=\frac{p(\theta|y)}{r(\theta)}$$ Typical choices include a flat reference function r(θ) := 1 or the density p(θ) of a prior distribution Pϑ for the parameter θ. In the former case, the surprise function becomes the posterior density p(θ|y). In the latter case, the surprise function s(θ) quantifies the ratio of posterior to prior, and parameter values θ with s(θ) ≥ 1 can be interpreted as being corroborated by observing the data y.

As a sidenote, the connection between statistical surprise, corroboration and statistical information is non-trivial and the interested reader is referred to Good (1968) for a detailed treatment of the topic. However, for current purposes it is legitimate to speak of corroboration in the above context, although being corroborated could be replaced by being informative with regard to H0 would be more accurate. Parameter values θ with s(θ) < 1 can be interpreted as not being corroborated by observing the data y.

  1. Select the evidence-threshold ν ≥ 0 and calculate the resulting Bayesian evidence interval EIr(ν) for the reference function r(θ) and threshold ν. The Bayesian evidence interval EIr(ν) includes only parameter values θ ∈ Θ which have been corroborated by a factor ν compared to the reference function r(θ), that is, parameter values θ for which the evidence is at least ν.
  2. Run the FBET and compute the Bayesian evidence value EvEIr(ν)R(H0) in favour of H0, and the Bayesian evidence value EvEIr(ν)R(H1) in favour of H1.

Note that the notion of evidence associated with the Bayesian evidence interval EIr(ν) stems from the relationship to the Bayes factor: Parameter values θ̃ ∈ EIr(ν) fulfill the condition s(θ̃) ≥ ν, which is equivalent to p(θ̃|y)/r(θ̃) ≥ ν. Whenever the reference function r(θ) is chosen as the model’s prior distribution p(θ) for the parameter θ, for the values θ̃ the ratio p(θ|y)/p(θ) can be identified as the Savage-Dickey-density ratio which is equal to the Bayes factor BF01 for the precise null hypothesis H0 : θ = θ̃ when Dickey’s continuity condition holds. Thus, the evidence for H0 : θ = θ̃ obtained through a traditional Bayes factor hypothesis test is at least ν for all values inside the Bayesian evidence interval EIr(ν).

Again, as a sidenote it is worth mentioning that the Bayes factor is simply a difference in weak explanatory power, where the latter is equal to statistical information, where again the latter term is boroughed from information theory and, in particular, Kullback-Leibler information (or divergence). Thus, the evidence interval has close connections to information theory and also to weight of evidence, see Good (1968).

Now, we illustrate the FBST with two examples below. The first example is a Bayesian two-sample t-test with flat reference function, and the second example the same test with a user-defined reference function.

Examples

Example 1: Paired-sample t-test with flat reference function

We use Student’s sleep data as illustration, which is available in the data frame sleep.

sleep
#>    extra group ID
#> 1    0.7     1  1
#> 2   -1.6     1  2
#> 3   -0.2     1  3
#> 4   -1.2     1  4
#> 5   -0.1     1  5
#> 6    3.4     1  6
#> 7    3.7     1  7
#> 8    0.8     1  8
#> 9    0.0     1  9
#> 10   2.0     1 10
#> 11   1.9     2  1
#> 12   0.8     2  2
#> 13   1.1     2  3
#> 14   0.1     2  4
#> 15  -0.1     2  5
#> 16   4.4     2  6
#> 17   5.5     2  7
#> 18   1.6     2  8
#> 19   4.6     2  9
#> 20   3.4     2 10

We select both groups and perform a traditional Bayes factor paired-sample t-test based on the model of Rouder et al. (2009). There, the null hypothesis states that the standardized effect size δ := (μ − μ0)/σ is assumed to be zero under H0, and Cauchy-distributed under the alternative H1, δ ∼ C(0, γ) for γ > 0.

require(BayesFactor)
#> Loading required package: BayesFactor
#> Loading required package: coda
#> Loading required package: Matrix
#> ************
#> Welcome to BayesFactor 0.9.12-4.7. If you have questions, please contact Richard Morey ([email protected]).
#> 
#> Type BFManual() to open the manual.
#> ************
#> 
#> Attaching package: 'BayesFactor'
#> The following object is masked from 'package:ks':
#> 
#>     compare
grp1 = sleep[1:10,]$extra
grp2 = sleep[11:20,]$extra
diff = grp1 - grp2
ttestBF(x = grp1, y = grp2, rscale = "medium", paired = TRUE)
#> Bayes factor analysis
#> --------------
#> [1] Alt., r=0.707 : 17.25888 ±0%
#> 
#> Against denominator:
#>   Null, mu = 0 
#> ---
#> Bayes factor type: BFoneSample, JZS

Based on a traditional Bayes factor test, there is strong evidence for the alternative H1 : δ ≠ 0: BF10 = 17.25 according to Jeffreys (1961). However, note that the precise Bayes factor tests the point-null hypothesis H0 : δ = 0 of exactly no effect, that is, no difference between both population means. Without observing any data y, we can be reasonable sure that the effect size δ will almost surely not be exactly zero, so an interval hypothesis which separates scientifically meaningful from scientifically irrelevant parameter values constitutes a more adequate test in such a situation (interval Bayes factors would be more adequate when sticking to Bayes factors).

Therefore, we now obtain posterior MCMC samples and conduct the FBET based on a flat reference function r(θ) := 1. We use the evidence threshold ν = 0, which implies that the surprise function s(θ) = p(θ|y)/r(θ) reduces to the posterior density p(θ|y) and the Bayesian evidence interval EIr(ν) becomes EIr(0), which includes all parameter values θ ∈ Θ with positive posterior density p(θ|y) ≥ 0.

set.seed(42)
posteriorDraws = ttestBF(x = grp1, y = grp2, rscale = "medium", paired = TRUE,
                         posterior = TRUE, iterations = 100000)[,3]
set.seed(42)
result = fbet(posteriorDensityDraws = posteriorDraws, interval = c(-0.2,0.2), nu=0)
summary(result)
#> Full Bayesian Evidence Test:
#> Reference function: Flat 
#> Testing Hypothesis H_0:=[ -0.2 , 0.2 ]
#> Bayesian e-value in favour of H_0: 0.01416656 
#> Bayesian e-value against H_0: 0.9858334

The Bayesian evidence value EvEIr(ν)R(H0) in favour of H0 : θ ∈ [−0.1, 0.1] is equal to 0.014, while the Bayesian evidence value EvEIr(ν)R(H1) in favour of H1 is equal to 0.986. Thus, the evidence for the alternative H1 : θ ∉ [−0.1, 0.1] is much stronger than the evidence in favour of H0. In fact, 98.6% of the posterior probability mass are located outside the interval [−0.1, 0.1], which is expressed by the Bayesian evidence value in favour of H1. The following plot illustrates this result:

plot(result, type = "posterior")

The blue area amounts to the 0.5% of posterior probability mass which are located inside [−0.1, 0.1] and corroborate the null hypothesis H0 based on the chosen ROPE R := [−0.1, 0.1] around θ0 := 0. However, the evidence threshold ν = 0 requires no form of corroboration for the parameter values θ after observing the data y, which is shown in the above plot: The evidence threshold is not visible, as it is a horizontal dashed line at the ordinate y = 0.

If we choose a different hypothesis H0 : δ ∈ (−0.25, 0.25), we obtain:

set.seed(42)
result2 = fbet(posteriorDensityDraws = posteriorDraws, interval = c(-0.25,0.25), nu=0)
summary(result2)
#> Full Bayesian Evidence Test:
#> Reference function: Flat 
#> Testing Hypothesis H_0:=[ -0.25 , 0.25 ]
#> Bayesian e-value in favour of H_0: 0.02020536 
#> Bayesian e-value against H_0: 0.9797946
plot(result2, type = "posterior")

Example 2: Paired-sample t-test with user-specified reference function

In the above, a flat reference function r(θ) := 1 was used in the surprise function s(θ) and the evidence threshold ν was set to zero. However, the prior distribution on the parameter θ was a medium Cauchy prior $C(0,\sqrt{2}/2)$. As the prior beliefs are reflected via this prior, it is reasonable to choose the Cauchy prior as the reference function instead (then, the surprise function is equal to what Good (1968) motivated as statistical information, providing an independent justification of the expanded tangential set; the latter includes parameter values which are informative with regard to the hypothesis under consideration). Below, it is shown how to do this. Also, the evidence threshold ν is set to ν = 1, so that only parameter values θ are included in the Bayesian evidence interval EIr(1) which attain a larger posterior density value than the corresponding prior density value. This time, we test H0 : δ ∈ (−0.5, 0.5) against H1 : δ ∉ (−0.5, 0.5):

set.seed(42)
result3 = fbet(posteriorDensityDraws = posteriorDraws, interval = c(-0.5,0.5), 
               nu=1, FUN = dcauchy, par=list(location = 0, scale = sqrt(2)/2))
summary(result3)
#> Full Bayesian Evidence Test:
#> Reference function: User-defined 
#> Testing Hypothesis H_0:=[ -0.5 , 0.5 ]
#> Bayesian e-value in favour of H_0: 0.01750199 
#> Bayesian e-value against H_0: 0.9108208

Now, based on the different reference function and evidence threshold, we see that the resulting Bayesian evidence values change accordingly. Below, the situation is illustrated:

plot(result3, type="surprise")

The horizontal dashed line shows the evidence-threshold ν = 1, and only parameter values θ which fulfill p(θ|y)/p(θ) ≥ 1 are included in the Bayesian evidence interval EIr(1). The resulting Bayesian evidence interval EIr(1) is shown as the vertical blue lines in the plot. The interval hypothesis H0 := (−0.5, 0.5) is shown as the dashed vertical blue lines, and the resulting Bayesian evidence value EvEIr(ν)R(H0) in favour of H0 is visualized as the blue area, which amounts to 1.75% of the posterior probability mass. The Bayesian evidence value EvEIr(ν)R(H1) in favour of H1 is given as EvEIr(ν)R(H1) = 0.9117, which shows that 91.17% of the posterior probability mass confirm the alternative. It should be clear by now that omitting the ROPE R and instead switching towards an interval hypothesis H0 from the start only simplifies the notation. Henceforth, we therefore only speak of H0.

Now, if we choose a different interval hypothesis H0 := (−0.2, 0.2), we obtain:

result4 = fbet(posteriorDensityDraws = posteriorDraws, interval = c(-0.2,0.2), 
               nu=1, FUN = dcauchy, par=list(location = 0, scale = sqrt(2)/2))
summary(result4)
#> Full Bayesian Evidence Test:
#> Reference function: User-defined 
#> Testing Hypothesis H_0:=[ -0.2 , 0.2 ]
#> Bayesian e-value in favour of H_0: 0 
#> Bayesian e-value against H_0: 0.9283223

Now, based on the different reference function and evidence threshold, we see that the resulting Bayesian evidence values change accordingly: The evidence for the null hypothesis reduces, while the evidence for the alternative has increased. Below, the situation is illustrated:

plot(result4, type="surprise")

Next to plotting the surprise function we can also plot the resulting posterior density and visualize the evidence values. Therefore, we just need to omit the type argument.

plot(result4)

Also, we could require that values need to be corroborated in a stronger form by choosing for example ν = 2. Then, every value inside the Bayesian evidence interval has at least twice the value of the corresponding reference function r(θ). In the above example, we obtain:

set.seed(42)
result5 = fbet(posteriorDensityDraws = posteriorDraws, interval = c(-0.2,0.2), nu=2, 
               FUN = dcauchy, par=list(location = 0, scale = sqrt(2)/2))
summary(result5)
#> Full Bayesian Evidence Test:
#> Reference function: User-defined 
#> Testing Hypothesis H_0:=[ -0.2 , 0.2 ]
#> Bayesian e-value in favour of H_0: 0 
#> Bayesian e-value against H_0: 0.8546793

Evidently, as the evidence for H0 was already zero, the only thing which changes in this different setting now is that the evidence in favour of H1 decreases to about 85.57%. Thus, compared to the previous example we are now less than 90% sure that the alternative hypothesis holds. To arrive at this conclusion, we must, however, accept that the reference function is sensible, and that we regard parameter values as evidential in favour of the null or alternative only if the ratio of posterior to prior density is at least ν = 2. The following plot visualizes the posterior in this setting:

plot(result5)

In contrast, note that the surprise function plot differs. Importantly, the evidence threshold ν is now equal to 2:

plot(result5, type = "surprise")

Conclusion

In this vignette it was demonstrated how to perform the Full Bayesian Evidence Test (FBET) for testing an interval hypothesis in the Bayesian paradigm. It was shown that the FBET can be applied to a wide range of statistical models as long as the posterior is obtainable via standard MCMC algorithms. Thus, the FBET presents a flexible approach to Bayesian hypothesis testing while being fully coherent with the likelihood principle and the associated benefits (e.g. the irrelevance of stopping rules, the irrelevance of censoring mechanisms). Also, the assumption of an interval hypothesis is often more realistic than the assumption of a precise point null hypothesis, and the Bayesian evidence value provides a convenient way to quantify the evidence for a hypothesis based on the Bayesian evidence interval. Furthermore, the requirement of different strength of evidence can be incorporated into the analysis. Theoretical details can be found in Kelter (2022).

References

Kelter, R. (2022). The Evidence Interval and the Bayesian Evidence Value - On a unified theory for Bayesian hypothesis testing and interval estimation. British Journal of Mathematical and Statistical Psychology (2022). https://doi.org/10.1111/bmsp.12267

Kelter, R. (2021). fbst: An R package for the Full Bayesian Significance Test for testing a sharp null hypothesis against its alternative via the e-value. Behav Res (2021). https://doi.org/10.3758/s13428-021-01613-6

Pereira, C. A. d. B., & Stern, J. M. (2020). The e-value: a fully Bayesian significance measure for precise statistical hypotheses and its research program. São Paulo Journal of Mathematical Sciences, 1–19. https://doi.org/10.1007/s40863-020-00171-7

Kelter, R. (2020). Analysis of Bayesian posterior significance and effect size indices for the two-sample t-test to support reproducible medical research. BMC Medical Research Methodology, 20(88). https://doi.org/https://doi.org/10.1186/s12874-020-00968-2

Rouder, Jeffrey N., Paul L. Speckman, Dongchu Sun, Richard D. Morey, and Geoffrey Iverson. 2009. Bayesian t tests for accepting and rejecting the null hypothesis. Psychonomic Bulletin and Review 16 (2): 225–37. https://doi.org/10.3758/PBR.16.2.225

Good, I. J. (1968). Corroboration, Explanation, Evolving Probability, Simplicity and a Sharpened Razor. The British Journal for the Philosophy of Science, Vol. 19, Issue 2. https://www.journals.uchicago.edu/doi/10.1093/bjps/19.2.123

Jeffreys, H. (1961). Theory of Probability. In Oxford Classic Texts in the Physical Sciences (3rd ed.). Oxford University Press.

Hodges, J. L., & Lehmann, E. L. (1954). Testing the Approximate Validity of Statistical Hypotheses. Journal of the Royal Statistical Society: Series B (Methodological), 16 (2), 261–268. https://doi.org/10.1111/j.2517-6161.1954.tb00169.x