Definition. The P-value is the probability of observing a test statistic (i.e., a summary of the data) that is as extreme or more extreme than currently observed test statistic under a statistical model that assumes, among other things, that the hypothesis being tested is true. This can be expressed as Pr(data|H0), where "Pr" is read "the probability of" and "|" is read as "given" or "conditional upon." The statistic should not be interpreted as the probability of H0 being true.
Interpretation: two competing frameworks. P-values can be used in multiple ways. This has caused a great deal of confusion, because there are two competing and sometimes contradictory philosophical frameworks used to derive the P-value. The first framework was formally developed and popularied by R. A. Fisher (Fisher, 1925). Fisher's framework is called significance testing. The second framework was developed by Jerzy Neyman and Egon Pearson (Neyman & Pearson, 1928, 1933). Neyman & Pearson's framework is called hypothesis testing. When we interpret the P-value borrowing some concepts from Fisher's framework and some from Neyman & Pearsons framework, incoherent interpretations may result. It is therefore important to understand the objectives and basis of each framework.
Fisher's significance testing. P-values are to be used flexibly in this framework, with the P-value interpreted as "a rational and well-defined measure of reluctance to accept the hypotheses they test" (Fisher, 1973, page 47). Although many have mistakely suggested a single threshold for determining "statistical significance" (myself included, mea culpa!), Fisher noted "no scientific worker has a fixed level of significance at which from year to year, and in all circumstances, he rejects hypotheses; he rather gives his mind to each particular case in the light of his evidence and his ideas" (Fisher 1973). Nonetheless, the smaller the P-values, the stronger the evidence against the null hypothesis. Fisher intended the P-value to be combined with other sources of information from within and outside the study, often based on background knowledge. Thus, the researcher is not to place sole reliance on the P-value as a means of reaching a conclusion. Note that there is no alternative hypothesis in Fisher's significance testing framework, and that failure to reject the null hypothesis provides no evidence for its support.
Neyman-Pearson (N-P) hypothesis testing. The N-P hypothesis testing procedures is suited for decision-making, and less so for scientific inference. In this framework, we set acceptable rates for type I errors (false rejection of null hypotheses) and type II errors (false retention of null hypotheses) before the experiment is begun (i.e., preexperimentally). The acceptable type I error rate is referred to as "alpha." The acceptable type II error rate is referred to as "beta." Preexperimental error rates are not based on the data from the study. After the experiment is completed, we may calculate a P-value from the data and compare it to the preexperimental a level. If p < a, the null hypothesis is rejected. Note that in N-P hypothesis testing, the conclusion of the test is not intended to verify or falsify the specific hypotheses tested. Instead, it provides "rules for behavior" that are intended to limit the number of type I and type II errors in a long run of similar experiments. The N-P hypothesis testing procedure is has been criticized as being non-scientific being incapable of interpreting the results from a single scientific study.
|
Fisher Significance Testing |
Neyman-Pearson Hypothesis Testing
|
Logical basis |
Inducive
reasoning |
Rules of behavior based on a quasi-deductive model. |
Hypotheses tested |
The null hypothesis.[1]
There is no alternative hypothesis in this system. |
Null and alternative hypotheses tailored to the situation. |
Objective |
The P-value is used as an
informal measure of evidence to reflect upon the credibility of the null hypothesis.
|
Alpha and beta levels are provided pre-experimentally to limit the
number of type I and type II errors in the long run.[2] |
p = .04 vs. p = .06 |
These results provide approximately the same level of evidence
against the null hypothesis. |
Assuming a pre-experiment alpha level set to .05, p = .04 provides a significant finding
while p = .06 provides a nonsignificant
finding. |
p = .04 vs. p = .001 |
p = .001 provides much stronger
evidence against the hypothesis than does p
= .04. |
Assuming
a pre-experiment alpha level set to .05, both studies provide
significant evidence to reject the null hypothesis, and the
two p values are treated equally. |
Conclusion |
The conclusions of the experiment should not be based on the P-value alone.[3]
|
Decisions should adhere to rejection and acceptance regions based on
alpha and beta set up before the study. |
�
In Fisher�s significance testing framework, the P-value is an inductive measure that assigns a number as a measure of the
credibility to the hypothesis being tested.
�
The P-value
is not a direct measure of inductive
statistical evidence. Inductive statistical evidence is defined as the
relative inductive support given to two hypotheses by the data.7
Fisher�s P-value addresses only one
hypothesis: the null hypothesis [4]
�
The alpha level in the N-P framework is akin but
not identical to the P-value. Both
the alpha level and the P-value are
based on unobserved data in the tail region of the probability model defined by
the null hypothesis. However, the P-value
is postexperimental, while the alpha level is preexperimental. It is a mistake
to view the postexperimental p value
as the smallest level of alpha at which the experimenter would reject the null
hypothesis (Goodman 1993, Greenland 1991).
�
Significance tests and hypothesis tests are both
forms of frequentist inference. Other forms of statistical inference include
Bayesian methods[5]
and standardized likelihoods.
[1] The
null hypothesis is the hypothesis to be nullified and is not necessarily
restricted to a statement of �no association.�
[2] A
postexperiment P-value can be slotted
into the hypothesis testing procedure by comparing it to the preexperimental
alpha level.
[3]
Fisher intended the P-value to be
used informally, as a flexible inductive measure with inferences depending on
background knowledge about the phenomenon under investigation.
[4]
Goodman 1993 cites the book Probability
and the weighing of evidence by I Good (New York: Charles Griffin & Co,
1950).
[5] Bayesian statistics were called inverse probabilities until the middle
of the twentieth century (Feinberg, 2006).