Erhard Reschenhofer

University of Vienna

Journal of Statistics Education Volume 9, Number 1 (2001)

Copyright © 2001 by Erhard Reschenhofer, all rights
reserved.

This text may be freely shared among individuals, but it
may not be republished in any medium without express
written consent from the author and advance notification of
the editor.

**Key Words:** Model selection; Selecting the level of significance; Testing.

In statistics courses, students often find it difficult to understand the concept of a statistical test. An aggravating aspect of this problem is the seeming arbitrariness in the selection of the level of significance. In most hypothesis-testing exercises with a fixed level of significance, the students are just asked to choose the 5% level, and no explanation for this particular choice is given. This article tries to make this arbitrary choice more appealing by providing a nice geometric interpretation of approximate 5% hypothesis tests for means.

Usually, we want to know not only whether an observed deviation from the null hypothesis is statistically significant, but also whether it is of practical relevance. We can use the same geometrical approach that we use to illustrate hypothesis tests to distinguish qualitatively between small and large deviations.

The histograms of many datasets occurring in practice have the appearance of a bell. They are symmetric about their means and tail off rapidly as we move away from the means. A typical example is shown in Figure 1a, which summarizes the mean temperatures in May recorded from 1845 to 1978 in St. Louis. (This dataset will be described in more detail in Section 3.) Of course, histograms can also look quite different from the distribution in Figure 1a. They may be skewed, have thick tails, or exhibit more than one peak. In this paper, we are interested in the last case, particularly that of two peaks. Histograms with two peaks are called bimodal, and those with only one peak are called unimodal. An example of a bimodal histogram is shown in Figure 2b, which summarizes a dataset containing mean temperatures observed in July and in September. Here bimodality is due to the fact that the dataset is heterogeneous. It could easily be dissected into two more homogeneous parts by studying the July temperatures and the September temperatures separately.

Figure 1a.

Figure 1a. Histogram of the Mean St. Louis Temperature in May (1845-1978).

Figure 1b.Figure 1b. Histogram of the Mean St. Louis Temperature in September (1845-1978).

Clearly, bimodality does not always occur when we have a mixture of two sets of observations with different means -- the means must be sufficiently different. Consider, for example, Figure 2a, which summarizes a dataset containing mean temperatures observed in May and September. In this case, the difference between the mean of the May temperatures and that of the September temperatures is too small to cause bimodality. If we want to assess the difference between the means of two datasets, we could examine the shape of the histogram of the combined dataset. Bimodality of this histogram could serve as a qualitative indicator for a big difference between the means. This idea will be explained in more detail in Section 3.

Figure 2a.

Figure 2a. Histogram of the Mean St. Louis Temperature in May and September (1845-1978).

Figure 2b.Figure 2b. Histogram of the Mean St. Louis Temperature in July and September (1845-1978).

Figure 2c. Figure 2c. Histogram of the Mean St. Louis Temperature
in July and September (1845-1978).

(Choosing class intervals that are too small gives rise to spurious peaks!)

Figure 2d. Histogram of the Mean St. Louis Temperature
in July and September (1845-1978).

(Choosing class intervals that are too wide conceals genuine bimodality!)

A different question, namely whether or not an observed difference between two means is statistically significant, is discussed in Section 4. To answer this question, we must examine the distributions of the sample means rather than the distributions of individual observations. Again we might check for bimodality. But this time we must examine the combination of the distributions of the sample means. It turns out that bimodality occurs whenever the null hypothesis of identical means is rejected by a hypothesis test at an approximate 5% level of significance. Hence this approach provides a nice geometric interpretation of tests for differences between means at the 5% level.

The question of how the 5% level of significance was
chosen as a standard is examined in Section 2. Finally, Section 5 discusses the usefulness of
fixed-level significance testing versus the mere reporting
of *p*-values, describes class reaction to the
bimodality principle, and gives suggestions for covering
this material with students.

A crucial problem in statistics is to discriminate between two or more competing hypotheses or models. The first problem of this kind faced by a beginner is that of testing the null hypothesis that the mean of a normal distribution is equal to a specified value *c*. When the sample size is large, it is usually suggested that we reject this null hypothesis whenever the distance between *c* and the sample mean exceeds two standard deviations of the sample mean. A similar problem is that of testing the null hypothesis that the means of two normal distributions are identical. The latter null hypothesis is usually rejected whenever the distance between the two sample means, and , exceeds two standard deviations of . In each of the two cases, the stated rejection rule guarantees that the probability of rejecting a true null hypothesis is only 5% (approximately).

But how can the choice of the 5% level be justified? Cowles and Davis (1982) investigated the question of how the 5% level of significance was chosen as a standard. Examining early literature in probability and statistics, they found that Fisher (1925) was perhaps the first to formally mention the 5% level. In his book Statistical Methods for Research Workers, Fisher stated that deviations exceeding twice the standard deviation are regarded as significant. However, Cowles and Davis (1982) stressed that Fisher should not be credited with introducing the 5% level because the choice of this level by Fisher was not casual and arbitrary, but was influenced by previous scientific conventions. At the beginning of the 20th century, statements about statistical significance were still given in terms of the probable error, which was the nineteenth-century measure of the width of a distribution (see Porter 1986 and Stigler 1986). (The German astronomer Frederik Wilhelm Bessel appears to have coined the term 'probable error' or 'der wahrscheinliche Fehler' by 1815 (see Walker 1929, p. 186). The term 'standard deviation' was introduced almost 80 years later by Karl Pearson (see Stigler 1986, p. 328).) Deviations exceeding three times the probable error were considered significant (see, e.g., Student 1908). The probable error is defined as the median deviation from the mean. If the mean coincides with the median, which is the case for symmetric distributions, the probable error is just half the interquartile range. Observing that the upper quartile of a standard normal distribution lies between 0.66 and 0.67, we note that the probable error roughly corresponds to 2/3 of a standard deviation. In the normal case, a deviation of three probable errors therefore corresponds to a deviation of two standard deviations. Hence it seems that the 5% level has a longer history than is generally appreciated.

We use a simple meteorological example to introduce the
bimodality principle. The variable of interest is the
monthly mean temperature in St. Louis, Missouri. Data are
available for the period from January 1845 to December 1978
(see Marple 1987). In view of the
extreme unreliability of long-term weather forecasts, these
measurements may be considered as roughly independent
observations. This dataset is considered as a sample of 134
years from the population of *all* years. For each
month we have *n* = 134*M*, with that in September,
*S*. We will use the notation *M _{i}* to
indicate the measurement of the variable

Clearly, it depends on the circumstances whether or not
this difference is considered as important. For an average
citizen of St. Louis it may be insignificant, whereas for
the operator of a solar power station it may be very
important. A purely formal approach for assessing the size
of this difference is to combine both samples into a single
sample and then produce a histogram for the combined
sample. If the distance between the means is large enough,
this histogram will exhibit two peaks, each of which
corresponds to a peak in one of the two original
histograms. In our case, the difference is too small. The
histogram for the combined dataset has only one peak (see
Figure 2a); hence it does not
indicate an important difference. In contrast, if we
compare the mean temperatures in July,
*J*_{1},..., *J _{n}*, with those
in September, we find two peaks in the histogram of the
combined dataset (see Figure 2b).
The bimodality of the histogram of the combined sample may
be considered an indication of an important location
difference between the two datasets. Indeed, the first peak
is close to the mean of the September measurements ( = 21.1) and the second
peak is close to the mean of the July measurements ( = 26.3).

The above procedure for distinguishing between important
and unimportant location differences is not completely
objective because it contains a subjective component,
namely the choice of the classes used for the construction
of the histograms. Unfortunately, this choice strongly
influences the appearance of the histogram. Choosing the
width of the class intervals too small could give rise to
spurious peaks (see Figure 2c). On
the other hand, genuine bimodality could be concealed by
choosing the width of the class intervals too large (see Figure 2d). An obvious way to get rid
of this subjective component is to use another graphical
tool for the description of the data instead of the
histogram. The probability distribution of a continuous
random variable like the air temperature is characterized
by its probability density function. The probability that
the random variable takes on a value in the interval from
*a* to *b* is just the area under the graph of
the probability density function between *a* and
*b*. A histogram can be regarded as an estimate of the
probability density function. Many continuous random
variables occurring in practice have bell-shaped
probability density functions. Figures
1a and 1b suggest that this
might be true also for our random variables *M* and
*S*. For both datasets, neither the Kolmogorov-Smirnov
test nor the Anderson-Darling test detects any deviation
from normality at the 10% level of significance. We may
therefore assume that their probability density functions
are of the normal type. Normal probability density
functions are completely determined by two parameters, the
mean and the standard deviation. Clearly, we do not know
the means and the standard deviations of the random
variables *M* and *S*, but we can use estimates
instead. Estimates of the means and the standard deviations
of *M* and *S* are obtained by calculating the
sample means,

and | , |

and the sample standard deviations,

and | . |

we obtain estimates of the probability density functions of *M* and *S*, namely *f* (*x* | , *s _{M}*) and

Figure 3a.

Figure 3a. Estimated Probability Density Function for the Mean Temperature in May.

Figure 3b.Figure 3b. Estimated Probability Density Function for the Mean Temperature in September.

What we need next is a graphical summary of the combined
dataset *M*_{1},..., *M _{n}*,

Combining the two probability density functions depicted in Figures 3a and 3b we obtain the function
*f* (*x* | , *s _{M}*) +

Figure 4a.

Figure 4a. Combination of the Estimated Probability Density Functions for May and September.

Figure 4b.Figure 4b. Combination of the Estimated Probability Density Functions for July and September.

Figure 5. Averages of Two Normal Probability Density Functions with Equal Standard Deviations but Different Means.

In statistics, the difference between two sample means is usually assessed in two ways. First, the size of the difference is judged by its practical importance. In most applications, this can easily be accomplished without sophisticated decision rules. Only if the investigator has absolutely no clue which differences should be considered important, he/she might have recourse to a formal rule like the one based on a bimodality check. According to this rule, called the bimodality principle, a location difference between two (estimated) normal probability density functions is regarded as large (or important) if their mixture density is bimodal. In the previous section, we have applied this principle to distinguish between small and large location differences.

The second interesting question regarding the difference between two sample means is whether it is large enough to indicate that the population means also differ. In our example, we could wish to determine whether the overall mean temperature in May differs significantly from that in September. This question may be answered by applying a 5% level hypothesis test. In the second part of this section, we will show how the bimodality principle can be used to illustrate this test. But first we consider the one-sample case.

Suppose we are given a sample *x*_{1},..., *x _{n}* from a normal distribution with mean
and standard deviation . We formulate a simple null hypothesis,

The null hypothesis states that the mean
is equal to a specified value *c*, and the alternative hypothesis states that the mean differs from this value. It is natural to test the null hypothesis by calculating the sample mean and rejecting the null hypothesis whenever the discrepancy between and *c* is too large. The significance of any discrepancy depends on the reliability of the sample mean. To assess the reliability of a sample mean, we may consider its sampling distribution. The sampling distribution of is normal with mean
and standard deviation . An estimate is given by . Under the null hypothesis, should be close to *c*, hence the two probability density functions and should not differ too much (see Endnote).
The null hypothesis could be rejected if their mixture
density is bimodal. Recalling from Section 2 that the mixture density of
two normal probability density functions is bimodal if the
difference between the means exceeds two standard
deviations, we note that in this case the bimodality
principle rejects the null hypothesis if . Thus the bimodality principle makes the same decision as the standard large sample significance test at the 5% level. (Actually a large sample *t*-test rejects the null hypothesis if .)

We now return to our meteorological hypothesis testing problem which involves two samples, *M*_{1},..., *M _{n}*

Rewriting this inequality as

we notice immediately that our two-sample test based on a bimodality check agrees with the standard large sample test for comparing two means if the 5% level of significance is chosen for the latter test.

In our example,

and hence the hypothesis of identical means is rejected at the 5% level of significance. Correspondingly, the combination of the probability density functions and exhibits two peaks (see Figure 6).

Figure 6.

Figure 6. Combination of the Probability Density Functions and .

Today's statistical software calculates *p*-values automatically; hence the practice of fixed-level significance testing is no longer dictated by the availability of tables. Of course, stating whether a hypothesis is rejected or not at some level of significance is not as informative as giving the *p*-value itself. Reporting the actual *p*-value indeed makes it much easier for the reader of a report to judge the significance of a result. Nevertheless, there are still situations, e.g., in economic forecasting, where statisticians must decide for or against some hypothesis before they can carry on with their work. Ideally, if a statistician is going to make such a decision, he/she should take the consequences of his/her decision into account in choosing the level of significance. Unfortunately, this often cannot be accomplished in an objective and verifiable way. At best, it will only be possible to decide whether the 10% level is more appropriate than the 1% level, but certainly not whether the 4% level is more appropriate than the 6% level. Hence it still makes sense to have standards like the 1% level, the 5% level, or the 10% level. The mere existence of such standards already makes cheating more difficult. Clearly, if someone reports that he/she has rejected a hypothesis at the 6% level, the reader of the report will check suspiciously whether there are good reasons for using just this level of significance.

I have used the bimodality principle to illustrate
5%-level hypothesis tests in introductory statistics
courses for science, education, and engineering students.
However, I did not explain all the details and omitted the
proof. I just showed the figures and used approximately
half an hour to explain them. Student reaction was mixed.
Only a few students, particularly those who frequently
asked questions, explicitly appreciated the explanation.
The majority never did question the use of the 5% level and
therefore felt no need for an illustration. In my
explanation I focussed on the coincidence that, on the one
hand, the critical value of a large sample *t*-test at
the 5% level is approximately 2, and, on the other hand,
the mixture density of two normal probability density
functions is bimodal if the difference between their means
exceeds two standard deviations. This material (including
the proof and all details) could possibly be appropriate
for an investigation involving extra effort outside of a
typical class. This may be an honors project associated
with a class or even a senior project.

I wish to thank the Editor, the Associate Editor, and the Referees for helpful comments. This paper was written at the Sultan Qaboos University, Oman.

Note that the sample standard deviation is a reasonable
estimate of under both
*H*_{0} and *H _{A}*. Under

**Theorem:** The mixture density

of two normal probability density functions with the same standard deviation, , but with different means, and , respectively, is bimodal if and only if .

will have either a maximum at
(the unimodal case) or a local minimum at (the bimodal case). Indeed, *x*_{0} is a stationary point because

= | |||

= | |||

= | 0. |

Now we must check the second derivative to see whether a maximum or a minimum occurs.

= | |||

= | |||

= | |||

> | 0, |

if or, equivalently, if . Thus, a minimum occurs only if the distance between the two means exceeds two standard deviations.

Cowles, M., and Davis, C. (1982), "On the Origins of the .05 Level of Significance," American Psychologist, 5, 553-558.

Fisher, R. A. (1925), Statistical Methods for Research Workers, Edinburgh: Oliver & Boyd.

Marple, S. L., Jr. (1987), Digital Spectral Analysis, Englewood Cliffs: Prentice Hall.

Porter, T. M. (1986), The Rise of Statistical Thinking 1820-1900, Princeton, NJ: Princeton University Press.

Robertson, C. A., and Fryer, J. G. (1969), "Some Descriptive Properties of Normal Mixtures," Skandinavisk Aktuarietidskrift, 69, 137-146.

Stigler, S. M. (1986), The History of Statistics, Cambridge, MA: The Belknap Press of Harvard University Press.

Student (W. S. Gosset) (1908), "The Probable Error of a Mean," Biometrika, 6, 1-25.

Walker, H. M. (1929), Studies in the History of Statistical Method, Baltimore: Williams & Wilkins. Reprinted 1975, New York: Arno Press.

Erhard Reschenhofer

Department of Statistics and Decision Support Systems

University of Vienna

Universitätsstr.5

A-1010 Vienna

Austria

Volume 9 (2001) | Archive | Index | Data Archive | Information Service | Editorial Board | Guidelines for Authors | Guidelines for Data Contributors | Home Page | Contact JSE | ASA Publications