David A. Rolls

The University of Melbourne, Australia

Journal of Statistics Education Volume 15, Number 3 (2007), www.amstat.org/publications/jse/v15n3/rolls.html

Copyright © 2007 by David A. Rolls all rights reserved. This text may be freely shared among individuals, but it may not be republished in any medium without express written consent from the author and advance notification of the editor.

**Key Words:** convergence, Chernoff's Theorem, Cramer's Theorem, independent study, teaching, undergraduate statistics

In the branch of probability called "large deviations," rates of convergence (e.g. of the sample mean) are considered. The theory makes use of the moment generating function. So, particularly for sums of independent and identically distributed random variables, the theory can be made accessible to senior undergraduates after a first course in stochastic processes. This paper describes a directed independent study in large deviations offered to a strong senior, providing a sample outline and discussion of resources. Learning points are also highlighted.

Imagine flipping a fair coin many times. We model the result of flip *n* with random variable *X _{n}* where `0' and `1' denote a head and a tail, respectively. For a fair coin

For any *n* ≥ 1, *S _{n}*/

An interesting question is *how quickly* does *P(S _{n}* ≥

While usually taught at the graduate level, aspects of the theory can be made accessible to senior undergrads, particularly when the sequence {*X _{n}*,

The rest of this paper is organized as follows. The remainder of Section 1 discusses a number of resources that are available in the area of large deviations. Section 2 provides more details and questions on the coin tossing example. Section 3 discusses Cramér's theorem, and describes some of the discussion points one could raise with an undergraduate. Section 4 discusses the project and how it progressed. Section 5 gives some concluding remarks, while Section 6 provides the outline for the project.

There are a number of references for large deviations and the quality and level of the exposition varies. All four of the books listed here are texts, and include questions which can be assigned. Careful choices must be made to keep the difficulty appropriate.

A general introduction to the topic for discrete random variables can be found in Section 9 of Billingsley (1986). Generally the level of that book is at a more advanced level but this optional section is something of an anomaly. Starting from the moment generating function it builds up to a version of an important theorem, Chernoff's theorem, and gives an example in the context of statistical hypothesis testing.

The text by den Hollander (2000) is split roughly 50-50 between theory and applications. Chapter 1 (8 pages) provides introductory theory for i.i.d. sequences. It is rigorous without excessive detail. It does assume students know some basics about measures. For example, if *X* is a random variable with cumulative distribution function *F*, so that *F(x)* = *P(X* ≤ *x*), students are assumed to understand integrals like ∫_{R}*g(x)dF(x)*. This need not be a barrier to undergrads however. To make this notation accessible there are a couple of possibilities. They can be told this is simply *E*[*g*(*x*)]. Or, you can stipulate an assumption that *X* is (absolutely) continuous with density function ƒ so that ∫_{R}*g(x)dF(x)* = ∫_{R}*g(x)f(x)dx*. In this form the expected value is recognizable. Chapter 6 (4 pages) provides an accessible application to statistical hypothesis testing with a nice statement of the statistical problem of interest and a quick overview of Neyman-Pearson tests. Chernoff's theorem is used to find the exponential rate of decay of error probabilities with increasing sample size *n* in optimal Neyman-Pearson tests.

The text by Bucklew (1990) is written at an upper graduate student level with applications to information theory, communications theory or applied statistics. For undergraduates, the introduction (Bucklew 1990, pp. 1-4) is accessible and provides motivation. Two examples on Chernoff's theorem (Bucklew 1990, p. 35) are also accessible.

The text by Dembo & Zeitouni (1998) is a staple in large deviations. It is mathematical, thorough and goes well beyond the i.i.d. setting. So, most of the coverage is too advanced for use here. Even the notation could prove challenging. One section of the text, Section 2.2.1, deals with Cramér's Theorem in the i.i.d. setting. If a student is given a more accessible presentation beforehand, this section provides a glimpse into a more sophisticated presentation of the material. It is expected they would have to read and re-read the section, asking questions for clarification. To help decipher some notation, I told Rick that for a set *A *

where

Finally, I included two papers in the outline of the project. The paper by Weiss (1995) is a very accessible introduction, written as an introduction for engineers. The paper by Chernoff (1952) illustrates the use of large deviations to the analysis of statistical hypothesis testing. In the project, reading Chernoff's paper is a choice. In fact, the texts above provide more accessible treatments to this application. While Rick was initially keen to read it, when it was time to choose later in the semester, he went with the text treatment.

The coin tossing example above leads to the following

**Lemma 2.1** (See den Hollander (2000)). *Let* {*X _{n}, n* = 1,2,...}

A direct proof of this result is possible and makes a good exercise. The interesting case has 0.5 < *a* ≤ 1, uses a counting argument, Stirling's approximation

and the inequalities

where

(The maximum is attained at *k* = . A proof appears in den Hollander (2000, p. 5), but even that proof is sufficiently terse that students are sufficiently challenged if asked to give full details. For example, how does one correctly use the asymptotic result from Stirling's approximation? (It is *not true* that

The Fenchel-Legendre transform is

**Cramér's Theorem.** *Let*{*X _{n}, n* = 1,2,...}

For all a > μ

*Proof*. See den Hollander (2000, p. 5)

This theorem includes the coin tossing example, but is much more general. Calculating *I*(*x*) for common distributions like Bernoulli(*p*), exponential(λ), Poisson(λ) and Normal(μ, σ^{2}) is a nice application of finding absolute maximums. Students, such as Rick, can be good at finding critical numbers, but may need reminding to justify they are absolute maximums.

**Definition.** *If y _{1}, y_{2},... is a sequence of numbers such that y_{n} → y as n → ∞ , a function f is lower semi-continuous if* lim inf

The function *I* has a number of properties, including 1) *I* is non-negative, 2) *I* is lower semi-continuous, 3) *I*(*x*) = 0 iff *x* = *E*(*X*_{1}), and 4) *I* is convex. The level sets of *I* are the sets *I* ^{-1}([0,a]) = {*x*:* I* (*x*) ≤* a*}. Lower semi-continuous is equivalent to the level sets of *I* being closed. A proof of property (1) is trivial. A proof for property (2) is a nice application of the following lemma.

**Fatou's Lemma.** *If { Y_{n}, n = 1,2,...} is a sequence of non-negative valued random variables with finite mean*

(Note that non-negativity in Fatou's lemma doesn't mean we can only consider non-negative sequences {*X _{n}, n* = 1,2,...}. The connection between

isn't necessarily true. Similarly, for a sequence of functions {ƒ* _{n}*,1,2,...} where ƒ

These results are generally established with one of the "convergence theorems". These theorems have stronger requirements on the sequence than Fatou's Lemma. In this sense Fatou's Lemma is a lesser alternative- a weaker hypothesis, but providing a weaker result. It is used less often. But that weaker result is still useful to prove property (2)! For this first exposure, these convergence results need not be proven to the students. For a proof of property (3), Jensen's inequality is useful, which will probably require a hint.

By the properties of *I*, it is not decreasing on (μ, ∞). The result of Cramér's Theorem can be written (den Hollander 2000, p. 10) as

It may be surprising to see that the probability is obtained from the minimizing value *x*^{*} in *A* which is closest to μ (i.e. the lower endpoint *a*). But this is the fundamental observation of large deviations: "rare events happen in the least unlikely of the unlikely ways".

Through this phase Rick worked hard, perhaps harder than he was used to. Through the entire semester we had a regularly-scheduled time each week when he would come by to ask questions, report progress or submit solutions. But I was accessible at other times too, and he would use those times to ask additional questions. This first phase generated the most questions from him, and led to the most face-to-face time. In some cases he asked my colleagues, whom he knew well, about supplemental texts for foundational probability or real analysis. He was challenged, but I think he rose to the occasion. Never did he appear discouraged.

The second phase, described in Section 6.5, was about 4 weeks long and involved computer simulation work. It was intentionally designed to be not too onerous since learning the necessary computer skills can take time. Rick chose to use R, with which he had some previous experience. Large deviation theory gives an exponential decay rate. The point here was to use simulations to demonstrate that rate. This required some mentoring, but far less involvement by me than in the other two phases. Since Rick had a lot of training in statistics he was able to give both point estimates and confidence intervals for that rate. Happily, all his confidence intervals covered the theoretical rate.

The final phase, Section 6.6, was the report. Rick's final report was 15 pages (double spaced) plus appendices for computer code and data. He chose to use Microsoft Word to prepare the document. The assignment is intentionally written to require some editorial decisions on what is worth including. It was intended that Rick describe what he learned in a careful mathematical way. This turned out to be a challenging phase, taking about three weeks and five drafts. To graduate Rick didn't need to write a senior thesis. So, some of my comments were either about English writing in general or typical comments for someone with little to no experience writing reports in mathematics. In some cases, I had to prod him to include more of the work he had done. In addition, some teaching points from the first phase were clearly not mastered, and the report was an opportunity to reinforce them. Stirling's approximation is not an equality. Chernoff's theorem is an asymptotic result, not an inequality for every *n*. There is a difference between the rate function for a zero threshold Neyman-Pearson test and the test itself. I expected some points would not be mastered the first time around. This is one reason I assigned a written report over an oral presentation- an oral presentation doesn't lend itself to revisiting points for mastery. So the learning continued, basically right through until the last week.

With hindsight, I am reasonably happy with how phase 1 of the project developed. Finding the right balance of giving hints versus having a student try again is challenging but important if the student is to stay motivated. I tend to think Rick could have handled a bit more frustration from trying again.

I can see a few things I would do differently. To me, the written report is a valuable part of the assignment. A nice advantage is that, as far as I know, cut-and-paste from web resources isn't really possible. But it realistically requires 3-4 weeks for students with little writing experience, even with quick feedback for each draft. That's to achieve a report that a student could proudly show to other people. Reduced expectations might eliminate some iterations of the revisions process. To provide the extra time, a week (maybe two) can be borrowed from the computer simulation time (phase 2) if a student is using software they have used before.

For the report, Word is cumbersome because of the volume and complexity of math content. LaTeX would be a natural choice, but the learning curve is too steep for student use with no previous experience. Next time, I would have a student try writing the report using "document mode" in the latest versions of Maple- assuming some prior familiarity with Maple. This allows presentation-quality documents to be prepared using a familiar Maple interface, syntax, symbol palettes and so-on.

The use of the Normal distribution in phase 2 was a deliberate choice, but ultimately it wasn't fully exploited. In fact, the distribution of *S _{n}* is known exactly and one can compare exact results for

Patrick Billingsley, *Probability and Measure*, 2nd ed., 1986. (pp. 142-149)

James A. Bucklew, *Large Deviation Techniques in Decision, Simulation, and Estimation*, 1990. (pp. 1-14, 91-93)

Amir Dembo, Ofer Zeitouni, *Large Deviations Techniques and Applications*, 1998. (Sections 2.2.1, 3.4) (more advanced reading)

Frank den Hollander, *Large Deviations*, 2000. (Chapters 1,6- a total of 15 pages)

H. Chernoff. "A measure of the asymptotic efficiency of tests of a hypothesis based on the sum
of observations." *Annals of Mathematical Statistics*, 23:493--507, 1952.

Alan Weiss, "An Introduction to Large Deviations for Communication Networks", *IEEE Journal on Selected Areas in Communications*, vol. 13, no. 6, August 1995 pp. 938--952. (Sections 1-5)

Read Bucklew pp. 1-4.

Read Weiss Section II.

- Suppose {
*X*_{1},*X*_{2},...} is an i.i.d. sequence with*P*(*X*= 0) =_{i}*P*(*X*= 1) = 0.5 so the_{i}*E*(*X*) = 0.5. Let_{i}*S*= Σ_{n}^{n}_{i=1}*X*. Use Stirling's approximation to_{i}*n*! and the Binomial Theorem to show for 0.5 <*x*≤ 1

where

Read Billingsley pp. 142-148, Weiss III-IVc.

- Compute the rate function (i.e. the Fenchel-Legendre transform) for these distributions.
a) Bernoulli(p), b) exponential(λ), c) Poisson(10) and d) N(μ, σ
^{2}). - Use a computer package to plot the rate functions for these distributions. a) Bernoulli(0.5), b) exponential(10), c) Poisson(10) and d) N(5,1)
- Billingsley (Chapter 1, 9.2)

Let *Y* be a random variable with moment generating function *M*(*t*). Show that *P*(*Y* > 0) = 0
implies that *P*(*Y* ≥ 0) = inf* _{t}M*(

Read Weiss IVd-V, Dembo & Zeitouni p.26 & Section 2.2.1

- a) State the Large Deviations Principle with rate function
*I*(⋅) in the Cramér setting, in terms of closed sets*F*and open sets*G*. - The Fenchel-Legendre transform is
*I*(*x*) = sup[_{θ}*θx*- log*M*(*θ*)], where*M*(*θ*) is the moment generating function.
a) Show

b) State the Large Deviations Principle with rate function *I*(⋅) in the Cramér setting for a set [*a*,∞). (Hint: use Corollary 2.2.19).

b) A probabilistic interpretation of Fatou's lemma states that if {

Use Fatou's lemma to show *I*(*x*) is lower semi-continuous.

Read Billingsley pp. 148-149.

- Verify equation 9.24 in Billingsley.
- Consider the test following equation 9.27 in Billingsley. Imagine trying to distinguish
between
*p*_{1}= 0.2 and*p*_{2}= 0.2 +*t*and also between*p*_{1}=0.5 and*p*_{2}= 0.5 +*t*. Approximately, by what fraction can the sample size be smaller in the first case to achieve the same precision as in the second case.

- Read den Hollander (Chapter 6) and complete den Hollander exercise VI.6.

OR

- Read Chernoff (1952) and express in your own words the conclusion of Theorem 3.

For this part you'll work with the N(5,1) distribution.

- Suppose {
*X*_{1},*X*_{2}, ...,} are i.i.d. random variables with a N(5,1) distribution, and*X*= Σ_{n}^{n}_{i=1}*X*. For any_{i}*a*> 5 what does Chernoff's theorem say about*P*(*S*≥_{n}*na)*. - Using random number simulation, demonstrate the exponential decay and asymptotic rate described in Chernoff's theorem. That is, use simulations to approximate
*P*(*S*≥_{n}*na*) and then plot it versus*n*for values of*a*> 5 using appropriate axes. That is, you should create a different graph for each value of*a*.

Summarize the results of the exercises and computer simulation in a typed report of 8-10 pages.

Billingsley, P. (1986), *Probability and Measure*, Wiley Series in Probability and Mathematical Statistics, 2nd ed.,
John Wiley & Sons Inc., New York.

Bucklew, J. A. (1990), *Large Deviation Techniques in Decision, Simulation, and Estimation*, Wiley Series in
Probability and Mathematical Statistics: Applied Probability and Statistics, John Wiley & Sons Inc., New York.

Chernoff, H. (1952), "A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations, Ann. Math. Statistics, 23, 493-507.

Dembo, A. & Zeitouni, O. (1998), *Large Deviations Techniques and Applications*, Vol. 38 of *Applications of Mathematics (New York)*, Springer-Verlag, New York.

den Hollander, F. (2000), *Large Deviations*, Vol. 14 of *Fields Institute Monographs*, American
Mathematical Society, Providence, RI.

Weiss, A. (1995), "An introduction to large deviations for communication networks", *IEEE Journal on Selected Areas
in Communications*, 13, 938-952.

David A. Rolls

Department of Mathematics and Statistics

The University of Melbourne

Parkville, VIC 3010

Australia
*D.Rolls@ms.unimelb.edu.au*

Volume 15 (2007) | Archive | Index | Data Archive | Information Service | Editorial Board | Guidelines for Authors | Guidelines for Data Contributors | Home Page | Contact JSE | ASA Publications