An introduction to large deviation theory

1. Introduction

1.1. Overview

The aim of this lecture note is to give a very brief introduction to the theory of large deviations. In probability theory, a recurrent theme is the convergence of random variables, and the law of large numbers (LLN) is among the fundamental results concerning their “typical” behavior. But beyond this scope, mathematicians yearn for a more refined picture of the “atypical configurations” or “rare events” that deviate from the expected outcome either by a small or a large amount. In this pursuit, the former is addressed in the central limit theorem (CLT) while the latter gives rise to the large deviation principle (LDP). To demonstrate these, we would like to begin with the following example.

Example 1.1 (Bernoulli trial). A Bernoulli trial is a series of identical and independent experiments with exactly two possible outcomes, say and , which may be modeled as a sequence of i.i.d. random variables with the Bernoulli distribution:

The core questions regarding these trials revolve around the number of 1′s in an -trial, which is mathematically phrased as a random variable . It is not hard to determine the distribution of :

for which, here and throughout our study, we intend to examine the logarithm of the probability of the associated events. Specifically, by Stirling’s approximation (),

Asymptotically, given that is uniformly continuous on ,

where

is plotted in Figure 1. In the language of large deviation theory, the sequence is said to satisfy the large deviation principle (which will be defined rigorously later) with rate function .

Some comments are in order.

is a convex function with

In particular, its minimum is attained by a unique point . This implies the law of large numbers by the Borel-Cantelli lemma.
Observe that the second-order Taylor expansion

When considering a small deviation , this is “roughly” consistent with the central limit theorem:

Figure 1: rate function

1.2. General settings

Throughout the note, we let be a collection of probability measures on a common measurable space . The family is indexed by a set of non-negative real numbers with an accumulation point at , as we are primarily interested in the limiting behavior of these measures as . Within this framework, we assume that is a Hausdorff topological space and that contains the Borel -algebra .

Definition 1.2 (rate function). Let be a function with sublevel sets and effective domain .

is called a rate function if it is lower semicontinuous.
A rate function is said to be good if is compact for all .

Definition 1.3 (large deviation principle). The family is said to satisfy the large deviation principle with rate if for every ,

(1.1)

This definition of large deviation principle provides the most general framework for our discussion. In particular, one can further adapt the definition for any sequence of random variables as follows: The sequence is said to satisfy the large deviation principle if the associated family of probability laws satisfies the large deviation principle. This convention is applied to all statements concerning random variables.

The rationale behind our specific settings is tied directly to the following characterizations of the large deviation principle. By assuming , we immediately have the following equivalent statement.

Remark 1.4. Suppose . Then, satisfies the large deviation principle with rate if and only if the following hold.

(upper bound) For any closed set ,

(1.2)
(lower bound) For any open set ,

(1.3)

As for the Hausdorff assumption, it ensures that compact sets are well-behaved, particularly concerning the notion of exponential tightness.

Definition 1.5. Suppose contains all compact subsets of . The family is exponentially tight if for every , there exists a compact set such that

As every compact set is closed in a Hausdorff space, we can relax the LDP conditions under the assumption of exponential tightness as follows:

Proposition 1.6. Suppose . If is exponentially tight, then the following hold.

(upper bound) If (1.2) holds for every compact set, so does it for every closed set.
(lower bound) If (1.3) holds for every open set, then is good.

Notably, any admitting a rate function that satisfies the relaxed conditions is said to satisfy the weak LDP.

Proof. Suppose is exponentially tight. If is a closed set, then for any closed set ,

. For all , there exists a compact set such that

Hence,

where the last inequality holds as is compact due to the Hausdorff assumption.

With exponential tightness and the lower bound for open sets, we can find for every a compact set such that

This naturally implies

which is a closed subset due to lower semicontinuity and therefore compact due to the Hausdorff assumption.⁠

2. Cramér’s theorem in

In this section, we establish the Large Deviation Principle for the empirical mean of i.i.d. random variables. We begin by introducing the primary tools of our analysis: the logarithmic moment generating function and its Fenchel-Legendre transform. Let be i.i.d. random variables with law , and let be the law of the empirical mean . We denote the expectation of by , whenever it is well-defined.

Definition 2.1. The logarithmic moment generating function (log MGF) associated with the law is defined as

(2.1)

Definition 2.2. The Fenchel-Legendre transform of is

(2.2)

Example 2.3. There is a geometric interpretation of the Fenchel-Legendre transform of ; that is, is the supremum of the -intercepts of all lines with slope that lie below .

Figure 2: rate function

For the empirical means, the large deviation principle takes the following form:

Theorem 2.4 (Cramér). Let , , , and be as defined on . Then, satisfies the LDP with the convex rate function , namely,

For any closed set , .
For any open set , .

Furthermore,

If , then is good.

To motivate the proof, we first examine the role of the log MGF. For real-valued random variables, Markov’s inequality provides the key estimate for the LDP upper bound ((1.2)). Specifically, for any :

Similarly,

These observations suggest that is the natural candidate for the rate function. The following lemma summarizes the key properties of and .

Lemma 2.5. Let and be as defined. Then,

is a convex function and is a convex rate function.
Either of following holds.
- If only when , then is identically zero.
- If for some (respectively, ), then (respectively, ) is well-defined. Under the circumstances,
  
  (2.3)
  
  which satisfies
  - is decreasing on and increasing on , and
  - and if , then .
is differentiable in with and
If , then is a good rate function. Moreover, if , then .

Proof. (1) By Hölder’s inequality, given any and any satisfying ,

proving the convexity of . The convexity of follows from definition:

To prove the lower semicontinuity, observe that for any and any ,

implying . The non-negativity of follows from the fact .

(2) The case is automatic. If for some , then, due to the fact that for all ,

meaning both and are well-defined. Hence, by Jensen’s inequality,

The argument proceeds similarly for whenever .

We then proceed to prove (2.3) and its related properties. Since now is well-defined, by Jensen’s inequality, for all and (similar for and ),

(2.4)

from which (2.3) follows. Moreover, (2.3) implies the monotonicity on and . Finally, if , then by (2.4), . If (similar for ), we deduce from Markov’s inequality that for all

from which it follows that , as desired.

(3) The differentiability follows from the dominated convergence theorem. Let so that as and whenever . Since for all sufficiently small , we may apply the dominated convergence theorem to derive the derivative of . Finally, the function is concave and thus implies , proving the proposed property.

(4) Suppose is a non-degenerate interval containing . Then, for

implying as . Hence, the sublevel set is closed and bounded for all , and is good. When , then the result follows by letting .⁠

Proof of Theorem 2.4. (1) For brevity, let . If , then the inequality is trivial. Hence, we assume . Under the circumstances, the numbers and are different from . By Markov’s inequality, for all and ,

By Lemma 2.5,

Taking the normalized logarithm and letting yields the desired inequality.

(2) It suffices to show that for all measures and all ,

(2.5)

for once this is proved, one may simply consider the translation to deduce that the log MGF and that , which in turn imply

This proves the desired lower bound.

To prove (2.5), assume first that (a) , (b) , and (c) is boundedly supported. Under the circumstances, (a) and (b) imply as , and (c) implies is finite on . Consequently, by Lemma 2.5, there exists such that

Define a probability measure by

Observe that

By Lemma 2.5,

Hence, by the law of large numbers,

Hence, for all ,

proving (2.5) by letting .

Now, if the support of is not bounded, fix such that (a) , (b) . Hence, by letting be the normalized law of on , we have that

and that the log MGF associated with is

Hence, by the case of bounded support,

and therefore, by writing and ,

(2.6)

Since is increasing in , so is . Therefore, and the level sets are decreasing compacted sets, admitting some in their intersection. By monotone convergence theorem,

(2.7)

Combining (2.6) and (2.7) proves (2.5).

Finally, if or , then is monotone and . The bound then yields (2.5).

(3) It follows from Lemma 2.5(4).⁠

3. Gärtner-Ellis theorem.

Let be the log MGF associated with -dimensional real random variables , which can be defined as

The Gärtner-Ellis theorem states the following.

Definition 3.1. A convex function is essentially smooth if:

is nonempty.
is differentiable throughout .
is steep, namely, whenever is a sequence in converging to a boundary point of .

Theorem 3.2 (Gärtner-Ellis). Suppose that exists for all as extended real numbers and .

For any closed set ,
For any open set ,

where is the set of exposed points of admitting an exposing hyperplane belonging to .
If is an essentially smooth, lower semicontinuous function, then the LDP holds with the good rate function .

The proof of the third statement relies on several results from convex analysis; for clarity, we state these results here and defer their proofs.

Lemma 3.3. Under the same assumption as in Theorem 3.2, the following hold.

is a convex function, everywhere.
is a good convex rate function.
Suppose that for some . Then . Moreover , with being the exposing hyperplane for .

Lemma 3.4 (Rockafellar). If is an essentially smooth, lower semicontinuous, convex function, then .

With these tools, we can now proceed with the proof of the Gärtner-Ellis theorem. For convenience, we first introduce the following auxiliary function:

Definition 3.5. The -rate function associated with a rate function is a function defined as

(3.1)

Proof of Theorem 3.2. (1) It suffices to prove the inequality for all compact sets and that is exponentially tight.

The upper bound for compact sets follows essentially from Markov’s inequality. Choose for each a and an open neighborhood of such that

where is the -rate function associated with as defined in (3.1). Applying Markov’s inequality yields

which in turn implies

Now that is a compact set, one can find a finite cover with such that

proving the theorem by letting .

It is left to demonstrate the exponential tightness of , which is equivalent to that for all marginals of on -th coordinate and follows essentially from Lemma 2.5 (4). Let be sufficiently small so that . By Markov’s inequality, we have that for all ,

(3.2)

where is the -th vector in the standard basis and the right-hand side converges to as . The same arguments applies to . The two estimates combined prove the exponential tightness.

(2) The case for some is trivial, for everywhere. Assuming for all , one may adopt the change of measure argument as before in the following manner. Let be any fixed open set, , , and be an exposing hyperplane for . By continuity of , we choose an open neighborhood of such that

Now that is well-defined for every sufficiently small , we define a probability measure equivalent to :

so that

yielding

It remains to show that . To this end, we claim that

(3.3)

(3.4)

which together implies the desired estimate. To prove (3.3), define the function

and observe that

One deduces from the Markov’s inequality that

since is lower semicontinuous and is an exposing hyperplane. For (3.4), observe that , so the inequality follows from (3.2).

(3) With Lemma 3.3 and Lemma 3.4, we have that for ever open set ,

⁠

We now prove the lemmas.

Proof of Lemma 3.3. (1) The convexity follows from Hölder’s inequality as in Lemma 2.5. Precisely, given any satisfying ,

proving the convexity of and hence .

If for some , then by convexity for all . Since , it follows by convexity that for all , contradicting the assumption that . Thus, everywhere.

(2) Since , it follows that for some , and since the convex function is continuous in . Therefore,

implying is bounded for every . The function is convex and lower semicontinuous by a routine check as conducted in Lemma 2.5. Combining these implies that is a good convex rate function.

(3) Suppose now that for some ,

Then, for every ,

In particular,

Since the inequality holds for all , . Hence, is an exposed point of with exposing hyperplane .⁠

Before proving Rockafellar’s lemma, let us recall the definition of relative interior of a convex set.

Definition 3.6. The relative interior of a nonempty convex set is defined as

Proof of Lemma 3.4. Assume without loss of generality that for the lemma is vacuous otherwise. Under the circumstances, fix henceforth and define a function

Observe that is convex, lower semincontinuous, and . Moreover, . Therefore, from it follows that . By Lemma C.2, there exists such that . Let , so that is an essentially smooth, convex function and . Consequently, by Lemma C.3, is finite in a neighborhood of the origin and thus , at which is differentiable by the assumption. Hence, , implying that , i.e., . It now follows from Lemma 3.3(2) that . Since is arbitrary, the proof is complete.⁠

The following theorem is a generalization of the Gärtner-Ellis theorem (Theorem 3.2), which essentially follows from the same proof with marginal distributions replaced by projection measures in every direction. Its proof can be found in [1].

Theorem 3.7 (Baldi). Suppose is an exponentially tight family of probability measures on .

For every closed set , .
Let be the set of exposed points of with an exposing hyperplane for which

(3.5)

Then, for every open set ,
If for every open set , , then satisfies the LDP with the good rate function .

Proof. See [1, Theorem 4.5.20].⁠

The theorem is accompanied by a generalized criterion of differentiability of in the following sense.

Definition 3.8 (Gateaux differentiable). Let be a topological vector space. A function is said to be Gateaux differentiable if for all , the function is differentiable at .

Corollary 3.9. Let be exponentially tight probability measures on the Banach space . Suppose that is finite-valued, Gateaux differentiable, and lower semicontinuous in with respect to the weak* topology. Then satisfies the LDP with the good rate function .

Proof. See [1, Corollary 4.5.27].⁠

4. Applications

4.1. Cramér’s theorem in

Combining the Gärtner-Ellis theorem with our previous results, we can establish the LDP for -dimensional i.i.d. random variables . As before, let be the law of the empirical mean, and let and be defined accordingly.

Theorem 4.1 (Cramér). Suppose , then the family satisfies the LDP with the convex good rate function .

While powerful, this theorem does not encompass all scenarios where an LDP holds; a refined version requires only that for the result to remain valid.

Example 4.2. Let be a Borel probability measure on with a density and be its log MGF, where is the normalizing constant. We will show in the following that , so the large deviation principle holds with a good rate function; however, is not steep and hence the Gärtner-Ellis theorem does not apply.

On one hand, with the aid of Cauchy-Schwarz inequality , we realize that the estimate

holds and therefore . On the other hand, given any , we have that along direction ,

is unbounded as if , implying . Finally, we have uniform bound

Therefore, within the interior of ,

This proves that is not steep.

4.2. Large deviations for finite state Markov chains

In this section, we study the large deviations of Markov chains on a finite alphabet . Let be a stochastic matrix, and consider a Markov chain with transition probability . We denote by the probability law of the chain starting at state :

We are interested in the empirical mean

where for a given function . The log MGF can be analyzed using the following family of non-negative matrices:

Theorem 4.3 (Perron-Frobenius). Let be an irreducible matrix. Then possesses an eigenvalue (called the Perron–Frobenius eigenvalue) such that:

is real.
For any eigenvalue of , .
There exist left and right eigenvectors corresponding to the eigenvalue that have strictly positive coordinates.
The left and right eigenvectors , corresponding to the eigenvalue are unique up to a constant multiple.
For every and every such that for all ,

Proof. See, for example, Wikipedia.⁠

Theorem 4.4. Let be a finite state Markov chain possessing an irreducible transition matrix . For every , define

Then the empirical mean satisfies the LDP with the convex, good rate function . Explicitly, for any set , and any initial state ,

Proof. Consider the logarithmic generating function

By Gärtner–Ellis theorem (Theorem 3.2), it is enough to check that the limit

exists, is finite and differentiable everywhere in , and satisfies . To begin, note that

Since is irreducible, we have

To show that it is differentiable, we apply the implicit function theorem. Explicitly, consider the functions , parametrized by , defined by

Clearly, is continuously differentiable. Hence, it suffices to show that for every with the associated Perron-Frobenius eigenvalue and left and right eigenvectors and of satisfying , we have (1) and (2) the Jacobian matrix

is invertible. Indeed, (1) is clear, and (2) holds because if it did not, there exists such that

which is impossible. This implies, by the implicit function theorem, that there exists a continuously differentiable function on a neighborhood of such that . In particular, is continuously differentiable on a neighborhood of . The proof is finished by noting that is arbitrary.⁠

As a corollary, we have the following.

Corollary 4.5. The empirical averages , with , satisfy the LDP with the good rate function

where and the inequality between vectors is compared entrywise.

Proof. The first equality follows immediately from Theorem 4.4 by taking

To prove the second inequality, we first note that one inequality is more obvious than the other: By taking to be the left probability eigenvector of , we see the inequality “”. To prove the other inequality, assume and choose so that and that . Therefore, by definition,

finishing the proof.⁠

We also consider a derived process of consecutive pairs to obtain Sanov’s theorem. Such a process has the transition matrix defined by

As is discussed in the following, one can determine the large deviations of the pair empirical measure

For , we write

and call shift-invariant if .

Theorem 4.6. Assume that is irreducible. Then for every probability measure ,

Proof. By Corollary 4.5,

If is not invariant, then for some . For such that when and , we have if we let .

If is invariant,

Hence,

where and . This implies

Taking approaching proves the theorem.⁠

A Probability theory

A.1 Basic inequalities

Proposition A.1 (Borel-Cantelli). Let be a sequence of events in a probability space . Then,

Proposition A.2 (Markov). Let be a nonnegative random variable. Then, for every ,

Proposition A.3 (Hölder). Let be non-negative random variables. Then, for satisfying ,

where the equality holds if and only if for some .

Proof. Without loss of generality, assume to paraphrase the proposition: with equality holding if and only if for some . Under the circumstances, it is not hard to verify that Young’s inequality

holds if and only if . The paraphrased proposition then follows by integrating both sides.⁠

Proposition A.4 (Jensen). Let be a real-valued random variable and is convex. If and are defined, then with the convention and .

Proof. We make use of the property of convex functions that

(A.1)

The left-hand side is by definition larger than the other, and thus it remains to show the remaining. To this end, it suffices to show that for all satisfying , there exists a linear function such that for all . Essentially, this is achieved by choosing

Now that (A.1) coincides with

(A.2)

one can enumerate the linear functions on the right-hand side by and obtain

If , then the proposition holds by letting . If (similar for ) and , then right-hand side of the above still converges to , while the proposition is trivial when and .⁠

A.2 Radon-Nikodym theorem

Definition A.5 (absolute continuity). Let and be defined on a common measurable space . We say is absolutely continuous with respect to , denoted by , if for every satisfying .

Theorem A.6 (Radon-Nikodym). Suppose are two -finite measures on a common measurable space and , then there exists a -measurable function such that for every ,

A.3 Laws of large numbers

Theorem A.7 (Weak Law of Large Numbers). Suppose are i.i.d. and . Then, in probability.

Proof. Equivalently, we can assume is nonnegative and and prove . Let and with . Immediately,

On the other hand,

The theorem is proved by combining the above.⁠

Lemma A.8 (Kronecker). Suppose and . Then implies .

Proof. Writing with , one may use summation by parts to deduce

The last expression converges to as .⁠

Proposition A.9 (Kolmogorov's Criterion of SLLN). Suppose are independent such that and . Then, a.s.

Proof. By virtue of the independence, one observes that

which, in particular, implies is finite almost surely, which in turn implies the conclusion by Lemma A.8.⁠

Proposition A.10 (Strong Law of Large Numbers). Suppose are i.i.d. and . Then, a.s.

Proof. It suffices to prove the case , for if otherwise, one may still apply the result to for every if (resp., if ) so that and that (resp., ), leading to the conclusion by letting .

To begin, truncate to define

Then, verify the assumption of Proposition A.9

to deduce

(A.3)

On the other hand, by the Borel-Cantelli lemma,

(A.4)

Combining (A.3) and (A.4) gives that

⁠

B Functional analysis

Theorem B.1 (Hahn-Banach Extension Theorem). Let be a real vector space. Suppose that

is a linear subspace of ,
is a convex functional,
is linear and .

Then, there exists a linear functional such that .

Proof. The theorem is proved by transfinite induction.

If , choose , and extend to by

where one may choose, if possible,

so that on . To see the above interval is indeed nonempty, note that the convexity of together with implies that for each ,

and hence the non-emptiness follows from the finite intersection property.

To conclude the proof, consider the set of all pairs such that is a subspace of and that is linear on with . Define further a partial order by

Hence, for any chain , the space is a vector space and the function whenever , which forms a maximal element of the chain. Hence, by Zorn’s lemma, there exists a maximal element , and by the first part of the proof.⁠

In the context of topological vector spaces, an equivalent of Theorem B.1 is phrased in terms of separation, as follows. Recall that a core point of a subset of a vector space is the point satisfying that for every there exists such that for all , for which we denote .

Theorem B.2 (Hahn-Banach Separation Theorem). Let and be disjoint, nonempty, convex subsets of a real topological vector space . Assume that . Then there is a linear functional satisfying and for all . Moreover, if in addition , then can be chosen so that and for all and .

Proof of Theorem B.2 by Theorem B.1. We first prove the general case, after which we point out the necessary modifications for the case where .

Let and . Define so that and its associated gauge

Then, is a sublinear functional. Indeed, since , is real-valued. Moreover, for all by definition and by convexity of . Consider the subspace and define the linear functional . To apply the extension theorem, observe that and thus , yielding a linear extension of satisfying by Theorem B.1. To see separates and , observe that for any and ,

(B.1)

The number in question can be arbitrary in . To see that that , note that for some and and as desired.

Now if , pick and proceed as before. Under the circumstances, is a neighborhood of . Hence, for all in the neighborhood of , proving the continuity. Furthermore, due to the fact , rendering (B.1) a strict inequality.⁠

Proof of Theorem B.1 by Theorem B.2. Let

Then, and are convex and every point in is a core point. Thus, by Theorem B.2, there exists a linear functional and numbers such that

(B.2)

Under the circumstances, we observe, by letting , that . It is clear that , for if otherwise, the inequality (B.1) above holds if and only if for all , contradicting Theorem B.2 as . Now that , the inequality (B.2) implies that

The proof is finished by choosing .⁠

Theorem B.3. Suppose is a locally convex real topological vector space. Suppose and are two disjoint, nonempty, convex sets in the locally convex topological vector space . If is compact and is closed, then there exists an such that .

Proof. Suppose is a convex neighborhood of such that . Apply Theorem B.2 to and to obtain such that since is compact and is open.⁠

C Basics in convex analysis

Lemma C.1 (duality). Let be a locally convex topological vector space. If is convex and lower semicontinuous, then .

Proof. We should assume, without loss of generality, that is not identically , for the lemma holds obviously otherwise. Define

which are convex subsets in the locally convex topological vector spaces and , respectively.

It is not hard to verify by definition that . Hence, it suffices to prove the other inequality. Equivalently, it asserts that if , then there is such that . Then, observe that is compact and is closed by lower semicontinuity of and nonempty by the fact is not identically . Moreover, under the assumption , . These altogether allow us to apply the Hahn-Banach separation theorem (specifically, Theorem B.3) to yield some and satisfying

(C.1)

where must hold as there exists with .

If , then (C.1) implies that . Moreover, as desired, for if otherwise, contradicting the definition of .

If , then for all , define for . In fact, since

Moreover,

yielding the desired when is sufficiently small.⁠

Lemma C.2. Suppose is a convex, lower semicontinuous function with and , then for some

Proof. Note that , and hence the lemma, by duality (Lemma C.1), is equivalent to the existence of such that for all . This latter claim, by convexity of , is equivalent to

To prove the claim, define

Obviously, is convex, compact, and nonempty. On the other hand, is nonempty, convex, and . Obviously, . To prove convexity, note that is a convex function and closure of a convex set is convex. Indeed,

Finally, to prove , we first show that . Under the circumstances, we deduce that is continuous at and thus follows naturally. To this end, note that implies . Indeed, and if , then and for all small ,

In particular, the above implies . Moreover, implies also since if and only if for all .

We now may apply Theorem B.3 to and to yield some such that

where since . Hence, for every with , we have that

The proof is concluded by choosing .⁠

Lemma C.3. Let be an essentially smooth, convex function. If and for some , then .

Proof. Since , it follows by convexity of that

Moreover, since ,

Because is essentially smooth, there exists a closed ball in which is differentiable. Hence,

Hence, for any and any different from , by the convexity of ,

where . Similarly, . Hence,

Observe that because of the convexity of . Hence, by assumption, exists, and by the preceding inequality, . Since is steep, by considering , in which case , we conclude that .⁠

References

[1]Dembo, Amir, and Zeitouni, Ofer, Large Deviations Techniques and Applications, Springer Berlin Heidelberg, 2010. ↩