1. Introduction
1.1. Overview
The aim of this lecture note is to give a very brief introduction to the theory of large deviations. In probability theory, a recurrent theme is the convergence of random variables, and the law of large numbers (LLN) is among the fundamental results concerning their “typical” behavior. But beyond this scope, mathematicians yearn for a more refined picture of the “atypical configurations” or “rare events” that deviate from the expected outcome either by a small or a large amount. In this pursuit, the former is addressed in the central limit theorem (CLT) while the latter gives rise to the large deviation principle (LDP). To demonstrate these, we would like to begin with the following example.
Example 1.1 (Bernoulli trial). A Bernoulli trial is a series of identical and independent experiments with exactly two possible outcomes, say
The core questions regarding these trials revolve around the number of 1′s in an
for which, here and throughout our study, we intend to examine the logarithm of the probability of the associated events. Specifically, by Stirling’s approximation (
Asymptotically, given that
where
is plotted in Figure 1. In the language of large deviation theory, the sequence
Some comments are in order.
-
is a convex function with In particular, its minimum is attained by a unique point
. This implies the law of large numbers by the Borel-Cantelli lemma. -
Observe that the second-order Taylor expansion
When considering a small deviation , this is “roughly” consistent with the central limit theorem:
1.2. General settings
Throughout the note, we let be a collection of probability measures on a common measurable space . The family is indexed by a set of non-negative real numbers with an accumulation point at , as we are primarily interested in the limiting behavior of these measures as . Within this framework, we assume that is a Hausdorff topological space and that contains the Borel -algebra .
Definition 1.2 (rate function). Let be a function with sublevel sets and effective domain .
- is called a rate function if it is lower semicontinuous.
- A rate function is said to be good if is compact for all .
Definition 1.3 (large deviation principle). The family is said to satisfy the large deviation principle with rate if for every ,
This definition of large deviation principle provides the most general framework for our discussion. In particular, one can further adapt the definition for any sequence of random variables as follows: The sequence is said to satisfy the large deviation principle if the associated family of probability laws satisfies the large deviation principle. This convention is applied to all statements concerning random variables.
The rationale behind our specific settings is tied directly to the following characterizations of the large deviation principle. By assuming , we immediately have the following equivalent statement.
Remark 1.4. Suppose . Then, satisfies the large deviation principle with rate if and only if the following hold.
-
(upper bound) For any closed set ,
(1.2) -
(lower bound) For any open set ,
(1.3)
As for the Hausdorff assumption, it ensures that compact sets are well-behaved, particularly concerning the notion of exponential tightness.
Definition 1.5. Suppose contains all compact subsets of . The family is exponentially tight if for every , there exists a compact set such that
As every compact set is closed in a Hausdorff space, we can relax the LDP conditions under the assumption of exponential tightness as follows:
Proposition 1.6. Suppose . If is exponentially tight, then the following hold.
- (upper bound) If (1.2) holds for every compact set, so does it for every closed set.
- (lower bound) If (1.3) holds for every open set, then is good.
Notably, any admitting a rate function that satisfies the relaxed conditions is said to satisfy the weak LDP.
Proof. Suppose is exponentially tight. If is a closed set, then for any closed set ,
. For all , there exists a compact set such that
Hence,
where the last inequality holds as is compact due to the Hausdorff assumption.
With exponential tightness and the lower bound for open sets, we can find for every a compact set such that
This naturally implies
which is a closed subset due to lower semicontinuity and therefore compact due to the Hausdorff assumption.
2. Cramér’s theorem in
In this section, we establish the Large Deviation Principle for the empirical mean of i.i.d. random variables. We begin by introducing the primary tools of our analysis: the logarithmic moment generating function and its Fenchel-Legendre transform. Let be i.i.d. random variables with law , and let be the law of the empirical mean . We denote the expectation of by , whenever it is well-defined.
Definition 2.1. The logarithmic moment generating function (log MGF) associated with the law is defined as
Definition 2.2. The Fenchel-Legendre transform of is
Example 2.3. There is a geometric interpretation of the Fenchel-Legendre transform of ; that is, is the supremum of the -intercepts of all lines with slope that lie below .
For the empirical means, the large deviation principle takes the following form:
Theorem 2.4 (Cramér). Let , , , and be as defined on . Then, satisfies the LDP with the convex rate function , namely,
- For any closed set , .
- For any open set , .
Furthermore,
- If , then is good.
To motivate the proof, we first examine the role of the log MGF. For real-valued random variables, Markov’s inequality provides the key estimate for the LDP upper bound ((1.2)). Specifically, for any :
Similarly,
These observations suggest that is the natural candidate for the rate function. The following lemma summarizes the key properties of and .
Lemma 2.5. Let and be as defined. Then,
- is a convex function and is a convex rate function.
-
Either of following holds.
- If only when , then is identically zero.
-
If for some (respectively, ), then (respectively, ) is well-defined. Under the circumstances,
(2.3)which satisfies
- is decreasing on and increasing on , and
- and if , then .
-
is differentiable in with and
- If , then is a good rate function. Moreover, if , then .
Proof. (1) By Hölder’s inequality, given any and any satisfying ,
proving the convexity of . The convexity of follows from definition:
To prove the lower semicontinuity, observe that for any and any ,
implying . The non-negativity of follows from the fact .
(2) The case is automatic. If for some , then, due to the fact that for all ,
meaning both and are well-defined. Hence, by Jensen’s inequality,
The argument proceeds similarly for whenever .
We then proceed to prove (2.3) and its related properties. Since now is well-defined, by Jensen’s inequality, for all and (similar for and ),
from which (2.3) follows. Moreover, (2.3) implies the monotonicity on and . Finally, if , then by (2.4), . If (similar for ), we deduce from Markov’s inequality that for all
from which it follows that , as desired.
(3) The differentiability follows from the dominated convergence theorem. Let so that as and whenever . Since for all sufficiently small , we may apply the dominated convergence theorem to derive the derivative of . Finally, the function is concave and thus implies , proving the proposed property.
(4) Suppose is a non-degenerate interval containing . Then, for
implying as . Hence, the sublevel set is closed and bounded for all , and is good. When , then the result follows by letting .
Proof of Theorem 2.4. (1) For brevity, let . If , then the inequality is trivial. Hence, we assume . Under the circumstances, the numbers and are different from . By Markov’s inequality, for all and ,
By Lemma 2.5,
Taking the normalized logarithm and letting yields the desired inequality.
(2) It suffices to show that for all measures and all ,
for once this is proved, one may simply consider the translation to deduce that the log MGF and that , which in turn imply
This proves the desired lower bound.
To prove (2.5), assume first that (a) , (b) , and (c) is boundedly supported. Under the circumstances, (a) and (b) imply as , and (c) implies is finite on . Consequently, by Lemma 2.5, there exists such that
Define a probability measure by
Observe that
By Lemma 2.5,
Hence, by the law of large numbers,
Hence, for all ,
proving (2.5) by letting .
Now, if the support of is not bounded, fix such that (a) , (b) . Hence, by letting be the normalized law of on , we have that
and that the log MGF associated with is
Hence, by the case of bounded support,
and therefore, by writing and ,
Since is increasing in , so is . Therefore, and the level sets are decreasing compacted sets, admitting some in their intersection. By monotone convergence theorem,
Combining (2.6) and (2.7) proves (2.5).
Finally, if or , then is monotone and . The bound then yields (2.5).
(3) It follows from Lemma 2.5(4).
3. Gärtner-Ellis theorem.
Let be the log MGF associated with -dimensional real random variables , which can be defined as
The Gärtner-Ellis theorem states the following.
Definition 3.1. A convex function is essentially smooth if:
- is nonempty.
- is differentiable throughout .
- is steep, namely, whenever is a sequence in converging to a boundary point of .
Theorem 3.2 (Gärtner-Ellis). Suppose that exists for all as extended real numbers and .
-
For any closed set ,
-
For any open set ,
where is the set of exposed points of admitting an exposing hyperplane belonging to .
- If is an essentially smooth, lower semicontinuous function, then the LDP holds with the good rate function .
The proof of the third statement relies on several results from convex analysis; for clarity, we state these results here and defer their proofs.
Lemma 3.3. Under the same assumption as in Theorem 3.2, the following hold.
- is a convex function, everywhere.
- is a good convex rate function.
- Suppose that for some . Then . Moreover , with being the exposing hyperplane for .
With these tools, we can now proceed with the proof of the Gärtner-Ellis theorem. For convenience, we first introduce the following auxiliary function:
Definition 3.5. The -rate function associated with a rate function is a function defined as
Proof of Theorem 3.2. (1) It suffices to prove the inequality for all compact sets and that is exponentially tight.
The upper bound for compact sets follows essentially from Markov’s inequality. Choose for each a and an open neighborhood of such that
where is the -rate function associated with as defined in (3.1). Applying Markov’s inequality yields
which in turn implies
Now that is a compact set, one can find a finite cover with such that
proving the theorem by letting .
It is left to demonstrate the exponential tightness of , which is equivalent to that for all marginals of on -th coordinate and follows essentially from Lemma 2.5 (4). Let be sufficiently small so that . By Markov’s inequality, we have that for all ,
where is the -th vector in the standard basis and the right-hand side converges to as . The same arguments applies to . The two estimates combined prove the exponential tightness.
(2) The case for some is trivial, for everywhere. Assuming for all , one may adopt the change of measure argument as before in the following manner. Let be any fixed open set, , , and be an exposing hyperplane for . By continuity of , we choose an open neighborhood of such that
Now that is well-defined for every sufficiently small , we define a probability measure equivalent to :
so that
yielding
It remains to show that . To this end, we claim that
which together implies the desired estimate. To prove (3.3), define the function
and observe that
One deduces from the Markov’s inequality that
since is lower semicontinuous and is an exposing hyperplane. For (3.4), observe that , so the inequality follows from (3.2).
(3) With Lemma 3.3 and Lemma 3.4, we have that for ever open set ,
We now prove the lemmas.
Proof of Lemma 3.3. (1) The convexity follows from Hölder’s inequality as in Lemma 2.5. Precisely, given any satisfying ,
proving the convexity of and hence .
If for some , then by convexity for all . Since , it follows by convexity that for all , contradicting the assumption that . Thus, everywhere.
(2) Since , it follows that for some , and since the convex function is continuous in . Therefore,
implying is bounded for every . The function is convex and lower semicontinuous by a routine check as conducted in Lemma 2.5. Combining these implies that is a good convex rate function.
(3) Suppose now that for some ,
Then, for every ,
In particular,
Since the inequality holds for all , . Hence, is an exposed point of with exposing hyperplane .
Before proving Rockafellar’s lemma, let us recall the definition of relative interior of a convex set.
Definition 3.6. The relative interior of a nonempty convex set is defined as
Proof of Lemma 3.4. Assume without loss of generality that for the lemma is vacuous otherwise. Under the circumstances, fix henceforth and define a function
Observe that is convex, lower semincontinuous, and . Moreover, . Therefore, from it follows that . By Lemma C.2, there exists such that . Let , so that is an essentially smooth, convex function and . Consequently, by Lemma C.3, is finite in a neighborhood of the origin and thus , at which is differentiable by the assumption. Hence, , implying that , i.e., . It now follows from Lemma 3.3(2) that . Since is arbitrary, the proof is complete.
The following theorem is a generalization of the Gärtner-Ellis theorem (Theorem 3.2), which essentially follows from the same proof with marginal distributions replaced by projection measures in every direction. Its proof can be found in [1].
Theorem 3.7 (Baldi). Suppose is an exponentially tight family of probability measures on .
- For every closed set , .
-
Let be the set of exposed points of with an exposing hyperplane for which
(3.5)Then, for every open set ,
- If for every open set , , then satisfies the LDP with the good rate function .
The theorem is accompanied by a generalized criterion of differentiability of in the following sense.
4. Applications
4.1. Cramér’s theorem in
Combining the Gärtner-Ellis theorem with our previous results, we can establish the LDP for -dimensional i.i.d. random variables . As before, let be the law of the empirical mean, and let and be defined accordingly.
While powerful, this theorem does not encompass all scenarios where an LDP holds; a refined version requires only that for the result to remain valid.
Example 4.2. Let be a Borel probability measure on with a density and be its log MGF, where is the normalizing constant. We will show in the following that , so the large deviation principle holds with a good rate function; however, is not steep and hence the Gärtner-Ellis theorem does not apply.
On one hand, with the aid of Cauchy-Schwarz inequality , we realize that the estimate
holds and therefore . On the other hand, given any , we have that along direction ,
is unbounded as if , implying . Finally, we have uniform bound
Therefore, within the interior of ,
This proves that is not steep.
4.2. Large deviations for finite state Markov chains
In this section, we study the large deviations of Markov chains on a finite alphabet . Let be a stochastic matrix, and consider a Markov chain with transition probability . We denote by the probability law of the chain starting at state :
We are interested in the empirical mean
where for a given function . The log MGF can be analyzed using the following family of non-negative matrices:
Theorem 4.3 (Perron-Frobenius). Let be an irreducible matrix. Then possesses an eigenvalue (called the Perron–Frobenius eigenvalue) such that:
- is real.
- For any eigenvalue of , .
- There exist left and right eigenvectors corresponding to the eigenvalue that have strictly positive coordinates.
- The left and right eigenvectors , corresponding to the eigenvalue are unique up to a constant multiple.
-
For every and every such that for all ,
Theorem 4.4. Let be a finite state Markov chain possessing an irreducible transition matrix . For every , define
Then the empirical mean satisfies the LDP with the convex, good rate function . Explicitly, for any set , and any initial state ,
Proof. Consider the logarithmic generating function
By Gärtner–Ellis theorem (Theorem 3.2), it is enough to check that the limit
exists, is finite and differentiable everywhere in , and satisfies . To begin, note that
Since is irreducible, we have
To show that it is differentiable, we apply the implicit function theorem. Explicitly, consider the functions , parametrized by , defined by
Clearly, is continuously differentiable. Hence, it suffices to show that for every with the associated Perron-Frobenius eigenvalue and left and right eigenvectors and of satisfying , we have (1) and (2) the Jacobian matrix
is invertible. Indeed, (1) is clear, and (2) holds because if it did not, there exists such that
which is impossible. This implies, by the implicit function theorem, that there exists a continuously differentiable function on a neighborhood of such that . In particular, is continuously differentiable on a neighborhood of . The proof is finished by noting that is arbitrary.
As a corollary, we have the following.
Corollary 4.5. The empirical averages , with , satisfy the LDP with the good rate function
where and the inequality between vectors is compared entrywise.
Proof. The first equality follows immediately from Theorem 4.4 by taking
To prove the second inequality, we first note that one inequality is more obvious than the other: By taking to be the left probability eigenvector of , we see the inequality “”. To prove the other inequality, assume and choose so that and that . Therefore, by definition,
finishing the proof.
We also consider a derived process of consecutive pairs to obtain Sanov’s theorem. Such a process has the transition matrix defined by
As is discussed in the following, one can determine the large deviations of the pair empirical measure
For , we write
and call shift-invariant if .
Theorem 4.6. Assume that is irreducible. Then for every probability measure ,
Proof. By Corollary 4.5,
If is not invariant, then for some . For such that when and , we have if we let .
If is invariant,
Hence,
where and . This implies
Taking approaching proves the theorem.
A Probability theory
A.1 Basic inequalities
Proposition A.1 (Borel-Cantelli). Let be a sequence of events in a probability space . Then,
Proposition A.2 (Markov). Let be a nonnegative random variable. Then, for every ,
Proposition A.3 (Hölder). Let be non-negative random variables. Then, for satisfying ,
where the equality holds if and only if for some .
Proof. Without loss of generality, assume to paraphrase the proposition: with equality holding if and only if for some . Under the circumstances, it is not hard to verify that Young’s inequality
holds if and only if . The paraphrased proposition then follows by integrating both sides.
Proof. We make use of the property of convex functions that
The left-hand side is by definition larger than the other, and thus it remains to show the remaining. To this end, it suffices to show that for all satisfying , there exists a linear function such that for all . Essentially, this is achieved by choosing
Now that (A.1) coincides with
one can enumerate the linear functions on the right-hand side by and obtain
If , then the proposition holds by letting . If (similar for ) and , then right-hand side of the above still converges to , while the proposition is trivial when and .
A.2 Radon-Nikodym theorem
Theorem A.6 (Radon-Nikodym). Suppose are two -finite measures on a common measurable space and , then there exists a -measurable function such that for every ,
A.3 Laws of large numbers
Proof. Equivalently, we can assume is nonnegative and and prove . Let and with . Immediately,
On the other hand,
The theorem is proved by combining the above.
Proof. Writing with , one may use summation by parts to deduce
The last expression converges to as .
Proof. By virtue of the independence, one observes that
which, in particular, implies is finite almost surely, which in turn implies the conclusion by Lemma A.8.
Proof. It suffices to prove the case , for if otherwise, one may still apply the result to for every if (resp., if ) so that and that (resp., ), leading to the conclusion by letting .
To begin, truncate to define
Then, verify the assumption of Proposition A.9
to deduce
On the other hand, by the Borel-Cantelli lemma,
Combining (A.3) and (A.4) gives that
B Functional analysis
Theorem B.1 (Hahn-Banach Extension Theorem). Let be a real vector space. Suppose that
- is a linear subspace of ,
- is a convex functional,
- is linear and .
Then, there exists a linear functional such that .
Proof. The theorem is proved by transfinite induction.
If , choose , and extend to by
where one may choose, if possible,
so that on . To see the above interval is indeed nonempty, note that the convexity of together with implies that for each ,
and hence the non-emptiness follows from the finite intersection property.
To conclude the proof, consider the set of all pairs such that is a subspace of and that is linear on with . Define further a partial order by
Hence, for any chain , the space is a vector space and the function whenever , which forms a maximal element of the chain. Hence, by Zorn’s lemma, there exists a maximal element , and by the first part of the proof.
In the context of topological vector spaces, an equivalent of Theorem B.1 is phrased in terms of separation, as follows. Recall that a core point of a subset of a vector space is the point satisfying that for every there exists such that for all , for which we denote .
Proof of Theorem B.2 by Theorem B.1. We first prove the general case, after which we point out the necessary modifications for the case where .
Let and . Define so that and its associated gauge
Then, is a sublinear functional. Indeed, since , is real-valued. Moreover, for all by definition and by convexity of . Consider the subspace and define the linear functional . To apply the extension theorem, observe that and thus , yielding a linear extension of satisfying by Theorem B.1. To see separates and , observe that for any and ,
The number in question can be arbitrary in . To see that that , note that for some and and as desired.
Now if , pick and proceed as before. Under the circumstances, is a neighborhood of . Hence, for all in the neighborhood of , proving the continuity. Furthermore, due to the fact , rendering (B.1) a strict inequality.
Proof of Theorem B.1 by Theorem B.2. Let
Then, and are convex and every point in is a core point. Thus, by Theorem B.2, there exists a linear functional and numbers such that
Under the circumstances, we observe, by letting , that . It is clear that , for if otherwise, the inequality (B.1) above holds if and only if for all , contradicting Theorem B.2 as . Now that , the inequality (B.2) implies that
The proof is finished by choosing .
C Basics in convex analysis
Proof. We should assume, without loss of generality, that is not identically , for the lemma holds obviously otherwise. Define
which are convex subsets in the locally convex topological vector spaces and , respectively.
It is not hard to verify by definition that . Hence, it suffices to prove the other inequality. Equivalently, it asserts that if , then there is such that . Then, observe that is compact and is closed by lower semicontinuity of and nonempty by the fact is not identically . Moreover, under the assumption , . These altogether allow us to apply the Hahn-Banach separation theorem (specifically, Theorem B.3) to yield some and satisfying
where must hold as there exists with .
If , then (C.1) implies that . Moreover, as desired, for if otherwise, contradicting the definition of .
If , then for all , define for . In fact, since
Moreover,
yielding the desired when is sufficiently small.
Proof. Note that , and hence the lemma, by duality (Lemma C.1), is equivalent to the existence of such that for all . This latter claim, by convexity of , is equivalent to
To prove the claim, define
Obviously, is convex, compact, and nonempty. On the other hand, is nonempty, convex, and . Obviously, . To prove convexity, note that is a convex function and closure of a convex set is convex. Indeed,
Finally, to prove , we first show that . Under the circumstances, we deduce that is continuous at and thus follows naturally. To this end, note that implies . Indeed, and if , then and for all small ,
In particular, the above implies . Moreover, implies also since if and only if for all .
We now may apply Theorem B.3 to and to yield some such that
where since . Hence, for every with , we have that
The proof is concluded by choosing .
Proof. Since , it follows by convexity of that
Moreover, since ,
Because is essentially smooth, there exists a closed ball in which is differentiable. Hence,
Hence, for any and any different from , by the convexity of ,
where . Similarly, . Hence,
Observe that because of the convexity of . Hence, by assumption, exists, and by the preceding inequality, . Since is steep, by considering , in which case , we conclude that .
References
- [1]Dembo, Amir, and Zeitouni, Ofer, Large Deviations Techniques and Applications, Springer Berlin Heidelberg, 2010. ↩