CA77's Site.

Articles

Probability Theory

Published on: 9/1/2025

Updated on: 9/18/2025

A rough overview of probability theory.

Series: Probability, Random Variables and Stochastic Process

Tags: probability, signal, electronic engineering, finance

Modern mathematics views probability as a measure structure on a space. First, we need a method to divide subsets within the space into “measurable” and “non-measurable,” where only “measurable” subsets possess probability. These sets are precisely the “events” studied in traditional probability theory.

Basic Structure

$σ$ -algebra

Given a non-empty set $Ω$ as the universe, if $σ$ belongs to the power set of $Ω$ , i.e., $σ \in P (Ω)$ , meaning $σ$ is a collection of subsets of $Ω$ , and it satisfies the following conditions:

The universe is in $σ$ : $Ω \in σ$ ;
For any set $A$ in $σ$ , its complement is also in $σ$ : $Ω - A \in σ$ ;
For any countable collection of sets ${A_{k}}_{k = 0}^{\infty} = {A_{0}, A_{1}, A_{2}, \dots}$ in $σ$ , their union is in $σ$ : $⋃_{k = 0}^{\infty} A_{k} \in σ$ ,

then we call $σ$ a $σ$ -algebra on $Ω$ .

It can be proven that a $σ$ -algebra is closed under finite intersections, unions, and differences. Since $Ω \in σ$ , we have $\emptyset = Ω - Ω \in σ$ .

Measure Space

If a set $Ω$ has a $σ$ -algebra $σ$ , then this set $Ω$ with respect to $σ$ forms a measurable space $(Ω, σ)$ .

Naturally, sets in $σ$ are called measurable, and sets not in $σ$ are non-measurable.

If we assign a non-negative number that can be infinite to each measurable set, this corresponds to a function $μ : σ \to R_{⩾ 0} \cup {+ \infty}$ , satisfying the following conditions:

$μ (\emptyset) = 0$ ;
For any countable collection of sets ${A_{k}}$ in $σ$ , their union satisfies

μ (k = 0 ⋃ \infty A_{k}) ⩽ k = 0 \sum \infty μ (A_{k})

where equality holds if and only if for any indices $i, j = 0, 1, 2, \dots$ with $i \neq = j$ , we have $μ (A_{i} \cap A_{j}) = 0$ (note that we don’t necessarily require $A_{i} \cap A_{j} = \emptyset$ ),

then we call $μ$ a measure on the measurable space $(Ω, σ)$ , and the measurable space $(Ω, σ)$ with respect to the measure $μ$ forms a measure space $(Ω, σ, μ)$ .

It’s easy to see that the measure $μ$ has a maximum value $μ_{m a x} = max_{A \in σ} μ (A) = μ (Ω)$ , which is called the limit capacity of the measure space $Ω$ .

If the limit capacity $μ_{m a x} < + \infty$ , then the measure space $Ω$ (with respect to the measure $μ$ ) is said to be bounded or finite; if the limit capacity $μ_{m a x} = 1$ , then the measure space $Ω$ is called a probability space, and the measure $μ$ is called a 0-1 measure or probability measure on $Ω$ .

Integration of Functions

Consider a function $f : Ω \to R$ on the measure space $(Ω, σ, μ)$ . If for any real number $t \in R$ , we have $f^{- 1} (] t, + \infty [) \in σ$ , meaning the preimage of any open interval $] t, + \infty [$ starting from $t$ is measurable, then the function $f$ is called a measurable function.

Characteristic Function

First, we consider the integral of a characteristic function $χ_{A}$ , where

χ_{A} (x) = {1, 0, x \in A x \neq \in A

If $A \in σ$ is measurable, then we define

\int_{Ω} χ_{A} d μ = μ (A)

Of course, if $A \in / σ$ (i.e., $A$ is not measurable), then we define

\int_{Ω} χ_{A} d μ = in f {\int_{Ω} χ_{A_{σ}} d μ A_{σ} \in σ, A \subset A_{σ}} = in f {μ (A_{σ}) ∣ A_{σ} \in σ, A \subset A_{σ}}

Now we define the integral of a so-called simple function or step function

\int_{Ω} k \sum a_{k} χ_{A_{k}} d μ = k \sum \int_{Ω} a_{k} χ_{A_{k}} d μ

Its convergence is determined entirely by the series defined by the summation, that is, if the series is finite, then it must converge; if the series is infinite, then its convergence is given by the following limit

\int_{Ω} k = 0 \sum \infty a_{k} χ_{A_{k}} d μ = n \to \infty lim k = 0 \sum n a_{k} \int_{Ω} χ_{A_{k}} d μ

General Definition through Approximation

If $f$ is a non-negative function, its integral is defined as the least upper bound of all non-negative simple functions less than $f$

\int_{Ω} fd μ = sup {\int_{Ω} k = 0 \sum \infty a_{k} χ_{A_{k}} d μ 0 ⩽ k = 0 \sum \infty a_{k} χ_{A_{k}} (x) ⩽ f (x), \forall x \in Ω}

Here, the $sup$ can be transformed into a limit using the monotone sequence convergence principle.

For any measurable function $f$ with arbitrary values, we can decompose $f$ as

f = f_{+} - f_{-}

where

f_{+} (x) = max {f (x), 0}, f_{-} (x) = - min {f (x), 0}

We can see that both $f_{+}$ and $f_{-}$ are non-negative and measurable, so we can define

\int_{Ω} fd μ = \int_{Ω} f_{+} d μ - \int_{Ω} f_{-} d μ

For the integral over a general set $E \subset Ω$ , noting that the product $f χ_{E}$ is measurable, we can define it as

\int_{E} fd μ = \int_{Ω} f χ_{E} d μ

Probability Interpretation

We conventionally use $P$ to denote a probability measure. Consider the probability space $(Ω, σ, P)$ , where we interpret $Ω$ as the sample space, its elements as samples, and a measurable function $X : Ω \to P$ on the probability space as the value possessed by the sample, constituting the statistical meaning of probability.

Measurable sets in the sample space, as collections of samples, are called events.

We can model it as follows: Let $Ω$ be a finite sample space containing all samples, and let a measurable function $X : Ω \to P$ assign a value $X (ω)$ to each sample $ω \in Ω$ . When the sample size $# Ω$ is large enough, it is more economical to consider the statistical significance of $X (ω)$ rather than the specific value of $X (ω)$ . When the sample size is large enough, it is also more economical to model using an infinite cardinality $Ω$ , just as we use a distribution curve to replace a sufficiently dense histogram. Therefore, we define: a measurable function on the sample space as a random variable.

Taking a random variable $X$ , we now consider the meaning of the integral

\int_{Ω} X d P

According to the definition,

\int_{Ω} X d P = sup {k \sum a_{k} ∣ A_{k} ∣ 0 ⩽ k \sum a_{k} χ_{A_{k}} ⩽ X} = {(a_{k}, A_{k})} \to X lim k \sum a_{k} ∣ A_{k} ∣

Here we use a set of simple notations:

$∣ A_{k} ∣$ represents $in f {P (A) ∣ A \in σ, A_{k} \subset A}$
$lim_{{(a_{k}, A_{k})} \to X}$ represents the limit process needed to obtain $sup$ , which can be simply thought of as $\sum_{k} a_{k} χ_{A_{k}} \to X$

We see that $a_{k}$ reflects the value of $X$ , and $∣ A_{k} ∣$ can be viewed as the measure of the measurable set $A_{k}^{σ}$ closest to $A_{k}$ , which we can briefly denote as $lim_{A_{k}^{σ} \to A_{k}} P (A_{k}^{σ})$ , that is, we have

\int_{Ω} X d P = {(a_{k}, A_{k})} \to X lim A_{k}^{σ} \to A_{k} lim k \sum a_{k} P (A_{k}^{σ})

This value reflects the sample average property of $X$ and is called the expectation of $X$ , denoted as $E_{X \sim P} (X)$ or simply $E (X)$ .

Consider a simple example, $Ω = {ω_{k}}_{k = 1}^{N}$ , with probability defined as

P ({ω_{k}}) = p_{k}, P (E) = ω_{k} \in E \sum p_{k}, E \subset Ω

and satisfying $\sum_{k = 1}^{N} p_{k} = 1$ . In this case, of course, all subsets of $Ω$ should be measurable, i.e., $σ = P (Ω)$ . Let the random variable be $X (ω_{k}) = x_{k}$ . Note that $X$ is a simple function

X = k = 1 \sum N x_{k} χ_{{ω_{k}}}

then

E_{X \sim P} (X) = \int_{Ω} X d P = k = 1 \sum N x_{k} P ({ω_{k}}) = k = 1 \sum N x_{k} p_{k}

That is, $E_{X \sim P} (X)$ is the weighted average of ${x_{k}}$ with weights ${p_{k}}$ .

Lebesgue Integration

In this section, we consider the special case where $Ω = R^{n}$ . For the interval

I = k = 1 \prod n [a_{k}, b_{k}] = [a_{1}, b_{1}] \times [a_{2}, b_{2}] \times \dots \times [a_{n}, b_{n}]

we define the measure

m (I) = k = 1 \prod n max (b_{k} - a_{k}, 0) = k = 1 \prod n ReLU (b_{k} - a_{k})

Note that we do not use the common definition $m (I) = \prod_{k = 1}^{n} ∣ b_{k} - a_{k} ∣$ , because we assume by default that the interval $[a, b]$ is $\emptyset$ when $a > b$ (the physical meaning of ReLU — Rectified Linear Unit ), which is consistent with the standard definition $[a, b] = {x \in R ∣ a ⩽ x ⩽ b}$ .

Similarly, we obtain such a measure through approximation

m^{*} (E) = in f {k = 0 \sum \infty m (I^{(ℓ)}) E \subset ℓ = 0 ⋃ \infty I^{(ℓ)}, I^{(ℓ)} = k = 1 \prod n [a_{k}^{(ℓ)}, b_{k}^{(ℓ)}]}

This measure is called the Lebesgue outer measure.

However, it is worth noting that we must carefully select measurable sets, even though the Lebesgue outer measure is defined for any subset of $R^{n}$ . In fact, $(R^{n}, P (R^{n}), m^{*})$ is not a measure space, because $m^{*}$ cannot form a measure on $P (R^{n})$ (hence called an “outer measure”).

A measure $μ$ requires countable additivity, that is, the following inequality holds

μ (k = 0 ⋃ \infty A_{k}) ⩽ k = 0 \sum \infty μ (A_{k})

with the condition for equality: the sets $A_{k}$ are pairwise disjoint.

If we abandon the condition for equality, then it is called subadditivity.

It can be proven that $m^{*} : P (R^{n}) \to R_{⩾ 0} \cup {+ \infty}$ only satisfies subadditivity, therefore it does not form a measure. But we can find a $σ$ -algebra $L$ such that $m = m^{*} ∣_{L}$ forms a measure space $(R^{n}, L, m)$ . In fact, the determination of $L$ is the famous Carathéodory measurability criterion, which only allows sets $E$ satisfying the condition

m^{*} (A) = m^{*} (A \cap E) + m^{*} (A \cap \overline{E}), \forall A \subset R^{n}

to be measurable sets.

We call such a measure space the Lebesgue measure space, that is,

L^{n} = (R^{n}, L = {E \subset R^{n} \forall A \subset R^{n}, m^{*} (A) = m^{*} (A \cap E) + m^{*} (A \cap \overline{E})}, m^{*}_{L})

where sets satisfying Carathéodory measurability are called Lebesgue measurable, and sets not satisfying Carathéodory measurability are called Lebesgue non-measurable.

Through the Lebesgue measure, we can introduce the integral $\int_{E} fd m$ , called the Lebesgue integral. In fact, for almost everywhere continuous functions $f_{a.e.}$ , the Lebesgue integral degenerates to the Riemann integral

\int_{E} f_{a.e.} d m = 0 ⩽ \sum_{k} a_{k} χ_{A_{k}} ⩽ f_{a.e.} sup k \sum a_{k} m^{*} (A_{k}) = {A_{k}} sup k \sum x \in A_{k} min f_{a.e.} (x) m^{*} (A_{k}) = \int_{E} f_{a.e.} (x) d x

Therefore, the Lebesgue integral as a generalization of the Riemann integral does not conflict with its definition, and we can directly use the notation of the Riemann integral $\int_{E} f (x) d x$ to denote the Lebesgue integral.

If a function $G : R \to R$ is monotonically increasing and right-continuous, then we can define such an outer measure

m_{G}^{*} (E) = in f {ℓ = 0 \sum \infty k = 0 \prod n max {(G (b_{k}^{(ℓ)}) - G (a_{k}^{(ℓ)})), 0} E \subset ℓ = 0 ⋃ \infty k = 0 \prod n [a_{k}^{(ℓ)}, b_{k}^{(ℓ)}]} = in f {ℓ = 0 \sum \infty k = 0 \prod n ReLU (G (b_{k}^{(ℓ)}) - G (a_{k}^{(ℓ)})) E \subset ℓ = 0 ⋃ \infty k = 0 \prod n [a_{k}^{(ℓ)}, b_{k}^{(ℓ)}]}

This is called the Lebesgue–Stieltjes outer measure induced by $G$ . Similarly, we can construct a Lebesgue–Stieltjes measure space and obtain the Lebesgue–Stieltjes integral

\int_{E} fd m_{G}

and for almost everywhere continuous functions $f_{a.e.}$ , the Lebesgue–Stieltjes integral degenerates to the Riemann–Stieltjes integral

\int_{E} f_{a.e.} d m_{G} = \int_{E} f_{a.e.} (x) d G (x)

Distribution

By convention, a bare integral sign $\int$ denotes integration over the entire sample space, $\int_{Ω}$ .

Now we return to the probability space $(Ω, σ, P)$ , where we want to point out that the choice of probability $P$ is not fixed. When $σ$ is chosen, the choice of $P$ is called a probability distribution on the probability space.

After fixing the probability distribution $P$ , for a random variable $X$ , we define the distribution function

F (x) = P {X ⩽ x}

Here we introduce the common notation $P {λ (X)} = P {ω ∣ λ (X (ω))}$ , where $λ (x)$ represents a limiting condition on $x$ (such as an inequality). In fact, the random variable $X$ represents a statistical numerical distribution of samples, and the probability $P$ represents a probability distribution on the samples. The two combined yield the statistical behavior of probability and random variables, so we use $X \sim P$ or $X \sim F (x)$ to describe this pattern, meaning that the random variable $X$ follows the probability distribution $P$ or $F (x)$ . In this sense, we define

E_{X \sim P} (u (X)) = \int u (X) d P

E_{X \sim F (x)} (u (X)) = \int u (x) d F (x)

where $u$ is some algebraic function. Now we point out that under the condition $F (x) = P {X ⩽ x}$ , the above two expressions are equivalent. For simplicity, we consider the Riemann-Stieltjes integral, take a partition $π = {a_{0} < a_{1} < a_{2} < \dots < a_{# π}}$ , and have

\int u (x) d F (x) = ∥ π ∥ \to 0 lim k = 0 \sum # π - 1 u (ξ_{k}) (F (a_{k + 1}) - F (a_{k})) = ∥ π ∥ \to 0 lim k = 0 \sum # π - 1 u (ξ_{k}) P {a_{k} < X ⩽ a_{k + 1}}

In the limit sense, the above expression clearly converges to $\to \int u (X) d P$ .

Looking at it essentially, elements in the sample space can be of various kinds, and a random variable is a quantification of some aspect of the samples. The sample space itself is not a quantitative entity, so the probability distribution on it is not quantitative either, and the distribution function quantifies the distribution of probability on the sample space through random variables.

The logic of this matter is as follows:

The sample space itself is just a collection of samples
Probability gives a measurable distribution on the sample space
- Although the probability value itself can be quantified
- But probability as a function of “sets” has a distribution that is difficult to quantify
A random variable is a quantification of some property of the sample itself
Therefore, random variables and probability jointly quantify samples and their probability distributions
- Samples are quantified through random variables
- The probability distribution on samples is quantified through the distribution function
- In this system, we say that the random variable follows this distribution

Conditional Distribution

Consider the probability space $(Ω, σ, P)$ , for an event $E \in σ$ , define the following probability

P_{E} (A) = \frac{P ( A \cap E )}{P ( E )}

We want to point out that $P_{E}$ is also a probability. In fact, according to countable additivity

P (A_{0} \cup A_{1} \cup A_{2} \cup \dots) ⩽ P (A_{0}) + P (A_{1}) + P (A_{2}) + \dots

where equality holds when the sets ${A_{k}}_{k = 0}^{\infty}$ are pairwise disjoint. According to the definition of $P_{E}$ , we have

P_{E} (A_{0} \cup A_{1} \cup A_{2} \cup \dots) = \frac{P (( A _{0} \cup A _{1} \cup A _{2} \cup \dots ) \cap E )}{P ( E )}

If the intersection diverges, then the above expression is 0. If the intersection converges, then the above expression should be

P_{E} (A_{0} \cup A_{1} \cup A_{2} \cup \dots) = \frac{P ( A _{0}^{E} \cup A _{1}^{E} \cup A _{2}^{E} \cup \dots )}{P ( E )}

where $A_{k}^{E} = A_{k} \cap E$ , so using countable additivity, we have

P_{E} (A_{0} \cup A_{1} \cup A_{2} \cup \dots) ⩽ \frac{P ( A _{0}^{E} ) + P ( A _{1}^{E} ) + P ( A _{2}^{E} ) + \dots}{P ( E )} = P_{E} (A_{0}) + P_{E} (A_{1}) + P_{E} (A_{2}) + \dots

In the derivation, we did not change the condition for equality, so $P_{E}$ indeed forms a probability space $(Ω, σ, P_{E})$ . We call $P_{E} (A)$ the conditional probability of $A$ given that event $E$ occurs, denoted as $P (A ∣ E)$ .

We study the integral of a random variable under the conditional probability $P_{E} = P (\cdot ∣ E)$

\int X d P_{E}

Let $F_{E}$ and $F_{X}$ be the distribution functions of $P_{E}$ and $P$ respectively, then according to the definition

F_{E} (x) = P {X ⩽ x ∣ E} = \frac{P { X ⩽ x , X \in E }}{P ( E )}

This actually does not yield an effective result, because we know too little about $E$ .

Now consider another random variable $Y$ , and assume $E = {y < Y ⩽ y + h}$ , then we consider

F_{E} (x) = \frac{P { X ⩽ x , y < Y ⩽ y + h }}{P { y < Y ⩽ y + h }} = \frac{F _{X, Y} ( x , y + h ) - F _{X, Y} ( x , y )}{F _{Y} ( y + h ) - F _{Y} ( y )} \to \frac{1}{f _{Y} ( y )} \frac{\partial F _{X, Y} ( x , y )}{\partial y}, h \to 0

where $F_{X, Y}$ is the joint probability distribution of $X, Y$ , now we have

\int X d P_{E} \to \int x d F_{E} (x) = \int x \frac{1}{f _{Y} ( y )} d (\frac{\partial F _{X, Y}}{\partial y}) = \int x \frac{1}{f _{Y} ( y )} \frac{\partial ^{2} F _{X, Y}}{\partial x \partial y} d x = \int x \frac{f _{X, Y} ( x , y )}{f _{Y} ( y )} d x

This means that the conditional density is

f_{X ∣ Y = y} (x) = \frac{f _{X, Y} ( x , y )}{f _{Y} ( y )}

This is a function of $y$ , which we denote as

f_{X ∣ Y} (x ∣ y) = f_{X ∣ Y = y} (x) = \frac{f _{X, Y} ( x , y )}{f _{Y} ( y )}

called the conditional density, and its integral is called the conditional distribution.

In the derivation, we can see that we actually assumed the existence of $f_{X, Y}$ , $f_{X}$ , and $f_{Y}$ , which is necessary. Density, as the derivative of distribution, reflects the second-order property, that is, the associative characteristics between samples, which is an important property of conditional variables.

We can further consider the integral

\int X d P_{E} = \int x f_{X ∣ Y} (x ∣ y) d x

This integral is a function of $y$ , which we denote as $F_{E_{X \sim F} (X ∣ Y)} (x)$ . It can be proven that this is a distribution function, which is associated with a random variable, which is denoted as

E_{X \sim F} (X ∣ Y)

called the conditional expectation of $X$ given $Y$ .

We can see that

E_{Y \sim F_{Y}} (E_{X \sim F} (X ∣ Y)) = \int d P_{Y} \int X d P_{X ∣ Y = y} = \int d F_{Y} (y) \int x d F_{X ∣ Y = y} (x) = \int f_{Y} (y) d y \int x f_{X ∣ Y} (x ∣ y) d x = \iint x f_{X, Y} (x, y) d x d y

Using the Fubini theorem to simplify this integral

E_{Y \sim F_{Y}} (E_{X \sim F} (X ∣ Y)) = \int x d x \int f_{X, Y} (x, y) d y = \int x f (x) d x = E_{X \sim F} (X)

We can understand conditional expectation as follows: the random variable $Y$ as a condition itself has randomness, and the conditional expectation $E (X ∣ Y)$ is the expectation of $X$ under fixed $Y$ , so it inevitably carries the randomness of $Y$ , which is why the conditional expectation is a random variable. And as we said, the randomness of the conditional expectation $E (X ∣ Y)$ comes entirely from the condition $Y$ , so taking the expectation of $Y$ again can eliminate the randomness, and its result

E (E (X ∣ Y)) = E (X)

is just the expectation of $X$ , which is precisely the embodiment of the law of total probability.

Large Sample Limit

We call $E (X^{n})$ the $n$ -th moment of the random variable $X$ . Just like the multipole moment expansion of potential fields in field theory, the $n$ -th moment is the integral of the $n$ -th power

E_{X \sim P} (X^{n}) = \int X^{n} d P, or E_{X \sim F (x)} (X^{n}) = \int x^{n} d F (x) = \int x^{n} f (x) d x

The most typical characteristic is the 2nd moment, which reflects the property of second-order correlation. Here we directly give the definitions of related second-order quantities:

Second-order self-correlation of the random variable $X$ — Fluctuation information
1. Variance
$Var_{X \sim P} (X) = E_{X \sim P} ((X - E_{X \sim P} (X))^{2}) = E_{X \sim P} (X^{2}) - (E_{X \sim P} (X))^{2}$
1. Standard deviation
$σ_{X \sim P} (X) = Var_{X \sim P} (X)$
Second-order correlation between two random variables $X, Y$ — Correlation information
1. Correlation function
$R_{X \sim P_{X}, Y \sim P_{Y}} (X, Y) = E_{X \sim P_{X}, Y \sim P_{Y}} (X Y)$
1. Covariance
$Cov_{X \sim P_{X}, Y \sim P_{Y}} (X, Y) = E_{X \sim P_{X}, Y \sim P_{Y}} ((X - E_{X \sim P_{X}} (X)) (Y - E_{Y \sim P_{Y}} (Y))) = R_{X \sim P_{X}, Y \sim P_{Y}} (X, Y) - E_{X \sim P_{X}} (X) E_{Y \sim P_{Y}} (Y)$
1. Correlation coefficient
$ρ_{X \sim P_{X}, Y \sim P_{Y}} = \frac{Cov _{X \sim P_{X}, Y \sim P_{Y}} ( X , Y )}{σ _{X \sim P_{X}} ( X ) σ _{Y \sim P_{Y}} ( Y )}$

Now suppose we have many random variables $X_{1}, X_{2}, \dots$ that are independent and identically distributed (i.i.d.), we consider the sequence average

\overline{X_{n}} = \frac{1}{n} k = 1 \sum n X_{k} = \frac{X _{1} + X _{2} + \dots + X _{n}}{n + 1}

Let the distribution function of the sequence $X_{1}, X_{2}, \dots$ be $F_{X}$ , with expectation $μ$ and variance $σ^{2}$ , and let the distribution function of $\overline{X_{n}}$ be $F_{n}$ , with expectation $μ_{n}$ and variance $σ_{n}^{2}$ . We want to study the limit behavior of ${μ_{n}}$ , ${σ_{n}^{2}}$ , and $F_{n}$ as $n \to \infty$ . The work on the numerical characteristics ${μ_{n}}$ , ${σ_{n}^{2}}$ is summarized as the Law of Large Numbers, while the work on the distribution function $F_{n}$ is the famous Central Limit Theorem.

Law of Large Numbers

We can directly calculate

μ_{n} = E (\overline{X_{n}}) = E (\frac{1}{n} k = 1 \sum n X_{k}) = \frac{1}{n} k = 1 \sum n E (X_{k}) = μ

The second moment

E (\overline{X_{n}}^{2}) = E (\frac{1}{n ^{2}} i = 1 \sum n j = 1 \sum n X_{i} X_{j}) = \frac{1}{n ^{2}} i = 1 \sum n j = 1 \sum n E (X_{i} X_{j}) = \frac{1}{n ^{2}} i = 1 \sum n j = 1 \sum n R (X_{i}, X_{j})

According to the relationship between covariance and second-order correlation function

R (X_{i}, X_{j}) = Cov (X_{i}, X_{j}) + E (X_{i}) E (X_{j}) = σ^{2} δ_{ij} + μ^{2}

Substituting this, we get

E (\overline{X_{n}}^{2}) = \frac{1}{n ^{2}} i = 1 \sum n j = 1 \sum n (σ^{2} δ_{ij} + μ^{2}) = \frac{σ ^{2}}{n ^{2}} i = 1 \sum n j = 1 \sum n δ_{ij} + \frac{μ ^{2}}{n ^{2}} i = 1 \sum n j = 1 \sum n 1

Computing the two counts $\sum_{i = 1}^{n} \sum_{j = 1}^{n} 1 = n^{2}$ and $\sum_{i = 1}^{n} \sum_{j = 1}^{n} δ_{ij} = n$ , thus the second moment is

E (\overline{X_{n}}^{2}) = \frac{σ ^{2}}{n} + μ^{2}

From this, we get the variance

Var (\overline{X_{n}}) = E (\overline{X_{n}}^{2}) - E (\overline{X_{n}})^{2} = \frac{σ ^{2}}{n} + μ^{2} - μ^{2} = \frac{σ ^{2}}{n}

These two results are called the Law of Large Numbers.

We now explain the meaning of such a sequence of i.i.d. random variables. Similar to the above, we consider a finite sample set $Ω = {ω_{1}, ω_{2}, \dots, ω_{N}}$ , in which case the random variable $X$ can be completely determined by the corresponding $N$ sample values $x_{1}, x_{2}, \dots, x_{N}$ , and similarly, the probability $P$ can also be completely determined by the $N$ sample probability values $p_{1}, p_{2}, \dots, p_{N}$ . Note that the distribution function at this time is

F (x) = P {X ⩽ x} = x_{k} ⩽ x \sum p_{k} = k = 1 \sum N p_{k} θ (x - x_{k})

where $θ$ is the Heaviside function or unit step response. Differentiating it gives the probability density

f (x) = k = 1 \sum N p_{k} δ (x - x_{k})

We can see that for two random variables $X, Y$ , if they have the same distribution, then they must satisfy

f_{X} (ξ) = f_{Y} (ξ) ⟺ k = 1 \sum N p_{k} (δ (ξ - x_{k}) - δ (ξ - y_{k})) = 0

But note that this does not directly imply $X = Y$ . A simple counterexample is

N = 2, p_{1} = p_{2} = 1/2, (x_{1}, x_{2}) = (α, β), (y_{1}, y_{2}) = (β, α)

We have

f_{X} (ξ) = f_{Y} (ξ) = \frac{δ ( ξ - α ) + δ ( ξ - β )}{2}

That is, a sequence of i.i.d. random variables cannot explain much from their values themselves, but we can be sure that their numerical characteristics must be consistent. We remember that a random variable is a quantification of some characteristic of the sample, and a sequence of i.i.d. random variables can be statistically understood as multiple independent surveys of a certain characteristic of the sample, with each random variable being the result of one survey. From the experience of multiple surveys in the real world, the result of each survey is unknown, but under stable conditions, the statistical quantities of the surveys will not change, which is consistent with the fact that the distribution of i.i.d. random variables does not change.

Central Limit Theorem

Finally, we look at the distribution function $F_{n}$ or probability distribution $f_{n}$ of $\overline{X_{n}}$ . First, we need to introduce the so-called normal distribution or Gaussian distribution, which is defined as the following probability density in one dimension

f (x) = \frac{1}{2 π σ} exp (- \frac{1}{2 σ ^{2}} (x - μ)^{2})

It can be proven that the parameters $μ$ and $σ^{2}$ are precisely the expectation and variance of the distribution. Commonly denote as $N (μ, σ^{2})$ . We point out that the normal distribution, as its name suggests, is the distribution in a normal state. No matter what kind of i.i.d. random variable sequence, the distribution of its sequence average must be a normal distribution.

To illustrate this point, we first need to introduce an important tool — the characteristic function. For a random variable $X$ , its characteristic function is defined as the mean

φ (jω) = E_{X \sim P} (e^{jω X}) = \int e^{jω X} d P

Considering the distribution function $F (x)$ and probability density $f (x)$ , the above expression can be written as

E_{X \sim P} (e^{jω X}) = \int e^{jω X} d P = \int e^{jω x} d F (x) = \int f (x) e^{jω x} d x

This is precisely the Fourier transform of the probability density! Going further, we define

φ (s) = E_{X \sim P} (e^{s X}) = \int e^{s X} d P, s \in C

This is actually the Laplace transform of the probability density. This process is strikingly consistent with the process of extending the spectrum of a signal from the frequency domain to the s-domain!

Now we consider the properties of the characteristic function. Here we use a bit of “engineering benefit” (which we have actually used many times before), expanding as follows

E (e^{s X}) = E (1 + s X + \frac{1}{2} s^{2} X^{2} + \frac{1}{6} s^{3} X^{3} + \dots + \frac{1}{n !} s^{n} X^{n} + \dots) = 1 + s E (X) + \frac{1}{2} s^{2} E (X^{2}) + \frac{1}{6} s^{3} E (X^{3}) + \dots + \frac{1}{n !} s^{n} E (X^{n}) + \dots

Viewing this expression as the Maclaurin series of $φ (s) = E (e^{s X})$ , we can immediately get the derivative of the characteristic function

\frac{d ^{n}}{d s ^{n}} φ (s)_{s = 0} = E (X^{n})

That is, the $n$ -th derivative of the characteristic function at $s = 0$ is precisely the $n$ -th moment of the random variable.

Let’s review our derivation: formally expanding $e^{s X}$ is essentially the expansion of $e^{s x}$ , because the $X$ under $E$ degenerates from the random integral $\int (X \dots) d P$ to the regular function integral $\int (x \dots) d F (x)$ or $\int (x \dots) f (x) d x$ under the distribution function or probability density. Then it’s bringing the summation inside $E$ to the outside, which is guaranteed by the uniform convergence of the internal series, and the internal series, as the Maclaurin series of the exponential function, is guaranteed to converge uniformly. Finally, for the derivative of the characteristic function, we require $φ (s)$ to be a holomorphic function. Although holomorphic is a rather high requirement for complex functions, since $φ (s)$ is a sufficiently smooth real-valued function when restricted to $s \in R$ or phase $s \in j R$ (the exponential function $e^{s X}$ ), holomorphic is guaranteed.

Now consider the limit of the sequence average $\overline{X_{n}}$ of the random variable sequence ${X_{n}}_{n = 1}^{\infty}$ . We note that the expectation and variance of $\overline{X_{n}}$ are $μ$ and $σ^{2} / n$ , and we transform it into a quantity with expectation $0$ and variance $1$

Z_{n} = \frac{X _{n} - μ}{σ / n} = \frac{( \frac{X _{1} + X _{2} + \dots + X _{n}}{n} ) - μ}{σ / n} = \frac{X _{1} + X _{2} + \dots + X _{n} - n μ}{n σ} = k = 1 \sum n \frac{X _{i} - μ}{n σ} \sim N (0, 1)

Similarly, standardizing $X_{n}$ , we get

Y_{n} = \frac{X _{n} - μ}{σ} \sim N (0, 1)

Therefore, we have

Z_{n} = k = 1 \sum n \frac{Y _{k}}{n}

Taking the exponential, we get

e^{s Z_{n}} = k = 1 \prod n e^{s Y_{k} / n}

Taking the expectation gives the characteristic function

φ_{Z_{n}} (s) = k = 1 \prod n φ_{Y_{n}} (s / n)

Since $Y_{n}$ are i.i.d., we have $φ_{Y_{n}} \equiv φ_{Y}$ , so

φ_{Z_{n}} (s) = (φ_{Y} (\frac{s}{n}))^{n} = (1 + \frac{s}{n} E (Y) + \frac{s ^{2}}{2 n} E (Y^{2}) + o (\frac{s ^{3}}{n ^{3/2}}))^{n}

We know that $E (Y) = 0$ , $E (Y^{2}) = Var (Y) = 1$ , so

φ_{Z_{n}} (s) = (1 + \frac{s ^{2}}{2 n} + o (\frac{s ^{3}}{n ^{3/2}}))^{n} \to exp (\frac{s ^{2}}{2})

On the frequency axis

φ_{Z_{n}} (jω) \to exp (- \frac{1}{2} ω^{2})

Taking the inverse Fourier transform of both sides gives the distribution function

f_{Z_{n}} (z) = \frac{1}{2 π} \int φ_{Z_{n}} (jω) e^{- jω z} d ω \to \frac{1}{2 π} \int exp (- \frac{1}{2} ω^{2}) e^{- jω z} d ω

This is a Gaussian integral, which evaluates to

f_{Z_{n}} (z) \to \frac{1}{2 π} exp (- \frac{1}{2} z^{2})

Note that $\overline{X_{n}} = μ + σ Z_{n} / n ⟹ Z_{n} = (n / σ) (\overline{X_{n}} - μ)$ , so

f_{n} (x) = \frac{n}{σ} f_{Z_{n}} (\frac{n}{σ} (x - μ)) = \frac{n}{σ} \frac{1}{2 π} exp (- \frac{1}{2} \frac{n}{σ ^{2}} (x - μ)^{2}) = \frac{1}{2 π ( σ / n )} exp (- \frac{( x - μ ) ^{2}}{2 ( σ / n ) ^{2}}) ⟹ \overline{X_{n}} \sim N (μ, \frac{σ ^{2}}{n})

That is, $\overline{X_{n}}$ follows a normal distribution with expectation $μ$ and variance $σ / n$ . This result is called the Central Limit Theorem.

What we see is that for any i.i.d. sequence of random variables, the limit distribution of their sequence average is the same distribution. In other words, for any number of surveys of the same form on a sample, the mean of the survey results will tend towards a steady-state distribution, which is the normal distribution. In fact, what we see is that for a sufficiently large sample, the histogram of a certain numerical statistic in the limit case is precisely the normal distribution curve, which is what we often call the Gaussian curve (in physics/engineering fields) or bell curve (in finance/statistics fields).