Articles
Probability Theory
By: Ao Chen
Published on: 9/1/2025
Updated on: 9/18/2025
A rough overview of probability theory.
Series: Probability, Random Variables and Stochastic Process
Tags: probability, signal, electronic engineering, finance
Modern mathematics views probability as a measure structure on a space. First, we need a method to divide subsets within the space into “measurable” and “non-measurable,” where only “measurable” subsets possess probability. These sets are precisely the “events” studied in traditional probability theory.
Basic Structure
-algebra
Given a non-empty set as the universe, if belongs to the power set of , i.e., , meaning is a collection of subsets of , and it satisfies the following conditions:
- The universe is in : ;
- For any set in , its complement is also in : ;
- For any countable collection of sets in , their union is in : ,
then we call a -algebra on .
It can be proven that a -algebra is closed under finite intersections, unions, and differences. Since , we have .
Measure Space
If a set has a -algebra , then this set with respect to forms a measurable space .
Naturally, sets in are called measurable, and sets not in are non-measurable.
If we assign a non-negative number that can be infinite to each measurable set, this corresponds to a function , satisfying the following conditions:
- ;
- For any countable collection of sets in , their union satisfies
where equality holds if and only if for any indices with , we have (note that we don’t necessarily require ),
then we call a measure on the measurable space , and the measurable space with respect to the measure forms a measure space .
It’s easy to see that the measure has a maximum value , which is called the limit capacity of the measure space .
If the limit capacity , then the measure space (with respect to the measure ) is said to be bounded or finite; if the limit capacity , then the measure space is called a probability space, and the measure is called a 0-1 measure or probability measure on .
Integration of Functions
Consider a function on the measure space . If for any real number , we have , meaning the preimage of any open interval starting from is measurable, then the function is called a measurable function.
Characteristic Function
First, we consider the integral of a characteristic function , where
If is measurable, then we define
Of course, if (i.e., is not measurable), then we define
Now we define the integral of a so-called simple function or step function
Its convergence is determined entirely by the series defined by the summation, that is, if the series is finite, then it must converge; if the series is infinite, then its convergence is given by the following limit
General Definition through Approximation
If is a non-negative function, its integral is defined as the least upper bound of all non-negative simple functions less than
Here, the can be transformed into a limit using the monotone sequence convergence principle.
For any measurable function with arbitrary values, we can decompose as
where
We can see that both and are non-negative and measurable, so we can define
For the integral over a general set , noting that the product is measurable, we can define it as
Probability Interpretation
We conventionally use to denote a probability measure. Consider the probability space , where we interpret as the sample space, its elements as samples, and a measurable function on the probability space as the value possessed by the sample, constituting the statistical meaning of probability.
Measurable sets in the sample space, as collections of samples, are called events.
We can model it as follows: Let be a finite sample space containing all samples, and let a measurable function assign a value to each sample . When the sample size is large enough, it is more economical to consider the statistical significance of rather than the specific value of . When the sample size is large enough, it is also more economical to model using an infinite cardinality , just as we use a distribution curve to replace a sufficiently dense histogram. Therefore, we define: a measurable function on the sample space as a random variable.
Taking a random variable , we now consider the meaning of the integral
According to the definition,
Here we use a set of simple notations:
- represents
- represents the limit process needed to obtain , which can be simply thought of as
We see that reflects the value of , and can be viewed as the measure of the measurable set closest to , which we can briefly denote as , that is, we have
This value reflects the sample average property of and is called the expectation of , denoted as or simply .
Consider a simple example, , with probability defined as
and satisfying . In this case, of course, all subsets of should be measurable, i.e., . Let the random variable be . Note that is a simple function
then
That is, is the weighted average of with weights .
Lebesgue Integration
In this section, we consider the special case where . For the interval
we define the measure
Note that we do not use the common definition , because we assume by default that the interval is when (the physical meaning of ReLU — Rectified Linear Unit ), which is consistent with the standard definition .
Similarly, we obtain such a measure through approximation
This measure is called the Lebesgue outer measure.
However, it is worth noting that we must carefully select measurable sets, even though the Lebesgue outer measure is defined for any subset of . In fact, is not a measure space, because cannot form a measure on (hence called an “outer measure”).
A measure requires countable additivity, that is, the following inequality holds
with the condition for equality: the sets are pairwise disjoint.
If we abandon the condition for equality, then it is called subadditivity.
It can be proven that only satisfies subadditivity, therefore it does not form a measure. But we can find a -algebra such that forms a measure space . In fact, the determination of is the famous Carathéodory measurability criterion, which only allows sets satisfying the condition
to be measurable sets.
We call such a measure space the Lebesgue measure space, that is,
where sets satisfying Carathéodory measurability are called Lebesgue measurable, and sets not satisfying Carathéodory measurability are called Lebesgue non-measurable.
Through the Lebesgue measure, we can introduce the integral , called the Lebesgue integral. In fact, for almost everywhere continuous functions , the Lebesgue integral degenerates to the Riemann integral
Therefore, the Lebesgue integral as a generalization of the Riemann integral does not conflict with its definition, and we can directly use the notation of the Riemann integral to denote the Lebesgue integral.
If a function is monotonically increasing and right-continuous, then we can define such an outer measure
This is called the Lebesgue–Stieltjes outer measure induced by . Similarly, we can construct a Lebesgue–Stieltjes measure space and obtain the Lebesgue–Stieltjes integral
and for almost everywhere continuous functions , the Lebesgue–Stieltjes integral degenerates to the Riemann–Stieltjes integral
Distribution
By convention, a bare integral sign denotes integration over the entire sample space, .
Now we return to the probability space , where we want to point out that the choice of probability is not fixed. When is chosen, the choice of is called a probability distribution on the probability space.
After fixing the probability distribution , for a random variable , we define the distribution function
Here we introduce the common notation , where represents a limiting condition on (such as an inequality). In fact, the random variable represents a statistical numerical distribution of samples, and the probability represents a probability distribution on the samples. The two combined yield the statistical behavior of probability and random variables, so we use or to describe this pattern, meaning that the random variable follows the probability distribution or . In this sense, we define
or
where is some algebraic function. Now we point out that under the condition , the above two expressions are equivalent. For simplicity, we consider the Riemann-Stieltjes integral, take a partition , and have
In the limit sense, the above expression clearly converges to .
Looking at it essentially, elements in the sample space can be of various kinds, and a random variable is a quantification of some aspect of the samples. The sample space itself is not a quantitative entity, so the probability distribution on it is not quantitative either, and the distribution function quantifies the distribution of probability on the sample space through random variables.
The logic of this matter is as follows:
- The sample space itself is just a collection of samples
- Probability gives a measurable distribution on the sample space
- Although the probability value itself can be quantified
- But probability as a function of “sets” has a distribution that is difficult to quantify
- A random variable is a quantification of some property of the sample itself
- Therefore, random variables and probability jointly quantify samples and their probability distributions
- Samples are quantified through random variables
- The probability distribution on samples is quantified through the distribution function
- In this system, we say that the random variable follows this distribution
Conditional Distribution
Consider the probability space , for an event , define the following probability
We want to point out that is also a probability. In fact, according to countable additivity
where equality holds when the sets are pairwise disjoint. According to the definition of , we have
If the intersection diverges, then the above expression is 0. If the intersection converges, then the above expression should be
where , so using countable additivity, we have
In the derivation, we did not change the condition for equality, so indeed forms a probability space . We call the conditional probability of given that event occurs, denoted as .
We study the integral of a random variable under the conditional probability
Let and be the distribution functions of and respectively, then according to the definition
This actually does not yield an effective result, because we know too little about .
Now consider another random variable , and assume , then we consider
where is the joint probability distribution of , now we have
This means that the conditional density is
This is a function of , which we denote as
called the conditional density, and its integral is called the conditional distribution.
In the derivation, we can see that we actually assumed the existence of , , and , which is necessary. Density, as the derivative of distribution, reflects the second-order property, that is, the associative characteristics between samples, which is an important property of conditional variables.
We can further consider the integral
This integral is a function of , which we denote as . It can be proven that this is a distribution function, which is associated with a random variable, which is denoted as
called the conditional expectation of given .
We can see that
Using the Fubini theorem to simplify this integral
We can understand conditional expectation as follows: the random variable as a condition itself has randomness, and the conditional expectation is the expectation of under fixed , so it inevitably carries the randomness of , which is why the conditional expectation is a random variable. And as we said, the randomness of the conditional expectation comes entirely from the condition , so taking the expectation of again can eliminate the randomness, and its result
is just the expectation of , which is precisely the embodiment of the law of total probability.
Large Sample Limit
We call the -th moment of the random variable . Just like the multipole moment expansion of potential fields in field theory, the -th moment is the integral of the -th power
The most typical characteristic is the 2nd moment, which reflects the property of second-order correlation. Here we directly give the definitions of related second-order quantities:
- Second-order self-correlation of the random variable — Fluctuation information
- Variance
- Standard deviation
- Second-order correlation between two random variables — Correlation information
- Correlation function
- Covariance
- Correlation coefficient
Now suppose we have many random variables that are independent and identically distributed (i.i.d.), we consider the sequence average
Let the distribution function of the sequence be , with expectation and variance , and let the distribution function of be , with expectation and variance . We want to study the limit behavior of , , and as . The work on the numerical characteristics , is summarized as the Law of Large Numbers, while the work on the distribution function is the famous Central Limit Theorem.
Law of Large Numbers
We can directly calculate
The second moment
According to the relationship between covariance and second-order correlation function
Substituting this, we get
Computing the two counts and , thus the second moment is
From this, we get the variance
These two results are called the Law of Large Numbers.
We now explain the meaning of such a sequence of i.i.d. random variables. Similar to the above, we consider a finite sample set , in which case the random variable can be completely determined by the corresponding sample values , and similarly, the probability can also be completely determined by the sample probability values . Note that the distribution function at this time is
where is the Heaviside function or unit step response. Differentiating it gives the probability density
We can see that for two random variables , if they have the same distribution, then they must satisfy
But note that this does not directly imply . A simple counterexample is
We have
That is, a sequence of i.i.d. random variables cannot explain much from their values themselves, but we can be sure that their numerical characteristics must be consistent. We remember that a random variable is a quantification of some characteristic of the sample, and a sequence of i.i.d. random variables can be statistically understood as multiple independent surveys of a certain characteristic of the sample, with each random variable being the result of one survey. From the experience of multiple surveys in the real world, the result of each survey is unknown, but under stable conditions, the statistical quantities of the surveys will not change, which is consistent with the fact that the distribution of i.i.d. random variables does not change.
Central Limit Theorem
Finally, we look at the distribution function or probability distribution of . First, we need to introduce the so-called normal distribution or Gaussian distribution, which is defined as the following probability density in one dimension
It can be proven that the parameters and are precisely the expectation and variance of the distribution. Commonly denote as . We point out that the normal distribution, as its name suggests, is the distribution in a normal state. No matter what kind of i.i.d. random variable sequence, the distribution of its sequence average must be a normal distribution.
To illustrate this point, we first need to introduce an important tool — the characteristic function. For a random variable , its characteristic function is defined as the mean
Considering the distribution function and probability density , the above expression can be written as
This is precisely the Fourier transform of the probability density! Going further, we define
This is actually the Laplace transform of the probability density. This process is strikingly consistent with the process of extending the spectrum of a signal from the frequency domain to the s-domain!
Now we consider the properties of the characteristic function. Here we use a bit of “engineering benefit” (which we have actually used many times before), expanding as follows
Viewing this expression as the Maclaurin series of , we can immediately get the derivative of the characteristic function
That is, the -th derivative of the characteristic function at is precisely the -th moment of the random variable.
Let’s review our derivation: formally expanding is essentially the expansion of , because the under degenerates from the random integral to the regular function integral or under the distribution function or probability density. Then it’s bringing the summation inside to the outside, which is guaranteed by the uniform convergence of the internal series, and the internal series, as the Maclaurin series of the exponential function, is guaranteed to converge uniformly. Finally, for the derivative of the characteristic function, we require to be a holomorphic function. Although holomorphic is a rather high requirement for complex functions, since is a sufficiently smooth real-valued function when restricted to or phase (the exponential function ), holomorphic is guaranteed.
Now consider the limit of the sequence average of the random variable sequence . We note that the expectation and variance of are and , and we transform it into a quantity with expectation and variance
Similarly, standardizing , we get
Therefore, we have
Taking the exponential, we get
Taking the expectation gives the characteristic function
Since are i.i.d., we have , so
We know that , , so
On the frequency axis
Taking the inverse Fourier transform of both sides gives the distribution function
This is a Gaussian integral, which evaluates to
Note that , so
That is, follows a normal distribution with expectation and variance . This result is called the Central Limit Theorem.
What we see is that for any i.i.d. sequence of random variables, the limit distribution of their sequence average is the same distribution. In other words, for any number of surveys of the same form on a sample, the mean of the survey results will tend towards a steady-state distribution, which is the normal distribution. In fact, what we see is that for a sufficiently large sample, the histogram of a certain numerical statistic in the limit case is precisely the normal distribution curve, which is what we often call the Gaussian curve (in physics/engineering fields) or bell curve (in finance/statistics fields).