Probability for Randomized Algorithms
I always seem to need to refresh my probability theory knowledge and thought that writing some of the core concepts might be useful for myself and for you as well. Since I usually understand concepts best in lay-person terms, I will try to adhere to that in the rest of this post. I think it’s also useful to note that this is not a overarching probability manual in terms of including all possible concepts in probability. Instead, most of the concepts discussed here are very much useful within the domain of randomized algorithms.
With respect to background knowledge – I will try to ensure that this document is self-contained but it may be useful to search things up if things are confusing.
Contents are adapted from material produce by Prof. Srini Devadas, Prof. Hung Le, Prof. Cameron Musco.
AI Usage Notice – I wrote all of the contents of this post but AI was used for spell check and reorganization.
Basic Terminology
- Random Experiment (\(S\), \(\Omega\))
- A process that produces an outcome that cannot be predicted with certainty beforehand. E.g., flipping a coin, rolling a die.
- Sample Space
- The set of all possible outcomes. Coin flip: \(S = \{H, T\}\). Die roll: \(S = \{1,2,3,4,5,6\}\).
- Outcome
- A single possible result of an experiment — one element of the sample space.
- Event
- A subset of the sample space. E.g., \(E = \{2,4,6\}\) is the event of rolling an even number. Events can be simple (one outcome) or compound (multiple outcomes).
- Probability — \(P(A) = \frac{\text{favorable outcomes}}{\text{total outcomes}}\)
- Measures how likely an event is to occur. Properties: \(0 \le P(A) \le 1\), \(P(S) = 1\), \(P(\emptyset) = 0\).
- Complement — \(P(A^c) = 1 - P(A)\)
- The event that \(A\) does not occur.
- Union — \(P(A \cup B) = P(A) + P(B) - P(A \cap B)\)
- The event that \(A\) or \(B\) (or both) occur.
- Intersection — \(A \cap B\)
- The event that both \(A\) and \(B\) occur. E.g., if \(A =\) even and \(B =\) greater than 3, then \(A \cap B = \{4,6\}\).
- Conditional Probability — \(P(A \mid B) = \frac{P(A \cap B)}{P(B)}\)
- The probability of \(A\) given that \(B\) has occurred, provided \(P(B) > 0\).
- Independence — \(P(A \cap B) = P(A)P(B)\)
- Two events are independent if the occurrence of one does not affect the probability of the other. E.g., two separate coin flips.
- Random Variable
- A function that assigns a numerical value to each outcome. Can be discrete (countable values) or continuous (values in an interval).
- Probability Distribution
- Describes how probabilities are assigned to a random variable’s values. PMF for discrete variables, PDF for continuous variables.
- Expected Value — \(E[X] = \sum_x x\, P(X=x)\)
- The long-run average value of a random variable.
- Variance — \(\text{Var}(X) = E[(X - E[X])^2]\)
- Measures the spread of a random variable around its expected value.
Scaling Laws
\(E[aX] = a\cdot E[X]\) for any constant \(a\) and random variable \(X\).
\[ E[aX] = \sum_x (ax) \cdot P(X=x) \]
Factor out the constant \(a\).
\[ = a \sum_x x \cdot P(X=x) \]
Recognize that the remaining sum is the definition of \(E[X]\).
\[ E[aX] = a\cdot E[X] \]
\(\text{Var}(aX) = a^2 \cdot \text{Var}(X)\) for any constant \(a\) and random variable \(X\).
\[ \text{Var}(aX) = E[(aX)^2] - E[aX]^2 \]
Apply the scaling of expectation.
\[ = E[a^2 X^2] - (a\cdot E[X])^2 \]
\[ = a^2 \cdot E[X^2] - a^2 \cdot E[X]^2 \]
Factor out \(a^2\).
\[ = a^2 (E[X^2] - E[X]^2) \]
Recognize the definition of variance.
\[ \text{Var}(aX) = a^2 \cdot \text{Var}(X) \]
Linearity of Expectation
\(E[X + Y] = E[X] + E[Y]\) for any two random variables \(X\) and \(Y\).
Start from the definition of expectation.
\[ E[Z] = \sum_z z\cdot P(Z=z) \]
Substitute \(X+Y\) for \(Z\).
\[ E[X + Y] = \sum_x \sum_y (x+y) \cdot P(X=x \cap Y=y) \]
Distribute the sum.
\[ E[X + Y] = \sum_x \sum_y x \cdot P(X=x \cap Y=y) + \sum_x \sum_y y \cdot P(X=x \cap Y=y) \]
Marginalize out the other variable in each term.
\[ E[X + Y] = \sum_x x \cdot P(X=x) + \sum_y y \cdot P(Y=y) \]
Recognize both sums as expectations.
\[ E[X + Y] = E[X] + E[Y] \]
Linearity of Variance
\[ \text{Var}(X) = E[X^2] - E[X]^2 \]
Start from the definition of variance.
\[ \text{Var}(X) = E[(X - E[X])^2] \]
Let \(\mu = E[X]\) for convenience.
\[ = E[(X - \mu)^2] \]
Expand the square.
\[ = E[X^2 - 2\mu\cdot X + \mu^2] \]
Apply linearity of expectation.
\[ = E[X^2] + E[-2\mu\cdot X] + E[\mu^2] \]
Note that \(\mu\) is already the expectation and so taking the expectation of expectation will not change the value.
\[ = E[X^2] -2\mu\cdot E[X] + \mu^2 \]
Substitute back in \(E[X]\) for \(\mu\).
\[ = E[X^2] -2\cdot E[X]\cdot E[X] + E[X]^2 \]
Combine like terms.
\[ = E[X^2] -2\cdot E[X]^2 + E[X]^2 \]
\[ \text{Var}(X) = E[X^2] - E[X]^2 \]
\[ E[X\cdot Y] = E[X]\cdot E[Y] \text{ for any two independent random variables } X \text{ and } Y \]
Start from the definition of expectation.
\[ E[Z] = \sum_z z\cdot P(Z=z) \]
Substitute \(X\cdot Y\) for \(Z\).
\[ E[X\cdot Y] = \sum_x \sum_y x\cdot y\cdot P(X=x \cap Y=y) \]
Since \(X\) and \(Y\) are independent, \(P(X=x \cap Y=y) = P(X=x)\cdot P(Y=y)\).
\[ = \sum_x \sum_y x\cdot y\cdot P(X=x)\cdot P(Y=y) \]
Factor out each variable’s terms into separate sums.
\[ = \sum_x x\cdot P(X=x) \times \sum_y y\cdot P(Y=y) \]
Recognize that the components are expectations.
\[ E[X\cdot Y] = E[X]\cdot E[Y] \]
\(\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y)\) for any two independent random variables \(X\) and \(Y\).
Start from Lemma I.
\[ \text{Var}(Z) = E[Z^2] - E[Z]^2 \]
Substitute \(X+Y\) for \(Z\).
\[ \text{Var}(X + Y) = E[(X + Y)^2] - E[X+Y]^2 \]
Expand the square and apply linearity of expectation.
\[ = E[X^2 + 2\cdot XY + Y^2] - (E[X] + E[Y])^2 \]
\[ = E[X^2] + 2\cdot E[XY] + E[Y^2] - (E[X]^2 + 2\cdot E[X]\cdot E[Y] + E[Y]^2) \]
Distribute the negative sign.
\[ = E[X^2] + 2\cdot E[XY] + E[Y^2] - E[X]^2 - 2\cdot E[X]\cdot E[Y] - E[Y]^2 \]
Apply Lemma II to replace \(E[XY]\) with \(E[X]\cdot E[Y]\).
\[ = E[X^2] + 2\cdot E[X]\cdot E[Y] + E[Y^2] - E[X]^2 - 2\cdot E[X]\cdot E[Y] - E[Y]^2 \]
Cancel the \(2\cdot E[X]\cdot E[Y]\) terms and regroup.
\[ = (E[X^2] - E[X]^2) + (E[Y^2] - E[Y]^2) \]
Recognize each group as a variance by Lemma I.
\[ \text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y) \]
Indicator Random Variables
An indicator random variable for an event \(A\) is defined as
\[ X_A = \begin{cases} 1 & \text{if } A \text{ occurs} \\ 0 & \text{otherwise} \end{cases} \]
A key property is that the expectation of an indicator equals the probability of the event:
\[ E[X_A] = P(A) \]
This is powerful because many quantities in randomized algorithms can be expressed as a sum of indicators. Combined with linearity of expectation, this lets us compute expectations without worrying about dependencies between events.
Let \(X_{ij}\) be the indicator that elements \(i\) and \(j\) are compared during the algorithm. The total number of comparisons is \(X = \sum_{i < j} X_{ij}\). By linearity of expectation:
\[ E[X] = \sum_{i < j} E[X_{ij}] = \sum_{i < j} P(i \text{ and } j \text{ are compared}) \]
This reduces a complicated random variable to a sum of probabilities — no independence needed.
Union Bound
For any events \(A_1, A_2, \ldots, A_n\):
\[ P\left(\bigcup_{i=1}^{n} A_i\right) \le \sum_{i=1}^{n} P(A_i) \]
We prove this by induction on \(n\).
Base case (\(n = 1\)): \(P(A_1) \le P(A_1)\). Trivially true.
Inductive step: Assume the bound holds for \(n-1\) events. For \(n\) events, use the addition rule:
\[ P\left(\bigcup_{i=1}^{n} A_i\right) = P\left(\bigcup_{i=1}^{n-1} A_i\right) + P(A_n) - P\left(\left(\bigcup_{i=1}^{n-1} A_i\right) \cap A_n\right) \]
Since probabilities are non-negative, drop the subtracted term.
\[ \le P\left(\bigcup_{i=1}^{n-1} A_i\right) + P(A_n) \]
Apply the inductive hypothesis.
\[ \le \sum_{i=1}^{n-1} P(A_i) + P(A_n) = \sum_{i=1}^{n} P(A_i) \]
The union bound is often used to bound the overall failure probability of a randomized algorithm. If step \(i\) fails with probability \(p_i\), then the probability that any step fails is at most \(\sum p_i\).
Markov’s Inequality
For a non-negative random variable \(X\) and any \(a > 0\):
\[ P(X \ge a) \le \frac{E[X]}{a} \]
Start from the definition of expectation for a non-negative random variable.
\[ E[X] = \sum_x x \cdot P(X = x) \]
Split the sum into values below \(a\) and values at or above \(a\).
\[ = \sum_{x < a} x \cdot P(X = x) + \sum_{x \ge a} x \cdot P(X = x) \]
Drop the first sum (it is non-negative) and lower-bound \(x\) by \(a\) in the second.
\[ \ge \sum_{x \ge a} a \cdot P(X = x) \]
Factor out \(a\).
\[ = a \sum_{x \ge a} P(X = x) = a \cdot P(X \ge a) \]
Rearrange.
\[ P(X \ge a) \le \frac{E[X]}{a} \]
Markov’s inequality is loose but requires almost nothing — just non-negativity and a known expectation.
Chebyshev’s Inequality
For any random variable \(X\) with finite variance and any \(k > 0\):
\[ P(|X - E[X]| \ge k) \le \frac{\text{Var}(X)}{k^2} \]
Define \(Y = (X - E[X])^2\). Note that \(Y\) is non-negative, and \(E[Y] = \text{Var}(X)\).
Apply Markov’s inequality to \(Y\) with threshold \(k^2\).
\[ P(Y \ge k^2) \le \frac{E[Y]}{k^2} = \frac{\text{Var}(X)}{k^2} \]
Observe that \(Y \ge k^2\) is equivalent to \((X - E[X])^2 \ge k^2\), which is equivalent to \(|X - E[X]| \ge k\).
\[ P(|X - E[X]| \ge k) \le \frac{\text{Var}(X)}{k^2} \]
Chernoff Bounds
Chernoff bounds give exponentially strong concentration for sums of independent random variables. Let \(X_1, X_2, \ldots, X_n\) be independent indicator random variables, and let \(X = \sum_{i=1}^{n} X_i\) with \(\mu = E[X]\).
For any \(\delta > 0\):
\[ P(X \ge (1 + \delta)\mu) \le \left(\frac{e^\delta}{(1+\delta)^{(1+\delta)}}\right)^\mu \]
Simplified (\(0 < \delta \le 1\)): \(\quad P(X \ge (1+\delta)\mu) \le e^{-\mu\delta^2/3}\)
The key idea uses the moment generating function and Markov’s inequality.
For any \(t > 0\), the event \(X \ge a\) is equivalent to \(e^{tX} \ge e^{ta}\).
Apply Markov’s inequality.
\[ P(X \ge a) = P(e^{tX} \ge e^{ta}) \le \frac{E[e^{tX}]}{e^{ta}} \]
Since the \(X_i\) are independent, the MGF factors.
\[ E[e^{tX}] = \prod_{i=1}^{n} E[e^{tX_i}] \]
For an indicator variable with \(P(X_i = 1) = p_i\):
\[ E[e^{tX_i}] = 1 + p_i(e^t - 1) \le e^{p_i(e^t - 1)} \]
using the inequality \(1 + x \le e^x\).
\[ E[e^{tX}] \le e^{\mu(e^t - 1)} \]
Substitute \(a = (1+\delta)\mu\) and optimize over \(t\) by setting \(t = \ln(1+\delta)\).
\[ P(X \ge (1+\delta)\mu) \le \frac{e^{\mu(e^{\ln(1+\delta)} - 1)}}{e^{(1+\delta)\mu\ln(1+\delta)}} = \left(\frac{e^\delta}{(1+\delta)^{(1+\delta)}}\right)^\mu \]
For \(0 < \delta < 1\):
\[ P(X \le (1 - \delta)\mu) \le \left(\frac{e^{-\delta}}{(1-\delta)^{(1-\delta)}}\right)^\mu \]
Simplified (\(0 < \delta < 1\)): \(\quad P(X \le (1-\delta)\mu) \le e^{-\mu\delta^2/2}\)
The argument mirrors the upper tail, but with \(t < 0\).
For any \(t < 0\), the event \(X \le a\) is equivalent to \(e^{tX} \ge e^{ta}\) (the inequality flips because \(t\) is negative).
Apply Markov’s inequality.
\[ P(X \le a) = P(e^{tX} \ge e^{ta}) \le \frac{E[e^{tX}]}{e^{ta}} \]
The MGF bound is the same as before since it holds for all \(t\).
\[ E[e^{tX}] \le e^{\mu(e^t - 1)} \]
Substitute \(a = (1-\delta)\mu\) and optimize over \(t\) by setting \(t = \ln(1-\delta)\) (which is negative for \(0 < \delta < 1\)).
\[ P(X \le (1-\delta)\mu) \le \frac{e^{\mu(e^{\ln(1-\delta)} - 1)}}{e^{(1-\delta)\mu\ln(1-\delta)}} = \left(\frac{e^{-\delta}}{(1-\delta)^{(1-\delta)}}\right)^\mu \]
For \(0 < \delta \le 1\):
\[ P(|X - \mu| \ge \delta\mu) \le 2e^{-\mu\delta^2/3} \]
This is the most commonly used form in practice — it bounds deviation in either direction.
The event \(|X - \mu| \ge \delta\mu\) means either \(X \ge (1+\delta)\mu\) or \(X \le (1-\delta)\mu\).
Apply the union bound.
\[ P(|X - \mu| \ge \delta\mu) \le P(X \ge (1+\delta)\mu) + P(X \le (1-\delta)\mu) \]
Apply the simplified upper and lower tail bounds.
\[ \le e^{-\mu\delta^2/3} + e^{-\mu\delta^2/2} \]
Since \(e^{-\mu\delta^2/2} \le e^{-\mu\delta^2/3}\) for \(\mu, \delta > 0\):
\[ \le 2e^{-\mu\delta^2/3} \]