These notes are intended to provide a quick refresher on basics of probability before introduction to the use of randomization in algorithms. Probabilistic analysis tends to be non-trivial even in simple settings. Various tools have been developed over the years and it is helpful to see some simple templates. For the most part we will work with discrete (in fact finite) probability spaces but we will need also some tools from continuous spaces. There are several subtleties involved in formal aspects of probability theory when dealing with continuous spaces. These will not be of concern to us but it is useful to know why one needs more involved definitions.
1.1. Discrete Probability Space and Events
We will start with simple finite probability spaces. Consider the experiment of tossing an unbiased coin which turns up heads with probability 1//21 / 2 and tails with probability 1//21 / 2. Or tossing a 6-sided die where each side is equally likely (with probability 1//61 / 6). To formally understand probability we need to define a probability space. In the finite setting it consists of two pieces of information. A set of possible atomic or elementary eventsOmega\Omega and for each omega in Omega\omega \in \Omega the probability that omega\omega is realized.
Definition 1.1.A finite probability space is a pair (Omega,Pr)(\Omega, \mathbf{Pr}) consists of finite set Omega\Omega of elementary events and function Pr:Omega rarr[0,1]\mathbf{Pr}: \Omega \rightarrow[0,1] which assigns a probability Pr[omega]\mathbf{Pr}[\omega] for each omega in Omega\omega \in \Omega such that sum_(omega in Omega)Pr[omega]=1\sum_{\omega \in \Omega} \mathbf{Pr}[\omega]=1.
Example 1.An unbiased coin. Omega={H,T}\Omega=\{H, T\} and Pr[H]=Pr[T]=1//2\mathbf{Pr}[H]=\mathbf{Pr}[T]=1 / 2.
Example 2.A biased coin. Omega={H,T}\Omega=\{H, T\} and Pr[H]=1//4\mathbf{Pr}[H]=1 / 4 and Pr[T]=3//4\mathbf{Pr}[T]=3 / 4.
Example 3.A 66-sided unbiased die. Omega={1,2,3,4,5,6}\Omega=\{1,2,3,4,5,6\} and Pr[i]=1//6\mathbf{Pr}[i]=1 / 6 for 1 <= i <= 61 \leq i \leq 6.
From simple probability spaces one often creates more complex ones by the product construction.
Definition 1.2.Suppose (Omega_(1),Pr_(1))\left(\Omega_{1}, \mathbf{Pr}_{1}\right) and (Omega_(2),Pr_(2))\left(\Omega_{2}, \mathbf{Pr}_{2}\right) are two finite probability spaces. Then the product probability space (Omega:}\left(\Omega\right., Pr) is defined as follows. Omega=Omega_(1)xxOmega_(2)\Omega=\Omega_{1} \times \Omega_{2} and Pr(omega_(1),omega_(2))=Pr_(1)(omega_(1))Pr_(2)(omega_(2))\mathbf{Pr}\left(\omega_{1}, \omega_{2}\right)=\mathbf{Pr}_{1}\left(\omega_{1}\right) \mathbf{Pr}_{2}\left(\omega_{2}\right) for all omega_(1)inOmega_(1),omega_(2)inOmega_(2)\omega_{1} \in \Omega_{1}, \omega_{2} \in \Omega_{2}.
The reader unfamiliar with product constructions should verify that the product space is indeed a probability space. That is, sum_(omega in Omega)Pr[omega]=1\sum_{\omega \in \Omega} \mathbf{Pr}[\omega]=1.
One can easily extend the above product definition to the product of a finite number of spaces. Things get tricky if one wants infinite products but we will not need it.
Example 4.Two unbiased coins. One can view this as a single probability space (Omega,Pr)(\Omega, \mathbf{Pr}) where Omega=\Omega={HH,HT,TH,TT}\{H H, H T, T H, T T\} where the probability of each event is 1//41 / 4 or view this as the product of two spaces where each space is that of one unbiased coin. The product view is often useful when considering independence. On the other hand when correlations are present it is useful to view it as a single space.
Most things from finite probability spaces extend to a probability space over a countably infinite set. Recall that a set AA is countably infinite if there is a bijection from AA to the natural numbers. Examples include the set of even numbers, the set of primes, set of all strings over a finite alphabet.
Example 5.Let lambda > 0\lambda>0 be a fixed real number. Consider the space (Omega,Pr)(\Omega,\mathbf{Pr}) where Omega={0,1,2,dots,}\Omega=\{0,1,2, \ldots,\} is the set of non-negative integers and Pr(i)=e^(-lambda)(lambda^(i))/(i!)\mathbf{Pr}(i)=e^{-\lambda} \frac{\lambda^{i}}{i !} for each i >= 0i \geq 0. This is the Poisson distribution with mean lambda\lambda. We see that sum_(i >= 0)Pr(i)=1\sum_{i \geq 0} \mathbf{Pr}(i)=1 since e^(lambda)=sum_(i >= 0)(lambda^(i))/(i!)e^{\lambda}=\sum_{i \geq 0} \frac{\lambda^{i}}{i !}.
Question 1.Suppose Omega\Omega is the set of positive integers. We want to have a probability space where Pr(i)=c//i^(2)\mathbf{Pr}(i)=c / i^{2} for each i >= 1i \geq 1 where cc is a fixed constant. Is it possible? What about if we wanted have a probability space where Pr(i)=c//i\mathbf{Pr}(i)=c / i for each ii?
Events: In most probability spaces what we are really after are interesting events.
Definition 1.3.Given a probability space (Omega,Pr)(\Omega, \mathbf{Pr}) an event is a subset of Omega\Omega. In other words an event is a collection of elementary events. The probability of an event AA, denoted by Pr[A]\mathbf{Pr}[A], is sum_(omega in A)Pr[omega]\sum_{\omega \in A} \mathbf{Pr}[\omega]. The complement event of an event A sube OmegaA \subseteq \Omega is the event Omega\\A\Omega \backslash A frequently denoted by bar(A)\bar{A}.
Example 6.A pair of independent dice. Omega={(i,j)∣1 <= i <= 6,1 <= j <= 6}\Omega=\{(i, j) \mid 1 \leq i \leq 6,1 \leq j \leq 6\}. Let AA be the event that the sum of the two numbers on the dice is even. Then A={(i,j)in Omega:(i+j)" is even"}A=\{(i, j) \in \Omega:(i+j)\text{ is even}\}. Pr[A]=|A|//36=1//2\mathbf{Pr}[A]=|A| / 36=1 / 2.
Question 2.Consider the probability space corresponding to the Poisson distribution with mean lambda\lambda. Let AA be the event that the outcome is an even number. What is Pr[A]\mathbf{Pr}[A]?
The following question illustrates that one can easily define events over relatively simple probability spaces but whose probability may not be easy to compute.
Question 3.Let (Omega,Pr)(\Omega, \mathbf{Pr}) be the probability space corresponding to the product of the probability spaces of 100 unbiased coins. Let AA be the event that the number of heads is a prime number. What is Pr[A]\mathbf{Pr}[A]?
Algebra over events: Given two events A,BA, B we can define their union as the event A uu BA \cup B and their intersection as A nn BA \cap B. We can easily prove the following in the setting of finite probability spaces via basic set theory.
Pr[A]+Pr[B]=Pr[A uu B]+Pr[A nn B]\mathbf{Pr}[A]+\mathbf{Pr}[B]=\mathbf{Pr}[A \cup B]+\mathbf{Pr}[A \cap B]
which can be written alternatively as
Pr[A uu B]=Pr[A]+Pr[B]-Pr[A nn B].\mathbf{Pr}[A \cup B]=\mathbf{Pr}[A]+\mathbf{Pr}[B]-\mathbf{Pr}[A \cap B] .
What about the union of 3 events A,B,CA, B, C? You can convince yourself via Venn diagram that
Pr[A uu B uu C]=Pr[A]+Pr[B]+Pr[C]-Pr[A nn B]-Pr[B nn C]-Pr[A nn C]+Pr[A nn B nn C].\mathbf{Pr}[A \cup B \cup C]=\mathbf{Pr}[A]+\mathbf{Pr}[B]+\mathbf{Pr}[C]-\mathbf{Pr}[A \cap B]-\mathbf{Pr}[B \cap C]-\mathbf{Pr}[A \cap C]+\mathbf{Pr}[A \cap B \cap C] .
A powerful generalization is the inclusion-exclusion formula for the union of a finite collection of events A_(1),A_(2),dots,A_(n)A_{1}, A_{2}, \ldots, A_{n}.
Probability space vs distribution: We often talk about probability distributions. What is the distinction between a probability space and a distribution? Distributions should be thought of as templates for an actual concrete experiment. We say X_(1),X_(2),dots,X_(n)X_{1}, X_{2}, \ldots, X_{n} are distributed according to say Poisson distribution with mean lambda\lambda. What this means is that we instantiate the probability space for each X_(i)X_{i} using the values that we copy from the distribution. This is to avoid the task of specifying in gory detail the probability space for standard cases.
1.2. Continuous Probability Spaces
Consider the experiment of picking a uniformly random number from the continuous interval [0,1][0,1]. It is natural to view the underlying probability space Omega\Omega as [0,1][0,1] where the elementary events are the real numbers in the interval [0,1][0,1]. How should we define Pr over this space? Since Omega\Omega is uncountably infinite we cannot associate any non-zero value to Pr[x]\mathbf{Pr}[x] for any x in[0,1]x \in[0,1]; thus we need to set Pr[x]=0\mathbf{Pr}[x]=0. On the other hand we intuitively see that the probability that a number picked uniformly from [0,1][0,1] lies in the interval [a,b][a, b] with b >= ab \geq a should be its length b-ab-a. How do we reconcile this with setting the probability of each atomic event in [a,b][a, b] to 00? One can see that it is not feasible to define probability measures over continuous spaces by defining the values over elementary events and then assigning the values to events. Hence a more general attempt is to define the probability measure over all events simultaneously as long as they satisfy certain basic and intuitive axioms. For the interval [0,1][0,1] we would like Pr[A]\mathbf{Pr}[A] for any event/subset A sube[0,1]A \subseteq[0,1] to correspond to its "length" which is intuitive for interval events.
So an attempt would be define Pr(A)\mathbf{Pr}(A) for all A sube OmegaA \subseteq \Omega and the natural axioms would be the following: (i) Pr[Omega]=1\mathbf{Pr}[\Omega]=1 (ii) for each event A,Pr[ bar(A)]=1-Pr[A]A, \mathbf{Pr}[\bar{A}]=1-\mathbf{Pr}[A] (iii) for any countably infinite set of disjoint events A_(1),A_(2),dotsA_{1}, A_{2}, \ldots, we have Pr[uu_(i=1)^(oo)A_(i)]=sum_(i=1)^(oo)Pr[A_(i)]\mathbf{Pr}\left[\cup_{i=1}^{\infty} A_{i}\right]=\sum_{i=1}^{\infty} \mathbf{Pr}\left[A_{i}\right]. The last property may appear strong since we ask for countably infinite unions but this is needed in many applications of probability. It turns out that one cannot create such a measure due to various technical reasons and paradoxes. To avoid these issues standard measure theory takes the view that we do not need to assign a measure to all possible subsets of Omega\Omega but only some subsets F\mathscr{F}. We want F\mathscr{F} to have some basic closure properties as well as include all interesting sets that we can intuitively think of. For example, in the setting of [0,1][0,1] all intervals and their complements and finite unions and intersections should be included. A set not in F\mathscr{F} is called non-measurable. Formally we define the notion of a sigma\sigma-algebra.
Definition 1.4.Given a set Omega\Omega a collection of subsets FsubeP(Omega)\mathscr{F} \subseteq \mathscr{P}(\Omega) is a sigma\sigma-algbera if the following properties hold:
Omega inF\Omega \in \mathscr{F}
A inFA \in \mathscr{F} implies that bar(A)inF\bar{A} \in \mathscr{F} (closure under complementation)
If A_(1),A_(2),dotsA_{1}, A_{2}, \ldots, is a countably infinite sequence of sets such that each A_(i)inFA_{i} \in \mathscr{F} then uu_(i=1)^(oo)A_(i)inF\cup_{i=1}^{\infty} A_{i} \in \mathscr{F} (closure under countable union)
Note that closure under complementation and countable union also implies closure under finite intersection. That is, if A_(1),A_(2),dots,A_(n)inFA_{1}, A_{2}, \ldots, A_{n} \in \mathscr{F} then nn_(i=1)^(n)A_(i)inF\cap_{i=1}^{n} A_{i} \in \mathscr{F}.
Remark 1.5.For a finite or a countably infinite probability space we let F=P(Omega)\mathscr{F}=\mathscr{P}(\Omega) and hence it is often not explicitly mentioned.
With the discussion in place the formal definition of a probability space is as follows.
Definition 1.6.A probability space is a triple (Omega,F,Pr)(\Omega, \mathscr{F}, \mathbf{Pr}) which consists of a set Omega\Omega, a sigma\sigma-algebra F\mathscr{F} over Omega\Omega and a probability measure Pr:Frarr[0,1]\mathbf{Pr}: \mathscr{F} \rightarrow[0,1] such that
Pr[Omega]=1\mathbf{Pr}[\Omega]=1
ForA inF,Pr[A]=1-Pr[ bar(A)]A \in \mathscr{F}, \mathbf{Pr}[A]=1-\mathbf{Pr}[\bar{A}].
For any countably infinite sequence of disjoint sets A_(1),A_(2),dotsA_{1}, A_{2}, \ldots, such that A_(i)inFA_{i} \in \mathscr{F} for all i >= 1,Pr[uu_(i)A_(i)]=sum_(i=1)^(oo)Pr[A_(i)]i \geq 1, \mathbf{Pr}\left[\cup_{i} A_{i}\right]=\sum_{i=1}^{\infty} \mathbf{Pr}\left[A_{i}\right].
Density functions: Almost all of the time we don't deal with the above subtleties since we typically define continuous probability measures via density functions and cumulative distribution functions. In other words we define probability measure indirectly via integration. Consider Omega=[0,1]\Omega=[0,1]. We specify a density function f:[0,1]rarr[0,1]f:[0,1] \rightarrow[0,1] we let Pr[A]=int_(A)fdx\mathbf{Pr}[A]=\int_{A} f d x. For instance if we want the uniform distribution over [0,1][0,1] we let f(x)=1f(x)=1 and we see that Pr[[a,b]]=int_(a)^(b)1dx=b-a\mathbf{Pr}[[a, b]]=\int_{a}^{b} 1 d x=b-a. In fact what this formula hides is that it doesn't tell us how to compute the standard Riemann integral for an arbitrary set A sub[0,1]A \subset[0,1] but only over "nice" sets such as intervals and their finite unions and intersections. However, for all practical purposes in this course and for most applications that most of us will encounter the definition of probability measures via integration of "nice" functions over "nice" sets suffices.
Example 7.The standard Normal distribution over the entire real line (-oo,oo)(-\infty, \infty) is defined by the density function f(x)=(1)/(sqrt(2pi))e^(-x^(2)//2)f(x)=\frac{1}{\sqrt{2 \pi}} e^{-x^{2} / 2}.
Question 4.Consider the density function f(x)=cx^(2)f(x)=c x^{2} over the set Omega=[0,1]\Omega=[0,1] where cc is a fixed constant. What should cc be so that Pr[Omega]=1\mathbf{Pr}[\Omega]=1? What is Pr[A]\mathbf{Pr}[A] where A=[0.2,0.4]A=[0.2,0.4]. What about when A=[0.1,0.2]uu[0.3,0.6]A=[0.1,0.2] \cup[0.3,0.6].
Remark 1.7.We will not mention the issue of the sigma\sigma-algebra F\mathscr{F} from now on even when discussing continuous probability measures.
1.3. Union Bound
Union bound is a very simple bound but extremely useful because it does not make any assumptions.
Lemma 1.8.Let A,BA, B be two events in a probability space (Omega,Pr)(\Omega, \mathbf{Pr}) then Pr[A uu B] <= Pr[A]+Pr[B]\mathbf{Pr}[A \cup B] \leq \mathbf{Pr}[A]+\mathbf{Pr}[B]. More generally let A_(1),A_(2),dots,A_(n)A_{1}, A_{2}, \ldots, A_{n} be a finite collection of events then
Proof: The discrete case is easy to prove. We consider the continuous case. A uu BA \cup B is the disjoint union of A-B,A nn B,B-AA-B, A \cap B, B-A (if A,BA, B are in the sigma\sigma-algebra then A uu B,A nn B,A-B,B-AA \cup B, A \cap B, A-B, B-A are also in the sigma\sigma-algebra, do you see why?). Hence,
{:[Pr[A uu B]=Pr[A-B]+Pr[A nn B]+Pr[B-A]],[ <= Pr[A-B]+Pr[A nn B]+Pr[A nn B]+Pr[B-A]],[=Pr[A]+Pr[B].]:}\begin{aligned}
\mathbf{Pr}[A \cup B] &=\mathbf{Pr}[A-B]+\mathbf{Pr}[A \cap B]+\mathbf{Pr}[B-A] \\
& \leq \mathbf{Pr}[A-B]+\mathbf{Pr}[A \cap B]+\mathbf{Pr}[A \cap B]+\mathbf{Pr}[B-A] \\
&=\mathbf{Pr}[A]+\mathbf{Pr}[B] .
\end{aligned}
We used the third axiom of probability on additivity over disjoint sets in the first and third equalities and non-negativity of probability for the inequality in the second line.
1.4. Independent Events
Independence is a fundamental notion in probability.
Definition 1.9.Let A,BA, B be two events in a probability space. They are independent iff Pr[A nn B]=\mathbf{Pr}[A \cap B]=Pr[A]Pr[B]\mathbf{Pr}[A] \mathbf{Pr}[B], otherwise they are dependent.
Example 8.Two coins. Omega={HH,TT,HT,TH}\Omega=\{H H, T T, H T, T H\} and Pr[HH]=Pr[TT]=Pr[HT]=Pr[TH]=1//4\mathbf{Pr}[H H]=\mathbf{Pr}[T T]=\mathbf{Pr}[H T]=\mathbf{Pr}[T H]=1 / 4.
AA is the event that the first coin is heads and BB is the event that the second coin is tails. A,BA, B are independent.
AA is the event that both are not tails and BB is the event that the second coin is heads. A,BA,Bare dependent.
The formal definition of independence is often not very useful in understanding when two events are independent since calculating the probabilities of various quantities is not straightforward. A more intuitive definition is that AA and BB are independent if knowing whether one of them happened does not tell us anything about whether the other happened or not. In complex settings, often the reason two events A,BA, B can be seen to be independent is because they are based on different parts of the probability space. In fact, in most cases, we define or construct probability spaces via independence! Consider the following example.
Example 9.Toss 100100 independent unbiased coins. Let AA be the event that in the first 55 coins there are at least 22 heads and let BB be the event that the number of tails in the last 1010 coins is more than 77. Then A,BA, B are independent events.
Note that A,BA, B are obviously independent because the coins are defined to be independent. What is the probability space here? It is the product of the probability spaces of the individual coins. We can formalize this in a lemma below.
Lemma 1.10.Let (Omega,Pr)(\Omega, \mathbf{Pr}) be the product of kk probability spaces (Omega_(1),Pr_(1)),dots,(Omega_(k),Pr_(k))\left(\Omega_{1}, \mathbf{Pr}_{1}\right), \ldots,\left(\Omega_{k}, \mathbf{Pr}_{k}\right). For S sube[k]S \subseteq[k] we say that event AA depends only on SS if A=B xx CA=B \times C where B=prod_(i!in S)Omega_(i)B=\prod_{i \notin S} \Omega_{i} and C subeprod_(j in S)Omega_(j)C \subseteq \prod_{j \in S} \Omega_{j}. If is an event that depends only on S sub[k]S \subset[k] and A^(')A^{\prime} is an event that depends only on T sub[k]T \subset[k] and S nn T=O/S \cap T=\emptyset then A,A^(')A, A^{\prime} are independent.
Proof: Exercise.
2. Random Variables
The most important concept that you need is the idea of a random variable. A random variable in a general context is simply a mapping/function from a probability space (Omega,Pr)(\Omega, \mathbf{Pr}) to another probability/measure space Omega^(')\Omega^{\prime}. However, for the most part, we will be mainly interested in real-valued random variables and indicator random variables.
Definition 2.1.Given a probability space (Omega,Pr)(\Omega, \mathbf{Pr}) a random variable XX over Omega\Omega is a function that maps each elementary event to a real number. In other words X:Omega rarrRX: \Omega \rightarrow \mathbb{R}.
An indicator (or binary) random variable is one whose range is {0,1}\{0,1\}, that is X(omega)=0X(\omega)=0 or X(omega)=1X(\omega)=1 for all omega in Omega\omega \in \Omega.
Discrete versus continuous random variables: If the underlying probability space is discrete then even when XX is a real-valued random variable the range space is also effectively discrete although it is embedded in a continuous space of real numbers. However when the original space Omega\Omega is continuous then one has to be more careful in general. Not all functions mapping Omega\Omega to R\mathbb{R} work. There is a notion of measurable functions that we will not discuss here. However, for calculations, we will need a density function f_(X)f_{X} for the probability distribution induced by XX over the real numbers. Given a density function ff defining the probability distribution (Omega,Pr)(\Omega, \mathbf{Pr} ) we can usually compute the density function f_(X)f_{X} associated with the transformation XX via calculus.
For real-valued random variables an important notion closely related to the density function is the cumulative distribution function (cdf). Typically density is referred to by f_(X)f_{X} and cumulative distribution function as F_(X)F_{X}. The function F_(X):Rrarr[0,1]F_{X}: \mathbb{R} \rightarrow[0,1] is defined as: F_(X)(t)=Pr[X <= t]F_{X}(t)=\mathbf{Pr}[X \leq t]. One can see that Pr[X <= t]=int_(oo)^(t)f_(X)(y)dy\mathbf{Pr}[X \leq t]=\int_{\infty}^{t} f_{X}(y) d y. This also shows that f_(X)=(d)/(dy)F_(X)f_{X}=\frac{d}{d y} F_{X}.
Example 10.Consider the probability space (Omega,Pr)(\Omega, \mathbf{Pr}) where Omega={1,2,dots,6}\Omega=\{1,2, \ldots, 6\} corresponding to the outcomes of a 6-sided die with uniform probability. Let X:Omega rarrRX: \Omega \rightarrow \mathbb{R} where X(i)=i^(2)X(i)=i^{2}. Then XX is an integer-valued random variable.
Example 11.Consider the probability space (Omega,Pr)(\Omega, \mathbf{Pr}) where Omega={1,2,dots,6}\Omega=\{1,2, \ldots, 6\} corresponding to the outcomes of a 6-sided die with uniform probability. Let X:Omega rarr{0,1}X: \Omega \rightarrow\{0,1\} where X(i)=i mod2X(i)=i \bmod {2}. Then XX is an indicator random variable.
Example 12.Consider the probability space (Omega,Pr)(\Omega, \mathbf{Pr}) where Omega=[0,1]\Omega=[0,1] and Pr\mathbf{Pr} is the uniform distribution. Let X:Omega rarrRX: \Omega \rightarrow \mathbb{R} where X(a)=exp(a)X(a)=\exp (a). Then XX is a real-valued random variable.
In the preceding example how do we compute the density function of XX? Consider the uniform probability distribution over [0,1][0,1]. The density function ff corresponding to this is f(x)=1f(x)=1 for all x in[0,1]x \in[0,1]. The random variable XX maps [0,1][0,1] to [1,e][1, e]. In this case we see that XX is a one-to-one increasing function. It is easier to work with the cumulative distribution functions FF and F_(X)F_{X}. We see that F(x)=xF(x)=x for x in[0,1]x \in[0,1]. Since XX is one-to-one we see that for any t in[1,e],F_(X)(t)=Pr[X <= t]=F(ln t)=ln tt \in[1, e], F_{X}(t)=\mathbf{Pr}[X \leq t]=F(\ln t)=\ln t. And hence f_(X)(t)=(d)/(dt)F_(X)(t)=1//tf_{X}(t)=\frac{d}{d t} F_{X}(t)=1 / t.
Why random variables? Random variables allow us to take specific views of a complex probability space and analyze that particular view. Consider any omega^(')\omega^{\prime} in the range of a random variable X:Omega rarrOmega^(')X: \Omega \rightarrow \Omega^{\prime}. Thus XX collapses all atomic events in X^(-1)(omega^('))X^{-1}\left(\omega^{\prime}\right) into omega^(')\omega^{\prime}. Thus, one can view XX as partitioning the original space Omega\Omega and associating with each part of this partition a symbol; hence XX creates a new probability space implicitly. In the real-valued case, by associating a number instead of a symbol, we gain the advantage of calculating various quantities of interest. Moreover, as we will see, the judicious use of random variables to calculate and estimate various quantities of interest is the key for analysis of probabilistic methods and randomized algorithms.
Events and Indicator Random Variables: Recall that an event in a probability space is simply a subset of atomic events in the discrete case (which will be our main focus) or belongs to the sigma\sigma-algebra in the continuous case. With each event AA we can associate an indicator random variable X_(A)X_{A} where X_(A)(omega)=1X_{A}(\omega)=1 if omega in A\omega \in A and 00 otherwise. Why is this useful? It is useful because we have associated a number with the event and now we can do manipulations with numbers when considering multiple events instead of working with set algebra which can become quite complicated quickly. We will see several examples soon. Conversely, each indicator random variable X:Omega rarr{0,1}X: \Omega \rightarrow\{0,1\} corresponds to an event A_(X)A_{X} where A_(X)={omega∣X(omega)=1}A_{X}=\{\omega \mid X(\omega)=1\}. Given a random variable XX, what is Pr[X=b]\mathbf{Pr}[X=b] for some bb in the range of XX? We consider the event X^(-1)(b)={omega in Omega∣X(omega)=b}X^{-1}(b)=\{\omega \in \Omega \mid X(\omega)=b\} and Pr[X=b]=Pr[X^(-1)(b)]\mathbf{Pr}[X=b]=\mathbf{Pr}\left[X^{-1}(b)\right].
2.1. Independence of Random Variables
We saw the definition of independence of events. A more general notion is that of independence of random variables.
Definition 2.2.Two discrete random variables X,YX, Y defined over the same probability space (Omega,Pr)(\Omega, \mathbf{Pr}) are independent if for all a,ba, b we have Pr[X=a,Y=b]=Pr[X=a]Pr[Y=b]\mathbf{Pr}[X=a, Y=b]=\mathbf{Pr}[X=a] \mathbf{Pr}[Y=b]. More generally X,YX, Y are independent if Pr[X in A,Y in B]=Pr[X in A]Pr[Y in B]\mathbf{Pr}[X \in A, Y \in B]=\mathbf{Pr}[X \in A] \mathbf{Pr}[Y \in B] for all measurable sets A,BA, B.
As in the discussion of independence of events, the formal definition is less useful in knowing when random variables are actually independent. Intuitively, two random variables are independent if knowing the value of one does not tell us any information about the value of the other.
2.2. Expectation
Recall that the advantage of a real-valued random variable is that we can calculate with it. The most basic information associated with a real-valued random variable is its expectation.
Definition 2.3.For a real-valued discrete random variable XX over a probability space (Omega,Pr)(\Omega, \mathbf{Pr}) the expectation of XX, denoted by E[X]\mathbf{E}[X], is defined as sum_(omega in Omega)Pr[omega]X(omega)\sum_{\omega \in \Omega} \mathbf{Pr}[\omega] X(\omega). In other words, the expectation is the average value of XX according to the probabilities given by Pr[*]\mathbf{Pr}[\cdot].
Definition 2.4.For a real-valued random variable XX over a continuous probability space (Omega,Pr)(\Omega, \mathbf{Pr}) where f_(X)f_{X} is the density function of XX the expectation of XX is defined as int_(-oo)^(oo)yf_(X)(y)dy\int_{-\infty}^{\infty} y f_{X}(y) d y[1].
Remark 2.5.Note that for infinite probability spaces the expectation of a random variable may not be finite in which case we say it does not exist. Consider the random variable XX where Pr[X=i]=c//i^(2)\mathbf{Pr}[X=i]=c / i^{2} for all integers i >= 1i \geq 1 where c=6//pi^(2)c=6 / \pi^{2}. E[X]=sum_(i >= 1)c//i\mathbf{E}[X]=\sum_{i \geq 1} c / i which diverges to oo\infty.
Remark 2.6.Often one works with functions of random variables which can be viewed as random variables themselves but calculating the expectation of the resulting function can be done more directly than computing the density function of the new random variable. Suppose XX is a real-valued random variable with density function XX and we are interested in the function g(X)g(X) (say gg is given by g(y)=y^(2)g(y)=y^{2}). Then E[g(X)]=int g(y)f_(X)(y)dy\mathbf{E}[g(X)]=\int g(y) f_{X}(y) d y.
Expectation of an indicator random variable: A very useful fact is the following easy observation.
Lemma 2.7.For an event AA let X_(A)X_{A} be the indicator random variable for AA. Then E[X_(A)]=Pr[A]\mathbf{E}\left[X_{A}\right]=\mathbf{Pr}[A].
2.3. Linearity of Expectation and Examples
One of the most useful facts about expectation is the following simple one about linearity of expectation.
Lemma 2.8.Let X,YX, Y be two real-valued random variables over the same probability space. Then E[X+Y]=E[X]+E[Y]\mathbf{E}[X+Y]=\mathbf{E}[X]+\mathbf{E}[Y]. More generally for any finite collection X_(1),X_(2),dots,X_(n)X_{1}, X_{2}, \ldots, X_{n} of random variables over the same probability space and real numbers a_(1),a_(2),dots,a_(n)a_{1}, a_{2}, \ldots, a_{n} we have E[sum_(i=1)^(n)a_(i)X_(i)]=sum_(i=1)^(n)a_(i)E[X_(i)]\mathbf{E}\left[\sum_{i=1}^{n} a_{i} X_{i}\right]=\sum_{i=1}^{n} a_{i} \mathbf{E}\left[X_{i}\right].
The preceding lemma easily follows from the definition of expectation.
We now give an example to illustrate the use of linearity of expectation. Let G=(V,E)G=(V, E) be a graph with nn vertices and mm edges. Let G^(')G^{\prime} be the graph resulting from independently deleting every vertex of GG with probability 1//21 / 2. What is the expected number of edges in G^(')G^{\prime}? How do we analyze this? First we define a random variable XX to denote the number of edges in G^(')G^{\prime}. What we want is E[X]\mathbf{E}[X]. The straightforward formula for expectation is sum_(k=1)^(|E|)kPr[X=k]\sum_{k=1}^{|E|} k \mathbf{Pr}[X=k]. How do we calculate the quantity Pr[X=k]\mathbf{Pr}[X=k] and how do we do the sum? This seems quite hard. The key is to express XX as a sum and then use linearity of expectation. For each edge e in Ee \in E let X_(e)X_{e} be an indicator random variable for whether e inG^(')e \in G^{\prime}. We see that X=sum_(e in E)X_(e)X=\sum_{e \in E} X_{e} and by linearity of expectation E[X]=sum_(e in|E|)E[X_(e)]\mathbf{E}[X]=\sum_{e \in|E|} \mathbf{E}\left[X_{e}\right]. Since X_(e)X_{e} is an indicator random variable we know that E[X_(e)]=Pr[X_(e)=1]\mathbf{E}\left[X_{e}\right]=\mathbf{Pr}\left[X_{e}=1\right].
Event A_(e)=A_{e}= edge e in Ee \in E is present in G^(')G^{\prime}.
Pr[A_(e=(u,v))]=Pr[u" and "v" both are present"]=Pr[u" is present"]*Pr[v" is present"]=(1)/(2)*(1)/(2)=(1)/(4)\mathbf{Pr}\left[A_{e=(u, v)}\right]=\mathbf{Pr}[u \text{ and } v \text{ both are present}]=\mathbf{Pr}[u \text{ is present}] \cdot \mathbf{Pr}[v \text{ is present}]=\frac{1}{2} \cdot \frac{1}{2}=\frac{1}{4}. Here we use independence in the experiment itself.
E[X]=E[sum_(e in E)X_(e)]=sum_(e in E)Pr[A_(e)]=(m)/(4).\mathbf{E}[X]=\mathbf{E}\left[\sum_{e \in E} X_{e}\right]=\sum_{e \in E} \mathbf{Pr}\left[A_{e}\right]=\frac{m}{4} .
Question 5.Let G=(V,E)G=(V, E) be a graph with nn vertices and mm edges. Assume GG has tt triangles (i.e., a triangle is a simple cycle with three vertices). Let G^(')G^{\prime} be the graph resulting from deleting independently each vertex of GG with probability 1//21 / 2. What is the expected number of triangles in G^(')G^{\prime}?
Question 6.Suppose nn balls are independently thrown into nn bins where for each ball the bin is chosen uniformly at random. (i) What is the probability that bin 11 is empty? (ii) What is the expected number of empty bins?
2.4. Independence and product of expectations
Linearity of expectation works even when the variables are not independent. There are situations when we need to work with products and here independence is useful.
Lemma 2.9.Let X,YX, Y be independent random variables. Then E[XY]=E[X]E[Y]\mathbf{E}[X Y]=\mathbf{E}[X] \mathbf{E}[Y]. More generally for any finite collection X_(1),X_(2),dots,X_(n)X_{1}, X_{2}, \ldots, X_{n} of independent random variables E[prod_(i)X_(i)]=prod_(i)X_(i)\mathbf{E}\left[\prod_{i} X_{i}\right]=\prod_{i} X_{i}.
We give a proof for the discrete setting.
Proof:
{:[E[X*Y]=sum_(omega in Omega)Pr[omega](X(omega)*Y(omega))],[=sum_(x,y inR)Pr[X=x^^Y=y](x*y)],[=sum_(x,y inR)Pr[X=x]*Pr[Y=y]*x*y],[=(sum_(x inR)Pr[X=x]x)(sum_(y inR)Pr[Y=y]y)=E[X]E[Y]]:}\begin{aligned}
\mathbf{E}[X \cdot Y] &=\sum_{\omega \in \Omega} \mathbf{Pr}[\omega](X(\omega) \cdot Y(\omega)) \\
&=\sum_{x, y \in \mathbb{R}} \mathbf{Pr}[X=x \wedge Y=y](x \cdot y) \\
&=\sum_{x, y \in \mathbb{R}} \mathbf{Pr}[X=x] \cdot \mathbf{Pr}[Y=y] \cdot x \cdot y \\
&=(\sum_{x \in \mathbb{R}} \mathbf{Pr}[X=x] x)(\sum_{y \in \mathbb{R}} \mathbf{Pr}[Y=y] y)=\mathbf{E}[X] \mathbf{E}[Y]
\end{aligned}
2.5. Variance and Higher Moments
Expectation of a random variable gives basic information about its behavior. Additional information can be obtained via higher moments. For an integer t >= 1t \geq 1 the t^(')t^{\prime} th moment of XX is defined as E[X^(t)]\mathbf{E}\left[X^{t}\right]. Expectation is the first moment. The second moment is closely related to the more standard measure, namely the variance.
Definition 2.10.Let XX be a real-valued random variable over a probability space such that E[X]\mathbf{E}[X] is finite. The variance of XX, denoted by Var[X]\mathbf{Var}[X], is E[(X-E[X])^(2)]\mathbf{E}\left[(X-\mathbf{E}[X])^{2}\right].
Prove the following claim which shows that Var[X]\mathbf{Var}[X] differs from the second moment by an additive term.
Note that Var[X]\mathbf{Var}[X] may not be finite even when E[X]\mathbf{E}[X] is finite. Can you think of an example? Note that Var[X]\mathbf{Var}[X] is a non-negative random variable and measures how far XX deviates from its expectation. Since Var[X]\mathbf{Var}[X] is the square of the deviation we also use sigma_(X)=sqrt(Var[X])\sigma_{X}=\sqrt{\mathbf{Var}[X]}, called the standard deviation, to measure the deviation.
Example 13.Consider a random variable XX where X=-BX=-B with probability 1//21 / 2 and X=BX=B with probability 1//21 / 2. E[X]=0\mathbf{E}[X]=0 for all values of BB. On the other hand Var[X]=B^(2)\mathbf{Var}[X]=B^{2} and sigma_(X)=B\sigma_{X}=B.
Question 7.What is Var[aX]\mathbf{Var}[a X] for a scalar aa as a function of aa and Var[X]\mathbf{Var}[X]?
Prove the following lemma.
Lemma 2.11.Let X,YX, Y be independent random variables. Then Var[X+Y]=Var[X]+Var[Y]\mathbf{Var}[X+Y]=\mathbf{Var}[X]+\mathbf{Var}[Y]. More generally if X_(1),X_(2),dots,X_(n)X_{1}, X_{2}, \ldots, X_{n} are independent random variables then Var[sum_(i=1)^(n)X_(i)]=sum_(i=1)^(n)Var[X_(i)]\mathbf{Var}\left[\sum_{i=1}^{n} X_{i}\right]=\sum_{i=1}^{n} \mathbf{Var}\left[X_{i}\right].
E[X^(t)]\mathbf{E}\left[X^{t}\right] is the tt'th moment of XX. They are typically split into odd and even moments based on the parity of tt and the even moments are non-negative and hence tend to be more useful as measures of deviation of XX from its expectation. If E[X^(t)]\mathbf{E}\left[X^{t}\right] is small compared to E[X]\mathbf{E}[X] for larger values of tt then it indicates that XX is very close to its expectation and well-behaved. This leads to concentration inequalities that find many applications.
3. Conditional Probability and Expectation
Conditional probability allows us to focus on subsets of the probability space. Suppose we have a probability space (Omega,Pr)(\Omega, \mathbf{Pr}) and we have information that event AA happened but nothing else. If A={omega}A=\{\omega\} then there is no uncertainty left. However if AA is a non-singleton set then we don't know which elementary event in AA may have happened. How do we capture this? Since the discrete case is easier to understand and essentially captures all the intuition we will stick to it. A natural way to do it is to define a new probability space over the set AA since only elementary events in AA are feasible now. How should we assign probabilities to each omega in A\omega \in A? Let Pr^(')\mathbf{Pr}^{\prime} be the probability function over AA that we wish to compute. The most natural choice is to set Pr^(')[omega]=Pr[omega]//Pr[A]\mathbf{Pr}^{\prime}[\omega]=\mathbf{Pr}[\omega] / \mathbf{Pr}[A] for each omega in A\omega \in A. This ensures that sum_(omega in A)Pr^(')[omega]=1\sum_{\omega \in A} \mathbf{Pr}^{\prime}[\omega]=1, and hence the new measure is indeed a probability measure over AA. One can also think of Pr^(')\mathbf{Pr}^{\prime} as a probability measure over Omega\Omega instead of over AA; we set Pr^(')[omega]=0\mathbf{Pr}^{\prime}[\omega]=0 for each omega!in A\omega \notin A. Pr' is the probability measure conditioned on event AA. Typically we write Prr^(')\mathbf{Pr} r^{\prime} as Pr[∣A]\mathbf{Pr}[\mid A]. For an event BB we say Pr[B∣A]\mathbf{Pr}[B \mid A] is the probability of BB conditioned on event AA which is simply Pr^(')[B]\mathbf{Pr}^{\prime}[B]. From our definition we have the following easy consequence. Note that this is really coming from the definition.
Lemma 3.1.For any event B,Pr[B∣A]=Pr^(')[B nn A]=(Pr[A nn B])/(Pr[A])B, \mathbf{Pr}[B \mid A]=\mathbf{Pr}^{\prime}[B \cap A]=\frac{\mathbf{Pr}[A \cap B]}{\mathbf{Pr}[A]}.
The above is often called Bayes's theorem or rule; I prefer rule because it comes easily from a definition rather than coming from a chain of non-trivial deductions. We note that Bayes's rule is applied in different ways depending on the context. In some applications we may have access to Pr[A nn B]\mathbf{Pr}[A \cap B] and Pr[A]\mathbf{Pr}[A] and want to compute Pr[B∣A]\mathbf{Pr}[B \mid A]. In others we may have access to Pr[B∣A]\mathbf{Pr}[B \mid A] and Pr[A]\mathbf{Pr}[A] and want to compute Pr[A nn B]\mathbf{Pr}[A \cap B] (for instance in observational studies).
Verify the claim below and see why it makes intuitive sense.
Corollary 3.2.A,BA, B are independent events iff Pr[B∣A]=Pr[B]\mathbf{Pr}[B \mid A]=\mathbf{Pr}[B].
One can extend conditional probability to random variables in the natural way. Recall that a realvalued random variable XX over a probability space (Omega(\Omega, Pr) is simply a function from Omega\Omega to R\mathbb{R}. As such XX does not explicitly depend on the probability measure. However E[X]\mathbf{E}[X] does indeed depend on Pr. We define E[X∣A]\mathbf{E}[X \mid A] as the conditional expectation of XX given AA. Its formal definition is simply the expectation of XX under the modified probability measure Pr^(')=Pr[|A]\mathbf{Pr}^{\prime}=\mathbf{Pr}[| A]. More formally, in the discrete case,
E[X∣A]=sum_(omega in Omega)Pr[omega∣A]X(omega)=sum_(omega in A)(Pr[omega])/(Pr[A])X(omega).\mathbf{E}[X \mid A]=\sum_{\omega \in \Omega} \mathbf{Pr}[\omega \mid A] X(\omega)=\sum_{\omega \in A} \frac{\mathbf{Pr}[\omega]}{\mathbf{Pr}[A]} X(\omega) .
4. Some probability distributions of interest
We will encounter a small number of common distributions often and it is helpful to catalog them and some of their known properties. Often these distributions are parametrized. You can find a lot of useful information as well as pictures and examples on these distributions and much more on Wikipedia.
Bernoulli and Binomial distributions: Coin tosses are perhaps the simplest probability distribution that are extremely well-studied for their simplicity and applicability. Given a number p in[0,1]p \in[0,1] consider the two variable space {H,T}\{H, T\} with Pr[H]=p\mathbf{Pr}[H]=p and Pr[T]=1-p\mathbf{Pr}[T]=1-p. This is called a Bernoulli distribution. Often we want to toss nn independent coins. Then the probability space is of size 2^(n)2^{n} corresponding to {H,T}^(n)\{H, T\}^{n}. It is often simpler to use {1,0}\{1,0\} instead of {H,T}\{H, T\} in which case the probability space is the set of all binary strings of length nn. The Binomial distribution with parameters nn and pp is obtained by considering a random variable XX where X(omega)X(\omega) is equal to the number of heads (or 00's) in the string omega\omega. We see that Pr[X=k]=([n],[k])p^(k)(1-p)^(n-k)\mathbf{Pr}[X=k]=\left(\begin{array}{l}n \\ k\end{array}\right) p^{k}(1-p)^{n-k}. Thus the Binomial distribution can thus be alternatively thought of as a distribution over {0,1,2,dots,n}\{0,1,2, \ldots, n\} where Pr[i]=([n],[k])p^(k)(1-p)^(n-k)\mathbf{Pr}[i]=\left(\begin{array}{l}n \\ k\end{array}\right) p^{k}(1-p)^{n-k}.
Suppose XX has the Binomial distribution with parameters n,pn, p. Then one way to understand XX is via a sum of nn independent Bernoulli random variables Y_(1),Y_(2),dots,Y_(n)Y_{1}, Y_{2}, \ldots, Y_{n} with parameter pp. We then see that E[X]=E[sum_(i=1)^(n)Y_(i)]=np\mathbf{E}[X]=\mathbf{E}\left[\sum_{i=1}^{n} Y_{i}\right]=n p. We also have that Var[X]=sum_(i)Var[Y_(i)]=np(1-p)\mathbf{Var}[X]=\sum_{i} \mathbf{Var}\left[Y_{i}\right]=n p(1-p). Note that a direct computation with binomial coefficients would be much more messy.
Poisson distribution: The Poisson distribution has a single non-negative real parameter lambda > 0\lambda>0 and is a discrete distribution over the non-negative integers {0,1,2,dots}\{0,1,2, \ldots\}. For a random variable XX distributed according to this distribution, Pr[X=i]=e^(-lambda)(lambda)/(i!)\mathbf{Pr}[X=i]=e^{-\lambda} \frac{\lambda}{i !}. One can prove that E[X]=Var[X]=lambda\mathbf{E}[X]=\mathbf{Var}[X]=\lambda. Do you see why? The Poisson distribution is very useful for a variety of reasons. It is connected to the Binomial distribution as follows. It can be seen as the limit distribution of the Binomial distribution with parameters n,pn, p where we take n rarr oon \rightarrow \infty while reducing pp such that np=lambdan p=\lambda. Thus the Poisson distribution approximates the Binomial distribution when pp is very small compared to nn; this allows one to avoid messy computations with binomial coefficients and instead use the simpler formulas provided by the Poisson distribution. One very interesting property satisfied by the Poisson distribution is the following. If X_(1),X_(2),dots,X_(n)X_{1}, X_{2}, \ldots, X_{n} are independent Poisson random variables with means lambda_(1),lambda_(2),dots,lambda_(n)\lambda_{1}, \lambda_{2}, \ldots, \lambda_{n} then X=sum_(i=1)^(n)X_(i)X=\sum_{i=1}^{n} X_{i} has the Poisson distribution with mean sum_(i=1)^(n)lambda_(i)\sum_{i=1}^{n} \lambda_{i}.
Geometric distribution: This distribution is closely related to the Bernoulli distribution and hence has a single parameter pp. There are two variants and both use the same name so one has to be careful. The first variant is over the support {1,2,dots}\{1,2, \ldots\} and the probability of ii is the number of tosses of a Bernoulli random variable to see the first head. If XX is distributed according to this then we see that Pr[X=i]=(1-p)^(i-1)p.E[X]=1//p\mathbf{Pr}[X=i]=(1-p)^{i-1} p . \mathbf{E}[X]=1 / p and Var[X]=(1-p)//p^(2)\mathbf{Var}[X]=(1-p) / p^{2}.
The second variant is to measure the number of failures before the first head. This is the same as above shifted by -1-1 and hence has support {0,1,2,dots}\{0,1,2, \ldots\}. For this second variant the expectation is (1//p-1)(1 / p-1) and variance is the same (1-p)//p^(2)(1-p) / p^{2}.
Uniform distribution: The uniform distribution over a finite state space is pretty simple. If the state space has size kk then each elementary event has probability 1//k1 / k. This can be seen as a generalization of the Bernoulli distribution. Suppose we have nn such independent distributions. Their joint distribution corresponds to the experiment of throwing nn balls into kk bins. Many aspects of balls and bins distributions are studied and we will see examples and applications and connections to hashing.
Uniform distribution is very important in continuous probability spaces. The most familiar one is to consider a uniform distribution over a closed interval [a,b][a, b] of the real line. The density function ff corresponding to this is f(x)=1//(b-a)f(x)=1 /(b-a) for x in[a,b]x \in[a, b] and 00 outside the interval. In general we can define a uniform distribution over more complicated "shapes", especially in higher dimensions. Consider the uniform distribution over the unit circle in the Euclidean plane. Or over a specified rectangle in the plane. In such cases we have the density function f(x,y)f(x, y) is constant over the interior of the shape where the constant depends on the normalization factor which is the area of the shape. Random variables can be used to modify the uniform distribution appropriately to create more complex distributions.
Normal distribution: One of the most fundamental probability distributions is the Normal or Guassian distribution. It is a distribution over the entire Euclidean space in any fixed dimension dd. In one dimension it is defined by two parameters, the mean mu\mu and the variance sigma^(2)\sigma^{2} (or standard deviation sigma\sigma). The density function of this distribution is: f(x)=(1)/(sigmasqrt(2pi))e^(-(1)/(2)((x-mu)/(sigma))^(2))f(x)=\frac{1}{\sigma \sqrt{2 \pi}} e^{-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^{2}}. The standard Normal distribution is obtained by setting mu=0\mu=0 and sigma=1\sigma=1 in which case it becomes the simpler expression (1)/(sqrt(2pi))e^(-(1)/(2)x^(2))\frac{1}{\sqrt{2 \pi}} e^{-\frac{1}{2} x^{2}}. It is a distribution that is symmetric about the mean mu\mu. If XX is a random variable distributed according to the Normal distribution, by construction E[X]=mu\mathbf{E}[X]=\mu and Var[X]=sigma^(2)\mathbf{Var}[X]=\sigma^{2}. The cumulative probability distribution F_(X)(t)=Pr[X <= t]F_{X}(t)=\mathbf{Pr}[X \leq t] is often needed but there is no closed-form expression for it.
Acknowledgments: We thank Manuel Torres for reading and corrections.
Alternatively it is also the same as int_(-oo)^(oo)(1-F_(X)(y))dy\int_{-\infty}^{\infty}\left(1-F_{X}(y)\right) d y where F_(X)F_{X} is the cumulative distribution function of XX. This can be seen by integration by parts. ↩︎
This review gives an introduction to Adversarial Machine Learning on graph-structured data, including several recent papers and research ideas in this field. This review is based on our paper “A Survey of Adversarial Learning on Graph”.