Probability: The Basics

1. Introduction

These notes are intended to provide a quick refresher on basics of probability before introduction to the use of randomization in algorithms. Probabilistic analysis tends to be non-trivial even in simple settings. Various tools have been developed over the years and it is helpful to see some simple templates. For the most part we will work with discrete (in fact finite) probability spaces but we will need also some tools from continuous spaces. There are several subtleties involved in formal aspects of probability theory when dealing with continuous spaces. These will not be of concern to us but it is useful to know why one needs more involved definitions.

1.1. Discrete Probability Space and Events

We will start with simple finite probability spaces. Consider the experiment of tossing an unbiased coin which turns up heads with probability 1 / 2 1 / 2 1//21 / 2 and tails with probability 1 / 2 1 / 2 1//21 / 2. Or tossing a 6-sided die where each side is equally likely (with probability 1 / 6 1 / 6 1//61 / 6). To formally understand probability we need to define a probability space. In the finite setting it consists of two pieces of information. A set of possible atomic or elementary events Ω Ω Omega\Omega and for each ω Ω ω Ω omega in Omega\omega \in \Omega the probability that ω ω omega\omega is realized.
Definition 1.1. A finite probability space is a pair ( Ω , P r ) ( Ω , P r ) (Omega,Pr)(\Omega, \mathbf{Pr}) consists of finite set Ω Ω Omega\Omega of elementary events and function P r : Ω [ 0 , 1 ] P r : Ω [ 0 , 1 ] Pr:Omega rarr[0,1]\mathbf{Pr}: \Omega \rightarrow[0,1] which assigns a probability P r [ ω ] P r [ ω ] Pr[omega]\mathbf{Pr}[\omega] for each ω Ω ω Ω omega in Omega\omega \in \Omega such that ω Ω P r [ ω ] = 1 ω Ω P r [ ω ] = 1 sum_(omega in Omega)Pr[omega]=1\sum_{\omega \in \Omega} \mathbf{Pr}[\omega]=1.
Example 1. An unbiased coin. Ω = { H , T } Ω = { H , T } Omega={H,T}\Omega=\{H, T\} and P r [ H ] = P r [ T ] = 1 / 2 P r [ H ] = P r [ T ] = 1 / 2 Pr[H]=Pr[T]=1//2\mathbf{Pr}[H]=\mathbf{Pr}[T]=1 / 2.
Example 2. A biased coin. Ω = { H , T } Ω = { H , T } Omega={H,T}\Omega=\{H, T\} and P r [ H ] = 1 / 4 P r [ H ] = 1 / 4 Pr[H]=1//4\mathbf{Pr}[H]=1 / 4 and P r [ T ] = 3 / 4 P r [ T ] = 3 / 4 Pr[T]=3//4\mathbf{Pr}[T]=3 / 4.
Example 3. A 6 6 66-sided unbiased die. Ω = { 1 , 2 , 3 , 4 , 5 , 6 } Ω = { 1 , 2 , 3 , 4 , 5 , 6 } Omega={1,2,3,4,5,6}\Omega=\{1,2,3,4,5,6\} and P r [ i ] = 1 / 6 P r [ i ] = 1 / 6 Pr[i]=1//6\mathbf{Pr}[i]=1 / 6 for 1 i 6 1 i 6 1 <= i <= 61 \leq i \leq 6.
From simple probability spaces one often creates more complex ones by the product construction.
Definition 1.2. Suppose ( Ω 1 , P r 1 ) Ω 1 , P r 1 (Omega_(1),Pr_(1))\left(\Omega_{1}, \mathbf{Pr}_{1}\right) and ( Ω 2 , P r 2 ) Ω 2 , P r 2 (Omega_(2),Pr_(2))\left(\Omega_{2}, \mathbf{Pr}_{2}\right) are two finite probability spaces. Then the product probability space ( Ω Ω (Omega:}\left(\Omega\right., Pr) is defined as follows. Ω = Ω 1 × Ω 2 Ω = Ω 1 × Ω 2 Omega=Omega_(1)xxOmega_(2)\Omega=\Omega_{1} \times \Omega_{2} and P r ( ω 1 , ω 2 ) = P r 1 ( ω 1 ) P r 2 ( ω 2 ) P r ω 1 , ω 2 = P r 1 ω 1 P r 2 ω 2 Pr(omega_(1),omega_(2))=Pr_(1)(omega_(1))Pr_(2)(omega_(2))\mathbf{Pr}\left(\omega_{1}, \omega_{2}\right)=\mathbf{Pr}_{1}\left(\omega_{1}\right) \mathbf{Pr}_{2}\left(\omega_{2}\right) for all ω 1 Ω 1 , ω 2 Ω 2 ω 1 Ω 1 , ω 2 Ω 2 omega_(1)inOmega_(1),omega_(2)inOmega_(2)\omega_{1} \in \Omega_{1}, \omega_{2} \in \Omega_{2}.
The reader unfamiliar with product constructions should verify that the product space is indeed a probability space. That is, ω Ω P r [ ω ] = 1 ω Ω P r [ ω ] = 1 sum_(omega in Omega)Pr[omega]=1\sum_{\omega \in \Omega} \mathbf{Pr}[\omega]=1.
One can easily extend the above product definition to the product of a finite number of spaces. Things get tricky if one wants infinite products but we will not need it.
Example 4. Two unbiased coins. One can view this as a single probability space ( Ω , P r ) ( Ω , P r ) (Omega,Pr)(\Omega, \mathbf{Pr}) where Ω = Ω = Omega=\Omega= { H H , H T , T H , T T } { H H , H T , T H , T T } {HH,HT,TH,TT}\{H H, H T, T H, T T\} where the probability of each event is 1 / 4 1 / 4 1//41 / 4 or view this as the product of two spaces where each space is that of one unbiased coin. The product view is often useful when considering independence. On the other hand when correlations are present it is useful to view it as a single space.
Most things from finite probability spaces extend to a probability space over a countably infinite set. Recall that a set A A AA is countably infinite if there is a bijection from A A AA to the natural numbers. Examples include the set of even numbers, the set of primes, set of all strings over a finite alphabet.
Example 5. Let λ > 0 λ > 0 lambda > 0\lambda>0 be a fixed real number. Consider the space ( Ω , P r ) ( Ω , P r ) (Omega,Pr)(\Omega,\mathbf{Pr}) where Ω = { 0 , 1 , 2 , , } Ω = { 0 , 1 , 2 , , } Omega={0,1,2,dots,}\Omega=\{0,1,2, \ldots,\} is the set of non-negative integers and P r ( i ) = e λ λ i i ! P r ( i ) = e λ λ i i ! Pr(i)=e^(-lambda)(lambda^(i))/(i!)\mathbf{Pr}(i)=e^{-\lambda} \frac{\lambda^{i}}{i !} for each i 0 i 0 i >= 0i \geq 0. This is the Poisson distribution with mean λ λ lambda\lambda. We see that i 0 P r ( i ) = 1 i 0 P r ( i ) = 1 sum_(i >= 0)Pr(i)=1\sum_{i \geq 0} \mathbf{Pr}(i)=1 since e λ = i 0 λ i i ! e λ = i 0 λ i i ! e^(lambda)=sum_(i >= 0)(lambda^(i))/(i!)e^{\lambda}=\sum_{i \geq 0} \frac{\lambda^{i}}{i !}.
Question 1. Suppose Ω Ω Omega\Omega is the set of positive integers. We want to have a probability space where P r ( i ) = c / i 2 P r ( i ) = c / i 2 Pr(i)=c//i^(2)\mathbf{Pr}(i)=c / i^{2} for each i 1 i 1 i >= 1i \geq 1 where c c cc is a fixed constant. Is it possible? What about if we wanted have a probability space where P r ( i ) = c / i P r ( i ) = c / i Pr(i)=c//i\mathbf{Pr}(i)=c / i for each i i ii?
Events: In most probability spaces what we are really after are interesting events.
Definition 1.3. Given a probability space ( Ω , P r ) ( Ω , P r ) (Omega,Pr)(\Omega, \mathbf{Pr}) an event is a subset of Ω Ω Omega\Omega. In other words an event is a collection of elementary events. The probability of an event A A AA, denoted by P r [ A ] P r [ A ] Pr[A]\mathbf{Pr}[A], is ω A P r [ ω ] ω A P r [ ω ] sum_(omega in A)Pr[omega]\sum_{\omega \in A} \mathbf{Pr}[\omega]. The complement event of an event A Ω A Ω A sube OmegaA \subseteq \Omega is the event Ω A Ω A Omega\\A\Omega \backslash A frequently denoted by A ¯ A ¯ bar(A)\bar{A}.
Example 6. A pair of independent dice. Ω = { ( i , j ) 1 i 6 , 1 j 6 } Ω = { ( i , j ) 1 i 6 , 1 j 6 } Omega={(i,j)∣1 <= i <= 6,1 <= j <= 6}\Omega=\{(i, j) \mid 1 \leq i \leq 6,1 \leq j \leq 6\}. Let A A AA be the event that the sum of the two numbers on the dice is even. Then A = { ( i , j ) Ω : ( i + j ) is even } A = { ( i , j ) Ω : ( i + j )  is even } A={(i,j)in Omega:(i+j)" is even"}A=\{(i, j) \in \Omega:(i+j)\text{ is even}\}. P r [ A ] = | A | / 36 = 1 / 2 P r [ A ] = | A | / 36 = 1 / 2 Pr[A]=|A|//36=1//2\mathbf{Pr}[A]=|A| / 36=1 / 2.
Question 2. Consider the probability space corresponding to the Poisson distribution with mean λ λ lambda\lambda. Let A A AA be the event that the outcome is an even number. What is P r [ A ] P r [ A ] Pr[A]\mathbf{Pr}[A]?
The following question illustrates that one can easily define events over relatively simple probability spaces but whose probability may not be easy to compute.
Question 3. Let ( Ω , P r ) ( Ω , P r ) (Omega,Pr)(\Omega, \mathbf{Pr}) be the probability space corresponding to the product of the probability spaces of 100 unbiased coins. Let A A AA be the event that the number of heads is a prime number. What is P r [ A ] P r [ A ] Pr[A]\mathbf{Pr}[A]?
Algebra over events: Given two events A , B A , B A,BA, B we can define their union as the event A B A B A uu BA \cup B and their intersection as A B A B A nn BA \cap B. We can easily prove the following in the setting of finite probability spaces via basic set theory.
P r [ A ] + P r [ B ] = P r [ A B ] + P r [ A B ] P r [ A ] + P r [ B ] = P r [ A B ] + P r [ A B ] Pr[A]+Pr[B]=Pr[A uu B]+Pr[A nn B]\mathbf{Pr}[A]+\mathbf{Pr}[B]=\mathbf{Pr}[A \cup B]+\mathbf{Pr}[A \cap B]
which can be written alternatively as
P r [ A B ] = P r [ A ] + P r [ B ] P r [ A B ] . P r [ A B ] = P r [ A ] + P r [ B ] P r [ A B ] . Pr[A uu B]=Pr[A]+Pr[B]-Pr[A nn B].\mathbf{Pr}[A \cup B]=\mathbf{Pr}[A]+\mathbf{Pr}[B]-\mathbf{Pr}[A \cap B] .
What about the union of 3 events A , B , C A , B , C A,B,CA, B, C? You can convince yourself via Venn diagram that
P r [ A B C ] = P r [ A ] + P r [ B ] + P r [ C ] P r [ A B ] P r [ B C ] P r [ A C ] + P r [ A B C ] . P r [ A B C ] = P r [ A ] + P r [ B ] + P r [ C ] P r [ A B ] P r [ B C ] P r [ A C ] + P r [ A B C ] . Pr[A uu B uu C]=Pr[A]+Pr[B]+Pr[C]-Pr[A nn B]-Pr[B nn C]-Pr[A nn C]+Pr[A nn B nn C].\mathbf{Pr}[A \cup B \cup C]=\mathbf{Pr}[A]+\mathbf{Pr}[B]+\mathbf{Pr}[C]-\mathbf{Pr}[A \cap B]-\mathbf{Pr}[B \cap C]-\mathbf{Pr}[A \cap C]+\mathbf{Pr}[A \cap B \cap C] .
A powerful generalization is the inclusion-exclusion formula for the union of a finite collection of events A 1 , A 2 , , A n A 1 , A 2 , , A n A_(1),A_(2),dots,A_(n)A_{1}, A_{2}, \ldots, A_{n}.
P r [ i = 1 n A i ] = i = 1 n P r [ A i ] 1 i < j n P r [ A i A j ] + 1 i < j < k n P r [ A i A j A k ] + + ( 1 ) n 1 P r [ i = 1 n A i ] . P r i = 1 n A i = i = 1 n P r A i 1 i < j n P r A i A j + 1 i < j < k n P r A i A j A k + + ( 1 ) n 1 P r i = 1 n A i . Pr[uu_(i=1)^(n)A_(i)]=sum_(i=1)^(n)Pr[A_(i)]-sum_(1 <= i < j <= n)Pr[A_(i)nnA_(j)]+sum_(1 <= i < j < k <= n)Pr[A_(i)nnA_(j)nnA_(k)]+dots+(-1)^(n-1)Pr[nn_(i=1)^(n)A_(i)].\mathbf{Pr}\left[\cup_{i=1}^{n} A_{i}\right]=\sum_{i=1}^{n} \mathbf{Pr}\left[A_{i}\right]-\sum_{1 \leq i<j \leq n} \mathbf{Pr}\left[A_{i} \cap A_{j}\right]+\sum_{1 \leq i<j<k \leq n} \mathbf{Pr}\left[A_{i} \cap A_{j} \cap A_{k}\right]+\ldots+(-1)^{n-1} \mathbf{Pr}\left[\cap_{i=1}^{n} A_{i}\right] .
Probability space vs distribution: We often talk about probability distributions. What is the distinction between a probability space and a distribution? Distributions should be thought of as templates for an actual concrete experiment. We say X 1 , X 2 , , X n X 1 , X 2 , , X n X_(1),X_(2),dots,X_(n)X_{1}, X_{2}, \ldots, X_{n} are distributed according to say Poisson distribution with mean λ λ lambda\lambda. What this means is that we instantiate the probability space for each X i X i X_(i)X_{i} using the values that we copy from the distribution. This is to avoid the task of specifying in gory detail the probability space for standard cases.

1.2. Continuous Probability Spaces

Consider the experiment of picking a uniformly random number from the continuous interval [ 0 , 1 ] [ 0 , 1 ] [0,1][0,1]. It is natural to view the underlying probability space Ω Ω Omega\Omega as [ 0 , 1 ] [ 0 , 1 ] [0,1][0,1] where the elementary events are the real numbers in the interval [ 0 , 1 ] [ 0 , 1 ] [0,1][0,1]. How should we define Pr over this space? Since Ω Ω Omega\Omega is uncountably infinite we cannot associate any non-zero value to P r [ x ] P r [ x ] Pr[x]\mathbf{Pr}[x] for any x [ 0 , 1 ] x [ 0 , 1 ] x in[0,1]x \in[0,1]; thus we need to set P r [ x ] = 0 P r [ x ] = 0 Pr[x]=0\mathbf{Pr}[x]=0. On the other hand we intuitively see that the probability that a number picked uniformly from [ 0 , 1 ] [ 0 , 1 ] [0,1][0,1] lies in the interval [ a , b ] [ a , b ] [a,b][a, b] with b a b a b >= ab \geq a should be its length b a b a b-ab-a. How do we reconcile this with setting the probability of each atomic event in [ a , b ] [ a , b ] [a,b][a, b] to 0 0 00? One can see that it is not feasible to define probability measures over continuous spaces by defining the values over elementary events and then assigning the values to events. Hence a more general attempt is to define the probability measure over all events simultaneously as long as they satisfy certain basic and intuitive axioms. For the interval [ 0 , 1 ] [ 0 , 1 ] [0,1][0,1] we would like P r [ A ] P r [ A ] Pr[A]\mathbf{Pr}[A] for any event/subset A [ 0 , 1 ] A [ 0 , 1 ] A sube[0,1]A \subseteq[0,1] to correspond to its "length" which is intuitive for interval events.
So an attempt would be define P r ( A ) P r ( A ) Pr(A)\mathbf{Pr}(A) for all A Ω A Ω A sube OmegaA \subseteq \Omega and the natural axioms would be the following: (i) P r [ Ω ] = 1 P r [ Ω ] = 1 Pr[Omega]=1\mathbf{Pr}[\Omega]=1 (ii) for each event A , P r [ A ¯ ] = 1 P r [ A ] A , P r [ A ¯ ] = 1 P r [ A ] A,Pr[ bar(A)]=1-Pr[A]A, \mathbf{Pr}[\bar{A}]=1-\mathbf{Pr}[A] (iii) for any countably infinite set of disjoint events A 1 , A 2 , A 1 , A 2 , A_(1),A_(2),dotsA_{1}, A_{2}, \ldots, we have P r [ i = 1 A i ] = i = 1 P r [ A i ] P r i = 1 A i = i = 1 P r A i Pr[uu_(i=1)^(oo)A_(i)]=sum_(i=1)^(oo)Pr[A_(i)]\mathbf{Pr}\left[\cup_{i=1}^{\infty} A_{i}\right]=\sum_{i=1}^{\infty} \mathbf{Pr}\left[A_{i}\right]. The last property may appear strong since we ask for countably infinite unions but this is needed in many applications of probability. It turns out that one cannot create such a measure due to various technical reasons and paradoxes. To avoid these issues standard measure theory takes the view that we do not need to assign a measure to all possible subsets of Ω Ω Omega\Omega but only some subsets F F F\mathscr{F}. We want F F F\mathscr{F} to have some basic closure properties as well as include all interesting sets that we can intuitively think of. For example, in the setting of [ 0 , 1 ] [ 0 , 1 ] [0,1][0,1] all intervals and their complements and finite unions and intersections should be included. A set not in F F F\mathscr{F} is called non-measurable. Formally we define the notion of a σ σ sigma\sigma-algebra.
Definition 1.4. Given a set Ω Ω Omega\Omega a collection of subsets F P ( Ω ) F P ( Ω ) FsubeP(Omega)\mathscr{F} \subseteq \mathscr{P}(\Omega) is a σ σ sigma\sigma-algbera if the following properties hold:
  • Ω F Ω F Omega inF\Omega \in \mathscr{F}
  • A F A F A inFA \in \mathscr{F} implies that A ¯ F A ¯ F bar(A)inF\bar{A} \in \mathscr{F} (closure under complementation)
  • If A 1 , A 2 , A 1 , A 2 , A_(1),A_(2),dotsA_{1}, A_{2}, \ldots, is a countably infinite sequence of sets such that each A i F A i F A_(i)inFA_{i} \in \mathscr{F} then i = 1 A i F i = 1 A i F uu_(i=1)^(oo)A_(i)inF\cup_{i=1}^{\infty} A_{i} \in \mathscr{F} (closure under countable union)
Note that closure under complementation and countable union also implies closure under finite intersection. That is, if A 1 , A 2 , , A n F A 1 , A 2 , , A n F A_(1),A_(2),dots,A_(n)inFA_{1}, A_{2}, \ldots, A_{n} \in \mathscr{F} then i = 1 n A i F i = 1 n A i F nn_(i=1)^(n)A_(i)inF\cap_{i=1}^{n} A_{i} \in \mathscr{F}.
Remark 1.5. For a finite or a countably infinite probability space we let F = P ( Ω ) F = P ( Ω ) F=P(Omega)\mathscr{F}=\mathscr{P}(\Omega) and hence it is often not explicitly mentioned.
With the discussion in place the formal definition of a probability space is as follows.
Definition 1.6. A probability space is a triple ( Ω , F , P r ) ( Ω , F , P r ) (Omega,F,Pr)(\Omega, \mathscr{F}, \mathbf{Pr}) which consists of a set Ω Ω Omega\Omega, a σ σ sigma\sigma-algebra F F F\mathscr{F} over Ω Ω Omega\Omega and a probability measure P r : F [ 0 , 1 ] P r : F [ 0 , 1 ] Pr:Frarr[0,1]\mathbf{Pr}: \mathscr{F} \rightarrow[0,1] such that
  • P r [ Ω ] = 1 P r [ Ω ] = 1 Pr[Omega]=1\mathbf{Pr}[\Omega]=1
  • For A F , P r [ A ] = 1 P r [ A ¯ ] A F , P r [ A ] = 1 P r [ A ¯ ] A inF,Pr[A]=1-Pr[ bar(A)]A \in \mathscr{F}, \mathbf{Pr}[A]=1-\mathbf{Pr}[\bar{A}].
  • For any countably infinite sequence of disjoint sets A 1 , A 2 , A 1 , A 2 , A_(1),A_(2),dotsA_{1}, A_{2}, \ldots, such that A i F A i F A_(i)inFA_{i} \in \mathscr{F} for all i 1 , P r [ i A i ] = i = 1 P r [ A i ] i 1 , P r i A i = i = 1 P r A i i >= 1,Pr[uu_(i)A_(i)]=sum_(i=1)^(oo)Pr[A_(i)]i \geq 1, \mathbf{Pr}\left[\cup_{i} A_{i}\right]=\sum_{i=1}^{\infty} \mathbf{Pr}\left[A_{i}\right].
Density functions: Almost all of the time we don't deal with the above subtleties since we typically define continuous probability measures via density functions and cumulative distribution functions. In other words we define probability measure indirectly via integration. Consider Ω = [ 0 , 1 ] Ω = [ 0 , 1 ] Omega=[0,1]\Omega=[0,1]. We specify a density function f : [ 0 , 1 ] [ 0 , 1 ] f : [ 0 , 1 ] [ 0 , 1 ] f:[0,1]rarr[0,1]f:[0,1] \rightarrow[0,1] we let P r [ A ] = A f d x P r [ A ] = A f d x Pr[A]=int_(A)fdx\mathbf{Pr}[A]=\int_{A} f d x. For instance if we want the uniform distribution over [ 0 , 1 ] [ 0 , 1 ] [0,1][0,1] we let f ( x ) = 1 f ( x ) = 1 f(x)=1f(x)=1 and we see that P r [ [ a , b ] ] = a b 1 d x = b a P r [ [ a , b ] ] = a b 1 d x = b a Pr[[a,b]]=int_(a)^(b)1dx=b-a\mathbf{Pr}[[a, b]]=\int_{a}^{b} 1 d x=b-a. In fact what this formula hides is that it doesn't tell us how to compute the standard Riemann integral for an arbitrary set A [ 0 , 1 ] A [ 0 , 1 ] A sub[0,1]A \subset[0,1] but only over "nice" sets such as intervals and their finite unions and intersections. However, for all practical purposes in this course and for most applications that most of us will encounter the definition of probability measures via integration of "nice" functions over "nice" sets suffices.
Example 7. The standard Normal distribution over the entire real line ( , ) ( , ) (-oo,oo)(-\infty, \infty) is defined by the density function f ( x ) = 1 2 π e x 2 / 2 f ( x ) = 1 2 π e x 2 / 2 f(x)=(1)/(sqrt(2pi))e^(-x^(2)//2)f(x)=\frac{1}{\sqrt{2 \pi}} e^{-x^{2} / 2}.
Question 4. Consider the density function f ( x ) = c x 2 f ( x ) = c x 2 f(x)=cx^(2)f(x)=c x^{2} over the set Ω = [ 0 , 1 ] Ω = [ 0 , 1 ] Omega=[0,1]\Omega=[0,1] where c c cc is a fixed constant. What should c c cc be so that P r [ Ω ] = 1 P r [ Ω ] = 1 Pr[Omega]=1\mathbf{Pr}[\Omega]=1? What is P r [ A ] P r [ A ] Pr[A]\mathbf{Pr}[A] where A = [ 0.2 , 0.4 ] A = [ 0.2 , 0.4 ] A=[0.2,0.4]A=[0.2,0.4]. What about when A = [ 0.1 , 0.2 ] [ 0.3 , 0.6 ] A = [ 0.1 , 0.2 ] [ 0.3 , 0.6 ] A=[0.1,0.2]uu[0.3,0.6]A=[0.1,0.2] \cup[0.3,0.6].
Remark 1.7. We will not mention the issue of the σ σ sigma\sigma-algebra F F F\mathscr{F} from now on even when discussing continuous probability measures.

1.3. Union Bound

Union bound is a very simple bound but extremely useful because it does not make any assumptions.
Lemma 1.8. Let A , B A , B A,BA, B be two events in a probability space ( Ω , P r ) ( Ω , P r ) (Omega,Pr)(\Omega, \mathbf{Pr}) then P r [ A B ] P r [ A ] + P r [ B ] P r [ A B ] P r [ A ] + P r [ B ] Pr[A uu B] <= Pr[A]+Pr[B]\mathbf{Pr}[A \cup B] \leq \mathbf{Pr}[A]+\mathbf{Pr}[B]. More generally let A 1 , A 2 , , A n A 1 , A 2 , , A n A_(1),A_(2),dots,A_(n)A_{1}, A_{2}, \ldots, A_{n} be a finite collection of events then
P r [ i = 1 n A i ] i = 1 n P r [ A i ] . P r i = 1 n A i i = 1 n P r A i . Pr[uu_(i=1)^(n)A_(i)] <= sum_(i=1)^(n)Pr[A_(i)].\mathbf{Pr}\left[\cup_{i=1}^{n} A_{i}\right] \leq \sum_{i=1}^{n} \mathbf{Pr}\left[A_{i}\right] .
Proof: The discrete case is easy to prove. We consider the continuous case. A B A B A uu BA \cup B is the disjoint union of A B , A B , B A A B , A B , B A A-B,A nn B,B-AA-B, A \cap B, B-A (if A , B A , B A,BA, B are in the σ σ sigma\sigma-algebra then A B , A B , A B , B A A B , A B , A B , B A A uu B,A nn B,A-B,B-AA \cup B, A \cap B, A-B, B-A are also in the σ σ sigma\sigma-algebra, do you see why?). Hence,
P r [ A B ] = P r [ A B ] + P r [ A B ] + P r [ B A ] P r [ A B ] + P r [ A B ] + P r [ A B ] + P r [ B A ] = P r [ A ] + P r [ B ] . P r [ A B ] = P r [ A B ] + P r [ A B ] + P r [ B A ] P r [ A B ] + P r [ A B ] + P r [ A B ] + P r [ B A ] = P r [ A ] + P r [ B ] . {:[Pr[A uu B]=Pr[A-B]+Pr[A nn B]+Pr[B-A]],[ <= Pr[A-B]+Pr[A nn B]+Pr[A nn B]+Pr[B-A]],[=Pr[A]+Pr[B].]:}\begin{aligned} \mathbf{Pr}[A \cup B] &=\mathbf{Pr}[A-B]+\mathbf{Pr}[A \cap B]+\mathbf{Pr}[B-A] \\ & \leq \mathbf{Pr}[A-B]+\mathbf{Pr}[A \cap B]+\mathbf{Pr}[A \cap B]+\mathbf{Pr}[B-A] \\ &=\mathbf{Pr}[A]+\mathbf{Pr}[B] . \end{aligned}
We used the third axiom of probability on additivity over disjoint sets in the first and third equalities and non-negativity of probability for the inequality in the second line.

1.4. Independent Events

Independence is a fundamental notion in probability.
Definition 1.9. Let A , B A , B A,BA, B be two events in a probability space. They are independent iff P r [ A B ] = P r [ A B ] = Pr[A nn B]=\mathbf{Pr}[A \cap B]= P r [ A ] P r [ B ] P r [ A ] P r [ B ] Pr[A]Pr[B]\mathbf{Pr}[A] \mathbf{Pr}[B], otherwise they are dependent.
Example 8. Two coins. Ω = { H H , T T , H T , T H } Ω = { H H , T T , H T , T H } Omega={HH,TT,HT,TH}\Omega=\{H H, T T, H T, T H\} and P r [ H H ] = P r [ T T ] = P r [ H T ] = P r [ T H ] = 1 / 4 P r [ H H ] = P r [ T T ] = P r [ H T ] = P r [ T H ] = 1 / 4 Pr[HH]=Pr[TT]=Pr[HT]=Pr[TH]=1//4\mathbf{Pr}[H H]=\mathbf{Pr}[T T]=\mathbf{Pr}[H T]=\mathbf{Pr}[T H]=1 / 4.
  • A A AA is the event that the first coin is heads and B B BB is the event that the second coin is tails. A , B A , B A,BA, B are independent.
  • A A AA is the event that both are not tails and B B BB is the event that the second coin is heads. A , B A , B A,BA,B are dependent.
The formal definition of independence is often not very useful in understanding when two events are independent since calculating the probabilities of various quantities is not straightforward. A more intuitive definition is that A A AA and B B BB are independent if knowing whether one of them happened does not tell us anything about whether the other happened or not. In complex settings, often the reason two events A , B A , B A,BA, B can be seen to be independent is because they are based on different parts of the probability space. In fact, in most cases, we define or construct probability spaces via independence! Consider the following example.
Example 9. Toss 100 100 100100 independent unbiased coins. Let A A AA be the event that in the first 5 5 55 coins there are at least 2 2 22 heads and let B B BB be the event that the number of tails in the last 10 10 1010 coins is more than 7 7 77. Then A , B A , B A,BA, B are independent events.
Note that A , B A , B A,BA, B are obviously independent because the coins are defined to be independent. What is the probability space here? It is the product of the probability spaces of the individual coins. We can formalize this in a lemma below.
Lemma 1.10. Let ( Ω , P r ) ( Ω , P r ) (Omega,Pr)(\Omega, \mathbf{Pr}) be the product of k k kk probability spaces ( Ω 1 , P r 1 ) , , ( Ω k , P r k ) Ω 1 , P r 1 , , Ω k , P r k (Omega_(1),Pr_(1)),dots,(Omega_(k),Pr_(k))\left(\Omega_{1}, \mathbf{Pr}_{1}\right), \ldots,\left(\Omega_{k}, \mathbf{Pr}_{k}\right). For S [ k ] S [ k ] S sube[k]S \subseteq[k] we say that event A A AA depends only on S S SS if A = B × C A = B × C A=B xx CA=B \times C where B = i S Ω i B = i S Ω i B=prod_(i!in S)Omega_(i)B=\prod_{i \notin S} \Omega_{i} and C j S Ω j C j S Ω j C subeprod_(j in S)Omega_(j)C \subseteq \prod_{j \in S} \Omega_{j}. If is an event that depends only on S [ k ] S [ k ] S sub[k]S \subset[k] and A A A^(')A^{\prime} is an event that depends only on T [ k ] T [ k ] T sub[k]T \subset[k] and S T = S T = S nn T=O/S \cap T=\emptyset then A , A A , A A,A^(')A, A^{\prime} are independent.
Proof: Exercise.

2. Random Variables

The most important concept that you need is the idea of a random variable. A random variable in a general context is simply a mapping/function from a probability space ( Ω , P r ) ( Ω , P r ) (Omega,Pr)(\Omega, \mathbf{Pr}) to another probability/measure space Ω Ω Omega^(')\Omega^{\prime}. However, for the most part, we will be mainly interested in real-valued random variables and indicator random variables.
Definition 2.1. Given a probability space ( Ω , P r ) ( Ω , P r ) (Omega,Pr)(\Omega, \mathbf{Pr}) a random variable X X XX over Ω Ω Omega\Omega is a function that maps each elementary event to a real number. In other words X : Ω R X : Ω R X:Omega rarrRX: \Omega \rightarrow \mathbb{R}.
An indicator (or binary) random variable is one whose range is { 0 , 1 } { 0 , 1 } {0,1}\{0,1\}, that is X ( ω ) = 0 X ( ω ) = 0 X(omega)=0X(\omega)=0 or X ( ω ) = 1 X ( ω ) = 1 X(omega)=1X(\omega)=1 for all ω Ω ω Ω omega in Omega\omega \in \Omega.
Discrete versus continuous random variables: If the underlying probability space is discrete then even when X X XX is a real-valued random variable the range space is also effectively discrete although it is embedded in a continuous space of real numbers. However when the original space Ω Ω Omega\Omega is continuous then one has to be more careful in general. Not all functions mapping Ω Ω Omega\Omega to R R R\mathbb{R} work. There is a notion of measurable functions that we will not discuss here. However, for calculations, we will need a density function f X f X f_(X)f_{X} for the probability distribution induced by X X XX over the real numbers. Given a density function f f ff defining the probability distribution ( Ω , P r ) ( Ω , P r ) (Omega,Pr)(\Omega, \mathbf{Pr} ) we can usually compute the density function f X f X f_(X)f_{X} associated with the transformation X X XX via calculus.
For real-valued random variables an important notion closely related to the density function is the cumulative distribution function (cdf). Typically density is referred to by f X f X f_(X)f_{X} and cumulative distribution function as F X F X F_(X)F_{X}. The function F X : R [ 0 , 1 ] F X : R [ 0 , 1 ] F_(X):Rrarr[0,1]F_{X}: \mathbb{R} \rightarrow[0,1] is defined as: F X ( t ) = P r [ X t ] F X ( t ) = P r [ X t ] F_(X)(t)=Pr[X <= t]F_{X}(t)=\mathbf{Pr}[X \leq t]. One can see that P r [ X t ] = t f X ( y ) d y P r [ X t ] = t f X ( y ) d y Pr[X <= t]=int_(oo)^(t)f_(X)(y)dy\mathbf{Pr}[X \leq t]=\int_{\infty}^{t} f_{X}(y) d y. This also shows that f X = d d y F X f X = d d y F X f_(X)=(d)/(dy)F_(X)f_{X}=\frac{d}{d y} F_{X}.
Example 10. Consider the probability space ( Ω , P r ) ( Ω , P r ) (Omega,Pr)(\Omega, \mathbf{Pr}) where Ω = { 1 , 2 , , 6 } Ω = { 1 , 2 , , 6 } Omega={1,2,dots,6}\Omega=\{1,2, \ldots, 6\} corresponding to the outcomes of a 6-sided die with uniform probability. Let X : Ω R X : Ω R X:Omega rarrRX: \Omega \rightarrow \mathbb{R} where X ( i ) = i 2 X ( i ) = i 2 X(i)=i^(2)X(i)=i^{2}. Then X X XX is an integer-valued random variable.
Example 11. Consider the probability space ( Ω , P r ) ( Ω , P r ) (Omega,Pr)(\Omega, \mathbf{Pr}) where Ω = { 1 , 2 , , 6 } Ω = { 1 , 2 , , 6 } Omega={1,2,dots,6}\Omega=\{1,2, \ldots, 6\} corresponding to the outcomes of a 6-sided die with uniform probability. Let X : Ω { 0 , 1 } X : Ω { 0 , 1 } X:Omega rarr{0,1}X: \Omega \rightarrow\{0,1\} where X ( i ) = i mod 2 X ( i ) = i mod 2 X(i)=i mod2X(i)=i \bmod {2}. Then X X XX is an indicator random variable.
Example 12. Consider the probability space ( Ω , P r ) ( Ω , P r ) (Omega,Pr)(\Omega, \mathbf{Pr}) where Ω = [ 0 , 1 ] Ω = [ 0 , 1 ] Omega=[0,1]\Omega=[0,1] and P r P r Pr\mathbf{Pr} is the uniform distribution. Let X : Ω R X : Ω R X:Omega rarrRX: \Omega \rightarrow \mathbb{R} where X ( a ) = exp ( a ) X ( a ) = exp ( a ) X(a)=exp(a)X(a)=\exp (a). Then X X XX is a real-valued random variable.
In the preceding example how do we compute the density function of X X XX? Consider the uniform probability distribution over [ 0 , 1 ] [ 0 , 1 ] [0,1][0,1]. The density function f f ff corresponding to this is f ( x ) = 1 f ( x ) = 1 f(x)=1f(x)=1 for all x [ 0 , 1 ] x [ 0 , 1 ] x in[0,1]x \in[0,1]. The random variable X X XX maps [ 0 , 1 ] [ 0 , 1 ] [0,1][0,1] to [ 1 , e ] [ 1 , e ] [1,e][1, e]. In this case we see that X X XX is a one-to-one increasing function. It is easier to work with the cumulative distribution functions F F FF and F X F X F_(X)F_{X}. We see that F ( x ) = x F ( x ) = x F(x)=xF(x)=x for x [ 0 , 1 ] x [ 0 , 1 ] x in[0,1]x \in[0,1]. Since X X XX is one-to-one we see that for any t [ 1 , e ] , F X ( t ) = P r [ X t ] = F ( ln t ) = ln t t [ 1 , e ] , F X ( t ) = P r [ X t ] = F ( ln t ) = ln t t in[1,e],F_(X)(t)=Pr[X <= t]=F(ln t)=ln tt \in[1, e], F_{X}(t)=\mathbf{Pr}[X \leq t]=F(\ln t)=\ln t. And hence f X ( t ) = d d t F X ( t ) = 1 / t f X ( t ) = d d t F X ( t ) = 1 / t f_(X)(t)=(d)/(dt)F_(X)(t)=1//tf_{X}(t)=\frac{d}{d t} F_{X}(t)=1 / t.
Why random variables? Random variables allow us to take specific views of a complex probability space and analyze that particular view. Consider any ω ω omega^(')\omega^{\prime} in the range of a random variable X : Ω Ω X : Ω Ω X:Omega rarrOmega^(')X: \Omega \rightarrow \Omega^{\prime}. Thus X X XX collapses all atomic events in X 1 ( ω ) X 1 ω X^(-1)(omega^('))X^{-1}\left(\omega^{\prime}\right) into ω ω omega^(')\omega^{\prime}. Thus, one can view X X XX as partitioning the original space Ω Ω Omega\Omega and associating with each part of this partition a symbol; hence X X XX creates a new probability space implicitly. In the real-valued case, by associating a number instead of a symbol, we gain the advantage of calculating various quantities of interest. Moreover, as we will see, the judicious use of random variables to calculate and estimate various quantities of interest is the key for analysis of probabilistic methods and randomized algorithms.
Events and Indicator Random Variables: Recall that an event in a probability space is simply a subset of atomic events in the discrete case (which will be our main focus) or belongs to the σ σ sigma\sigma-algebra in the continuous case. With each event A A AA we can associate an indicator random variable X A X A X_(A)X_{A} where X A ( ω ) = 1 X A ( ω ) = 1 X_(A)(omega)=1X_{A}(\omega)=1 if ω A ω A omega in A\omega \in A and 0 0 00 otherwise. Why is this useful? It is useful because we have associated a number with the event and now we can do manipulations with numbers when considering multiple events instead of working with set algebra which can become quite complicated quickly. We will see several examples soon. Conversely, each indicator random variable X : Ω { 0 , 1 } X : Ω { 0 , 1 } X:Omega rarr{0,1}X: \Omega \rightarrow\{0,1\} corresponds to an event A X A X A_(X)A_{X} where A X = { ω X ( ω ) = 1 } A X = { ω X ( ω ) = 1 } A_(X)={omega∣X(omega)=1}A_{X}=\{\omega \mid X(\omega)=1\}. Given a random variable X X XX, what is P r [ X = b ] P r [ X = b ] Pr[X=b]\mathbf{Pr}[X=b] for some b b bb in the range of X X XX? We consider the event X 1 ( b ) = { ω Ω X ( ω ) = b } X 1 ( b ) = { ω Ω X ( ω ) = b } X^(-1)(b)={omega in Omega∣X(omega)=b}X^{-1}(b)=\{\omega \in \Omega \mid X(\omega)=b\} and P r [ X = b ] = P r [ X 1 ( b ) ] P r [ X = b ] = P r X 1 ( b ) Pr[X=b]=Pr[X^(-1)(b)]\mathbf{Pr}[X=b]=\mathbf{Pr}\left[X^{-1}(b)\right].

2.1. Independence of Random Variables

We saw the definition of independence of events. A more general notion is that of independence of random variables.
Definition 2.2. Two discrete random variables X , Y X , Y X,YX, Y defined over the same probability space ( Ω , P r ) ( Ω , P r ) (Omega,Pr)(\Omega, \mathbf{Pr}) are independent if for all a , b a , b a,ba, b we have P r [ X = a , Y = b ] = P r [ X = a ] P r [ Y = b ] P r [ X = a , Y = b ] = P r [ X = a ] P r [ Y = b ] Pr[X=a,Y=b]=Pr[X=a]Pr[Y=b]\mathbf{Pr}[X=a, Y=b]=\mathbf{Pr}[X=a] \mathbf{Pr}[Y=b]. More generally X , Y X , Y X,YX, Y are independent if P r [ X A , Y B ] = P r [ X A ] P r [ Y B ] P r [ X A , Y B ] = P r [ X A ] P r [ Y B ] Pr[X in A,Y in B]=Pr[X in A]Pr[Y in B]\mathbf{Pr}[X \in A, Y \in B]=\mathbf{Pr}[X \in A] \mathbf{Pr}[Y \in B] for all measurable sets A , B A , B A,BA, B.
As in the discussion of independence of events, the formal definition is less useful in knowing when random variables are actually independent. Intuitively, two random variables are independent if knowing the value of one does not tell us any information about the value of the other.

2.2. Expectation

Recall that the advantage of a real-valued random variable is that we can calculate with it. The most basic information associated with a real-valued random variable is its expectation.
Definition 2.3. For a real-valued discrete random variable X X XX over a probability space ( Ω , P r ) ( Ω , P r ) (Omega,Pr)(\Omega, \mathbf{Pr}) the expectation of X X XX, denoted by E [ X ] E [ X ] E[X]\mathbf{E}[X], is defined as ω Ω P r [ ω ] X ( ω ) ω Ω P r [ ω ] X ( ω ) sum_(omega in Omega)Pr[omega]X(omega)\sum_{\omega \in \Omega} \mathbf{Pr}[\omega] X(\omega). In other words, the expectation is the average value of X X XX according to the probabilities given by P r [ ] P r [ ] Pr[*]\mathbf{Pr}[\cdot].
Definition 2.4. For a real-valued random variable X X XX over a continuous probability space ( Ω , P r ) ( Ω , P r ) (Omega,Pr)(\Omega, \mathbf{Pr}) where f X f X f_(X)f_{X} is the density function of X X XX the expectation of X X XX is defined as y f X ( y ) d y y f X ( y ) d y int_(-oo)^(oo)yf_(X)(y)dy\int_{-\infty}^{\infty} y f_{X}(y) d y[1].
Remark 2.5. Note that for infinite probability spaces the expectation of a random variable may not be finite in which case we say it does not exist. Consider the random variable X X XX where P r [ X = i ] = c / i 2 P r [ X = i ] = c / i 2 Pr[X=i]=c//i^(2)\mathbf{Pr}[X=i]=c / i^{2} for all integers i 1 i 1 i >= 1i \geq 1 where c = 6 / π 2 c = 6 / π 2 c=6//pi^(2)c=6 / \pi^{2}. E [ X ] = i 1 c / i E [ X ] = i 1 c / i E[X]=sum_(i >= 1)c//i\mathbf{E}[X]=\sum_{i \geq 1} c / i which diverges to oo\infty.
Remark 2.6. Often one works with functions of random variables which can be viewed as random variables themselves but calculating the expectation of the resulting function can be done more directly than computing the density function of the new random variable. Suppose X X XX is a real-valued random variable with density function X X XX and we are interested in the function g ( X ) g ( X ) g(X)g(X) (say g g gg is given by g ( y ) = y 2 g ( y ) = y 2 g(y)=y^(2)g(y)=y^{2}). Then E [ g ( X ) ] = g ( y ) f X ( y ) d y E [ g ( X ) ] = g ( y ) f X ( y ) d y E[g(X)]=int g(y)f_(X)(y)dy\mathbf{E}[g(X)]=\int g(y) f_{X}(y) d y.
Expectation of an indicator random variable: A very useful fact is the following easy observation.
Lemma 2.7. For an event A A AA let X A X A X_(A)X_{A} be the indicator random variable for A A AA. Then E [ X A ] = P r [ A ] E X A = P r [ A ] E[X_(A)]=Pr[A]\mathbf{E}\left[X_{A}\right]=\mathbf{Pr}[A].

2.3. Linearity of Expectation and Examples

One of the most useful facts about expectation is the following simple one about linearity of expectation.
Lemma 2.8. Let X , Y X , Y X,YX, Y be two real-valued random variables over the same probability space. Then E [ X + Y ] = E [ X ] + E [ Y ] E [ X + Y ] = E [ X ] + E [ Y ] E[X+Y]=E[X]+E[Y]\mathbf{E}[X+Y]=\mathbf{E}[X]+\mathbf{E}[Y]. More generally for any finite collection X 1 , X 2 , , X n X 1 , X 2 , , X n X_(1),X_(2),dots,X_(n)X_{1}, X_{2}, \ldots, X_{n} of random variables over the same probability space and real numbers a 1 , a 2 , , a n a 1 , a 2 , , a n a_(1),a_(2),dots,a_(n)a_{1}, a_{2}, \ldots, a_{n} we have E [ i = 1 n a i X i ] = i = 1 n a i E [ X i ] E i = 1 n a i X i = i = 1 n a i E X i E[sum_(i=1)^(n)a_(i)X_(i)]=sum_(i=1)^(n)a_(i)E[X_(i)]\mathbf{E}\left[\sum_{i=1}^{n} a_{i} X_{i}\right]=\sum_{i=1}^{n} a_{i} \mathbf{E}\left[X_{i}\right].
The preceding lemma easily follows from the definition of expectation.
We now give an example to illustrate the use of linearity of expectation. Let G = ( V , E ) G = ( V , E ) G=(V,E)G=(V, E) be a graph with n n nn vertices and m m mm edges. Let G G G^(')G^{\prime} be the graph resulting from independently deleting every vertex of G G GG with probability 1 / 2 1 / 2 1//21 / 2. What is the expected number of edges in G G G^(')G^{\prime}? How do we analyze this? First we define a random variable X X XX to denote the number of edges in G G G^(')G^{\prime}. What we want is E [ X ] E [ X ] E[X]\mathbf{E}[X]. The straightforward formula for expectation is k = 1 | E | k P r [ X = k ] k = 1 | E | k P r [ X = k ] sum_(k=1)^(|E|)kPr[X=k]\sum_{k=1}^{|E|} k \mathbf{Pr}[X=k]. How do we calculate the quantity P r [ X = k ] P r [ X = k ] Pr[X=k]\mathbf{Pr}[X=k] and how do we do the sum? This seems quite hard. The key is to express X X XX as a sum and then use linearity of expectation. For each edge e E e E e in Ee \in E let X e X e X_(e)X_{e} be an indicator random variable for whether e G e G e inG^(')e \in G^{\prime}. We see that X = e E X e X = e E X e X=sum_(e in E)X_(e)X=\sum_{e \in E} X_{e} and by linearity of expectation E [ X ] = e | E | E [ X e ] E [ X ] = e | E | E X e E[X]=sum_(e in|E|)E[X_(e)]\mathbf{E}[X]=\sum_{e \in|E|} \mathbf{E}\left[X_{e}\right]. Since X e X e X_(e)X_{e} is an indicator random variable we know that E [ X e ] = P r [ X e = 1 ] E X e = P r X e = 1 E[X_(e)]=Pr[X_(e)=1]\mathbf{E}\left[X_{e}\right]=\mathbf{Pr}\left[X_{e}=1\right].
  • Event A e = A e = A_(e)=A_{e}= edge e E e E e in Ee \in E is present in G G G^(')G^{\prime}.
  • P r [ A e = ( u , v ) ] = P r [ u and v both are present ] = P r [ u is present ] P r [ v is present ] = 1 2 1 2 = 1 4 P r A e = ( u , v ) = P r [ u  and  v  both are present ] = P r [ u  is present ] P r [ v  is present ] = 1 2 1 2 = 1 4 Pr[A_(e=(u,v))]=Pr[u" and "v" both are present"]=Pr[u" is present"]*Pr[v" is present"]=(1)/(2)*(1)/(2)=(1)/(4)\mathbf{Pr}\left[A_{e=(u, v)}\right]=\mathbf{Pr}[u \text{ and } v \text{ both are present}]=\mathbf{Pr}[u \text{ is present}] \cdot \mathbf{Pr}[v \text{ is present}]=\frac{1}{2} \cdot \frac{1}{2}=\frac{1}{4}. Here we use independence in the experiment itself.
  • E [ X e ] = P r [ A e ] E X e = P r A e E[X_(e)]=Pr[A_(e)]\mathbf{E}\left[X_{e}\right]=\mathbf{Pr}\left[A_{e}\right].
E [ X ] = E [ e E X e ] = e E P r [ A e ] = m 4 . E [ X ] = E e E X e = e E P r A e = m 4 . E[X]=E[sum_(e in E)X_(e)]=sum_(e in E)Pr[A_(e)]=(m)/(4).\mathbf{E}[X]=\mathbf{E}\left[\sum_{e \in E} X_{e}\right]=\sum_{e \in E} \mathbf{Pr}\left[A_{e}\right]=\frac{m}{4} .
Question 5. Let G = ( V , E ) G = ( V , E ) G=(V,E)G=(V, E) be a graph with n n nn vertices and m m mm edges. Assume G G GG has t t tt triangles (i.e., a triangle is a simple cycle with three vertices). Let G G G^(')G^{\prime} be the graph resulting from deleting independently each vertex of G G GG with probability 1 / 2 1 / 2 1//21 / 2. What is the expected number of triangles in G G G^(')G^{\prime}?
Question 6. Suppose n n nn balls are independently thrown into n n nn bins where for each ball the bin is chosen uniformly at random. (i) What is the probability that bin 1 1 11 is empty? (ii) What is the expected number of empty bins?

2.4. Independence and product of expectations

Linearity of expectation works even when the variables are not independent. There are situations when we need to work with products and here independence is useful.
Lemma 2.9. Let X , Y X , Y X,YX, Y be independent random variables. Then E [ X Y ] = E [ X ] E [ Y ] E [ X Y ] = E [ X ] E [ Y ] E[XY]=E[X]E[Y]\mathbf{E}[X Y]=\mathbf{E}[X] \mathbf{E}[Y]. More generally for any finite collection X 1 , X 2 , , X n X 1 , X 2 , , X n X_(1),X_(2),dots,X_(n)X_{1}, X_{2}, \ldots, X_{n} of independent random variables E [ i X i ] = i X i E i X i = i X i E[prod_(i)X_(i)]=prod_(i)X_(i)\mathbf{E}\left[\prod_{i} X_{i}\right]=\prod_{i} X_{i}.
We give a proof for the discrete setting.
Proof:
E [ X Y ] = ω Ω P r [ ω ] ( X ( ω ) Y ( ω ) ) = x , y R P r [ X = x Y = y ] ( x y ) = x , y R P r [ X = x ] P r [ Y = y ] x y = ( x R P r [ X = x ] x ) ( y R P r [ Y = y ] y ) = E [ X ] E [ Y ] E [ X Y ] = ω Ω P r [ ω ] ( X ( ω ) Y ( ω ) ) = x , y R P r [ X = x Y = y ] ( x y ) = x , y R P r [ X = x ] P r [ Y = y ] x y = ( x R P r [ X = x ] x ) ( y R P r [ Y = y ] y ) = E [ X ] E [ Y ] {:[E[X*Y]=sum_(omega in Omega)Pr[omega](X(omega)*Y(omega))],[=sum_(x,y inR)Pr[X=x^^Y=y](x*y)],[=sum_(x,y inR)Pr[X=x]*Pr[Y=y]*x*y],[=(sum_(x inR)Pr[X=x]x)(sum_(y inR)Pr[Y=y]y)=E[X]E[Y]]:}\begin{aligned} \mathbf{E}[X \cdot Y] &=\sum_{\omega \in \Omega} \mathbf{Pr}[\omega](X(\omega) \cdot Y(\omega)) \\ &=\sum_{x, y \in \mathbb{R}} \mathbf{Pr}[X=x \wedge Y=y](x \cdot y) \\ &=\sum_{x, y \in \mathbb{R}} \mathbf{Pr}[X=x] \cdot \mathbf{Pr}[Y=y] \cdot x \cdot y \\ &=(\sum_{x \in \mathbb{R}} \mathbf{Pr}[X=x] x)(\sum_{y \in \mathbb{R}} \mathbf{Pr}[Y=y] y)=\mathbf{E}[X] \mathbf{E}[Y] \end{aligned}

2.5. Variance and Higher Moments

Expectation of a random variable gives basic information about its behavior. Additional information can be obtained via higher moments. For an integer t 1 t 1 t >= 1t \geq 1 the t t t^(')t^{\prime} th moment of X X XX is defined as E [ X t ] E X t E[X^(t)]\mathbf{E}\left[X^{t}\right]. Expectation is the first moment. The second moment is closely related to the more standard measure, namely the variance.
Definition 2.10. Let X X XX be a real-valued random variable over a probability space such that E [ X ] E [ X ] E[X]\mathbf{E}[X] is finite. The variance of X X XX, denoted by V a r [ X ] V a r [ X ] Var[X]\mathbf{Var}[X], is E [ ( X E [ X ] ) 2 ] E ( X E [ X ] ) 2 E[(X-E[X])^(2)]\mathbf{E}\left[(X-\mathbf{E}[X])^{2}\right].
Prove the following claim which shows that V a r [ X ] V a r [ X ] Var[X]\mathbf{Var}[X] differs from the second moment by an additive term.
Claim 1. V a r [ X ] = E [ ( X E [ X ] ) 2 ] = E [ X 2 ] ( E [ X ] ) 2 V a r [ X ] = E ( X E [ X ] ) 2 = E X 2 ( E [ X ] ) 2 Var[X]=E[(X-E[X])^(2)]=E[X^(2)]-(E[X])^(2)\mathbf{Var}[X]=\mathbf{E}\left[(X-\mathbf{E}[X])^{2}\right]=\mathbf{E}\left[X^{2}\right]-(\mathbf{E}[X])^{2}.
Note that V a r [ X ] V a r [ X ] Var[X]\mathbf{Var}[X] may not be finite even when E [ X ] E [ X ] E[X]\mathbf{E}[X] is finite. Can you think of an example? Note that V a r [ X ] V a r [ X ] Var[X]\mathbf{Var}[X] is a non-negative random variable and measures how far X X XX deviates from its expectation. Since V a r [ X ] V a r [ X ] Var[X]\mathbf{Var}[X] is the square of the deviation we also use σ X = V a r [ X ] σ X = V a r [ X ] sigma_(X)=sqrt(Var[X])\sigma_{X}=\sqrt{\mathbf{Var}[X]}, called the standard deviation, to measure the deviation.
Example 13. Consider a random variable X X XX where X = B X = B X=-BX=-B with probability 1 / 2 1 / 2 1//21 / 2 and X = B X = B X=BX=B with probability 1 / 2 1 / 2 1//21 / 2. E [ X ] = 0 E [ X ] = 0 E[X]=0\mathbf{E}[X]=0 for all values of B B BB. On the other hand V a r [ X ] = B 2 V a r [ X ] = B 2 Var[X]=B^(2)\mathbf{Var}[X]=B^{2} and σ X = B σ X = B sigma_(X)=B\sigma_{X}=B.
Question 7. What is V a r [ a X ] V a r [ a X ] Var[aX]\mathbf{Var}[a X] for a scalar a a aa as a function of a a aa and V a r [ X ] V a r [ X ] Var[X]\mathbf{Var}[X]?
Prove the following lemma.
Lemma 2.11. Let X , Y X , Y X,YX, Y be independent random variables. Then V a r [ X + Y ] = V a r [ X ] + V a r [ Y ] V a r [ X + Y ] = V a r [ X ] + V a r [ Y ] Var[X+Y]=Var[X]+Var[Y]\mathbf{Var}[X+Y]=\mathbf{Var}[X]+\mathbf{Var}[Y]. More generally if X 1 , X 2 , , X n X 1 , X 2 , , X n X_(1),X_(2),dots,X_(n)X_{1}, X_{2}, \ldots, X_{n} are independent random variables then V a r [ i = 1 n X i ] = i = 1 n V a r [ X i ] V a r i = 1 n X i = i = 1 n V a r X i Var[sum_(i=1)^(n)X_(i)]=sum_(i=1)^(n)Var[X_(i)]\mathbf{Var}\left[\sum_{i=1}^{n} X_{i}\right]=\sum_{i=1}^{n} \mathbf{Var}\left[X_{i}\right].
E [ X t ] E X t E[X^(t)]\mathbf{E}\left[X^{t}\right] is the t t tt'th moment of X X XX. They are typically split into odd and even moments based on the parity of t t tt and the even moments are non-negative and hence tend to be more useful as measures of deviation of X X XX from its expectation. If E [ X t ] E X t E[X^(t)]\mathbf{E}\left[X^{t}\right] is small compared to E [ X ] E [ X ] E[X]\mathbf{E}[X] for larger values of t t tt then it indicates that X X XX is very close to its expectation and well-behaved. This leads to concentration inequalities that find many applications.

3. Conditional Probability and Expectation

Conditional probability allows us to focus on subsets of the probability space. Suppose we have a probability space ( Ω , P r ) ( Ω , P r ) (Omega,Pr)(\Omega, \mathbf{Pr}) and we have information that event A A AA happened but nothing else. If A = { ω } A = { ω } A={omega}A=\{\omega\} then there is no uncertainty left. However if A A AA is a non-singleton set then we don't know which elementary event in A A AA may have happened. How do we capture this? Since the discrete case is easier to understand and essentially captures all the intuition we will stick to it. A natural way to do it is to define a new probability space over the set A A AA since only elementary events in A A AA are feasible now. How should we assign probabilities to each ω A ω A omega in A\omega \in A? Let P r P r Pr^(')\mathbf{Pr}^{\prime} be the probability function over A A AA that we wish to compute. The most natural choice is to set P r [ ω ] = P r [ ω ] / P r [ A ] P r [ ω ] = P r [ ω ] / P r [ A ] Pr^(')[omega]=Pr[omega]//Pr[A]\mathbf{Pr}^{\prime}[\omega]=\mathbf{Pr}[\omega] / \mathbf{Pr}[A] for each ω A ω A omega in A\omega \in A. This ensures that ω A P r [ ω ] = 1 ω A P r [ ω ] = 1 sum_(omega in A)Pr^(')[omega]=1\sum_{\omega \in A} \mathbf{Pr}^{\prime}[\omega]=1, and hence the new measure is indeed a probability measure over A A AA. One can also think of P r P r Pr^(')\mathbf{Pr}^{\prime} as a probability measure over Ω Ω Omega\Omega instead of over A A AA; we set P r [ ω ] = 0 P r [ ω ] = 0 Pr^(')[omega]=0\mathbf{Pr}^{\prime}[\omega]=0 for each ω A ω A omega!in A\omega \notin A. Pr' is the probability measure conditioned on event A A AA. Typically we write P r r P r r Prr^(')\mathbf{Pr} r^{\prime} as P r [ A ] P r [ A ] Pr[∣A]\mathbf{Pr}[\mid A]. For an event B B BB we say P r [ B A ] P r [ B A ] Pr[B∣A]\mathbf{Pr}[B \mid A] is the probability of B B BB conditioned on event A A AA which is simply P r [ B ] P r [ B ] Pr^(')[B]\mathbf{Pr}^{\prime}[B]. From our definition we have the following easy consequence. Note that this is really coming from the definition.
Lemma 3.1. For any event B , P r [ B A ] = P r [ B A ] = P r [ A B ] P r [ A ] B , P r [ B A ] = P r [ B A ] = P r [ A B ] P r [ A ] B,Pr[B∣A]=Pr^(')[B nn A]=(Pr[A nn B])/(Pr[A])B, \mathbf{Pr}[B \mid A]=\mathbf{Pr}^{\prime}[B \cap A]=\frac{\mathbf{Pr}[A \cap B]}{\mathbf{Pr}[A]}.
The above is often called Bayes's theorem or rule; I prefer rule because it comes easily from a definition rather than coming from a chain of non-trivial deductions. We note that Bayes's rule is applied in different ways depending on the context. In some applications we may have access to P r [ A B ] P r [ A B ] Pr[A nn B]\mathbf{Pr}[A \cap B] and P r [ A ] P r [ A ] Pr[A]\mathbf{Pr}[A] and want to compute P r [ B A ] P r [ B A ] Pr[B∣A]\mathbf{Pr}[B \mid A]. In others we may have access to P r [ B A ] P r [ B A ] Pr[B∣A]\mathbf{Pr}[B \mid A] and P r [ A ] P r [ A ] Pr[A]\mathbf{Pr}[A] and want to compute P r [ A B ] P r [ A B ] Pr[A nn B]\mathbf{Pr}[A \cap B] (for instance in observational studies).
Verify the claim below and see why it makes intuitive sense.
Corollary 3.2. A , B A , B A,BA, B are independent events iff P r [ B A ] = P r [ B ] P r [ B A ] = P r [ B ] Pr[B∣A]=Pr[B]\mathbf{Pr}[B \mid A]=\mathbf{Pr}[B].
One can extend conditional probability to random variables in the natural way. Recall that a realvalued random variable X X XX over a probability space ( Ω ( Ω (Omega(\Omega, Pr) is simply a function from Ω Ω Omega\Omega to R R R\mathbb{R}. As such X X XX does not explicitly depend on the probability measure. However E [ X ] E [ X ] E[X]\mathbf{E}[X] does indeed depend on Pr. We define E [ X A ] E [ X A ] E[X∣A]\mathbf{E}[X \mid A] as the conditional expectation of X X XX given A A AA. Its formal definition is simply the expectation of X X XX under the modified probability measure P r = P r [ | A ] P r = P r [ | A ] Pr^(')=Pr[|A]\mathbf{Pr}^{\prime}=\mathbf{Pr}[| A]. More formally, in the discrete case,
E [ X A ] = ω Ω P r [ ω A ] X ( ω ) = ω A P r [ ω ] P r [ A ] X ( ω ) . E [ X A ] = ω Ω P r [ ω A ] X ( ω ) = ω A P r [ ω ] P r [ A ] X ( ω ) . E[X∣A]=sum_(omega in Omega)Pr[omega∣A]X(omega)=sum_(omega in A)(Pr[omega])/(Pr[A])X(omega).\mathbf{E}[X \mid A]=\sum_{\omega \in \Omega} \mathbf{Pr}[\omega \mid A] X(\omega)=\sum_{\omega \in A} \frac{\mathbf{Pr}[\omega]}{\mathbf{Pr}[A]} X(\omega) .

4. Some probability distributions of interest

We will encounter a small number of common distributions often and it is helpful to catalog them and some of their known properties. Often these distributions are parametrized. You can find a lot of useful information as well as pictures and examples on these distributions and much more on Wikipedia.
Bernoulli and Binomial distributions: Coin tosses are perhaps the simplest probability distribution that are extremely well-studied for their simplicity and applicability. Given a number p [ 0 , 1 ] p [ 0 , 1 ] p in[0,1]p \in[0,1] consider the two variable space { H , T } { H , T } {H,T}\{H, T\} with P r [ H ] = p P r [ H ] = p Pr[H]=p\mathbf{Pr}[H]=p and P r [ T ] = 1 p P r [ T ] = 1 p Pr[T]=1-p\mathbf{Pr}[T]=1-p. This is called a Bernoulli distribution. Often we want to toss n n nn independent coins. Then the probability space is of size 2 n 2 n 2^(n)2^{n} corresponding to { H , T } n { H , T } n {H,T}^(n)\{H, T\}^{n}. It is often simpler to use { 1 , 0 } { 1 , 0 } {1,0}\{1,0\} instead of { H , T } { H , T } {H,T}\{H, T\} in which case the probability space is the set of all binary strings of length n n nn. The Binomial distribution with parameters n n nn and p p pp is obtained by considering a random variable X X XX where X ( ω ) X ( ω ) X(omega)X(\omega) is equal to the number of heads (or 0 0 00's) in the string ω ω omega\omega. We see that P r [ X = k ] = ( n k ) p k ( 1 p ) n k P r [ X = k ] = n k p k ( 1 p ) n k Pr[X=k]=([n],[k])p^(k)(1-p)^(n-k)\mathbf{Pr}[X=k]=\left(\begin{array}{l}n \\ k\end{array}\right) p^{k}(1-p)^{n-k}. Thus the Binomial distribution can thus be alternatively thought of as a distribution over { 0 , 1 , 2 , , n } { 0 , 1 , 2 , , n } {0,1,2,dots,n}\{0,1,2, \ldots, n\} where P r [ i ] = ( n k ) p k ( 1 p ) n k P r [ i ] = n k p k ( 1 p ) n k Pr[i]=([n],[k])p^(k)(1-p)^(n-k)\mathbf{Pr}[i]=\left(\begin{array}{l}n \\ k\end{array}\right) p^{k}(1-p)^{n-k}.
Suppose X X XX has the Binomial distribution with parameters n , p n , p n,pn, p. Then one way to understand X X XX is via a sum of n n nn independent Bernoulli random variables Y 1 , Y 2 , , Y n Y 1 , Y 2 , , Y n Y_(1),Y_(2),dots,Y_(n)Y_{1}, Y_{2}, \ldots, Y_{n} with parameter p p pp. We then see that E [ X ] = E [ i = 1 n Y i ] = n p E [ X ] = E i = 1 n Y i = n p E[X]=E[sum_(i=1)^(n)Y_(i)]=np\mathbf{E}[X]=\mathbf{E}\left[\sum_{i=1}^{n} Y_{i}\right]=n p. We also have that V a r [ X ] = i V a r [ Y i ] = n p ( 1 p ) V a r [ X ] = i V a r Y i = n p ( 1 p ) Var[X]=sum_(i)Var[Y_(i)]=np(1-p)\mathbf{Var}[X]=\sum_{i} \mathbf{Var}\left[Y_{i}\right]=n p(1-p). Note that a direct computation with binomial coefficients would be much more messy.
Poisson distribution: The Poisson distribution has a single non-negative real parameter λ > 0 λ > 0 lambda > 0\lambda>0 and is a discrete distribution over the non-negative integers { 0 , 1 , 2 , } { 0 , 1 , 2 , } {0,1,2,dots}\{0,1,2, \ldots\}. For a random variable X X XX distributed according to this distribution, P r [ X = i ] = e λ λ i ! P r [ X = i ] = e λ λ i ! Pr[X=i]=e^(-lambda)(lambda)/(i!)\mathbf{Pr}[X=i]=e^{-\lambda} \frac{\lambda}{i !}. One can prove that E [ X ] = V a r [ X ] = λ E [ X ] = V a r [ X ] = λ E[X]=Var[X]=lambda\mathbf{E}[X]=\mathbf{Var}[X]=\lambda. Do you see why? The Poisson distribution is very useful for a variety of reasons. It is connected to the Binomial distribution as follows. It can be seen as the limit distribution of the Binomial distribution with parameters n , p n , p n,pn, p where we take n n n rarr oon \rightarrow \infty while reducing p p pp such that n p = λ n p = λ np=lambdan p=\lambda. Thus the Poisson distribution approximates the Binomial distribution when p p pp is very small compared to n n nn; this allows one to avoid messy computations with binomial coefficients and instead use the simpler formulas provided by the Poisson distribution. One very interesting property satisfied by the Poisson distribution is the following. If X 1 , X 2 , , X n X 1 , X 2 , , X n X_(1),X_(2),dots,X_(n)X_{1}, X_{2}, \ldots, X_{n} are independent Poisson random variables with means λ 1 , λ 2 , , λ n λ 1 , λ 2 , , λ n lambda_(1),lambda_(2),dots,lambda_(n)\lambda_{1}, \lambda_{2}, \ldots, \lambda_{n} then X = i = 1 n X i X = i = 1 n X i X=sum_(i=1)^(n)X_(i)X=\sum_{i=1}^{n} X_{i} has the Poisson distribution with mean i = 1 n λ i i = 1 n λ i sum_(i=1)^(n)lambda_(i)\sum_{i=1}^{n} \lambda_{i}.
Geometric distribution: This distribution is closely related to the Bernoulli distribution and hence has a single parameter p p pp. There are two variants and both use the same name so one has to be careful. The first variant is over the support { 1 , 2 , } { 1 , 2 , } {1,2,dots}\{1,2, \ldots\} and the probability of i i ii is the number of tosses of a Bernoulli random variable to see the first head. If X X XX is distributed according to this then we see that P r [ X = i ] = ( 1 p ) i 1 p . E [ X ] = 1 / p P r [ X = i ] = ( 1 p ) i 1 p . E [ X ] = 1 / p Pr[X=i]=(1-p)^(i-1)p.E[X]=1//p\mathbf{Pr}[X=i]=(1-p)^{i-1} p . \mathbf{E}[X]=1 / p and V a r [ X ] = ( 1 p ) / p 2 V a r [ X ] = ( 1 p ) / p 2 Var[X]=(1-p)//p^(2)\mathbf{Var}[X]=(1-p) / p^{2}.
The second variant is to measure the number of failures before the first head. This is the same as above shifted by 1 1 -1-1 and hence has support { 0 , 1 , 2 , } { 0 , 1 , 2 , } {0,1,2,dots}\{0,1,2, \ldots\}. For this second variant the expectation is ( 1 / p 1 ) ( 1 / p 1 ) (1//p-1)(1 / p-1) and variance is the same ( 1 p ) / p 2 ( 1 p ) / p 2 (1-p)//p^(2)(1-p) / p^{2}.
Uniform distribution: The uniform distribution over a finite state space is pretty simple. If the state space has size k k kk then each elementary event has probability 1 / k 1 / k 1//k1 / k. This can be seen as a generalization of the Bernoulli distribution. Suppose we have n n nn such independent distributions. Their joint distribution corresponds to the experiment of throwing n n nn balls into k k kk bins. Many aspects of balls and bins distributions are studied and we will see examples and applications and connections to hashing.
Uniform distribution is very important in continuous probability spaces. The most familiar one is to consider a uniform distribution over a closed interval [ a , b ] [ a , b ] [a,b][a, b] of the real line. The density function f f ff corresponding to this is f ( x ) = 1 / ( b a ) f ( x ) = 1 / ( b a ) f(x)=1//(b-a)f(x)=1 /(b-a) for x [ a , b ] x [ a , b ] x in[a,b]x \in[a, b] and 0 0 00 outside the interval. In general we can define a uniform distribution over more complicated "shapes", especially in higher dimensions. Consider the uniform distribution over the unit circle in the Euclidean plane. Or over a specified rectangle in the plane. In such cases we have the density function f ( x , y ) f ( x , y ) f(x,y)f(x, y) is constant over the interior of the shape where the constant depends on the normalization factor which is the area of the shape. Random variables can be used to modify the uniform distribution appropriately to create more complex distributions.
Normal distribution: One of the most fundamental probability distributions is the Normal or Guassian distribution. It is a distribution over the entire Euclidean space in any fixed dimension d d dd. In one dimension it is defined by two parameters, the mean μ μ mu\mu and the variance σ 2 σ 2 sigma^(2)\sigma^{2} (or standard deviation σ σ sigma\sigma). The density function of this distribution is: f ( x ) = 1 σ 2 π e 1 2 ( x μ σ ) 2 f ( x ) = 1 σ 2 π e 1 2 x μ σ 2 f(x)=(1)/(sigmasqrt(2pi))e^(-(1)/(2)((x-mu)/(sigma))^(2))f(x)=\frac{1}{\sigma \sqrt{2 \pi}} e^{-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^{2}}. The standard Normal distribution is obtained by setting μ = 0 μ = 0 mu=0\mu=0 and σ = 1 σ = 1 sigma=1\sigma=1 in which case it becomes the simpler expression 1 2 π e 1 2 x 2 1 2 π e 1 2 x 2 (1)/(sqrt(2pi))e^(-(1)/(2)x^(2))\frac{1}{\sqrt{2 \pi}} e^{-\frac{1}{2} x^{2}}. It is a distribution that is symmetric about the mean μ μ mu\mu. If X X XX is a random variable distributed according to the Normal distribution, by construction E [ X ] = μ E [ X ] = μ E[X]=mu\mathbf{E}[X]=\mu and V a r [ X ] = σ 2 V a r [ X ] = σ 2 Var[X]=sigma^(2)\mathbf{Var}[X]=\sigma^{2}. The cumulative probability distribution F X ( t ) = P r [ X t ] F X ( t ) = P r [ X t ] F_(X)(t)=Pr[X <= t]F_{X}(t)=\mathbf{Pr}[X \leq t] is often needed but there is no closed-form expression for it.
Acknowledgments: We thank Manuel Torres for reading and corrections.

  1. Alternatively it is also the same as ( 1 F X ( y ) ) d y 1 F X ( y ) d y int_(-oo)^(oo)(1-F_(X)(y))dy\int_{-\infty}^{\infty}\left(1-F_{X}(y)\right) d y where F X F X F_(X)F_{X} is the cumulative distribution function of X X XX. This can be seen by integration by parts. ↩︎

Recommended for you

Jintang Li
Adversarial Learning on Graph
Adversarial Learning on Graph
This review gives an introduction to Adversarial Machine Learning on graph-structured data, including several recent papers and research ideas in this field. This review is based on our paper “A Survey of Adversarial Learning on Graph”.
7 points
0 issues