Information Theory in Machine Learning

Introduction

Information can be represented as bits. One bit of information allows us to choose between two equally probable, or equiprobable, alternatives. In other words, in a scenario with 2 equiprobable choices, if you get an instruction to choose one of the choices, that is one bit of information. Say, each instruction can be represented as a binary digit (0='choice 1' and 1='choice 2') then this binary digit provides you with one bit of information. In the case of 'n' sequential forks, with 'm' final choices (destinations), then m= 2n. In other words, n= log2m. Information is thus, an ordered symbol of sequences to interpret its meaning.
Please note: A binary digit is the value of a binary variable, whereas a bit is an amount of information. A binary digit (when averaged over both of its possible states) can convey between zero and one bit of information.
By now, we know what information means. It is a mathematical representation of uncertainty. Let us now understand what information theory is.
Figure 1: A binary symmetric channel showing probability of incorrect transmission of message given the transmitted symbol is x and the received symbol is y
When we transmit data, for instance, a string of bits, over a communication channel or any computational medium, there is a probability (say 'p') that the received message will not be identical to the transmitted message. An ideal channel is where this probability (p) is zero (or almost 0). However, most real world channels have a non zero probability 'f' of incorrect transmission of information and a probability of (1 minus f) that each bit of information will be transmitted correctly. Given this probability of error, one needs to find solutions to reduce the same in order to transmit information with minimum error. One can use the 'physical solution' where one needs to revamp the physical attributes of the communication channel, for instance the circuitry. However, this might increase the operational cost. An alternative solution is to update the system using 'information theory' and 'coding theory'. Under these solutions, we accept the given noisy physical channel in its current form. We then add communication systems to it so that we can detect and correct the errors introduced by the channel. This system normally comprises of an encoder and decoder. The 'system' solutions can help you design a reliable communication channel with the only additional cost of computations of an encoder and a decoder. While coding theory helps you design the appropriate decoder and encoder, information theory helps you define the quality of information or content you can transmit. It can help you study the theoretical limitations and potentials of a system.
Information theory treats information as a physical entity, like energy or mass. It deals with theoretical analyses of how information can be transmitted over any channel: natural or man-made. Thus, it defines a few laws of information. Let us assume a basic system for information flow as follows:
Figure 2: Information Channel: A message (data) is encoded before being fed into a communication channel, which adds noise. The channel output has the encoded message with noise that is decoded by a receiver to recover the message.
There are a few laws on information that can be derived from the above channel:
  • There is a definite upper limit, the channel capacity, to the amount of information that can be communicated through that channel.
  • This limit shrinks as the amount of noise in the channel increases.
  • This limit can very nearly be reached by judicious packaging, or encoding, of data.
Essentially, information theory entails two broad techniques:
  1. Data Compression (source coding): More frequent events should have shorter encodings
  2. Error Correction (channel coding): Should be able to infer encoded event even if message is corrupted by noise
Both these methods require you to build probabilistic models of the data sources. This is why information theory is relevant to machine learning and data analytics. It not only helps you measure the accuracy of information contained in a data source, but also helps you improve results of predictive models that might be built on this data. Before we study how information theory can be applied to, let us study a few basic terms.

Few important terms of Information Theory and Machine Learning

Codewords

Information theory represents data in the form of codewords. These codewords are representation of the actual data elements in the form of sequence of binary digits or bits. There are various techniques to map each symbol (data element) with the corresponding codeword.

Variable Length Codes

One traditional way of assigning codewords to data elements would be to assign codes with fixed lengths, or every data element gets assigned a code of the same code length. Let us look at a visual representation of the same. If we have 4 values of a given variable, say 'Discount type' with the following 4 values: 'Promotion', 'Priority Customer', 'Repeat buy' and 'None'. If we assume that the 4 types of discounts are equally probable, they can safely be mapped to codewords of equal length, say 00, 01, 10 and 11, as shown in the image below:
Figure 3: Fixed Length Encoding: Every data element here is assumed to have equal likelhood of occurrence. Hence, every code word have equal and fixed length (i.e. 2)
Now, if we know that discounts under 'Promotion' are more frequent than the rest and have been assigned the following probability values.
Figure 4: Varying probabilities of each data element indicating the likelihood of occurrence of the event
We can then, design codes weighted according to these probability values. We design the code to reduce the total length of the message. Since one would transmit the most frequent terms (data elements) most number of times, you would want to associate the least length with the most frequent term. Hence, we design a code with the least number of bits for the most frequent data element. Let us look at an example of the same.
Figure 5: Variable Length Encoding: Every data element has been mapped to a sequence of binary codes of varying length according to the probability of occurrence
This variable-length code will now represent each data element in a unique and exclusive manner. Moreover, if you consider the vertical axis to visualize the probability of each word, p(x), and the horizontal axis to visualize the length of the corresponding codeword, L(x), let us compute the area covered by the encoding. This amounts to 1.75 bits, which is a reduction from a fixed-length encoding of 2 bits each for all the above terms. The fixed-length encoding would amount to 2 bits.

Optimal Encoding

The most optimal encoding has to strike a balance between the length of the message and the cost of the codeword. Let us look at how one affects the other. The cost of buying a codeword of length 0 is 1. This is because you are buying all possible codewords, and hence, if you want to have a codeword of length 0, you can’t have any other codeword. The cost of a codeword of length 1, like "0" or "1", is (1/2) because half of possible codewords start with "0" or "1" respectively. The cost of a codeword of length 2, like “01”, is (1/4) because a quarter of all possible codewords start with “01”. In general, the cost of codewords decreases exponentially with the length of the codeword.
Refer to the image below:
Figure 6: Cost of a codeword as a function of the length of the codeword
We know that the average length of the total code string is the function of the probability of each word and the length of the corresponding codeword. It can be represented as p ( x ) L ( x ) p ( x ) L ( x ) p(x)**L(x)p(x)*L(x). This average length contribution is related to the cost through the length of the codeword. The amount we pay decides the length of the codeword. The length of the codeword controls how much it adds to the average message length. We can picture the two of these together in the following way:
Figure 7: Relationship between Average Length Contribution and Cost of the Codeword
Thus, we can clearly establish that shorter codewords cost higher and vice versa.
Figure 8: Costs of long and short codewords

Shannon Information Content

Let us consider a set of discrete random variables 'X' which contains a possible set of events { a i a i a_(i)a_{i}} where 'i' represents the range of elements contained in the set 'X'. Shannon Information Content or Information Content is the measure of information contained in the event: X = a i X = a i X=a_(i)X=a_{i}. Information content actually quantifies the surprisal of each outcome of any experiment. Higher the surprisal of an outcome, higher is the information content.
Thus, the information you can gain when an event occurs which had some probability value associated with it, can be represented as: (1) h ( X = a i ) = l o g 2 ( 1 / p ( X = a i ) ) (1) h ( X = a i ) = l o g 2 ( 1 / p ( X = a i ) ) {:(1)h(X=a_(i))=log_(2)(1//p(X=a_(i))):}\begin{align} \tag{1} h(X=a_{i})= log_{2}(1/p(X=a_{i})) \end{align}
The above equation will show that as the value of probability 'p' increases, the information content decreases. You get a curve that looks something like this:
Figure 9: The information content function versus probability values
The information content is measured in bits. It is also known as self information.

Entropy

While designing an optimal codeword for our data elements above, we see that the average length contribution of a message and the cost of the same line up well. In other words, the height of the rectangle denoting the average length contribution of messages is almost equal to the maximum height of the exponential curve denting the cost of the codeword. Thus, if we slightly increase the length of the codeword, the message length contribution will increase in proportion to its height at the boundary, while the cost will decrease in proportion to its height at the boundary.
Figure 10: Proportionality of cost with the height of the message length contribution
We can also say that the cost to make the codeword for a data element 'a' shorter is p(a). Now, let us generalize this to any message 'X'. For a message 'X' with the likelihood of occurrence p ( X ) p ( X ) p(X)p(X), the cost can be approximated as p ( X ) p ( X ) p(X)p(X) too. We also know that the cost of a message of length 'L' is 1 / 2 L 1 / 2 L 1//2^(L)1/2^L. If we invert this to obtain the length of a message that costs a given amount (cost), we get: L = l o g ( 1 / c o s t ) L = l o g ( 1 / c o s t ) L=log(1//cost)L= log(1/cost). If we substitute the value of cost by p ( X ) p ( X ) p(X)p(X), we derive 'L' as l o g ( 1 / p ( X ) ) l o g ( 1 / p ( X ) ) log(1//p(X))log(1/p(X)). Thus, the length of a message 'X' can be denoted in terms of its probability in the following equation:
(1) L ( X ) = l o g 2 ( 1 / p ( X ) ) (1) L ( X ) = l o g 2 ( 1 / p ( X ) ) {:(1)L(X)=log_(2)(1//p(X)):}\begin{align} L(X)= log_{2}(1/p(X)) \end{align}
As we discussed earlier, there is a limit to how short an average message can get to communicate events effectively, from a particular probability distribution 'p' of events 'X'. This limit, the average message length using the best possible code, is called the entropy or Shannon entropy of 'p', H ( X ) H ( X ) H(X)H(X). In other words, entropy is a measure of unpredictability of the state (X), or equivalently, of its average information content. The formula to represent entropy is as follows:
(2) H ( X ) = X p ( X ) l o g 2 ( 1 / p ( X ) ) (2) H ( X ) = X p ( X ) l o g 2 ( 1 / p ( X ) ) {:(2)H(X)=sum_(X)p(X)log_(2)(1//p(X)):}\begin{align} H(X)= \sum_{X} p(X)log_{2}(1/p(X)) \end{align}
H ( X ) H ( X ) H(X)H(X) can also be denoted by H ( p ) H ( p ) H(p)H(p) where 'p' is the vector containing probabilities of each of the discrete events. Entropy is measured in bits too.

Conditional Entropy

The average uncertainty about the output value (say 'Y') given an input value ('X') is the conditional entropy H ( Y | X ) H ( Y | X ) H(Y|X)H(Y|X). Here, H ( Y | X ) H ( Y | X ) H(Y|X)H(Y|X) is, ‘the residual uncertainty (entropy) of 'Y' given that we know the value of 'X'’. Let us assume a channel with noise η η eta\eta and input 'X' and output 'Y'. In this case, H ( Y | X ) H ( Y | X ) H(Y|X)H(Y|X) can also be represented as H ( η ) H ( η ) H(eta)H(\eta). It can be represented in the following form:
(3) H ( Y | X ) = x X p ( x ) y Y p ( y | x ) l o g 2 ( 1 / p ( y | x ) ) (3) H ( Y | X ) = x X p ( x ) y Y p ( y | x ) l o g 2 ( 1 / p ( y | x ) ) {:(3)H(Y|X)=sum_(x in X)p(x)sum_(y in Y)p(y|x)log_(2)(1//p(y|x)):}\begin{align} H(Y|X)= \sum_{x \in X} p(x) \sum_{y \in Y}p(y|x)log_{2}(1/p(y|x)) \end{align}
The above equation can further be condensed into the following form:
(4) H ( Y | X ) = x X , y Y p ( y , x ) l o g 2 ( 1 / p ( y | x ) ) (4) H ( Y | X ) = x X , y Y p ( y , x ) l o g 2 ( 1 / p ( y | x ) ) {:(4)H(Y|X)=sum_(x in X,y in Y)p(y","x)log_(2)(1//p(y|x)):}\begin{align} H(Y|X)= \sum_{x \in X, y \in Y} p(y,x) log_{2}(1/p(y|x)) \end{align}
You could modify this equation for a given value of the input, say X = a i X = a i X=a_(i)X = a_{i} in the following manner:
(5) H ( Y | X = a i ) = y Y p ( y , x = a i ) l o g 2 ( 1 / p ( y | x = a i ) ) (5) H ( Y | X = a i ) = y Y p ( y , x = a i ) l o g 2 ( 1 / p ( y | x = a i ) ) {:(5)H(Y|X=a_(i))=sum_(y in Y)p(y","x=a_(i))log_(2)(1//p(y|x=a_(i))):}\begin{align} H(Y|X=a_{i})= \sum_{y \in Y} p(y,x=a_{i}) log_{2}(1/p(y|x=a_{i})) \end{align}

Cross Entropy or Joint Entropy

If X and Y are discrete random variables and p(X,Y) is the value of their joint probability distribution at (, y), then the joint entropy or cross entropy of X and Y is:
(6) H ( Y , X ) = x X , y Y p ( y , x ) l o g 2 ( 1 / p ( y , x ) ) (6) H ( Y , X ) = x X , y Y p ( y , x ) l o g 2 ( 1 / p ( y , x ) ) {:(6)H(Y","X)=sum_(x in X,y in Y)p(y","x)log_(2)(1//p(y","x)):}\begin{align} H(Y,X)= \sum_{x \in X, y \in Y} p(y,x) log_{2}(1/p(y,x)) \end{align}
Let's say you have two discrete distributions: p(x) and q(x) with common elements. One way of defining cross entropy would be: the average length of code for communicating an event from one distribution with the optimal code for another distribution is called the cross-entropy. It can be represented so: (7) H p ( q ) = x q ( x ) l o g 2 ( 1 / p ( x ) ) (7) H p ( q ) = x q ( x ) l o g 2 ( 1 / p ( x ) ) {:(7)H_(p)(q)=sum_(x)q(x)log_(2)(1//p(x)):}\begin{align} H_{p}(q)= \sum_{x} q(x) log_{2}(1/p(x)) \end{align}
For continuous distributions, you can simply replace the discrete summation with integral functions. Cross-entropy is an important measure. It gives us a way to show how different two probability distributions are. The more different the distributions 'p' and 'q' are, the more the cross-entropy of 'p' with respect to 'q' will be bigger than the entropy of 'p'.

Mutual Information

Let us consider a channel with inputs 'x', output 'y' and noise η η eta\eta.
Figure 11: Channel design demonstrating an ideal communication channel
In the above channel H ( x ) H ( x ) H(x)H(x) and H ( y ) H ( y ) H(y)H(y) are the marginal entropies of the inputs and outputs respectively. H ( x | y ) H ( x | y ) H(x|y)H(x|y) and H ( y | x ) H ( y | x ) H(y|x)H(y|x) are the conditional entropies of the inputs given the outputs and vice versa respectively. We can now derive mutual information from this system. The mutual information I ( x , y ) I ( x , y ) I(x,y)I(x, y) between two variables, such as a channel input 'x' and output 'y', is the average amount of information that each value of 'x' provides about 'y'. It can represented so:
(8) I ( x , y ) = H ( y ) H ( y | x ) = H ( y ) H ( η ) = H ( x ) + H ( y ) H ( x , y ) (8) I ( x , y ) = H ( y ) H ( y | x ) = H ( y ) H ( η ) = H ( x ) + H ( y ) H ( x , y ) {:(8)I(x","y)=H(y)-H(y|x)=H(y)-H(eta)=H(x)+H(y)-H(x","y):}\begin{align} I(x, y) = H(y)- H(y|x) = H(y) − H(\eta) = H(x)+H(y)−H(x,y) \end{align}
In terms of 2 two random variables 'X' and 'Y' , whose joint distribution is defined by p ( X , Y ) p ( X , Y ) p(X,Y)p(X,Y), mutual information can be represented in the following form: (9) I ( X , Y ) = H ( Y ) H ( Y | X ) = x X , y Y p ( x , y ) l o g 2 ( p ( x , y ) / p ( x ) p ( y ) ) (9) I ( X , Y ) = H ( Y ) H ( Y | X ) = x X , y Y p ( x , y ) l o g 2 ( p ( x , y ) / p ( x ) p ( y ) ) {:(9)I(X","Y)=H(Y)-H(Y|X)=sum_(x in X,y in Y)p(x","y)log_(2)(p(x","y)//p(x)p(y)):}\begin{align} I(X, Y)= H(Y)- H(Y|X)=\sum_{x \in X, y \in Y} p(x,y) log_{2}(p(x,y)/p(x)p(y)) \end{align}
Another metric that can be derived from the mutual information is the variation of information. The variation of information is the information which isn’t shared between these two variables ('x' and 'y'). We can define it like so:
(10) V ( x , y ) = H ( x , y ) I ( x , y ) (10) V ( x , y ) = H ( x , y ) I ( x , y ) {:(10)V(x","y)=H(x","y)-I(x","y):}\begin{align} V(x,y)=H(x,y)−I(x,y) \end{align}
The variation of information between two variables is zero if knowing the value of one tells you the value of the other and increases as they become more independent.

Conditional Mutual Information

The conditional mutual information is defined as the expected value of the mutual information of two random variables given the value of a third. For random variables X, Y, and Z, it can be represented as I ( X ; Y | Z ) I ( X ; Y | Z ) I(X;Y|Z)I(X;Y|Z). Let us look at a representation of these three variables in the form of a Venn diagram:
Figure 12: Channel design demonstrating an ideal communication channel
Mathematically, it can be defined as: (11) I ( X ; Y | Z ) = z Z p Z ( z ) x X y Y p X , Y | Z ( x , y | z ) l o g ( p X , Y | Z ( x , y | z ) / ( p X | Z ( x | z ) p Y | Z ( y | z ) ) ) (11) I ( X ; Y | Z ) = z Z p Z ( z ) x X y Y p X , Y | Z ( x , y | z ) l o g ( p X , Y | Z ( x , y | z ) / ( p X | Z ( x | z ) p Y | Z ( y | z ) ) ) {:(11)I(X;Y|Z)=sum_(z in Z)p_(Z)(z)sum_(x in X)sum_(y in Y)p_(X,Y|Z)(x","y|z)log(p_(X,Y|Z)(x","y|z)//(p_(X|Z)(x|z)p_(Y|Z)(y|z))):}\begin{align} I(X;Y|Z)=\sum_{z \in Z} p_{Z}(z) \sum_{x \in X} \sum_{y \in Y}p_{X,Y|Z}(x,y|z) log(p_{X,Y|Z}(x,y|z)/(p_{X|Z}(x|z)p_{Y|Z}(y|z))) \end{align}
This can be simplified to:
(12) I ( X ; Y | Z ) = p X , Y , Z ( x , y , z ) l o g ( ( p Z ( z ) p X , Y , Z ( x , y , z ) ) / ( p X , Z ( x , z ) p Y , Z ( y , z ) ) ) (12) I ( X ; Y | Z ) = p X , Y , Z ( x , y , z ) l o g ( ( p Z ( z ) p X , Y , Z ( x , y , z ) ) / ( p X , Z ( x , z ) p Y , Z ( y , z ) ) ) {:(12)I(X;Y|Z)=p_(X,Y,Z)(x","y","z)log((p_(Z)(z)p_(X,Y,Z)(x","y","z))//(p_(X,Z)(x","z)p_(Y,Z)(y","z))):}\begin{align} I(X;Y|Z)=p_{X,Y,Z}(x,y,z)log((p_{Z}(z)p_{X,Y,Z}(x,y,z))/(p_{X,Z}(x,z)p_{Y,Z}(y,z))) \end{align}
The above entity is greater than '0' for discrete, jointly distributed random variables X, Y and Z. The above probability mass functions represented by 'p' can be evaluated using conventional definitions of probability.

Multivariate mutual information

Multivariate mutual information (MMI) or Multi-information is the amount of information about Z which is yielded by knowing both X and Y together is the information that is mutual to Z and the X,Y pair, written as I ( X , Y ; Z ) I ( X , Y ; Z ) I(X,Y;Z)I(X,Y;Z). In the above Venn diagram, it is the central part of the diagram. Mathematically, a general form of multi-information can be represented as a sum of entropies: (13) I ( X 1 ; X 2 ; . . . . . ; X n ) = T ( 1 , 2 , . . . , n ) ( 1 ) | T | H ( T ) (13) I ( X 1 ; X 2 ; . . . . . ; X n ) = T ( 1 , 2 , . . . , n ) ( 1 ) | T | H ( T ) {:(13)I(X_(1);X_(2);.....;X_(n))=-sum_(T sube(1,2,...,n))(-1)^(|T|)H(T):}\begin{align} I(X_{1};X_{2};.....;X_{n})= - \sum_{T \subseteq (1,2,...,n)}(-1)^{|T|}H(T) \end{align}
Positive MMI is typical of common-cause structures. For example, clouds cause rain and also block the sun; therefore, the correlation between rain and darkness is partly accounted for by the presence of clouds, I ( r a i n ; d a r k | c l o u d ) I ( r a i n ; d a r k ) I ( r a i n ; d a r k | c l o u d ) I ( r a i n ; d a r k ) I(rain;dark|cloud) <= I(rain;dark)I(rain;dark|cloud) \le I(rain;dark). The result is positive MMI I ( r a i n ; d a r k ; c l o u d ) I ( r a i n ; d a r k ; c l o u d ) I(rain;dark;cloud)I(rain;dark;cloud). A condition where I ( Y ; Z | X ) I ( Y ; Z ) I ( Y ; Z | X ) I ( Y ; Z ) I(Y;Z|X) >= I(Y;Z)I(Y;Z|X) \ge I(Y;Z) indicates a negative MMI I ( X ; Y ; Z ) I ( X ; Y ; Z ) I(X;Y;Z)I(X;Y;Z).

Channel Capacity

The measure channel capacity basically helps us answer how fast can we transmit information over a communication channel. One of the most common and convenient channels used is the additive channel. This channel adds noise ( η η eta\eta) to the encoded input values as they pass through the channel, so that the channel output is a noisy version 'y' of the channel input 'x'. It looks something like this:
y = x + η y = x + η y=x+etay=x+\eta The channel capacity 'C' is the maximum amount of information that a channel can provide at its output about the input. It is often represented as:
(14) C = m a x ( I ( x , y ) ) (14) C = m a x ( I ( x , y ) ) {:(14)C=max(I(x","y)):}\begin{align} C=max(I(x,y)) \end{align}
where ( I ( x , y ) ) ( I ( x , y ) ) (I(x,y))(I(x,y)) is the mutual information of the inputs and the output.
The rate at which information is transmitted through a channel can be determined using the entropies of three variables:
  1. the entropy H ( x ) H ( x ) H(x)H(x) of the input,
  2. the entropy H ( y ) H ( y ) H(y)H(y) of the output,
  3. the entropy H ( η ) H ( η ) H(eta)H(\eta) of the noise in the channel.
A output entropy is high then this provides a large potential for information transmission. It depends on the input entropy and the level of noise. If the noise is low, then the output entropy can be as close to the channel capacity. his further explains and reinforces equation [12].However, channel capacity gets progressively smaller as the noise increases. Capacity is usually expressed in bits-per-usage (bits per output), or bits-per-second (bits/s). The Shannon-Hartley theorem states that the channel capacity (C) is given by:
(15) C = B l o g 2 ( 1 + S / N ) (15) C = B l o g 2 ( 1 + S / N ) {:(15)C=Blog_(2)(1+S//N):}\begin{align} C = B log_{2}(1 + S/N) \end{align}
Here, ( S / N ) ( S / N ) (S//N)(S/N) is the signal-to-noise ratio of the channel. 'B' is the bandwidth of the channel in Hertz and C is represented in bits-per-second.

Information Divergence

Information divergence is a measure of dissimilarity of content of two probability distributions. This is a general concept that is used across various domains to compare distributions and the information content of the same. In a lot of machine learning objectives, it can be used as an approximation of the form x μ x μ x∼mux \sim \mu, where x > 0 x > 0 x > 0x > 0 is the observed data (input) and μ μ mu\mu is the approximation given by the model.
There is a large variety of information divergences. Most of them are distance metrics. We are covering one of them in the next section.

Relative entropy or Kullback–Leibler divergence

KL divergence stands for Kullback-Leibler divergence. If we have two distributions p ( x ) p ( x ) p(x)p(x) and q ( x ) q ( x ) q(x)q(x), defined for the same set of events, KL divergence helps you determine the difference between these distributions. It is closely related to relative entropy, and information divergence. It is a non-symmetric measure of the difference between two probability distributions. This means that the divergence of q ( x ) q ( x ) q(x)q(x) from p ( x ) p ( x ) p(x)p(x), denoted by D K L ( p | | q ) D K L ( p | | q ) D_(KL)(p||q)D_{KL}(p||q) or D K L ( p ( x ) , q ( x ) ) D K L ( p ( x ) , q ( x ) ) D_(KL)(p(x),q(x))D_{KL}(p(x), q(x)), is a measure of the information lost when q(x) is used to approximate p(x). It implies that D K L ( p ( x ) , q ( x ) ) ! = D K L ( q ( x ) , p ( x ) ) D K L ( p ( x ) , q ( x ) ) ! = D K L ( q ( x ) , p ( x ) ) D_(KL)(p(x),q(x))!=D_(KL)(q(x),p(x))D_{KL}(p(x), q(x))!= D_{KL}(q(x), p(x)). For two discrete distributions, such that p ( x ) > 0 p ( x ) > 0 p(x) > 0p(x)>0 and q ( x ) > 0 q ( x ) > 0 q(x) > 0q(x)>0, it can be represented in the following way:
(16) D K L ( p ( x ) , q ( x ) ) = D K L ( p | | q ) = H ( p , q ) H ( p ) = x p ( x ) l o g ( p ( x ) / q ( x ) ) (16) D K L ( p ( x ) , q ( x ) ) = D K L ( p | | q ) = H ( p , q ) H ( p ) = x p ( x ) l o g ( p ( x ) / q ( x ) ) {:(16)D_(KL)(p(x)","q(x))=D_(KL)(p||q)=H(p","q)-H(p)=sum_(x)p(x)log(p(x)//q(x)):}\begin{align} D_{KL}(p(x), q(x))=D_{KL}(p||q)= H(p,q)-H(p)= \sum_{x}p(x)log(p(x)/q(x)) \end{align}
Thus, KL divergence or relative entropy measures the expected number of extra bits required to code samples from p ( x ) p ( x ) p(x)p(x) when using a code based on q ( x ) q ( x ) q(x)q(x), instead of using a code based on p ( x ) p ( x ) p(x)p(x). Since KL divergence is a measure of dissimilarity or a distance, it is closely related to cross entropy and mutual information. If you remember from the last section, 'variation of information' quantifies the notion of distance, between different variables. However, it differs slightly from KL divergence.
KL divergence gives us a distance between two distributions over the same variable or set of variables. In contrast, variation of information gives us distance between two jointly distributed variables. KL divergence talks about divergence between distributions, while variation of information does the same within a distribution.
KL divergence is not a symmetric measure (i.e. D K L ( p ( x ) , q ( x ) ) ! = D K L ( q ( x ) , p ( x ) ) D K L ( p ( x ) , q ( x ) ) ! = D K L ( q ( x ) , p ( x ) ) D_(KL)(p(x),q(x))!=D_(KL)(q(x),p(x))D_{KL}(p(x), q(x))!= D_{KL}(q(x), p(x))), we get two objective functions to optimize while approximating p(x) in a loss-less manner efficiently. These functions are:
  1. Minimizing the forward KL divergence: a r g m i n θ D K L ( P | | Q θ ) a r g m i n θ D K L ( P | | Q θ ) argmin_(theta)D_(KL)(P||Q_(theta))arg min_{\theta} D_{KL}(P||Q_{\theta})
  2. Minimizing the reverse KL divergence: a r g m i n θ D K L ( Q θ | | P ) a r g m i n θ D K L ( Q θ | | P ) argmin_(theta)D_(KL)(Q_(theta)||P)arg min_{\theta} D_{KL}(Q_{\theta}||P)
Both the above optimization functions actually cause different types of approximations. Let us look at each one of them in a little detail:

Forward KL

Let us start with the base equation of Forward KL: (17) a r g m i n θ D K L ( P | | Q θ ) (17) a r g m i n θ D K L ( P | | Q θ ) {:(17)argmin_(theta)D_(KL)(P||Q_(theta)):}\begin{align} arg min_{\theta} D_{KL}(P||Q_{\theta}) \end{align}
Let us start expanding equation [14]. On substituting the value of KL divergence, with its expansion from equation [12]:
a r g m i n θ D K L ( P | | Q θ ) = E x P [ l o g ( 1 / q ( X ) ) ] + E x P [ l o g ( 1 / p ( X ) ) ] a r g m i n θ D K L ( P | | Q θ ) = E x P [ l o g ( 1 / q ( X ) ) ] + E x P [ l o g ( 1 / p ( X ) ) ] argmin_(theta)D_(KL)(P||Q_(theta))=E_(x∼P)[log(1//q(X))]+E_(x∼P)[log(1//p(X))]arg min_{\theta} D_{KL}(P||Q_{\theta}) = \mathbb{E}_{x\sim P}[log(1/q(X))]+ \mathbb{E}_{x\sim P}[log(1/p(X))] This can also be written as:
(18) a r g m i n θ D K L ( P | | Q θ ) = E x P [ l o g ( 1 / q ( X ) ) ] H ( p ( X ) ) (18) a r g m i n θ D K L ( P | | Q θ ) = E x P [ l o g ( 1 / q ( X ) ) ] H ( p ( X ) ) {:(18)argmin_(theta)D_(KL)(P||Q_(theta))=E_(x∼P)[log(1//q(X))]-H(p(X)):}\begin{align} arg min_{\theta} D_{KL}(P||Q_{\theta}) = \mathbb{E}_{x\sim P}[log(1/q(X))]- H(p(X)) \end{align}
In the above equation, the first element is the cross entropy between P and Q (also denoted by H(p,q) while the second element H ( p ( X ) ) H ( p ( X ) ) H(p(X))H(p(X)) is the entropy of 'P'(also represented as E x P [ l o g ( p ( X ) ) ] E x P [ l o g ( p ( X ) ) ] E_(x∼P)[-log(p(X))]\mathbb{E}_{x\sim P}[-log(p(X))]). Since the entropy of p ( X ) p ( X ) p(X)p(X) is fixed the final equation takes the following form: (19) a r g m a x θ D K L ( P | | Q θ ) = E x P [ l o g ( 1 / q ( X ) ) ] (19) a r g m a x θ D K L ( P | | Q θ ) = E x P [ l o g ( 1 / q ( X ) ) ] {:(19)argmax_(theta)D_(KL)(P||Q_(theta))=E_(x∼P)[log(1//q(X))]:}\begin{align} arg max_{\theta} D_{KL}(P||Q_{\theta}) = \mathbb{E}_{x\sim P}[log(1/q(X))] \end{align}
The above equation looks similar to the maximum likelihood estimation objective in machine learning exercises. This objective will sample points from p(X) and try to maximize the probability of occurrence of these points under q(X). mean-seeking behaviour, because the approximate distribution Q must cover all the modes (frequent events) and regions of high probability in P. In short, **"Wherever P has high probability, Q must also have high probability."**Let us look at an example of an approximate distribution for the same:
Figure 13: Approximate distribution for minimization of forward KL divergence
In the above diagram, the approximate distribution Q centers itself between the two modes of P, so that it can have high coverage of both. The 'forward KL divergence' optimization technique does not penalize Q for having high probability mass where P does not.

Application of forward KL divergence

Now, in supervised learning techniques employing 'empirical risk minimization', we use a dataset of samples D = ( x i , y i ) D = ( x i , y i ) D=(x_(i),y_(i))D={(x_{i},y_{i})} from some ground-truth data distribution P ( x , y ) = P ( x ) P ( y | x ) P ( x , y ) = P ( x ) P ( y | x ) P(x,y)=P(x)P(y|x)P(x,y)=P(x)P(y|x). In supervised learning we aim to learn a model that maps data elements X to Y in the following manner: f : X Y f : X Y f:X rarr Yf:X→Y. This model should eventually minimize the empirical risk of the model, which is determined by a parameter defining loss of information, also called *loss function"" L ( f ( x ) , y ) L ( f ( x ) , y ) L(f(x),y)L(f(x),y). A loss function is a mapping function that associates an event to a real number or a value with some "cost". We will talk about the loss functions in detail later. We need to optimize the loss function (reduce the cost) over some distribution of models f θ f θ f_(theta)f_{\theta}. This creates an optimization objective function as follows:
(20) a r g m i n θ E ( x , y ) D [ L ( f θ ( x ) , y ) ] (20) a r g m i n θ E ( x , y ) D [ L ( f θ ( x ) , y ) ] {:(20)argmin_(theta)E_((x,y)∼D)[L(f_(theta)(x)","y)]:}\begin{align} arg min_{\theta} \mathbb{E}_{(x,y)\sim D}[L(f_{\theta}(x),y)] \end{align} Optimizing this objective is equivalent to minimizing the divergence from an approximate distribution( Q θ Q θ Q_(theta)Q_{\theta}) to the true data distribution( P P PP). This concept can be extended to both regression and classification in the following ways:
  1. Regression with Mean-Squared Error Loss: In regression, one of the significant loss functions is the mean-squared error. We aim to reduce the loss in order to get the most accurate predictions. If the distribution of your estimations can be represented as q θ ( y | x ) q θ ( y | x ) q_(theta)(y|x)q_{\theta}(y|x) which is normally distributed. The negative log-likelihood of this normal distribution can be defined as: l o g ( q θ ) = ( 1 / 2 ) | | y f θ ( x ) | | 2 + C l o g ( q θ ) = ( 1 / 2 ) | | y f θ ( x ) | | 2 + C -log(q_(theta))=(-1//2)||y-f_(theta)(x)||^(2)+C-log(q_{\theta})= (-1/2)||y-f_{\theta}(x)||^{2} + C, where f θ ( x ) f θ ( x ) f_(theta)(x)f_{\theta}(x) are the results from the regression function and y y yy are the actuals. Minimizing the negative log-likelihood of this normal distribution is hence, equivalent to the mean-squared error loss.
  2. Classification with Cross Entropy Loss: In this case, the approximate distribution (Q) is the result of the classification model. It is represented as a discrete event distribution parameterized by a probability vector (probability of an event belonging to a particular class). In classification you aim to reduce the cross entropy loss(or log loss) of the predicted results against the ground-truth.
The above logic can be applied to any loss function. This way, we can improve the accuracy of the machine learning models. This concept is widely applied to supervised learning techniques.

Reverse KL

Let us start with the base equation of Reverse KL: (21) a r g m i n θ D K L ( Q θ | | P ) (21) a r g m i n θ D K L ( Q θ | | P ) {:(21)argmin_(theta)D_(KL)(Q_(theta)||P):}\begin{align} arg min_{\theta} D_{KL}(Q_{\theta}||P) \end{align}
The equation for reverse KL can be written as:
a r g m a x θ D K L ( Q θ | | P ) = E x Q θ [ l o g ( p ( X ) ) ] + H [ Q θ ] a r g m a x θ D K L ( Q θ | | P ) = E x Q θ [ l o g ( p ( X ) ) ] + H [ Q θ ] argmax_(theta)D_(KL)(Q_(theta)||P)=E_(x∼Q_(theta))[log(p(X))]+H[Q_(theta)]arg max_{\theta} D_{KL}(Q_{\theta}||P) = \mathbb{E}_{x\sim Q_{\theta}}[log(p(X))]+ H[Q_{\theta}] The above equation will sample points from Q and try to maximize the probability of these points under P. The entropy term encourages the approximate distribution to be as wide as possible. A good approximation under the reverse KL objective thus satisfies the below condition: "Wherever Q has high probability, P must also have high probability."
It is known as the mode-seeking behaviour because any sample from the approximate distribution Q must lie within a mode of P (since it's required that samples from Q are highly probable to occur under P). Let us look at an approximate distribution to visualize the same.
Figure 14: Approximate distribution for minimization of reverse KL divergence
As we see here, the approximate distribution essentially encompasses the right mode of P. The reverse KL divergence does not penalize Q for not placing probability mass on the other mode of P.

Application of reverse KL divergence

Reverse KL divergence finds its application in reinforcement learning. The maximum-entropy reinforcement learning objective which is used in reinforcement learning models uses this principle extensively.

Differential Entropy

While applying statistics to information theory, we come across a concept of maximum entropy probability distributions. Let us begin with a binomial event of flipping a coin. If a random variable 'X' represents the toss of a fair coin we have, P ( X = H ) = p P ( X = H ) = p P(X=H)=pP(X=H)=p and P ( X = T ) = ( 1 p ) P ( X = T ) = ( 1 p ) P(X=T)=(1-p)P(X=T)=(1−p) with p [ 0 , 1 ] p [ 0 , 1 ] p in[0,1]p\in [0,1]. If we were to compute the entropy of all possible probability values, we will obtain a distribution.

Loss functions

Loss functions are basically just objective functions that are applied to a lot of machine learning models. The functions take from the concepts of information theory. While a lot of loss functions are designed on the basis of information content or the entropy of datasets, there are a few others that use simple mathematical operations to measure the accuracy and performance of machine learning models. Let us refer to this image below:
Figure 15: Some key loss functions in classification and regression models
Let us talk about a few loss functions in detail.

Cross Entropy Loss and Log Loss

To understand the definition and application of this loss function, let us first understand that it is used in classification problems. This entropy-based loss function is widely used to measure the performance of classification algorithms that give the probability of a record belonging to a particular class as an output. Cross entropy is the more generic form of another loss function, called the logarithmic loss or log loss, when it comes to machine learning algorithms. While log loss is used for binary classification algorithms, cross-entropy serves the same purpose for multiclass classification problems. In other words, log loss is used when there are 2 possible outcomes and cross-entropy is used when there are more than 2 possible outcomes.
These loss functions quantify the price paid for the inaccuracy of predictions in classification problems by penalizing false classifications by taking into account the probability of classification. The generic equation of cross entropy loss looks like the following: (22) C r o s s E n t r o p y L o s s = ( 1 / N ) i = 1 N j = 1 M y i j l o g 2 p i j (22) C r o s s E n t r o p y L o s s = ( 1 / N ) i = 1 N j = 1 M y i j l o g 2 p i j {:(22)CrossEntropyLoss=-(1//N)sum_(i=1)^(N)sum_(j=1)^(M)y_(ij)log_(2)p_(ij):}\begin{align} CrossEntropyLoss= -(1/N) \sum_{i=1}^{N} \sum_{j=1}^{M} y_{ij} log_{2} p_{ij} \end{align}
Here, ‘M’ is the number of outcomes or labels that are possible for a given situation and 'N' is the number of samples or instances in the dataset. ‘ p i j p i j p_(ij)p_{ij}’ is the model’s probability of assigning label 'j' to instance 'i' and ' y i y y i y y_(iy)y_{iy}' depicts the outcome of the i-th instance with respect to the possible labels (j). Now, if you see the equations looks similar to the cross entropy equation above. Let us take an example for the same.
Let us assume a problem statement where one has to predict the range of grades a student will score in an exam given his attributes. If there are three possible outcomes: High, Medium and Low represented by [(1,0,0) (0,1,0) (0,0,1)]. Now, for a particular student, the predicted probabilities are (0.2, 0.7, 0.1). This indicates the predicted range of scores will most likely be ‘Medium’ as the probability is the highest there. Hence, the cross-entropy loss would be given as ( ( ( 0 ) l o g 2 ( 0.2 ) ) + ( ( 1 ) l o g 2 ( 0.7 ) ) + ( ( 0 ) l o g 2 ( 0.1 ) ) ) = 0.5145 ( ( ( 0 ) l o g 2 ( 0.2 ) ) + ( ( 1 ) l o g 2 ( 0.7 ) ) + ( ( 0 ) l o g 2 ( 0.1 ) ) ) = 0.5145 -(((0)**log2(0.2))+((1)**log2(0.7))+((0)**log2(0.1)))=0.5145-(((0)*log2(0.2)) + ((1)*log2(0.7)) + ((0)*log2(0.1)))=0.5145.
The equation for cross-entropy loss (equation [22]) can be converted into log loss, by accounting for only two possible outcomes. Thus the formula looks something like this:
(23) L o g L o s s = ( 1 / N ) i = 1 N ( y i l o g 2 p i + ( 1 y i ) l o g 2 ( 1 p i ) ) (23) L o g L o s s = ( 1 / N ) i = 1 N ( y i l o g 2 p i + ( 1 y i ) l o g 2 ( 1 p i ) ) {:(23)LogLoss=-(1//N)sum_(i=1)^(N)(y_(i)log_(2)p_(i)+(1-y_(i))log_(2)(1-p_(i))):}\begin{align} LogLoss= -(1/N) \sum_{i=1}^{N} (y_{i}log_{2}p_{i} + (1-y_{i})log_{2}(1-p_{i})) \end{align} Following the same conventions as equation [eq:1], here y i y i y_(i)y_{i} depicts the outcome (the class) of the i-th instance. p i p i p_(i)p_{i} is the probability (outcome of the classifier) of the i-th instance assuming the value ‘ y i y i y_(i)y_{i}’. So if y i y i y_(i)y_{i} assumes the value '1', ( 1 y i ) ( 1 y i ) (1-y_(i))(1-y_{i}) would be equal to 0: the two classes of a binary classification problem.
We have already spoken about KL divergence as a loss function for both regression and classification problems. Let us look at a few other loss functions.

Focal Loss

Focal loss is an improvement of cross-entropy loss often used for object detection and neural networks. In problems like object detection, a major issue is caused by class imbalance due to the presence of large number of easily-classified background examples. Focal loss is one such loss function that tries to accommodate for this class imbalance. (24) F L ( p t ) = ( 1 p t ) γ l o g 2 ( p t ) (24) F L ( p t ) = ( 1 p t ) γ l o g 2 ( p t ) {:(24)FL(p_(t))=(1-p_(t))^(gamma)log_(2)(p_(t)):}\begin{align} FL(p_{t})=(1-p_{t})^{\gamma}log_{2}(p_{t}) \end{align} Here, p t p t p_(t)p_{t} is defined the following way:
p t = p : w h e n ( y = 1 ) ; ( 1 p ) : o t h e r w i s e p t = p : w h e n ( y = 1 ) ; ( 1 p ) : o t h e r w i s e p_(t)=p:when(y=1);(1-p):otherwisep_{t}= p: when(y=1); (1-p):otherwise This basically indicates the two possible outcomes of a binary classifier.
This loss function is a dynamically scaled log loss function, where the scaling factor ( 1 p ) ( γ ) ( 1 p ) ( γ ) (1-p)^((gamma))(1-p)^{(\gamma)} decays to zero as confidence in the correct class increases. This scaling factor can automatically down-weight the contribution of easy examples (examples present in large numbers) during training and rapidly focus the model on hard examples (sparse examples). Focal loss enables the user to train a high-accuracy, one-stage detector that significantly outperforms the alternatives of training with hard example mining. Focal loss can be depicted used the below graph:
Figure 16: Focal loss adds a scaling factor to standard cross entropy loss
Here, we see that the focal loss function adds a factor ( 1 p t ) γ ( 1 p t ) γ (1-p_(t))^(gamma)(1-p_{t})^{\gamma} to the standard binary loss function. The relative loss for well-classified easy examples (having ( p t > 0.5 ) ( p t > 0.5 ) (p_(t) > 0.5)(p_{t}>0.5)) is reduced for all γ > 0 γ > 0 gamma > 0{\gamma>0}. As the value of γ γ gamma\gamma increases this loss further reduces towards the tail of the plots. In the above curve, the line corresponding to ( γ = 0 ) ( γ = 0 ) (gamma=0)(\gamma=0) represents the cross entropy (CE) loss or more specifically, the log loss function.
There are several other functions that use probability-based concepts, not necessarily based on entropy and information content. You can explore their use and application according to your requirements and the kind of model you are trying to evaluate.

Learning Rate

Learning rate, also called step-size, is a model-hyperparameter that tells us how to adjust the weights of our model network with respect to the loss gradient function. This concept comes from the optimization functions we saw previously. When applied to machine learning, learning rate helps determine how to change the parameters of a model for it to converge the best, or have the most accuracy. In other words, if we have a hypothesis or model represented in the following manner:
(25) h θ = θ 0 + θ 1 x (25) h θ = θ 0 + θ 1 x {:(25)h_(theta)=theta_(0)+theta_(1)x:}\begin{align} h_{\theta}= \theta_{0}+\theta_{1}x \end{align} We have to alter the weights ( θ 0 , θ 1 ) ( θ 0 , θ 1 ) (theta_(0),theta_(1))(\theta_{0},\theta_{1}) so that the above equation best estimates h θ h θ h_(theta)h_{\theta}. We thus apply the gradient descent (partial derivative) operation here and find the following relationship between the old and new weights:
n e w w e i g h t = e x i s t i n g w e i g h t l e a r n i n g r a t e g r a d i e n t n e w w e i g h t = e x i s t i n g w e i g h t l e a r n i n g r a t e g r a d i e n t new_(w)eight=existing_(w)eight—learning_(r)ate**gradientnew_weight = existing_weight — learning_rate * gradient More precisely, it can be represented in the following manner:
(26) θ 1 := θ 1 α ( δ J ( θ 1 ) / δ θ 1 ) (26) θ 1 := θ 1 α ( δ J ( θ 1 ) / δ θ 1 ) {:(26)theta_(1):=theta_(1)-alpha(delta J(theta_(1))//deltatheta_(1)):}\begin{align} \theta_{1}:=\theta_{1}-\alpha(\delta J(\theta_{1})/\delta\theta_{1}) \end{align} In the above equation ' α α alpha\alpha' is the learning rate. This basically depicts to what extent the newly acquired information about a dataset or a model overrides old information. The aim is to arrive at an optimal learning rate. This is because an optimal learning rate will give the best performance of the model i.e. least loss of information.
Figure 17: Effect of learning rate on loss function of a model
As we see above ([17]), the learning rate cannot be set either be too high or too low. This is because, with a very high learning rate, the next point in the equation [26] above will perpetually bounce haphazardly across the minima of the curve (refer to [18] below). With the value of α α alpha\alpha too low, it might take too much time trying to converge the model. This will cost you computational power and time.
Figure 18: Effect of learning rate on gradient descent of a model

Information Gain

Information Gain is a key concept of information theory that applies significantly to machine learning. While building a machine learning model on an existing dataset, there are several attributes or fields that contribute to the information content of the dataset and thus the accuracy of the model. The addition or removal of these features can alter the performance of the model.
Information gain is the amount of information or entropy that is gained by knowing the value of the attribute. This is given by calculating the difference of the entropy of the distribution after the attribute is added from the entropy of the distribution before it. It basically measures how much “information” a feature gives us about the class. The largest information gain is a desirable state. Often, information gain is a key concept used in building decision trees. Decision Trees algorithms always try to maximize information gain. Information gain of a dataset can be represented by the following formula:
(27) I n f o r m a t i o n G a i n = e n t r o p y ( P a r e n t D a t a s e t ) [ W e i g h t e d A v e r a g e ] ( e n t r o p y ( C h i l d D a t a s e t ) ) (27) I n f o r m a t i o n G a i n = e n t r o p y ( P a r e n t D a t a s e t ) [ W e i g h t e d A v e r a g e ] ( e n t r o p y ( C h i l d D a t a s e t ) ) {:(27)InformationGain=entropy(ParentDataset)-[WeightedAverage]**(entropy(ChildDataset)):}\begin{align} InformationGain= entropy(Parent Dataset)- [Weighted Average]*(entropy(Child Dataset)) \end{align}
Information Gain is largely used in decision tree models to define the rules and make the necessary splits at each stage.

Information Bottleneck Method

The Information Bottleneck (IB) [ 8 ] [ 8 ] ^([8])^{[8]} method is a technique of maximization of information that has a lot of applications across various fields. In machine learning, it is used in dimensionality reduction, clustering and even in deep learning. Let us first look at how this principle works.
Suppose you have a system with input x X x X x in Xx\in X and a corresponding output (or another signal) such that y Y y Y y in Yy\in Y. We first start with computing how much information X gives about Y. This would be the mutual information ( I ( X , Y ) ) ( I ( X , Y ) ) (I(X,Y))(I(X,Y)). The mutual information should be a positive variable (i.e. Y must not be independent from the original signal X). Now, we need to encode and find a short code ( X E ) ( X E ) (X_(E))(X_{E}) for X X XX such that the relevant information about Y Y YY is preserved to its maximum capacity. This is the Information Bottleneck (IB) method. IB method attempts to minimize the quantity I ( X , X E ) I ( X , X E ) I(X,X_(E))I(X, X_{E}) to gain maximum compression while maximizing the mutual information I ( X E , Y ) I ( X E , Y ) I(X_(E),Y)I(X_{E},Y). Here, X E X E X_(E)X_{E} is the information bottleneck.
The information between these variables can be best represented in the following manner: (28) I ( X E , Y ) = y Y x E X E p ( y , x E ) l o g ( p ( y , x E ) / ( p ( y ) p ( x E ) ) ) I ( X , Y ) (28) I ( X E , Y ) = y Y x E X E p ( y , x E ) l o g ( p ( y , x E ) / ( p ( y ) p ( x E ) ) ) I ( X , Y ) {:(28)I(X_(E)","Y)=sum_(y in Y)sum_(x_(E)inX_(E))p(y","x_(E))log(p(y","x_(E))//(p(y)p(x_(E)))) <= I(X","Y):}\begin{align} I(X_{E},Y)=\sum_{y \in Y}\sum_{x_{E}\in X_{E}}p(y,x_{E})log(p(y,x_{E})/(p(y)p(x_{E})))\le I(X,Y) \end{align} This is because most real-world compression systems are lossy compressions. Lossy compressions cannot convey more information than the original data (X). In this process, there is a trade-off between compressing the representation and preserving meaningful information. There is no single right solution for the trade-off. We are looking for a solution that keeps a fixed amount of meaningful information about the relevant signal 'Y' while minimizing the number of bits from the original signal 'X' (maximizing the compression). The solution to the same is as follows:
(29) m i n p ( X E | X ) [ ( I ( X , X E ) ) ( β ( I ( X E , Y ) ) ) ] (29) m i n p ( X E | X ) [ ( I ( X , X E ) ) ( β ( I ( X E , Y ) ) ) ] {:(29)min_(p(X_(E)|X))[(I(X","X_(E)))-(beta(I(X_(E)","Y)))]:}\begin{align} min_{p(X_{E}|X)}[(I(X,X_{E}))-(\beta(I(X_{E},Y)))] \end{align}
Here β β beta\beta is the Lagrange multiplier attached to the constrained meaningful information while maintaining the normalization of the mapping p ( X E | X ) p ( X E | X ) p(X_(E)|X)p(X_{E}|X) for every x x xx. β β beta\beta reflects the trade-off between compression and preservation of mutual information.

Application of IB method in Machine Learning

  1. Information Bottleneck method can widely be used in dimensionality reduction. This is because we need to compress the base dataset into its most concise form i.e. reduce the dimensions (attributes) in a dataset so as to retain the maximum information contained in the base dataset. We need to strike a balance between the number of attributes in the dataset and the information those attributes carry collectively.
  2. IB method also finds its application in clustering. Clustering is viewed as lossy data compression because the identity of individual points is replaced by the identity of the cluster to which they are assigned. Information theory is relevant in this context as two points can be clustered together if this merger does not lose too much information about the relevant variable. For instance, you can improve the performance of K-means clustering to converge at the best possible clusters with maximum information using the minimization constraint as defined in equation [27].
  3. IB method can also be used in another specific case of clustering i.e. topic modelling and document summarization and clustering. For this, it first generates a partitioning of the words (with encoded words: p ( W E | W ) p ( W E | W ) p(W_(E)|W)p(W_{E}|W)), which is supposed to preserve information about the documents. Then, the original document representation is replaced by a representation based on the word-clusters. Then, a partitioning of the documents p ( D E | D ) p ( D E | D ) p(D_(E)|D)p(D_{E}|D) is found that preserves information about the words.

Rate Distortion Theory

In information theory, rate distortion theory describes lossy data compression i.e. the method of compressing data to the minimum number of bits such that it can be reconstructed at the other end of the channel. Rate distortion theory helps determine the minimal number of bits per symbol, as measured by the rate R R RR, that should be communicated over a channel, so that the source (input signal) can be approximately reconstructed at the receiver (output signal) without exceeding an expected distortion D D DD.
In the most conventional cases, distortion is measured as the expected value of the square of the difference between input and output signal (i.e., the mean squared error). Let us consider a channel consisting of an encoder and a decoder.
Figure 19: Rate distortion encoder and decoder
Here, X X XX is the input and Y Y YY is the encoded output. The decoder transforms this signal Y Y YY into X ^ X ^ hat(X)\hat X or X X X^(')X'. In the above channel, we can define distortion measures or functions in two common ways:
  • Hamming Distortion
(30) d ( X , X ^ ) = 0 : i f X = X ^ ; e l s e = 1 : i f X X ^ (30) d ( X , X ^ ) = 0 : i f X = X ^ ; e l s e = 1 : i f X X ^ {:(30)d(X"," hat(X))=0:ifX= hat(X);else=1:ifX!= hat(X):}\begin{align} d(X,\hat X)= 0: if X=\hat X; else =1: if X\neq \hat X \end{align}
  • Squared-error Distortion
(31) d ( X , X ^ ) = ( X X ^ ) 2 (31) d ( X , X ^ ) = ( X X ^ ) 2 {:(31)d(X"," hat(X))=(X- hat(X))^(2):}\begin{align} d(X,\hat X)= (X- \hat X)^2 \end{align}

Information Theory in Feature Engineering and Feature Extraction

Knowledge discovery is a process that helps you engineer features or attributes in a base dataset to improve the performance of the machine learning models. Information theory has a lot of 'information measures' or metrics that can help you evaluate the importance and relevance of each new attribute that you develop. For instance, if you develop new attributes in a dataset, you should use measures like mutual information and conditional entropy to evaluate how significant an explanatory variable is with respect to the target or the dependent variable.
We have already seen in the sections above that conditional entropy determines how two variables are correlated. If H ( Y | X = x i ) H ( Y | X = x i ) H(Y|X=x_(i))H(Y|X=x_{i}) represents the conditional entropy of a variable 'Y' given X = x i X = x i X=x_(i)X=x_{i}. Let us assume that 'Y' represents the target variable in a machine learning model and 'X' is an explanatory variable, say a categorical variable with several labels ( x i ) ( x i ) (x_(i))(x_{i}) where 'i' specifies the number of unique labels of the variable 'X'. In this case, we need to ensure that H ( Y | X = x i ) H ( Y | X = x i ) H(Y|X=x_(i))H(Y|X=x_{i}) when aggregated for all 'i' is close to zero. This is because H ( Y | X ) = 0 H ( Y | X ) = 0 H(Y|X)=0H(Y|X)=0 if and only if the value of 'Y' is completely determined by the value of X. The value of conditional entropy H ( Y | X ) H ( Y | X ) H(Y|X)H(Y|X) is equal to H ( Y ) H ( Y ) H(Y)H(Y) if and only if 'Y; and 'X' are independent random variables. Our aim while designing and collecting variables is to achieve a conditional entropy value of close to '0'. In terms of mutual information, we need to collate and engineer features such that the mutual information between 'Y' and 'X' is maximized. Below is a table of some important information measures that we have already covered so far, that can be used for this purpose.
Figure 20: Information Measures that can be used as learning and content measures
An automated, relatively new of method of deriving and creating features is feature learning. It helps you automatically discover (or learn) the data transformations and representations needed for feature detection in a dataset. You can conduct two types of feature learning techniques: 1) Supervised: where features are learned using labeled input data, and 2) Unsupervised: where features are learned with unlabeled input data. Rate distortion theory is widely used in feature learning- both supervised and unsupervised. If X X XX is the instance in your dataset and Z Z ZZ represents the feature, the objective is to optimize the mutual information between X X XX and Z Z ZZ such that I ( X , Z ) I ( X , Z ) I(X,Z)I(X,Z) is bounded by a minimum value of a distortion function. This means that we could design a distortion function d ( X , Z ) d ( X , Z ) d(X,Z)d(X,Z) that quantifies the quality of the feature map I ( X , Z ) I ( X , Z ) I(X,Z)I(X,Z) and place lower bounds on the same to ensure that the loss of compression/data transformation is minimal.

Recent Works

In 2006, Hild II et. al. introduced an information theoretic approach to feature extraction in their paper Feature Extraction Using Information-Theoretic Learning [ 14 ] [ 14 ] ^([14])^{[14]}. To understand this method, let us rewrite equation [9] as:
(32) I ( X ; C ) = H ( X ) H ( X | C ) (32) I ( X ; C ) = H ( X ) H ( X | C ) {:(32)I(X;C)=H(X)-H(X|C):}\begin{align} I(X;C)= H(X) - H(X|C) \end{align}
The proposed method called the MRMI-SIG method replaces the entropy terms in the above equation with a specific form of Renyi's quadratic entropy [34]. This leads to the following MRMI-SIG information-theoretic criterion for feature extraction.
(33) J = log 1 N T n = 1 N T G ( x ( n ) x ( n 1 ) , 2 σ 2 ) + j = 1 N C ( N j N T log 1 N j n = 1 N j G ( x j ( n ) x j ( n 1 ) , 2 σ 2 ) ) (33) J = log 1 N T n = 1 N T G x ( n ) x ( n 1 ) , 2 σ 2 + j = 1 N C N j N T log 1 N j n = 1 N j G x j ( n ) x j ( n 1 ) , 2 σ 2 {:(33)J=-log (1)/(N_(T))sum_(n=1)^(N_(T))G(x(n)-x(n-1),2sigma^(2))+sum_(j=1)^(N_(C))((N_(j))/(N_(T))log (1)/(N_(j))sum_(n=1)^(N_(j))G(x_(j)(n)-x_(j)(n-1),2sigma^(2))):}\begin{align} J=-\log \frac{1}{N_{T}} \sum_{n=1}^{N_{T}} G\left(\boldsymbol{x}(n)-\boldsymbol{x}(n-1), 2 \sigma^{2}\right)+ \sum_{j=1}^{N_{C}}\left(\frac{N_{j}}{N_{T}} \log \frac{1}{N_{j}} \sum_{n=1}^{N_{j}} G\left(\boldsymbol{x}_{j}(n)-\boldsymbol{x}_{j}(n-1), 2 \sigma^{2}\right)\right) \end{align}
The proposed methods extracts features from the raw dataset such that the second term on the right hand side is maximized. This would means that different data points from the same class (in the context of classification) would be very close to each other in the new output features space. To prevent a trivial solution, wherein all features from all classes overlap perfectly, we look at the first term on the right hand side of the equation above. This term denotes the spread measure the of the data irrespective of the class and is maximized in the equation.
In a more specific use case of Fault Diagnosis of Reciprocating Machinery, published by Huaqing Wang et. al. [ 15 ] [ 15 ] ^([15])^{[15]} in 2009, KL diveregence has been used as the underlying concept for feature extraction. In fault diagnosis, the authors have devised a way to diagnose an unknown distribution against a known reference distribution. They then compute KL information as the diveregence of the diagnosos signal with respect to the expectation value of the reference feature variable. Similarly, diagnosis information (DI) calculated from the expectation value of the diagnosis feature variable.

Information Theory in Feature Selection

While feature engineering includes generation of data, feature selection is a key step in machine learning where you select the most important and relevant features for the given problem statement. Information theory is a key concept used in this step. You can design your optimization function using measures from information theory. Let us look at a few possible techniques.

Mutual Information

As we see in equation [2], Shannon's definition of entropy relies on class prior probabilities. Let us also factor condition entropy in from equation [4]. Let us assume a classification problem with class identities of the target variable represented as 'C'. If 'X' represents your feature space vector, the uncertainty of the class identity can be quantified using the conditional entropy:
H ( C | X ) = x X p ( x ) c C p ( c | x ) l o g 2 ( 1 / p ( c | x ) ) H ( C | X ) = x X p ( x ) c C p ( c | x ) l o g 2 ( 1 / p ( c | x ) ) H(C|X)=sum_(x in X)p(x)sum_(c in C)p(c|x)log_(2)(1//p(c|x))H(C|X)= \sum_{x \in X} p(x) \sum_{c \in C}p(c|x)log_{2}(1/p(c|x))
The amount by which the class uncertainty is reduced after having observed the feature vector 'X' is called the mutual information, I ( C , X ) = H ( C ) H ( C | X ) I ( C , X ) = H ( C ) H ( C | X ) I(C,X)=H(C)-H(C|X)I(C,X) = H(C)- H(C|X). Mutual Information can also be defined using equation [9] above. Since mutual information measures independence between two variables, in this case between C and X. It is 0 when p ( c , x ) = p ( c ) p ( x ) p ( c , x ) = p ( c ) p ( x ) p(c,x)=p(c)p(x)p(c,x)= p(c)p(x) i.e. when the joint density of C and X can be defined as a direct product of marginal densities, which is the condition for independence of the two variables. (This condition can also be defined as the Kullback-Leibler divergence measure between p ( c , x ) p ( c , x ) p(c,x)p(c,x) and p ( c ) p ( x ) p ( c ) p ( x ) p(c)p(x)p(c)p(x).) Now, we need to design an optimization objective function such that the mutual information between C and X is maximized or H ( C | X ) H ( C | X ) H(C|X)H(C|X) is minimized. There are a few constraints for the same.
  1. Lower Bound: A lower bound to the probability of error when estimating a discrete random variable 'C' from another random variable 'X' can be defined using Fano's bound. It can be represented as:
P r ( c c ^ ) ( H ( C | X ) 1 ) / l o g ( N c ) = ( H ( C ) I ( C , X ) 1 ) / l o g ( N c ) P r ( c c ^ ) ( H ( C | X ) 1 ) / l o g ( N c ) = ( H ( C ) I ( C , X ) 1 ) / l o g ( N c ) Pr(c!= hat(c)) >= (H(C|X)-1)//log(N_(c))=(H(C)-I(C,X)-1)//log(N_(c))Pr(c \neq \hat{c} ) \ge (H(C|X)-1)/log(N_{c}) = (H(C)-I(C,X)-1)/log(N_{c}) Here, c ^ c ^ hat(c)\hat{c} is the estimate of C after observing a sample of X, which can be scalar or multivariate and N c N c N_(c)N_{c} represents the number of different classes. We are trying to optimize the probability that ( c ^ c ) ( c ^ c ) ( hat(c)!=c)(\hat{c} \neq c). This lower bound on error probability is minimized when the mutual information between C and X is maximized. If we find such features, we can achieve the lowest possible bound to the error of a classifier.
  1. Upper Bound: The upper bound that can be set on the probability error can be represented as follows:
P r ( c c ^ ) 1 / 2 ( H ( C | X ) ) = 1 / 2 ( H ( C ) I ( C , X ) ) P r ( c c ^ ) 1 / 2 ( H ( C | X ) ) = 1 / 2 ( H ( C ) I ( C , X ) ) Pr(c!= hat(c)) <= 1//2(H(C|X))=1//2(H(C)-I(C,X))Pr(c \neq \hat{c} ) \le 1/2(H(C|X))= 1/2(H(C)-I(C,X)) Both the upper and lower bounds can be minimized by either maximizing the mutual information between C and X or by minimizing H(C|X).

Application of optimization of mutual information for feature selection

The idea of using mutual information is to design an objective function that runs on your high dimensional training data. Let us represent the training dataset as ( x i , c ) ( x i , c ) (x_(i),c)(x_{i},c) where x i x i x_(i)x_{i} is the feature space and c c cc represents the various classes. The objective is to find a transformation g g gg (or its parameter or weight vector w w ww) such that x i ^ = g ( w , x i ) x i ^ = g ( w , x i ) hat(x_(i))=g(w,xi)\hat{x_{i}} = g(w,xi) maximizes I ( C , X ^ ) I ( C , X ^ ) I(C, hat(X))I(C,\hat{X}), the mutual information between transformed data X ^ X ^ hat(X)\hat{X} and class labels C C CC. For this process to work, we need to estimate I ( C , X ^ ) I ( C , X ^ ) I(C, hat(X))I(C,\hat{X}) as a function of the dataset in a differentiable form. This process will be a reiterative and recursive process till you find the most optimized solution. To find the factor for the learning process, a gradient ascent (representing the learning parameter) can be performed on I ( C , X ^ ) I ( C , X ^ ) I(C, hat(X))I(C,\hat{X}). This process can be represented best using the below diagram.
Figure 21: Feature selection by maximizing the mutual information between class labels and transformed features.

Renyi Entropy

We know that mutual information can be computed using probability mass functions. Histograms are one of the most popularly used probability mass functions. However, they are parametric. The issue with parametric methods of estimating information measures is that they work well with 2 or 3 variables. However, with non-parametric methods, you can evaluate and compare multiple variables. This is ideal for high-dimensional data. One of the non-parametric evaluations of entropy in the Renyi Entropy.
Renyi’s quadratic entropy is a generalized form of Shannon's entropy that we have covered earlier. It can be represented in the following manner: (34) H α ( X ) = ( 1 / ( 1 α ) ) l o g 2 ( i = 1 n ( p i α ) ) (34) H α ( X ) = ( 1 / ( 1 α ) ) l o g 2 ( i = 1 n ( p i α ) ) {:(34)H_(alpha)(X)=(1//(1-alpha))log_(2)(sum_(i=1)^(n)(p_(i)^(alpha))):}\begin{align} H_{\alpha}(X)= (1/(1-\alpha))log_{2}(\sum_{i=1}^{n} (p_{i}^{\alpha})) \end{align} As we see above, for the value ( α = 1 ) ( α = 1 ) (alpha=1)(\alpha=1), the equation provides Shannon entropy. α α alpha\alpha can assume values [ 0 , ] [ 0 , ] [0,oo][0,\infty]. However, Renyi's entropy has computational advantages in the sense that this parametric measure helps you find a distribution (instead of a particular entropy value) that maximizes or minimizes the entropy given some constraints. For continuous variables, you can use the integral operation instead of the discrete aggregation.
The above metrics can then be used to compute mutual information of a class variable with respect to the features. The aim would be to select features such that the mutual information is maximized.

Recent Works

In 2012, Gulshan Kumar et. al. [ 16 ] [ 16 ] ^([16])^{[16]} published their work on an information theoretic approach for feature selection. This paper talks about mutual information as a selection criterion for relevant features.

Information Theory in Model Selection

Model selection is an important bit in machine learning. With an increasing number of algorithms available for both supervised and unsupervised, it is necessary we choose the best model. This means we also need to penalize model for complexity when the model introduces more complexity than what is needed to fit the regularities of the data. We need to find a model that has an optimal goodness-of-fit and generalization (or generalizability). Generalization basically helps you avoid overfitting of the model. It is claimed that as the model complexity increases, the goodness-of-fit increases but the generalization ability of the model decreases. In other words, a very complex model shows a tendency of overfitting. Below is an image representing the same.
Figure 22: Relationship between generalizability and model fit for varying degrees of model complexity
Let us now look at a few methods of model selection. A few common ways of selecting the best model are as follows:

AIC: Akaike Information Criterion

The Akaike Information Criterion (AIC) is a way of selecting a good model from a set of models. It is based on the concept of KL divergence. In simple terms, it ensures that the best model hence selected minimizes the Kullback-Leibler (KL) divergence between the model and the truth. In general, AIC can be defined in the following manner.
(35) A I C = 2 K 2 ( l o g ( L ^ ) ) (35) A I C = 2 K 2 ( l o g ( L ^ ) ) {:(35)AIC=2K-2(log( hat(L))):}\begin{align} AIC= 2K - 2(log(\hat{L})) \end{align} Here, K K KK is the number of model parameters (the number of variables in the model plus the intercept) required to make the model estimations. L ^ L ^ hat(L)\hat{L} denotes the maximum likelihood function of the model and thus l o g ( L ^ ) l o g ( L ^ ) log( hat(L))log(\hat{L}) is the log-likelihood of the model. It can also be represented as:
A I C = 2 K + T l o g ( R S S ) A I C = 2 K + T l o g ( R S S ) AIC=2K+Tlog(RSS)AIC= 2K + T log(RSS) Here, R S S R S S RSSRSS denotes the residual sum of squared errors, T T TT denotes the number of observations and K stands for the model parameters or regressors. If you have a set of models for your given data, the preferred model is the one with the smallest AIC value. In the above equations we see that AIC rewards goodness-of-fit (represented by the element with the likelihood function). However, it also penalizes a model by a factor that is an increasing function of the number of parameters (K). This penalty discourages overfitting, because increasing the number of parameters in the model generally improves the goodness of its fit.
The above equation is slightly modified for small sample sizes (i.e. T / K 40 T / K 40 T//K <= 40T/K \le 40). It can be represented as a second-order AIC (or A I C C A I C C AIC_(C)AIC_{C})
(36) A I C C = 2 K 2 ( l o g ( L ^ ) ) + ( 2 K ( K + 1 ) / ( T K 1 ) ) (36) A I C C = 2 K 2 ( l o g ( L ^ ) ) + ( 2 K ( K + 1 ) / ( T K 1 ) ) {:(36)AIC_(C)=2K-2(log( hat(L)))+(2K(K+1)//(T-K-1)):}\begin{align} AIC_{C}= 2K - 2(log(\hat{L})) + (2K(K+1)/(T-K-1)) \end{align} The above equation follows the same conventions as the previous equation. Every model will thus have an AIC. Let us briefly see how to understand and use an AIC score. In a given set of R models, let us list the AIC values of each of these models by A I C 1 , A I C 2 , A I C 3 , . . . , A I C R A I C 1 , A I C 2 , A I C 3 , . . . , A I C R AIC_(1),AIC_(2),AIC_(3),...,AIC_(R)AIC_{1}, AIC_{2}, AIC_{3}, ..., AIC_{R}. Let A I C m i n A I C m i n AIC_(min)AIC_{min} be the minimum of those values. Then the quantity e x p ( ( A I C m i n A I C i ) / 2 ) e x p ( ( A I C m i n A I C i ) / 2 ) exp((AIC_(min)-AIC_(i))//2)exp((AIC_{min}-AIC_{i})/2) is proportional to the probability that the i-th model minimizes the (estimated) information loss.
While AIC seems to be a reliable measure for selecting models, there are a few caveats in the same. AIC can only compare a given set of models and not determine the absolute quality of a model. In other words, it will give you the best model out of the set of models you have (relative quality), but there might be another model outside your sample set that is a better model. Moreover, it is advised to not use AIC for comparing a large number of models.

BIC: Bayesian information criterion

Bayesian information criterion (BIC) or Schwarz information criterion (also SIC, SBC, SBIC) is yet another information measure for model selection. It is closely related to AIC and can be interpreted in the same manner i.e. the model with the smallest BIC value is the best model. It can be defined as:
(37) B I C = l o g ( T ) K 2 l o g ( L ^ ) (37) B I C = l o g ( T ) K 2 l o g ( L ^ ) {:(37)BIC=log(T)K-2log( hat(L)):}\begin{align} BIC= log(T)K-2log(\hat{L}) \end{align} As we see, there is not much difference between the equations for AIC and BIC. Following the same conventions as earlier, we see that the penalty term has an extra weightage by a factor of the number of observations. As a result, penalty for additional parameters is more in BIC than AIC.
One advantage of using BIC over AIC is that AIC generally tries to find an unknown model that has a high dimensionality. This means that all the models might not be true models in AIC. On the other hand, the Bayesian Information Criteria or BIC comes across only true models. This also implies that BIC is more consistent as compared to AIC while estimating for the best model.

MDL: Minimum Description Length

It is a complex concept based on algorithmic coding theory that represents machine learning models and data as compressible codes. The working principle of MDL for model selection is as follows: The best model is the one that provides the shortest description length of the data in bits by “compressing” the data as tightly as possible.. In other words, this means that the model that requires the least number of bits to describe the actual data is the best. Like the other two methods described above, you need to choose a model with the least MDL (minimum description length).
There are multiple methods to estimate the value of an MDL for a model. Some of them are:
  1. Fisher Information Approximation (FIA)
  2. Normalized Maximum Likelihood (NML)
The detailed description of these techniques are beyond the scope of the paper currently.

Information Theory in Regression

We have studied how KL divergence can be used in regression to reduce the mean -squared loss error. Let us also cover briefly how mutual information and linear regression might be highly related.
We know from equation [9] that mutual information can be defined as H ( Y ) H ( Y | X ) H ( Y ) H ( Y | X ) H(Y)-H(Y|X)H(Y)- H(Y|X). Since regression is a form of continuous estimation, we would like to assess an information measure for continuous variables, like differential entropy. We also saw earlier that differential entropy of a Gaussian is equal to a constant plus the log of its standard deviation. We can extend this concept to the expression for mutual information, where both H ( Y ) H ( Y ) H(Y)H(Y) and H ( Y | X ) H ( Y | X ) H(Y|X)H(Y|X) are Gaussian distributions. While H ( Y ) H ( Y ) H(Y)H(Y) has a standard deviation of 1, H ( Y | X ) H ( Y | X ) H(Y|X)H(Y|X) represents the loss of information while estimating for Y from X (or the error) and hence has a standard deviation value of σ e σ e sigma_(e)\sigma_{e} i.e. standard deviation of the error term. On substituting the same in equation [9], we get:
(38) I ( X , Y ) = H ( Y ) H ( Y | X ) = [ c o n s t a n t + l o g ( 1 ) ] [ c o n s t a n t + l o g ( σ e ) ] = l o g ( σ e ) (38) I ( X , Y ) = H ( Y ) H ( Y | X ) = [ c o n s t a n t + l o g ( 1 ) ] [ c o n s t a n t + l o g ( σ e ) ] = l o g ( σ e ) {:(38)I(X","Y)=H(Y)-H(Y|X)=[constant+log(1)]-[constant+log(sigma_(e))]=-log(sigma_(e)):}\begin{align} I(X,Y)= H(Y)-H(Y|X)= [constant + log(1)]-[constant+log(\sigma_{e})]= -log(\sigma_{e}) \end{align}
Now, we know that r-squared ( R 2 ) ( R 2 ) (R^(2))(R^{2}) is a significant measure in regression that represents the explained variance in a target variable by the features/explanatory variables. Higher the value of R 2 R 2 R^(2)R^{2}, lower is the error term or the loss of information while estimating Y from X. R-squared can also be represented as : (39) R 2 = 1 σ e 2 (39) R 2 = 1 σ e 2 {:(39)R^(2)=1-sigma_(e)^(2):}\begin{align} R^{2}= 1-\sigma_{e}^{2} \end{align} This indicates, σ e σ e sigma_(e)\sigma_{e} can be defined as ( 1 R 2 ) ( 1 R 2 ) sqrt(1-R^(2))\sqrt(1-R^{2}). We can simply substitute this in the equation for mutual information above (equation [38]) and say that:
(40) I ( X , Y ) = l o g ( 1 / ( 1 R 2 ) ) (40) I ( X , Y ) = l o g ( 1 / ( 1 R 2 ) ) {:(40)I(X","Y)=log(1//sqrt(1-R^(2))):}\begin{align} I(X,Y)= log(1/\sqrt(1-R^{2})) \end{align} The above equations shows a direct relation between linear regression and mutual information.

Information Theory in Classification

We have covered the concept of cross entropy loss and log loss as standard loss functions for a lot of classification algorithms. These loss functions are used both in binary classification as well multiclass classifications.
These loss functions can be used to refine the performance of your classification algorithms. You can refer to equation [22] for the generic form of cross entropy loss. A more custom form of the equation for binary classification is the log loss function.
We also saw how entropy is used to build decision trees. Entropy and information gain are used to construct each step of a decision tree. This can help you find the most relevant feature for a particular split in a decision tree. KL divergence is yet another significant information measure used in classification.

Information Theory in Clustering

Clustering is a kind of unsupervised learning problem in machine learning. The goal of (hard) clustering is to assign data instances into one of the groups (clusters) such that the instances in the same cluster exhibit similar properties.We saw in the earlier sections how the Information Bottleneck method can be used for clustering by maximizing compression of individual data points into clusters while ensuring that the loss of information is reduced (or the mutual information is maximized).
Rate-distortion theory is also used in clustering. It finds application in grouping input data into clusters such that the visualization is now driven by these groups instead of the individual elements. If the original data is also represented by x x xx, with its corresponding cluster representation is depicted as x c x c x_(c)x_{c} of the clusters c c cc. Then, each data element # x X x X x in Xx \in X belongs to a cluster c c cc with a given conditional probability p ( c | x ) p ( c | x ) p(c|x)p(c|x). We can differentiate between two different types of clustering:
  1. hard clustering, where each data element only belongs to one cluster and, thus, p ( c | x ) = 1 p ( c | x ) = 1 p(c|x)=1p(c|x) = 1 for a specific (correct) cluster and p ( c | x ) = 0 p ( c | x ) = 0 p(c|x)=0p(c|x) = 0 for the other clusters; and
  2. soft clustering where each data element is assigned to each cluster with a certain probability (in general, different from 0). Then, clustering can be seen as the process of finding class-representatives x c x c x_(c)x_c (or cluster centroids), such that the average distortion d ( x , x c ) d ( x , x c ) d(x,x_(c))d(x, x_{c}) is small and there is a high correlation between the original data and the clusters .
Furthermore, we can use the concept of mutual information for information theoretic clustering.

Mutual Information Criterion for Information Theoretic Clustering

Let us assume that 'Y' represents the cluster identities for your data elements and 'X' defines the features or the variables defining the attributes of these data points. The objective of clustering would be to maximize the mutual information of X and Y i.e. I ( X , Y ) I ( X , Y ) I(X,Y)I(X,Y).
From equation [9], we know that I ( X , Y ) = H ( X ) H ( X | Y ) I ( X , Y ) = H ( X ) H ( X | Y ) I(X,Y)=H(X)-H(X|Y)I(X,Y)= H(X)-H(X|Y). Now the entropy of the features (X) is independent of the type of clustering mechanism used. We thus need to evaluate the conditional entropy H ( X | Y ) H ( X | Y ) H(X|Y)H(X|Y) to evaluate our clusters. The objective thus boils down to minimizing H ( X | Y ) H ( X | Y ) H(X|Y)H(X|Y).

Conclusion

The use of information theory extends beyond the applications described in this paper. Information theory can also be used in applications like selection of distance metrics for algorithms like KNN, creation of factor graphs for Bayesian Networks etc.
There are basically two modes of applying information theory in the machine learning:
  1. Apply fundamental information measures and concepts like entropy, mutual information, KL-divergence as objective functions or regularization terms in an optimization problem. This will help you improve the performance an existing algorithm or model.
  2. Develop new algorithms and techniques using concepts from sophisticated information theory, such as rate-distortion theory and coding theory. This might help you provide additional insights for existing machine learning techniques and in that process, develop improvised versions of the existing algorithms.

References

  1. <>
    . Retrieved from http://colah.github.io/posts/2015-09-Visual-Information/
  2. Stone, James V;
    <>
    . arXiv:1802.05968v3
  3. <>
    . Retrieved from https://dibyaghosh.com/blog/probability/kldivergence.html
  4. <>
    . Retrieved from https://www.inference.org.uk/itprnn/book.pdf
  5. Lin, Tsung-Yi; Goyal, Priya; Girshick, Ross; He, Kaiming; Dollar, Piotr;
    <>
    . arXiv:Physics/0004057v1.
  6. <>
    https://heartbeat.fritz.ai/5-regression-loss-functions-all-machine-learners-should-know-4fb140e9d4b0
  7. <>
    . Retrieved from https://medium.com/coinmonks/what-is-entropy-and-why-information-gain-is-matter-4e85d46d2f01
  8. Tishby, Naftali; Pereira, Fernando C.; Bialek, William
    <>
    . arXiv:Physics/0004057v1.
  9. Rooyen B. V.; Williamson, R;
    <>
    . ArXiv abs/1504.00083 (2015)
  10. <>
    . Retrieved from https://www.johndcook.com/blog/2018/11/21/renyi-entropy/
  11. <>
    . Retrieved from https://www.r-bloggers.com/how-do-i-interpret-the-aic/
  12. <>
    . Retrieved from https://faculty.psy.ohio-state.edu/myung/personal/model selection tutorial.pdf
  13. Wei, Xiaokai;
    <>
    . 2011
  14. Kenneth E. Hild, II; Deniz Erdogmus; Kari Torkkola, and Jose C. Principe;
    <>
    . IEEE Trans Pattern Anal Mach Intell. 2006 Sep; 28(9): 1385–1392.
  15. Wang H, Chen P.
    <>
    Sensors (Basel). 2009;9(4):2415-36. doi: 10.3390/s90402415. Epub 2009 Apr 1. PMID: 22574021; PMCID: PMC3348826.
  16. Kumar Gulshan, Kumar Krishan
    <>
    <>
    Security Comm. Networks 2012; 5:178–185

Recommended for you

Weihua He
Spiking Neural Network : Towards Brain-inspired Computing
Spiking Neural Network : Towards Brain-inspired Computing
A brief introduction and discussion of the base of brain-inspired computing, spiking neural network (SNN).
9 points
1 issues