High Dimension Data Analysis - A tutorial and review for Dimensionality Reduction Techniques

In the current day and age, high dimensional datasets are a common occurrence in most areas of study, with the high volume and variety of data that streams in. Research in the field of deep learning and machine learning have led to development of high efficiency models and algorithms that can deal with high volume of high dimensional datasets. However, there are techniques to engineer some significant variables and reduce the dimensionality by either eliminating other redundant variables or by creating new variables by combining a few existing ones. High dimensionality poses a few problems and hence need to be dealt with. Some of these issues are:
  • High number of variables lead to higher computational and time complexity of machine learning models. Reducing the dimensionality can improve the efficiency of these models.
  • High number of variables in a dataset, when fed to a machine learning model can lead to overfitting, thus reducing the generalizability of the machine learning models
  • High number of variables in a dataset may make it noisy and increases ambiguity. By reducing the dimensionality and selecting features, we can improve the information content of the dataset.
  • Lower dimensionality of a dataset helps in visualizing it better.
  • For certain models like regression models, high number of variables may include some variables which have a high correlation with the target variable. This poses the problem of multicollinearity.
  • In this article, we investigate and study a few methods for selecting features from a high dimensional dataset, and thus reduce the dimensionality of the datasets.

    1. Feature Selection from base features

    Some techniques as described below are used to select features from the available base features in the dataset. These techniques do not alter the structure or definition of these variables, but simply select a subset of them.

    1.1. Low Variance and High Correlation Filters

    Statistically, variance is the measure of spread for a distribution of a random variable that determines the degree to which the values of a random variable differ from the expected value. In other words, it tells us how far the points are from the mean. The formula to variance is given by:
    σ 2 = Σ ( x x ¯ ) 2 n σ 2 = Σ ( x x ¯ ) 2 n sigma^(2)=(Sigma(x-( bar(x)))^(2))/(n)\sigma^{2}=\frac{\Sigma(x-\bar{x})^{2}}{n}
    If the variance is closer to zero, that means the variable is fairly constant and does not spread too far away from the mean. A variable with variance 0 means all the values of the variable are constant i.e. all values are the same.
    For a variable in the context of a dataset being prepared for machine learning, a variable with variance close to zero contains little to no information about the traget variable. This means it contains little impact on the target variable, and thus very little impact on the model's predictive ability.
    Since variance is dependent on the range of the variable, it is important to scale the variables, especially numerical variables before computing the variance. After the variances have been computed for all the scaled variables, you can choose a threshold for the variance. All variables with variance below the theshold can be dropped.
    For categorical variables, variables with few unique values but with more than 95% of the values belonging to a specific category can be dropped. For instance, if 95% of the records in a dataset belong to one particular country, say US, it is best to drop this variable as the information of the country would not improve or affect the performance of the model trained on this dataset.
    Boolean features are Bernoulli random variables, and the variance of such variables is given by:
    Var [ X ] = p ( 1 p ) Var [ X ] = p ( 1 p ) Var[X]=p(1-p)\operatorname{Var}[X]=p(1-p) For Boolean variables, we would want to remove all features that are either one or zero (true or false) in more than a certain threshold. For instance, if a variable has the value 'True' for 80% of the samples or more, we drop this variable. This threshold would be achieves using the above formula ( 0.8 ( 1 0.8 ) ) ( 0.8 ( 1 0.8 ) ) (0.8**(1-0.8))(0.8*(1-0.8)).
    A similar feature-selection criteria can be made on the basis of the correlation of the existing features. Multiple features which are highly correlated among themeselves can lead to multicollinearity in a model. In models, especially regression models, if a feature is a linear function of 1 or more features in the dataset, it makes it difficult to estimate the relationship between each independent variable and the dependent variable independently. Thus, it is important to eliminate multicollinearity by retaining only one of the highly correlated features.
    We use Pearson's correlation to find the correlation between numerical variables. To find correlations between categorical features, we can use the chi-squared test. The chi-squared statistical test considers 2 hypotheses:
  • H0 (Null Hypothesis) = The 2 variables to be compared are independent.
  • H1 (Alternate Hypothesis) = The 2 variables are dependent. Now, if the p-value obtained after conducting the test is less than 0.05, we reject the Null hypothesis and accept the Alternate hypothesis. If the p-value is greater that 0.05 we accept the Null hypothesis and reject the Alternate hypothesis i.e. it implies that the 2 variables are independent. A p-value of 0 would indicate that we can accept the alternate hypothesis with high confidence and that the two variables are highly correlated.
  • For determining the correlation between a categorical and a numerical variable, we can use correlation ratio ( η η eta\eta) which is defined as the weighted variance of the mean of each category divided by the variance of all samples. It helps us answer a simple question: Given a continuous number, how well can you know to which category it belongs to? The score has a range of [ 0 , 1 ] [ 0 , 1 ] [0,1][0,1] where a higher score indicates higher correlation. Using these scores, we can eliminate either of the variables in a pair that is highly correlated. One way of selecting a feature to retain, is to check the correlation of these features with the target. To train a model with good predictive ability, one should select the feature with a higher correlation with the target variable.

    1.2. Recursive Feature Elimination (RFE)

    Algorithm 1: Recursive Feature Elimination
    1.1 1.1 1.1\mathrm{1.1} Tune/Train the model on the training set using predictors S i , i = 1 , . . , S S i , i = 1 , . . , S S_(i),i=1,..,S\mathrm{S_{i}}, i=1,..,S
    1.2 1.2 1.2\mathrm{1.2} Determine the appropriate number of features in the final model ( n ) ( n ) (n)(n) and the subset size (number of features to be removed in each step: t t tt)
    1.3 1.3 1.3\mathrm{1.3} While i > n i > n i > ni > n do:
    1.3 .1 1.3 .1 1.3.1\mathrm{1.3.1} Calculate the importance of the predictors
    1.3 .2 1.3 .2 1.3.2\mathrm{1.3.2} Drop lowest t t tt features on the basis of lowest importance score
    1.3 .3 1.3 .3 1.3.3\mathrm{1.3.3} Tune/Train the model on the training set using the S i t S i t S_(i-t)S_{i-t} predictors
    1.3 .4 1.3 .4 1.3.4\mathrm{1.3.4} [Optional] Calculate model performance
    1.4 1.4 1.4\mathrm{1.4} end
    1.8 1.8 1.8\mathrm{1.8} Train the model using the optimal S n S n S_(n)S_{n} features
    Recursive Feature Elimination is wrapper-type feature selection technique. This means that a different machine learning algorithm is given and used in the core of the method, wrapped by RFE to help select features. Unlike filter-based feature selection methods that score each feature and select those features with the largest (or smallest) score, this method stops when it reaches a stopping criterion.
    In RFE, we aim to select features by recursively considering smaller sets of features in each iteration. First, the estimator is trained on the initial set of features, often the entire set of predictors and the importance score of each feature is computed either through any specific attribute like feature importance, p-values, variable coefficients or a callable. Then, the least important features are pruned from current set of features, the model is re-built, and importance scores are computed again. That procedure is recursively repeated on the pruned set until the desired number of features to select is eventually reached.
    One can specify the number of predictor subsets to evaluate as well as each subset’s size. Therefore, the variable subset size is a tuning parameter for RFE. The subset size that optimizes the performance criteria is used to select the predictors based on the importance rankings. The optimal subset is then used to train the final model.
    One important thing to note, is that the RFE method cannot be used with all models. RFE requires that the initial model uses the full predictor set. Hence, RFE cannot be used with some models like multiple linear regression, logistic regression, and linear discriminant analysis, when the number of predictors exceeds the number of samples. To use RFE in these cases, the number of predictors must first be reduced.

    1.3. Sequential Feature Selection (SFS)

    Sequential Feature Selection or SFS is another wrapper-method for feature selection. SFS can be either forward or backward. You can specify the desired number of features. The feature selection algorithm removes or adds one feature at a time based on the model performance until a feature subset of the desired size k k kk is reached.
    Forward-SFS is a greedy algorithm that iteratively finds the best new feature to add to the set of selected features. We initially start with zero features and find the one feature that maximizes a cross-validated score when an estimator is trained on this single feature. Once that first feature is selected, we repeat the procedure by adding a new feature to the set of selected features. The procedure stops when the desired number of selected features ( k ) ( k ) (k)(k) is reached.
    Backward-SFS follows the same idea but works in the opposite direction: instead of starting with no feature and greedily adding features, we start with all the features and greedily remove features from the set. The direction parameter controls whether forward or backward SFS is used.
    Algorithm 2: Backward SFS
    1.1 1.1 1.1\mathrm{1.1} Define the desired number of features
    1.2 1.2 1.2\mathrm{1.2} Tune/Train the model on the training set using all predictors
    1.3 1.3 1.3\mathrm{1.3} Compute Model Performance
    1.4 1.4 1.4\mathrm{1.4} Calculate the importance of the predictors
    1.5 1.5 1.5\mathrm{1.5} For each each subset of variables S i , S i , S_(i),\mathrm{S_{i},} i = 1 , . . . , S i = 1 , . . . , S i=1,...,Si=1,...,S do:
    1.5 .1 1.5 .1 1.5.1\mathrm{1.5.1} Greedily drop the feature responsible for the worst model performance at the given point; retain S i S i S_(i)S_{i} predictors
    1.5 .2 1.5 .2 1.5.2\mathrm{1.5.2} Tune/Train the model on the training set using the remaining S i S i S_(i)S_{i} predictors
    1.5 .3 1.5 .3 1.5.3\mathrm{1.5.3} Calculate model performance
    1.6 1.6 1.6\mathrm{1.6} end
    1.7 1.7 1.7\mathrm{1.7} Calculate the performance profile over the final S i S i S_(i)S_{i} features
    1.8 1.8 1.8\mathrm{1.8} Train the model using the optimal S i S i S_(i)S_{i} features
    While the forward SFS algorithm is easier to understand, there are minor diferences between backward SFS and RFE. SFS does not require the underlying model to expose variable coefficients or importance scores. The iterative step makes a greedy decision on the basis of the model performance. This algorithm includes a model selection on the basis of various parameters like Adj RSquared, Mallow's Cp, BIC or AIC, Accuracy or even Root Mean Squared Error (RMSE), depending on the nature of the model.
    SFS may however be slower considering that more models need to be evaluated, compared to RFE. For example in backward selection, the iteration going from m m mm features to m 1 m 1 m-1m - 1 features using k-fold cross-validation requires fitting m k m k m**km * k models, while RFE would require only a single fit.

    1.3.1. Definitions of some model-selection parameters

    AIC: The Akaike information criterion (AIC) is an estimator of out-of-sample prediction error and thereby relative quality of statistical models for a given set of data. It can be defined as:
    A I C = 2 ln ( L ) + 2 k A I C = 2 ln ( L ) + 2 k AIC=-2ln(L)+2kA I C=-2 \ln (L)+2 k
    Here, k k kk is the number of predictors or independent features and L L LL is the maximum value of the likelihood function for the model (The higher the number, the better the fit). We aim to minimize AIC, so the model with a lower AIC is selected as the better one.
    BIC: Bayesian Information Criteria (BIC) is a model selection parameter defined from Bayesian probability and inference. It can be defined as:
    B I C = k ln ( n ) 2 ln ( L ^ ) B I C = k ln ( n ) 2 ln ( L ^ ) BIC=k ln(n)-2ln( widehat(L))\mathrm{BIC}=k \ln (n)-2 \ln (\widehat{L})
    Here, ( L ^ ) ( L ^ ) ( widehat(L))(\widehat{L}) is the log-likelihood of the model, n n nn is the number of examples in the training dataset, and k k kk is the number of parameters in the model. Like AIC, we aim to minimize the BIC too i.e. we select the model with the lowest BIC. However, unlike AIC, BIC offers a higher penalty for complex models using the sample size of the dataset. If a model is trained on a dataset with a large sample size with relatively lower number of features, the model has a lower BIC, thus making it better than a model with smaller sample size and more features.
    MDL: The Minimum Description Length, or MDL for short, is a method for scoring and selecting a model. It is an information theoretic measure of model selection. It represents the minimum number of bits, or the minimum of the sum of the number of bits required to represent the data and the model and is defined as:
    M D L = L ( h ) + L ( D | h ) M D L = L ( h ) + L ( D | h ) MDL=L(h)+L(D|h)MDL = L(h) + L(D | h)
    Here, h h hh is the model, D D DD is the predictions made by the model, L ( h ) L ( h ) L(h)L(h) is the number of bits required to represent the model, and L ( D | h ) L ( D | h ) L(D|h)L(D | h) is the number of bits required to represent the predictions from the model on the training dataset.
    We also aim to reduce MDL, and hence the model with the lowest MDL is selected.

    1.4. Embedded Methods: L1 regularization for Feature Selection

    Let us talk about the two most common regularization techniques: L 1 L 1 L1L1 and L 2 L 2 L2L2. These regularization terms add a penalty that can be defined as:
    L1 norm penalty: α i = 1 n | w i | α i = 1 n | w i | alphasum_(i=1)^(n)|w_(i)|\alpha \sum_{i=1}^{n}|w_{i}| L2 norm penalty: α i = 1 n w i 2 α i = 1 n w i 2 alphasum_(i=1)^(n)w_(i)^(2)\alpha \sum_{i=1}^{n} w_{i}^{2}
    Here, L 1 L 1 L1L1 regularization, uses a penalty term which encourages the sum of the absolute values of the parameters to be small. L 2 L 2 L2L2 regularization, on the other hand, encourages the sum of the squares of the parameters to be small. It has frequently been observed that L 1 L 1 L1L1 regularization in many models causes many parameters to equal zero, so that the parameter vector is sparse. This makes it a natural candidate in feature selection settings, where we believe that many features should be ignored.
    image Graphical representation of L2 (left) and L1 (right) regularization
    In the above images, we see that L 1 L 1 L1L1 tends to generate sparser solutions than a quadratic regularizer like L 2 L 2 L2L2. In the formula defined fo L1-norm penalty, if we increase α α alpha\alpha further, the solution would become sparser and sparser, i.e. more and more features would have 0 as coefficients. The orange square area in the graph above will shrink to a point corresponding to 0.
    Thus, including a regularization term like L1-regularization with a machine learning model like linear regression to perform an embedded feature selection. Lasso regression uses L1 regularization on top of linear regression. It is like linear regression, but it uses a technique "shrinkage" where the coefficients of determination are shrunk towards zero.

    1.5. Tree-based Feature Selection

    Tree-based models can be used to compute impurity-based feature importances, which in turn can be used to discard irrelevant features. Let us consider the Random Forest models which computes importance scores for each feature based on the ‘Gini’ criterion during the training process. The Gini index is defined as follows:
    Gini ( D ) = 1 i = 1 k p i 2 Gini ( D ) = 1 i = 1 k p i 2 Gini(D)=1-sum_(i=1)^(k)p_(i)^(2)\operatorname{Gini}(D)=1-\sum_{i=1}^{k} p_{i}^{2}
    Here, given a dataset D D DD that contains samples from k k kk classes, the probability of samples belonging to class i i ii at a given node can be denoted as p i p i p_(i)p_{i}. The measure above tells us the probability of misclassifying an observation. Lower the Gini impurity, the better is the split. At every split ofa node in tree-based models, an attribute with the smallest Gini Impurity is selected for splitting the node.
    In other words, a lower Gini impurity index leads to a lower likelihood of misclassification. We can use impurity-based feature importances to discard irrelevant features using a meta-transformer for selecting features based on importance weights.

    2. Feature Extraction using Transformation/ Combination of Features

    There are a few methods that transform existing variables in the base dataset and/or combine them with other features to reduce the dimensionality while retaining maximum information/variance in the dataset.
    These methods can be classifed as linear or non-linear methods. The linear methods involve linearly projecting the original data onto a low-dimensional space and extract the transformed features. Some of these methods involve Principal Component Analysis (PCA), Factor Analysis (FA), Linear Discriminant Analysis (LDA) and Truncated Singular Value Decomposition (Truncated SVD). We will also cover some non-linear methods which work better with non-linear data.Some of those methods are: Kernel PCA, t-distributed Stochastic Neighbor Embedding (t-SNE), Multidimensional Scaling (MDS), and Isometric mapping (Isomap)

    2.1. Principal Component Analysis (PCA)

    In a dataset with large number of variables, it can be difficult to study and interpret the relationships between these variables. There can be too many pairwise correlations between the variables to consider. Moreover, it is difficult to visualize these variables and relationships.
    To interpret the data in a more meaningful form, it is necessary to reduce the number of variables to a few, interpretable linear combinations of the data. Each linear combination will correspond to a principal component. This is the essence of PCA.
    Let us assume a random vector X X XX:
    X = ( X 1 X 2 X p ) X = X 1 X 2 X p "X"=([X_(1)],[X_(2)],[vdots],[X_(p)])\textbf{X} = \left(\begin{array}{c} X_1\\ X_2\\ \vdots \\X_p\end{array}\right)
    X X XX represents all the features, represented by each vector X 1 , X 2 , . . . , X p X 1 , X 2 , . . . , X p X_(1),X_(2),...,X_(p)X_1, X_2,...,X_p in a dataset. If we compute the variance-covariance matrix of the vectors, it looks like:
    var ( X ) = Σ = ( σ 1 2 σ 12 σ 1 p σ 21 σ 2 2 σ 2 p σ p 1 σ p 2 σ p 2 ) var ( X ) = Σ = σ 1 2 σ 12 σ 1 p σ 21 σ 2 2 σ 2 p σ p 1 σ p 2 σ p 2 "var"("X")=Sigma=([sigma_(1)^(2),sigma_(12),dots,sigma_(1p)],[sigma_(21),sigma_(2)^(2),dots,sigma_(2p)],[vdots,vdots,ddots,vdots],[sigma_(p1),sigma_(p2),dots,sigma_(p)^(2)])\text{var}(\textbf{X}) = \Sigma = \left(\begin{array}{cccc}\sigma^2_1 & \sigma_{12} & \dots &\sigma_{1p}\\ \sigma_{21} & \sigma^2_2 & \dots &\sigma_{2p}\\ \vdots & \vdots & \ddots & \vdots \\ \sigma_{p1} & \sigma_{p2} & \dots & \sigma^2_p\end{array}\right)
    If we consider the linear combinations as below:
    Y 1 = e 11 X 1 + e 12 X 2 + + e 1 p X p Y 2 = e 21 X 1 + e 22 X 2 + + e 2 p X p Y p = e p 1 X 1 + e p 2 X 2 + + e p p X p Y 1      =      e 11 X 1 + e 12 X 2 + + e 1 p X p Y 2      =      e 21 X 1 + e 22 X 2 + + e 2 p X p           Y p      =      e p 1 X 1 + e p 2 X 2 + + e p p X p {:[Y_(1),=,e_(11)X_(1)+e_(12)X_(2)+cdots+e_(1p)X_(p)],[Y_(2),=,e_(21)X_(1)+e_(22)X_(2)+cdots+e_(2p)X_(p)],[,,vdots],[Y_(p),=,e_(p1)X_(1)+e_(p2)X_(2)+cdots+e_(pp)X_(p)]:}\begin{array}{lll} Y_1 & = & e_{11}X_1 + e_{12}X_2 + \dots + e_{1p}X_p \\ Y_2 & = & e_{21}X_1 + e_{22}X_2 + \dots + e_{2p}X_p \\ & & \vdots \\ Y_p & = & e_{p1}X_1 + e_{p2}X_2 + \dots +e_{pp}X_p\end{array}
    Each of the above equations represent the Y i Y i Y_(i)Y_i variables as a function of the random data X X XX. Here each component Y i Y i Y_(i)Y_i represents a principal component, a combination of several features. The equations can be seen as linear regressions where you can predict Y i Y i Y_(i)Y_{i} from X 1 , X 2 , . . . . , X p X 1 , X 2 , . . . . , X p X_(1),X_(2),....,X_(p)X_1, X_2,....,X_p with coefficients e p 1 , e p 2 , . . . , e p p e p 1 , e p 2 , . . . , e p p e_(p1),e_(p2),...,e_(pp)e_{p1}, e_{p2},...,e_{pp}. Since Y i Y i Y_(i)Y_i is function of the random data X X XX, its variance can be represented as:
    var ( Y i ) = k = 1 p l = 1 p e i k e i l σ k l = e i Σ e i var ( Y i ) = k = 1 p l = 1 p e i k e i l σ k l = e i Σ e i "var"(Y_(i))=sum_(k=1)^(p)sum_(l=1)^(p)e_(ik)e_(il)sigma_(kl)=e_(i)^(')Sigmae_(i)\text{var}(Y_i) = \sum\limits_{k=1}^{p}\sum\limits_{l=1}^{p}e_{ik}e_{il}\sigma_{kl} = \mathbf{e}'_i\Sigma\mathbf{e}_i
    and the covariance between Y i Y i Y_(i)Y_i and Y j Y j Y_(j)Y_j can be defined as:
    cov ( Y i , Y j ) = k = 1 p l = 1 p e i k e j l σ k l = e i Σ e j cov ( Y i , Y j ) = k = 1 p l = 1 p e i k e j l σ k l = e i Σ e j "cov"(Y_(i),Y_(j))=sum_(k=1)^(p)sum_(l=1)^(p)e_(ik)e_(jl)sigma_(kl)=e_(i)^(')Sigmae_(j)\text{cov}(Y_i, Y_j) = \sum\limits_{k=1}^{p}\sum\limits_{l=1}^{p}e_{ik}e_{jl}\sigma_{kl} = \mathbf{e}'_i\Sigma\mathbf{e}_j
    Now, the coefficients can be represented as a vector:
    e i = ( e i 1 e i 2 e i p ) e i = e i 1 e i 2 e i p e_(i)=([e_(i1)],[e_(i2)],[vdots],[e_(ip)])\mathbf{e}_i = \left(\begin{array}{c} e_{i1}\\ e_{i2}\\ \vdots \\ e_{ip}\end{array}\right)
    Now, if we look at the first Principal Component: Y 1 Y 1 Y_(1)Y_1, it is the linear combination of the x-variables that has maximum variance (among all linear combinations). It accounts for as much variation in the data as possible. Hence the coefficients e 11 , e 12 , , e 1 p e 11 , e 12 , , e 1 p e_(11,)e_(12),dots,e_(1p)\boldsymbol { e } _ { 11 , } \boldsymbol { e } _ { 12 } , \ldots , \boldsymbol { e } _ { 1 p } are defined such that variance is maximized, given the constraint that the sum of the squared coefficients is equal to one. This constraint is required to obtain a unique answer. This means that the below variance is maximized:
    var ( Y 1 ) = k = 1 p l = 1 p e 1 k e 1 l σ k l = e 1 Σ e 1 var ( Y 1 ) = k = 1 p l = 1 p e 1 k e 1 l σ k l = e 1 Σ e 1 "var"(Y_(1))=sum_(k=1)^(p)sum_(l=1)^(p)e_(1k)e_(1l)sigma_(kl)=e_(1)^(')Sigmae_(1)\text{var}(Y_1) = \sum\limits_{k=1}^{p}\sum\limits_{l=1}^{p}e_{1k}e_{1l}\sigma_{kl} = \mathbf{e}'_1\Sigma\mathbf{e}_1
    subject to the constraint:
    e 1 e 1 = j = 1 p e 1 j 2 = 1 e 1 e 1 = j = 1 p e 1 j 2 = 1 e_(1)^(')e_(1)=sum_(j=1)^(p)e_(1j)^(2)=1\mathbf{e}'_1\mathbf{e}_1 = \sum\limits_{j=1}^{p}e^2_{1j} = 1
    Similarly, the second principal component is the linear combination of x-variables that explains the maximum proportion of the remaining variation, with the constraint that the sum of the squared coefficients is equal to one and the additional constraint that the correlation between the first and second component is 0. It implies that the below variance is maximized:
    var ( Y 2 ) = k = 1 p l = 1 p e 2 k e 2 l σ k l = e 2 Σ e 2 var ( Y 2 ) = k = 1 p l = 1 p e 2 k e 2 l σ k l = e 2 Σ e 2 "var"(Y_(2))=sum_(k=1)^(p)sum_(l=1)^(p)e_(2k)e_(2l)sigma_(kl)=e_(2)^(')Sigmae_(2)\text{var}(Y_2) = \sum\limits_{k=1}^{p}\sum\limits_{l=1}^{p}e_{2k}e_{2l}\sigma_{kl} = \mathbf{e}'_2\Sigma\mathbf{e}_2
    given the first and second principal components are uncorrelated:
    cov ( Y 1 , Y 2 ) = k = 1 p l = 1 p e 1 k e 2 l σ k l = e 1 Σ e 2 = 0 cov ( Y 1 , Y 2 ) = k = 1 p l = 1 p e 1 k e 2 l σ k l = e 1 Σ e 2 = 0 "cov"(Y_(1),Y_(2))=sum_(k=1)^(p)sum_(l=1)^(p)e_(1k)e_(2l)sigma_(kl)=e_(1)^(')Sigmae_(2)=0\text{cov}(Y_1, Y_2) = \sum\limits_{k=1}^{p}\sum\limits_{l=1}^{p}e_{1k}e_{2l}\sigma_{kl} = \mathbf{e}'_1\Sigma\mathbf{e}_2 = 0
    All subsequent principal components have this same property – they are linear combinations that account for as much of the remaining variation as possible and they are not correlated with the other principal components. Moreover, all principal components are uncorrelated with one another.
    To find coefficients e i j e i j e_(ij)e_{ij} for a Principal Component, we need to calculate the eigenvalues and eigenvectors of the variance-covariance matrix Σ Σ Sigma\Sigma. If the eigenvalues of the variance-covariance matrix are repesented by λ 1 , λ 2 , . . . , λ p λ 1 , λ 2 , . . . , λ p lambda_(1),lambda_(2),...,lambda _(p)\lambda_1, \lambda_2, ... , \lambda_p such that:
    λ 1 λ 2 λ p λ 1 λ 2 λ p lambda_(1) >= lambda_(2) >= cdots >= lambda _(p)\lambda_1 \ge \lambda_2 \ge \dots \ge \lambda_p
    The corresponding eigenvectors can be represented as e 1 , e 2 , , e p e 1 , e 2 , , e p e_(1),e_(2),dots,e_(p)\boldsymbol{e}_1 , \boldsymbol{e}_2 , \dots , \boldsymbol{e}_p. It can be proved that the elements for these eigenvectors are the coefficients of our principal components.
    Since principal components are a combination of multiple variables, interpreting it can be tough. To do so, we can look at the correlation between the principal components and the individual variables. The direction and the magnitude of the correlation score with each of the variables tells us how each principal component can be interpreted. Say a particular principal component has a strong correlation of 0.95 with a variable X 1 X 1 X_(1)X_1, followed by strong correlation with 2 other variables [ X m , X n ] [ X m , X n ] [X_(m),X_(n)][X_m, X_n], we can say that this PC is predominantly a measure and a function of X 1 X 1 X_(1)X_1. If a PC has a strong negative correlation with a variable X k X k X_(k)X_k, it means the PC increases with a decreasing X k X k X_(k)X_k. Thus, it is a negative function of that variable.
    It is important to note that one should perform feature scaling before running PCA especially, if there is a significant difference in the scale between the features of the dataset. One can define the number of principal components to be extracted from a dataset. This could be defined using the proportion of the explained variance by each of these principal components.

    2.2. Factor Analysis (FA)

    Factor Analysis (FA) is a dimensionality reduction technique that also helps extract latent variables which are not directly measured in a single variable but rather inferred from other variables in the dataset. These latent variables are called factors. It models observed variables, and their covariance structure, in terms of a smaller number of these latent unobserved factors.
    To understand the algorithm, let us start with a vector X X XX containing all variables for all samples:
    X = ( X 1 X 2 X p ) X = X 1 X 2 X p "X"=([X_(1)],[X_(2)],[vdots],[X_(p)])\textbf{X} = \left(\begin{array}{c} X_1\\ X_2\\ \vdots \\X_p\end{array}\right)
    This is a random vector, with a population mean. Assume that vector of traits X X XX is sampled from a population with population mean vector:
    μ = ( μ 1 μ 2 μ p ) = population mean vector μ = μ 1 μ 2 μ p = population mean vector mu=([mu_(1)],[mu_(2)],[vdots],[mu _(p)])="population mean vector"\boldsymbol{\mu} = \left(\begin{array}{c}\mu_1\\\mu_2\\\vdots\\\mu_p\end{array}\right) = \text{population mean vector}
    Here, E ( X i ) = μ i E X i = μ i E(X_(i))=mu_(i)\mathrm { E } \left( X _ { i } \right) = \mu _ { i } denotes the population mean of variable i i ii. If we consider m m mm latent common factors f 1 , f 2 , , f m f 1 , f 2 , , f m f_(1),f_(2),dots,f_(m)f _ { 1 } , f _ { 2 } , \dots , f _ { m }, such that m << p m << p m<<pm << p, then the common factors can be represented as :
    f = ( f 1 f 2 f m ) = vector of common factors f = f 1 f 2 f m = vector of common factors f=([f_(1)],[f_(2)],[vdots],[f_(m)])="vector of common factors"\mathbf{f} = \left(\begin{array}{c}f_1\\f_2\\\vdots\\f_m\end{array}\right) = \text{vector of common factors}
    Now, we represent each variable as a regression function of these factors as follows:
    (1) X 1 = μ 1 + l 11 f 1 + l 12 f 2 + + l 1 m f m + ϵ 1 (2) X 2 = μ 2 + l 21 f 1 + l 22 f 2 + + l 2 m f m + ϵ 2 (3) (4) X p = μ p + l p 1 f 1 + l p 2 f 2 + + l p m f m + ϵ p (1) X 1 = μ 1 + l 11 f 1 + l 12 f 2 + + l 1 m f m + ϵ 1 (2) X 2 = μ 2 + l 21 f 1 + l 22 f 2 + + l 2 m f m + ϵ 2 (3) (4) X p = μ p + l p 1 f 1 + l p 2 f 2 + + l p m f m + ϵ p {:(1)X_(1)=mu_(1)+l_(11)f_(1)+l_(12)f_(2)+cdots+l_(1m)f_(m)+epsilon_(1),(2)X_(2)=mu_(2)+l_(21)f_(1)+l_(22)f_(2)+cdots+l_(2m)f_(m)+epsilon_(2),(3)vdots,(4)X_(p)=mu _(p)+l_(p1)f_(1)+l_(p2)f_(2)+cdots+l_(pm)f_(m)+epsilon _(p):}\begin{align} X_1 & = \mu_1 + l_{11}f_1 + l_{12}f_2 + \dots + l_{1m}f_m + \epsilon_1\\ X_2 & = \mu_2 + l_{21}f_1 + l_{22}f_2 + \dots + l_{2m}f_m + \epsilon_2 \\ & \vdots \\ X_p & = \mu_p + l_{p1}f_1 + l_{p2}f_2 + \dots + l_{pm}f_m + \epsilon_p \end{align}
    Here, the variable means μ 1 , . . . , μ p μ 1 , . . . , μ p mu_(1),...,mu_(p)\mu_{1}, ... , \mu_{p} can be regarded as the intercepts for the multiple regression models defined above. The regression coefficients l i j l i j l_(ij)l_{ij} are called factor loadings and can be collected into a matrix like:
    L = ( l 11 l 12 l 1 m l 21 l 22 l 2 m l p 1 l p 2 l p m ) = matrix of factor loadings L = l 11 l 12 l 1 m l 21 l 22 l 2 m l p 1 l p 2 l p m = matrix of factor loadings L=([l_(11),l_(12),dots,l_(1m)],[l_(21),l_(22),dots,l_(2m)],[vdots,vdots,,vdots],[l_(p1),l_(p2),dots,l_(pm)])="matrix of factor loadings"\mathbf{L} = \left(\begin{array}{cccc}l_{11}& l_{12}& \dots & l_{1m}\\l_{21} & l_{22} & \dots & l_{2m}\\ \vdots & \vdots & & \vdots \\l_{p1} & l_{p2} & \dots & l_{pm}\end{array}\right) = \text{matrix of factor loadings}
    Moreover, the error terms ( ϵ i ) ( ϵ i ) (epsilon _(i))(\epsilon_i) called the specific factors. The vector for the same can be represented as:
    ϵ = ( ϵ 1 ϵ 2 ϵ p ) = vector of specific factors ϵ = ϵ 1 ϵ 2 ϵ p = vector of specific factors epsilon=([epsilon_(1)],[epsilon_(2)],[vdots],[epsilon _(p)])="vector of specific factors"\boldsymbol{\epsilon} = \left(\begin{array}{c}\epsilon_1\\\epsilon_2\\\vdots\\\epsilon_p\end{array}\right) = \text{vector of specific factors}
    Each of our response variables X X XX is predicted as a linear function of the unobserved common factors $f_1,f_2,...,f_m $. We have m m mm unobserved factors that control the variation among our data.
    The factor model has a few assumptions though:
    1. The specific factors or random errors all have mean zero.
    2. The common factors have variance one.
    3. The common factors are uncorrelated with each other.
    4. The specific factors are uncorrelated with each other.
    5. The specific factors are uncorrelated with the common factors.
    We can interpret factor analysis as an inversion of principal components analysis. While PCA represents new components as a function of combination of observed variables, FA models the observed variables as linear functions of the new unobserved variables called factors. However, both PCA and FA reduce the dimension of the dataset.

    2.3. Linear Discriminant Analysis

    Linear Discriminant Analysis, or LDA, is a linear machine learning algorithm used for multi-class classification. It can also be used for dimensionality reduction.
    Let us assume a dataset with K K KK classes (for the target variable). LDA helps us reduce the number of features to K 1 K 1 K-1K-1 classes.
    The aim of LDA is to:
  • Minimize the Inter-Class Variability: This can be done by including as many similar points within a particular class. This will ensure less number of misclassifications.
  • Maximize the Distance Between the Mean of Classes: It tries to keep the class means apart from each other and as far as possible. This increases the confidence during prediction.
  • Thus, LDA tries to separate (or discriminate) the samples in the training dataset by their class value. In this pursuit of dimensionality reduction, the model tries to find a linear combination of input variables that achieves the maximum separation for samples between classes (class centroids or means) and the minimum separation of samples within each class.
    To explain this better, let us compare LDA to PCA. PCA is an unsupervised machine learning method, that is it takes into account only the spectral data and its variance to reduce the number of features. It does not require the labels of the data samples.
    However, LDA makes use of the class labels to produce a dimensionality reduction that maximizes the distance between the classes. In other words, while PCA will find the projections that maximise the variance of the data irrespective of their classes, LDA will seek to maximise the distance between those classes while reducing the number of features.
    In the above image, we see how LDA projects the existing features onto an exis such that it separates the two classes in the projected space. The general overview of this algorithm looks like the following:
    Algorithm 3: Linear Discriminant Analysis
    1.1 1.1 1.1\mathrm{1.1} Compute the d-dimensional mean vector for the different classes from the dataset.
    1.2 1.2 1.2\mathrm{1.2} Compute the Scatter matrix (in between class and within the class scatter matrix)
    1.3 1.3 1.3\mathrm{1.3} Sort the Eigen Vector by decreasing Eigen Value and choosing k k kk eigenvector with the largest eigenvalue to from a d × k d × k d xx kd \times k dimensional matrix w w ww (where every column represent an eigenvector)
    1.4 1.4 1.4\mathrm{1.4} Used d × k d × k d xx kd \times k eigenvector matrix to transform the sample onto the new subspace.
    Although it looks like LDA can perform better than PCA in multi-class classification models, it is not always the case. The choice depends on the performance of the model built on the transformed dataset. One can also combine LDA and PCA as an ensemble to obtain transformed features.

    2.4. Truncated Singular Value Decomposition (SVD)

    The Singular-Value Decomposition, or SVD, is a matrix decomposition method for reducing a matrix to its constituent parts in order to make certain subsequent matrix calculations simpler. To understand Truncated SVD, let us briefly explain SVD.
    SVD is a method for low-dimensional representation of a high-dimensional matrix. In this process, it also makes it easy to eliminate the less important parts of that representation to produce an approximate representation with any desired number of dimensions. Thus, it is a dimensionality reduction technique.
    Let M M MM be an m × n m × n m xx nm × n matrix and rank r r rr. Now, we find matrices U U UU, Σ Σ SigmaΣ, and V V VV as shown in the image below:
    image Form of a singular-value decomposition
    It can be mathematically represented as:
    A m × n = U m × m . Σ m × n . V n × n T A m × n = U m × m . Σ m × n . V n × n T A_(m xx n)=U_(m xx m).Sigma_(m xx n).V_(n xx n)^(T)A_{m×n} = U_{m×m} . \Sigma_{m×n} . V^T_{n×n}
    Let us define the above matrices:
    1. U U UU is an m × r m × r m xx rm × r column-orthonormal matrix ; i.e., each of its columns is a unit vector and the dot product of any two columns is 0.
    2. V V VV is an n × r n × r n xx rn × r column-orthonormal matrix. Note that we always use V in its transposed form, so it is the rows of V T V T V^(T)V^T that are orthonormal.
    3. Σ Σ SigmaΣ is a diagonal matrix; that is, all elements not on the main diagonal are 0. The elements of Σ Σ SigmaΣ are called the singular values of M M MM.
    The key to understanding what SVD offers is in viewing the r r rr columns of U U UU, Σ Σ SigmaΣ, and V V VV as representing concepts that are hidden in the original matrix M M MM. To understand how it helps in dimensionality reduction, let us suppose we want to represent a very large matrix M M MM by its SVD components U U UU, Σ Σ SigmaΣ, and V V VV. However, these matrices are also too large to store conveniently. We can reduce the dimensionality of the three matrices by setting the smallest of the singular values to zero. If we set the s s ss smallest singular values to 0, then we can also eliminate the corresponding s s ss columns of U U UU and V V VV. We make the choice of setting the lowest singular values to zero when we reduce the number of dimensions minimizes the root-mean-square error (RMSE) between the original matrix M M MM and its approximation.
    In order to determine how many singular values need to be retained, it is a thumb-rule to retain enough singular values to make up 90% of the energy in Σ Σ SigmaΣ. This means, the sum of the squares of the retained singular values should be at least 90% of the sum of the squares of all the singular values.
    In truncated SVD, we take the k k kk largest singular values (such that 0 < k < r 0 < k < r 0 < k < r0<k<r) and their corresponding left and right singular vectors
    A m × n U t ( m × k ) , Σ t ( k × k ) , V t ( k × n ) T A m × n U t ( m × k ) , Σ t ( k × k ) , V t ( k × n ) T A_(m xx n)~~U_(t(m xx k)),Sigma_(t(k xx k)),V_(t(k xx n))^(T)A_{m×n}≈U_{t(m×k)},Σ_{t(k×k)},V^T_{t(k×n)}
    Contrary to PCA, this estimator does not center the data before computing the singular value decomposition. This means it can work with sparse matrices efficiently. Thus, we can used Truncated SVD as a dimensionality reduction technique for sparse datasets.

    2.5. Kernel PCA

    All the above methods are linear methods of dimensionality reduction. They work best for linear data. To be able to reduce the dimensions of non-linear data, we need non-linear methods. The first one discussed here is Kernel PCA. It is a non-linear version of PCA that uses kernels.
    image Linear and Non-linear dataset
    In the above figure, the data on the left is linearly separable. The one on the right is non-linear and needs a higher order polynomial to divide the complex boundary.
    In the conventional PCA algorithm, the process of matrix decomposition into eigenvectors is a linear transformation. Kernel PCA can be used to reduce the dimensions of data during classification of data whose decision boundaries are described by non-linear function. The idea is to go to a higher dimension space in which the decision boundary becomes linear. Let's say the decision boundary of the above data distribution is defined by a third order polynomial y = a + b x + c x 2 + d x 3 y = a + b x + c x 2 + d x 3 y=a+bx+cx^(2)+dx^(3)y = a + bx + cx^{2} + dx^{3}. If we plot this function on an x y x y x-yx-y plane, it will produce a wavy line, which could act as the decision boundary in the image above. Now if we project this to a higher-dimension space with axes x , x 2 , x 3 x , x 2 , x 3 x,x^(2),x^(3)x,x^{2}, x^{3} and y y yy. In this 4D space, the third order polynomial becomes a linear function and the decision boundary converts into a hyperplane. So, if we find a suitable transformation to convert our data into linear data, we can proceed to the usual PCA decomposition on the transformed data.
    Thus the algorithm can be simplified as below:
    Algorithm 4: Kernel PCA
    1.1 1.1 1.1\mathrm{1.1} Define the original set of n n nn features as x x x\mathrm{x}
    1.2 1.2 1.2\mathrm{1.2} Define ϕ ( x ) ϕ ( x ) phi(x)\phi(\mathbf{x}) as the non-linear combination (mapping) of these variables into a