High Dimension Data Analysis  A tutorial and review for Dimensionality Reduction Techniques
In the current day and age, high dimensional datasets are a common occurrence in most areas of study, with the high volume and variety of data that streams in. Research in the field of deep learning and machine learning have led to development of high efficiency models and algorithms that can deal with high volume of high dimensional datasets. However, there are techniques to engineer some significant variables and reduce the dimensionality by either eliminating other redundant variables or by creating new variables by combining a few existing ones. High dimensionality poses a few problems and hence need to be dealt with. Some of these issues are:
In this article, we investigate and study a few methods for selecting features from a high dimensional dataset, and thus reduce the dimensionality of the datasets.
1. Feature Selection from base features
Some techniques as described below are used to select features from the available base features in the dataset. These techniques do not alter the structure or definition of these variables, but simply select a subset of them.
1.1. Low Variance and High Correlation Filters
Statistically, variance is the measure of spread for a distribution of a random variable that determines the degree to which the values of a random variable differ from the expected value. In other words, it tells us how far the points are from the mean. The formula to variance is given by:
If the variance is closer to zero, that means the variable is fairly constant and does not spread too far away from the mean. A variable with variance 0 means all the values of the variable are constant i.e. all values are the same.
For a variable in the context of a dataset being prepared for machine learning, a variable with variance close to zero contains little to no information about the traget variable. This means it contains little impact on the target variable, and thus very little impact on the model's predictive ability.
Since variance is dependent on the range of the variable, it is important to scale the variables, especially numerical variables before computing the variance. After the variances have been computed for all the scaled variables, you can choose a threshold for the variance. All variables with variance below the theshold can be dropped.
For categorical variables, variables with few unique values but with more than 95% of the values belonging to a specific category can be dropped. For instance, if 95% of the records in a dataset belong to one particular country, say US, it is best to drop this variable as the information of the country would not improve or affect the performance of the model trained on this dataset.
Boolean features are Bernoulli random variables, and the variance of such variables is given by:
A similar featureselection criteria can be made on the basis of the correlation of the existing features. Multiple features which are highly correlated among themeselves can lead to multicollinearity in a model. In models, especially regression models, if a feature is a linear function of 1 or more features in the dataset, it makes it difficult to estimate the relationship between each independent variable and the dependent variable independently. Thus, it is important to eliminate multicollinearity by retaining only one of the highly correlated features.
We use Pearson's correlation to find the correlation between numerical variables. To find correlations between categorical features, we can use the chisquared test. The chisquared statistical test considers 2 hypotheses:
H0 (Null Hypothesis) = The 2 variables to be compared are independent. H1 (Alternate Hypothesis) = The 2 variables are dependent.
Now, if the pvalue obtained after conducting the test is less than 0.05, we reject the Null hypothesis and accept the Alternate hypothesis. If the pvalue is greater that 0.05 we accept the Null hypothesis and reject the Alternate hypothesis i.e. it implies that the 2 variables are independent. A pvalue of 0 would indicate that we can accept the alternate hypothesis with high confidence and that the two variables are highly correlated.
For determining the correlation between a categorical and a numerical variable, we can use correlation ratio ( ) which is defined as the weighted variance of the mean of each category divided by the variance of all samples. It helps us answer a simple question: Given a continuous number, how well can you know to which category it belongs to? The score has a range of where a higher score indicates higher correlation. Using these scores, we can eliminate either of the variables in a pair that is highly correlated. One way of selecting a feature to retain, is to check the correlation of these features with the target. To train a model with good predictive ability, one should select the feature with a higher correlation with the target variable.
1.2. Recursive Feature Elimination (RFE)
Algorithm 1: Recursive Feature Elimination  
Tune/Train the model on the training set using predictors 

Determine the appropriate number of features in the final model 

While 

Calculate the importance of the predictors  
Drop lowest 

Tune/Train the model on the training set using the 

[Optional] Calculate model performance  
end  
Train the model using the optimal 
Recursive Feature Elimination is wrappertype feature selection technique. This means that a different machine learning algorithm is given and used in the core of the method, wrapped by RFE to help select features. Unlike filterbased feature selection methods that score each feature and select those features with the largest (or smallest) score, this method stops when it reaches a stopping criterion.
In RFE, we aim to select features by recursively considering smaller sets of features in each iteration. First, the estimator is trained on the initial set of features, often the entire set of predictors and the importance score of each feature is computed either through any specific attribute like feature importance, pvalues, variable coefficients or a callable. Then, the least important features are pruned from current set of features, the model is rebuilt, and importance scores are computed again. That procedure is recursively repeated on the pruned set until the desired number of features to select is eventually reached.
One can specify the number of predictor subsets to evaluate as well as each subset’s size. Therefore, the variable subset size is a tuning parameter for RFE. The subset size that optimizes the performance criteria is used to select the predictors based on the importance rankings. The optimal subset is then used to train the final model.
One important thing to note, is that the RFE method cannot be used with all models. RFE requires that the initial model uses the full predictor set. Hence, RFE cannot be used with some models like multiple linear regression, logistic regression, and linear discriminant analysis, when the number of predictors exceeds the number of samples. To use RFE in these cases, the number of predictors must first be reduced.
1.3. Sequential Feature Selection (SFS)
Sequential Feature Selection or SFS is another wrappermethod for feature selection. SFS can be either forward or backward. You can specify the desired number of features. The feature selection algorithm removes or adds one feature at a time based on the model performance until a feature subset of the desired size is reached.
ForwardSFS is a greedy algorithm that iteratively finds the best new feature to add to the set of selected features. We initially start with zero features and find the one feature that maximizes a crossvalidated score when an estimator is trained on this single feature. Once that first feature is selected, we repeat the procedure by adding a new feature to the set of selected features. The procedure stops when the desired number of selected features is reached.
BackwardSFS follows the same idea but works in the opposite direction: instead of starting with no feature and greedily adding features, we start with all the features and greedily remove features from the set. The direction parameter controls whether forward or backward SFS is used.
Algorithm 2: Backward SFS  
Define the desired number of features  
Tune/Train the model on the training set using all predictors  
Compute Model Performance  
Calculate the importance of the predictors  
For each each subset of variables 

Greedily drop the feature responsible for the worst model performance at the given point; retain 

Tune/Train the model on the training set using the remaining 

Calculate model performance  
end  
Calculate the performance profile over the final 

Train the model using the optimal 
While the forward SFS algorithm is easier to understand, there are minor diferences between backward SFS and RFE. SFS does not require the underlying model to expose variable coefficients or importance scores. The iterative step makes a greedy decision on the basis of the model performance. This algorithm includes a model selection on the basis of various parameters like Adj RSquared, Mallow's Cp, BIC or AIC, Accuracy or even Root Mean Squared Error (RMSE), depending on the nature of the model.
SFS may however be slower considering that more models need to be evaluated, compared to RFE. For example in backward selection, the iteration going from features to features using kfold crossvalidation requires fitting models, while RFE would require only a single fit.
1.3.1. Definitions of some modelselection parameters
AIC: The Akaike information criterion (AIC) is an estimator of outofsample prediction error and thereby relative quality of statistical models for a given set of data. It can be defined as:
Here, is the number of predictors or independent features and is the maximum value of the likelihood function for the model (The higher the number, the better the fit). We aim to minimize AIC, so the model with a lower AIC is selected as the better one.
BIC: Bayesian Information Criteria (BIC) is a model selection parameter defined from Bayesian probability and inference. It can be defined as:
Here, is the loglikelihood of the model, is the number of examples in the training dataset, and is the number of parameters in the model. Like AIC, we aim to minimize the BIC too i.e. we select the model with the lowest BIC. However, unlike AIC, BIC offers a higher penalty for complex models using the sample size of the dataset. If a model is trained on a dataset with a large sample size with relatively lower number of features, the model has a lower BIC, thus making it better than a model with smaller sample size and more features.
MDL: The Minimum Description Length, or MDL for short, is a method for scoring and selecting a model. It is an information theoretic measure of model selection. It represents the minimum number of bits, or the minimum of the sum of the number of bits required to represent the data and the model and is defined as:
Here, is the model, is the predictions made by the model, is the number of bits required to represent the model, and is the number of bits required to represent the predictions from the model on the training dataset.
We also aim to reduce MDL, and hence the model with the lowest MDL is selected.
1.4. Embedded Methods: L1 regularization for Feature Selection
Let us talk about the two most common regularization techniques: and . These regularization terms add a penalty that can be defined as:
L1 norm penalty:
L2 norm penalty:
Here, regularization, uses a penalty term which encourages the sum of the absolute values of the parameters to be small. regularization, on the other hand, encourages the sum of the squares of the parameters to be small. It has frequently been observed that regularization in many models causes many parameters to equal zero, so that the parameter vector is sparse. This makes it a natural candidate in feature selection settings, where we believe that many features should be ignored.
Graphical representation of L2 (left) and L1 (right) regularization
In the above images, we see that tends to generate sparser solutions than a quadratic regularizer like . In the formula defined fo L1norm penalty, if we increase further, the solution would become sparser and sparser, i.e. more and more features would have 0 as coefficients. The orange square area in the graph above will shrink to a point corresponding to 0.
Thus, including a regularization term like L1regularization with a machine learning model like linear regression to perform an embedded feature selection. Lasso regression uses L1 regularization on top of linear regression. It is like linear regression, but it uses a technique "shrinkage" where the coefficients of determination are shrunk towards zero.
1.5. Treebased Feature Selection
Treebased models can be used to compute impuritybased feature importances, which in turn can be used to discard irrelevant features. Let us consider the Random Forest models which computes importance scores for each feature based on the ‘Gini’ criterion during the training process. The Gini index is defined as follows:
Here, given a dataset that contains samples from classes, the probability of samples belonging to class at a given node can be denoted as . The measure above tells us the probability of misclassifying an observation. Lower the Gini impurity, the better is the split. At every split ofa node in treebased models, an attribute with the smallest Gini Impurity is selected for splitting the node.
In other words, a lower Gini impurity index leads to a lower likelihood of misclassification. We can use impuritybased feature importances to discard irrelevant features using a metatransformer for selecting features based on importance weights.
2. Feature Extraction using Transformation/ Combination of Features
There are a few methods that transform existing variables in the base dataset and/or combine them with other features to reduce the dimensionality while retaining maximum information/variance in the dataset.
These methods can be classifed as linear or nonlinear methods. The linear methods involve linearly projecting the original data onto a lowdimensional space and extract the transformed features. Some of these methods involve Principal Component Analysis (PCA), Factor Analysis (FA), Linear Discriminant Analysis (LDA) and Truncated Singular Value Decomposition (Truncated SVD). We will also cover some nonlinear methods which work better with nonlinear data.Some of those methods are: Kernel PCA, tdistributed Stochastic Neighbor Embedding (tSNE), Multidimensional Scaling (MDS), and Isometric mapping (Isomap)
2.1. Principal Component Analysis (PCA)
In a dataset with large number of variables, it can be difficult to study and interpret the relationships between these variables. There can be too many pairwise correlations between the variables to consider. Moreover, it is difficult to visualize these variables and relationships.
To interpret the data in a more meaningful form, it is necessary to reduce the number of variables to a few, interpretable linear combinations of the data. Each linear combination will correspond to a principal component. This is the essence of PCA.
Let us assume a random vector :
If we consider the linear combinations as below:
Each of the above equations represent the variables as a function of the random data . Here each component represents a principal component, a combination of several features. The equations can be seen as linear regressions where you can predict from with coefficients . Since is function of the random data , its variance can be represented as:
and the covariance between and can be defined as:
Now, the coefficients can be represented as a vector:
Now, if we look at the first Principal Component: , it is the linear combination of the xvariables that has maximum variance (among all linear combinations). It accounts for as much variation in the data as possible. Hence the coefficients are defined such that variance is maximized, given the constraint that the sum of the squared coefficients is equal to one. This constraint is required to obtain a unique answer. This means that the below variance is maximized:
subject to the constraint:
Similarly, the second principal component is the linear combination of xvariables that explains the maximum proportion of the remaining variation, with the constraint that the sum of the squared coefficients is equal to one and the additional constraint that the correlation between the first and second component is 0. It implies that the below variance is maximized:
given the first and second principal components are uncorrelated:
All subsequent principal components have this same property – they are linear combinations that account for as much of the remaining variation as possible and they are not correlated with the other principal components. Moreover, all principal components are uncorrelated with one another.
To find coefficients for a Principal Component, we need to calculate the eigenvalues and eigenvectors of the variancecovariance matrix . If the eigenvalues of the variancecovariance matrix are repesented by such that:
The corresponding eigenvectors can be represented as . It can be proved that the elements for these eigenvectors are the coefficients of our principal components.
Since principal components are a combination of multiple variables, interpreting it can be tough. To do so, we can look at the correlation between the principal components and the individual variables. The direction and the magnitude of the correlation score with each of the variables tells us how each principal component can be interpreted. Say a particular principal component has a strong correlation of 0.95 with a variable , followed by strong correlation with 2 other variables , we can say that this PC is predominantly a measure and a function of . If a PC has a strong negative correlation with a variable , it means the PC increases with a decreasing . Thus, it is a negative function of that variable.
It is important to note that one should perform feature scaling before running PCA especially, if there is a significant difference in the scale between the features of the dataset. One can define the number of principal components to be extracted from a dataset. This could be defined using the proportion of the explained variance by each of these principal components.
2.2. Factor Analysis (FA)
Factor Analysis (FA) is a dimensionality reduction technique that also helps extract latent variables which are not directly measured in a single variable but rather inferred from other variables in the dataset. These latent variables are called factors. It models observed variables, and their covariance structure, in terms of a smaller number of these latent unobserved factors.
To understand the algorithm, let us start with a vector containing all variables for all samples:
This is a random vector, with a population mean. Assume that vector of traits is sampled from a population with population mean vector:
Here, denotes the population mean of variable . If we consider latent common factors , such that , then the common factors can be represented as :
Now, we represent each variable as a regression function of these factors as follows:
Here, the variable means can be regarded as the intercepts for the multiple regression models defined above. The regression coefficients are called factor loadings and can be collected into a matrix like:
Moreover, the error terms called the specific factors. The vector for the same can be represented as:
Each of our response variables is predicted as a linear function of the unobserved common factors $f_1,f_2,...,f_m $. We have unobserved factors that control the variation among our data.
The factor model has a few assumptions though:
 The specific factors or random errors all have mean zero.
 The common factors have variance one.
 The common factors are uncorrelated with each other.
 The specific factors are uncorrelated with each other.
 The specific factors are uncorrelated with the common factors.
We can interpret factor analysis as an inversion of principal components analysis. While PCA represents new components as a function of combination of observed variables, FA models the observed variables as linear functions of the new unobserved variables called factors. However, both PCA and FA reduce the dimension of the dataset.
2.3. Linear Discriminant Analysis
Linear Discriminant Analysis, or LDA, is a linear machine learning algorithm used for multiclass classification. It can also be used for dimensionality reduction.
Let us assume a dataset with classes (for the target variable). LDA helps us reduce the number of features to classes.
The aim of LDA is to:
Thus, LDA tries to separate (or discriminate) the samples in the training dataset by their class value. In this pursuit of dimensionality reduction, the model tries to find a linear combination of input variables that achieves the maximum separation for samples between classes (class centroids or means) and the minimum separation of samples within each class.
To explain this better, let us compare LDA to PCA. PCA is an unsupervised machine learning method, that is it takes into account only the spectral data and its variance to reduce the number of features. It does not require the labels of the data samples.
However, LDA makes use of the class labels to produce a dimensionality reduction that maximizes the distance between the classes. In other words, while PCA will find the projections that maximise the variance of the data irrespective of their classes, LDA will seek to maximise the distance between those classes while reducing the number of features.
In the above image, we see how LDA projects the existing features onto an exis such that it separates the two classes in the projected space. The general overview of this algorithm looks like the following:
Algorithm 3: Linear Discriminant Analysis  
Compute the ddimensional mean vector for the different classes from the dataset.  
Compute the Scatter matrix (in between class and within the class scatter matrix)  
Sort the Eigen Vector by decreasing Eigen Value and choosing 

Used 
Although it looks like LDA can perform better than PCA in multiclass classification models, it is not always the case. The choice depends on the performance of the model built on the transformed dataset. One can also combine LDA and PCA as an ensemble to obtain transformed features.
2.4. Truncated Singular Value Decomposition (SVD)
The SingularValue Decomposition, or SVD, is a matrix decomposition method for reducing a matrix to its constituent parts in order to make certain subsequent matrix calculations simpler. To understand Truncated SVD, let us briefly explain SVD.
SVD is a method for lowdimensional representation of a highdimensional matrix. In this process, it also makes it easy to eliminate the less important parts of that representation to produce an approximate representation with any desired number of dimensions. Thus, it is a dimensionality reduction technique.
Let be an matrix and rank . Now, we find matrices , , and as shown in the image below:
Form of a singularvalue decomposition
It can be mathematically represented as:
Let us define the above matrices:
is an columnorthonormal matrix ; i.e., each of its columns is a unit vector and the dot product of any two columns is 0. is an columnorthonormal matrix. Note that we always use V in its transposed form, so it is the rows of that are orthonormal. is a diagonal matrix; that is, all elements not on the main diagonal are 0. The elements of are called the singular values of .
The key to understanding what SVD offers is in viewing the columns of , , and as representing concepts that are hidden in the original matrix . To understand how it helps in dimensionality reduction, let us suppose we want to represent a very large matrix by its SVD components , , and . However, these matrices are also too large to store conveniently. We can reduce the dimensionality of the three matrices by setting the smallest of the singular values to zero. If we set the smallest singular values to 0, then we can also eliminate the corresponding columns of and . We make the choice of setting the lowest singular values to zero when we reduce the number of
dimensions minimizes the rootmeansquare error (RMSE) between the original matrix and its approximation.
In order to determine how many singular values need to be retained, it is a thumbrule to retain enough singular values to make up 90% of the energy in . This means, the sum of the squares of the retained
singular values should be at least 90% of the sum of the squares of all the singular values.
In truncated SVD, we take the largest singular values (such that ) and their corresponding left and right singular vectors
Contrary to PCA, this estimator does not center the data before computing the singular value decomposition. This means it can work with sparse matrices efficiently. Thus, we can used Truncated SVD as a dimensionality reduction technique for sparse datasets.
2.5. Kernel PCA
All the above methods are linear methods of dimensionality reduction. They work best for linear data. To be able to reduce the dimensions of nonlinear data, we need nonlinear methods. The first one discussed here is Kernel PCA. It is a nonlinear version of PCA that uses kernels.
Linear and Nonlinear dataset
In the above figure, the data on the left is linearly separable. The one on the right is nonlinear and needs a higher order polynomial to divide the complex boundary.
In the conventional PCA algorithm, the process of matrix decomposition into eigenvectors is a linear transformation. Kernel PCA can be used to reduce the dimensions of data during classification of data whose decision boundaries are described by nonlinear function. The idea is to go to a higher dimension space in which the decision boundary becomes linear. Let's say the decision boundary of the above data distribution is defined by a third order polynomial . If we plot this function on an plane, it will produce a wavy line, which could act as the decision boundary in the image above. Now if we project this to a higherdimension space with axes and . In this 4D space, the third order polynomial becomes a linear function and the decision boundary converts into a hyperplane. So, if we find a suitable transformation to convert our data into linear data, we can proceed to the usual PCA decomposition on the transformed data.
Thus the algorithm can be simplified as below:
Algorithm 4: Kernel PCA  
Define the original set of 
