Machine Learning - Less is enough

We don’t need to know everything for decision-making. When doctors assess if a patient is obese or not, they measure the weight and height of the patient. They either know or can ask about the patient's hair or eye color, but that information will not be useful. Also, the weight and height will be combined to calculate body mass index (BMI) to make the assessment easier. 

That is the right way of judgment and evaluation  by human beings in all areas, which include four main steps:
1) Considering all the information that is possible to collect;
2) Removing irrelevant information;
3) Putting the relevant information together;
4) Come up with the final decision.

Sometimes steps one and two happen so fast that we don’t even notice we eliminated some decision-making information. The concept of reducing or combining features, characteristics of data points, is not limited to our decision-making process but is also part of dimensionality reduction in machine learning modelling.

As part of machine learning (ML) modelling, we can reduce irrelevant features and combine the relevant components to come up with new features. These new features are also called embeddings or representations and could be better predictive of the target task in a supervised setting, and/or help us reduce the computational cost in modelling or production and even help us to visualize high dimensional (data with an increased  number of features) in 2D figures. In this process, the ML methods remove and combine elements to develop  a smaller set of features (dimensions). 

Dimensionality reduction

Considering feature-feature or feature-output relationships or both, dimensionality reduction (DR) approaches either remove or combine original features resulting in fewer dimensions (features). If a DR approach uses outputs in this process, it will be a supervised approach. Otherwise, it will be an unsupervised approach. Reducing the number of dimensions could be helpful to 

1) Reduce the memory occupied by the data we want to further use in our technologies; 2) Reduce the running time of our machine learning models built on top of the new dimensions (features);
3) Potentially improve performance of the model;
4) Help to better understand the feature-feature and feature-output relationships;
5) Visualize the data in lower-dimensional space (with two or three embeddings).

Feature selection

Selecting a specific subset out of a pool of features is one type of dimensionality reduction. In supervised ML settings, regularization is a widely used approach to reduce the effect of or remove features with the lowest effect on the prediction of output values. In unsupervised ML settings, features with high sparsity or low variance across data points are sometimes removed to select features with the highest information content. In both cases, the goal is to select the features that will provide enough information regarding the target task at hand.

Feature extraction

In some problems, we can combine the available features in a linear or nonlinear process to come up with new features sets that are 1) smaller in number and help us to save on computational cost; and 2) better predictors of a target output variable if implemented in supervised ML setting. In the feature extraction process, some of the original components may have a negligible effect on generating new ones, but the primary goal is not to remove any feature. Principal component analysis (PCA) is a widely used approach that linearly combines the original features to come up with new embeddings. 

Figure 1. Extracting the Body Mass Index (BMI) of patients from their physical characteristics, through feature selection and extraction, to assess obesity. (There is much more into obesity assessment and this is just a simple example to illustrate the concepts of feature selection and extraction.)

Feature extraction for visualization

Feature extraction can be also used to help us better visualize high-dimensional data. T-distributed stochastic neighbour embedding (t-SNE) and Uniform Manifold Approximation and Projection (UMAP) are two of the DR approaches that have been extensively used to visualize high dimensional data in 2 dimensional space. The goal of these methods is primarily identifying local neighborhoods  to show groupings between data points. Although UMAP can be used to study large distances between datapoints and density of datapoints within groups, t-SNE’s results are not useful for such interpretations. Utility of t-SNE and UMAP to reduce 64 dimensional (64 pixel) hand-written digits to 2 dimensional space is shown in Figure 2.

Figure 2. Application of t-SNE and UMAP as examples of dimensionality reduction approaches for 2 dimensional visualization of hand-written digits data available in scikit-learn package in python. Each color is associated with a digit (between 0 and 9).

In our next post of the series, we will talk about clustering as an unsupervised learning approach. In this series, we plan to introduce several other fundamental topics associated with Machine Learning, such as deep learning and transfer learning!

Stay tuned !!

Author: Ali Madani

Editor: Andreas Windemuth & Chinmaya Sadangi

Dr. Ali Madani, Director of Machine Learning

Dr. Ali Madani, Director of Machine Learning

Ali develops new deep learning models to improve drug-target interaction prediction. He completed his Ph.D. in Computational Biology at the University of Toronto, developing new feature selection approaches from omics profiles of patient tumors that are predictive of their survival and their response to drugs.