To become better at a task, supervised machine learning (ML) methods need to get feedback about their predictions and bring them closer to reality which is done through the optimization process. To use an optimization algorithm, we specify a loss function which is the difference between predictions and reality (true values). If the loss function’s value is large, then it means that predictions are not close enough to true values. The optimization process is about changing the parameters (weights) of an ML model, using an optimization algorithm to make the loss function smaller for a target task (i.e. bringing the predictions closer to reality).

In our introduction to neural network blog post, we introduced the concepts of forward- versus backpropagation as predicting outputs and learning from the predictions processes, respectively. Here we focus on different optimization algorithms that help an artificial neural network to learn and become better at predicting labels of data points in classification settings.

**Loss functions**

Loss functions in supervised learning are meant to be a representative summary of the difference between ML model predictions and true values for a target task. Depending on the type of supervised ML model, the loss could summarize the difference between continuous predicted and true values in regression or predicted and true categories (or classes) in classification.

**Gradient descent, stochastic gradient descent and batch gradient descent**

Optimization algorithms such as Gradient Descent (GD) try to minimize the loss function. At each step in the training process, an optimization algorithm decides how to update each of the weights in a neural network or another machine learning model. Most optimization algorithms rely on the gradient vector of the cost function to update the weights. The main difference is how to use the gradient vector and what data points to use for its calculation.

In Gradient Descent, all data points are used to calculate the gradient of the cost function, and then weights of the model get updated in the direction of maximum decrease of cost. Despite the effectiveness of this method for small datasets, it can become computationally expensive and unsuitable for large datasets as for every iteration of learning, the cost needs to be calculated for all the data points simultaneously. The alternative approach is Stochastic Gradient Descent (SGD); instead of all data points, one datapoint gets selected in each iteration to calculate the cost and update the weights. But as one datapoint does not represent all the data in the training process, it causes a highly oscillating behavior in updating weights. Instead, we can use mini-batch gradient descent, which is used commonly as SGD in tutorials and tools, in which instead of all data points or only one in each iteration, will use a batch of data points for weight update. The mathematics behind these three approaches are shown in Figure 1.

**Figure 1. Comparison of gradient descent, stochastic gradient descent and mini-batch gradient descent. Mini-batch gradient descent is commonly used as SGD in tools, papers and packages. **

More optimization algorithms have been suggested in recent years to improve the performance of neural network models across a variety of applications, such as the Adam optimizer. One of the intuitions behind this approach is to avoid diminishing gradients in the optimization process. As optimization continues, gradient values get smaller while it can be avoided, as in the Adam optimizer, using the momentum effect of a gradient in previous steps. There are more than ten different optimization algorithms, including the Adam optimization algorithm, available in PyTorch.

Here we briefly talked about the optimization process in supervised neural network modelling. In our next post of the series, we will discuss graph neural network modelling and its application in drug discovery. In this series, we plan to introduce several other fundamental and advanced topics associated with Machine Learning, such as attention in neural network modelling!

Stay tuned !!

Author: Ali Madani

Editor: Andreas Windermuth & Chinmaya Sadangi