Stochastic Gradient Descent For Deep Learning

Understanding the parameter optimization process for deep learning models.

Samuel Ozechi
5 min readSep 21, 2024

Gradient descent is an optimization algorithm for finding the minima of a differentiable function. It is generally used in machine learning to optimize the parameters of machine learning models.

It approaches model optimization by iteratively minimizing the cost function by adjusting the parameters of a model with respect to the prediction error until optimal parameters are derived at the global minima of the function, which produces the minimum prediction error.

Fig. 1. 2D representation of gradient descent.

It is most commonly used for optimizing deep learning models, where it is applied in several iterations of feed-forward and backward propagation to derive the optimal weights for a neural network model.

The workflow for optimizing deep learning models with gradient descent usually involves several iterations of:

1. Randomly initializing a set of weights for the model parameters.

2. Computing the prediction error using a defined cost function for the data points.

3. Determining the gradients of the model parameters that minimize the prediction error with respect to the cost function.

4. Using the gradients of the model parameters to update the parameters’ weights such that it reduces the prediction error.

Deep learning algorithms commonly utilize a vast amount of data for training, therefore, updating the weights of the model parameters after computing the loss for all the available data points slows down the training process and is relatively time-consuming as more time is spent on computing and aggregating the total loss than in updating the model parameters. Stochastic gradient descent (SGD) is a variation of the gradient descent algorithm which optimizes the model after evaluating the loss of each data point, thereby more frequently updating the parameters of the model, increasing the speed of the learning process and improving the efficiency of large-scale training.

Applying Stochastic Gradient Descent For Deep Learning

Neural network models consist of multiple layers of interconnected neurons with several parameters. Deep learning aims at optimizing the weights of these parameters by minimizing the loss and thereby developing a model that makes accurate estimations.

Fig 2. Workflow of a neural network (image source)

Consider the linear combination of the parameters of a single node in the network as:

y = wx + b

Where y is the output, w is the weight of the input,

x is the input and b is the bias parameter.

The goal of deep learning is to achieve optimal values for w and b to accurately predict y. To achieve this according to deep learning workflows, we would initialize small random weights for the parameters (w and b), run a forward propagation using the input (x) to obtain a predicted output (y) and evaluate the loss to determine the relative performance of the model.

Optimizing the model would require minimizing the loss of the model over several iterations. The cost function that is utilized for optimizing deep learning models would largely depend on the specific problem (regression, classification, object detection etc.). Some commonly used cost functions include Mean Squared Error, Mean Absolute Error, Log Loss and Quadratic Loss.

After evaluating the loss, the model parameters are optimized by re-adjusting the weights of the parameters with respect to the loss gradient at every iteration (epoch). When using batch gradient descent, these weights are updated only once in each training cycle (epoch), as only a single loss gradient, aggregated from the loss values of all the data points, is used to update the weights in each epoch.

When optimizing the model using stochastic gradient descent, the weights are updated multiple times in each epoch. The number of weight updates (N) in each epoch is the total number of data points in the dataset. This is because, rather than aggregating the loss values of all data points to a single value for re-adjusting the weights, the weights are iteratively adjusted with a random loss gradient from the data points, leading to faster model convergence and thus training the model much faster.

While the model converges faster than using a batch gradient descent, it also incurs a of computations due to the multiple weight updates according to the number of sample points in each training epoch. This technique could then become more resource-intensive.

An alternative to updating the weights at every sample point evaluation is using the mini-batch SGD approach, where batches of the data points are randomly selected and used to update the weights of the model, thereby profiting from the advantages of the batch and stochastic gradient descent approaches.

Hyperparameters of Stochastic Gradient Descent

To efficiently apply SGD to deep learning, its movement about the gradient descent landscape to arrive at the global minima needs to be effectively controlled. This is done using the following hyperparameters of the stochastic gradient descent algorithm.

  1. Learning Rate: The learning rate controls the magnitude of adjustments that are made to the model parameters during training. It is important to carefully set the learning rate to avoid unnecessarily slowing down the convergence of the model or missing the global minima due to the magnitude of the steps downslope.
Fig. 3. Effect of different learning rates on model convergence (image source)

2. Momentum: This controls the acceleration of the SGD downslope and prevents unnecessary oscillations about the loss landscape. This is important because the SGD could be forced to oscillate about the gradient descent landscape rather than descending downslope towards the global minima. The momentum is significant in accelerating the movement of the SGD downslope.

3. Nesterov Acceleration: This controls the movement of SGD towards the global minima. It reduces the likelihood of overshooting beyond the global minima by slowing down the SGD s it approaches the global minima.

Conclusion

Stochastic gradient descent is an optimization algorithm that speeds up the convergence of deep learning models. It applies multiple weight updates for the parameters of the model in every iteration according to the number of data points. By so doing, SGD improves efficiency and reduces the time of training models with a huge amount of data.

While the basic SGD method could become computationally expensive, the mini-batch SGD method applies SGD using batches of the data points, thereby reducing the computations incurred when using the basic SGD.

The steps, pace and movement of the SGD towards the global minima are controlled using the learning rate, Momentum and Nesterov acceleration hyperparameters respectively.

--

--

No responses yet