This article is the third in a series of articles aimed at demystifying neural networks and outlining how to design and implement them. In this article, I will discuss the following concepts related to the optimization of neural networks:
- Challenges with optimization
- Adaptive Learning Rates
- Parameter Initialization
- Batch Normalization
You can access the previous articles below. The first provides a simple introduction to the topic of neural networks, to those who are unfamiliar. The second article covers more intermediary topics such as activation functions, neural architecture, and loss functions.
These tutorials are largely based on the notes and examples from multiple classes taught at Harvard and Stanford in the computer science and data science departments.
All the code that is discussed in this and subsequent tutorials on the topics of (fully connected) neural networks will be accessible through my Neural Networks GitHub repository, which can be found at the link below.https://github.com/mrdragonbear/Neural-Networks
Challenges with Optimization
When talking about optimization in the context of neural networks, we are discussing non-convex optimization.
Convex optimization involves a function in which there is only one optimum, corresponding to the global optimum (maximum or minimum). There is no concept of local optima for convex optimization problems, making them relatively easy to solve — these are common introductory topics in undergraduate and graduate optimization classes.
Non-convex optimization involves a function which has multiple optima, only one of which is the global optima. Depending on the loss surface, it can be very difficult to locate the global optima
For a neural network, the curve or surface that we are talking about is the loss surface. Since we are trying to minimize the prediction error of the network, we are interested in finding the global minimum on this loss surface — this is the aim of neural network training.
There are multiple problems associated with this:
What is a reasonable learning rate to use? Too small a learning rate takes too long to converge, and too large a learning rate will mean that the network will not converge.
How do we avoid getting stuck in local optima? One local optimum may be surrounded by a particularly steep loss function, and it may be difficult to ‘escape’ this local optimum.
What if the loss surface morphology changes? Even if we can find the global minimum, there is no guarantee that it will remain the global minimum indefinitely. A good example of this is when training on a dataset that is not representative of the actual data distribution — when applied to new data, the loss surface will different. This is one reason why trying to make the training and test datasets representative of the total data distribution is of such high importance. Another good example is data which habitually changes in distribution due to its dynamic nature — an example of this would be user preferences for popular music or movies, which changes day-to-day and month-to-month.
Fortunately, there are methods available that provide ways to tackle all of these challenges, thus mitigating their potentially negative ramifications.
Previously, local minima were viewed as a major problem in neural network training. Nowadays, researchers have found that when using sufficiently large neural networks, most local minima incur a low cost, and thus it is not particularly important to find the true global minimum — a local minimum with reasonably low error is acceptable.