Neural Network training can be a long process. Encog provides many different training methods to choose from. Many of these training methods contain multiple parameters that you must optimize. Understanding some of the basics of neural network training can help you to pick the training process that best suits your needs.
I wrote this blog post to explain some of the differences in training processes. I will also cover how these training methods work, at a high level. This will help you to understand why you can get completely different results between training runs. Through this series, I also hope to define some technical terms such as search space, dimensionality, stochastic, deterministic, derivative and gradient descent.
First, we must step back a bit and look at what we are really doing when training a neural network. Training a neural network involves taking training data and fitting the neural network weights to this training data. Training data consists of inputs and ideal outputs. The neural network is taught to produce close to the ideal output when presented the input. The degree to which the neural network does not match the ideal output is the training error. As training progresses, you should see the training error decrease.
For most training algorithms we are simply modifying the individual weights and biases of a neural network. For this article we can simply think of the biases as nothing more than additional weights. It is often very handy to think of a neural network’s weights as a one long array of values. For example, if the neural network had three weights, then you might represent them as follows.
[-2.34, 0.55, -1.222 ]
At this point we do not care about layers, neurons or any other such structural element. We have a neural network of some arbitrary structure. That structure results in three connections, and these are the three weights. Training is now the process of adjusting these three weights to produce a neural network that minimizes training error.
You can think of those three weights as dimensions in a 3D world. This is why neural network training is often called a “search problem”. Adjusting each of these three weights, or coordinates, moves the neural network to a different place in the “search space”. Each location in this search space produces some sort of training error when evaluated against the training data. Training now wants to improve the error rate. A location that has a slightly better error rate is generally really close to the current neural network. Training involves moving the neural network through the search space. Each move hopefully improves the training error to some degree. Most training algorithms call each move an iteration, or epoch.
It is very much like the balloon in this picture. The balloon is searching for a good place to land. The pilot only has limited control. The water is not a good place, but there are good landing sites beyond. The idea is not to find the best landing site. That would take too long. The idea is to find an acceptable landing site. This is the same with training a neural network.
The idea of a search space works well. Consider the above neural network with three weights. Using these three dimensions, we can move up/down, left/right and forward/backward. Our goal is to adjust our position, in these three dimensions, so as to improve the error. There are many means to determine which direction to move. These are the training methods. This is why there are so many training methods. They all seek to help you move to a better error rate.
Most neural networks have more than three weights. The dimensionality of the search is always the number of weights in the neural network. A “high dimensionality search” is a train on a neural network with many weights. We can’t visualize a world with more than three dimensions. But it is a simple enough think about. Higher dimensionality just means there are more coordinates to describe out position.
Two other important ideas in this search space are global and local minima. The global minimum in the search space is the location where the error rate is the lowest possible value. With even a small amount of dimensions, the global minimum is the “Lost City of Atlantis”. You will never find it. And really, you would not want to find it. The global minimum usually means the neural network has memorized the training data and is now overfitted. You will always stop training at a local minimum. A local minimum is a location where the error rate is below what you had set as the minimum accepted error.
Local minimums are not always bad. A local minimum is bad when the neural network becomes stuck at the minimum. The training algorithm is happy with the local minimum it found, and does not seek to find anything better. When you are training, and your neural network just stops dead at a high error rate, this is generally what has happened. The neural network has found a low point in the search space. However, the search space is extremely inhospitable in all directions from there. The training algorithm simply can’t find a way out.
In this case, you might have to randomize the network. Randomizing the network is essentially like picking up the neural network and randomly dropping it in another location on the search space. This allows it to escape from the bad location that it was at previously.
There are likely many different local minima that would provide you with an acceptable training rate. It is just a matter of finding them. Not all are created equal. Consider this example. You are on a road trip, and it is time to stop for lunch. There are many places to eat lunch. Some will be better than others, depending on what you. If you are a vegetarian, then a hamburger stand is not going to work too well. You also do not want to spend two hours looking for the best possible place to each lunch. So you use what you have. You use maps, signs, your iPhone and GPS. Whatever you have. And you settle on the best place you can find. This is how training works.
This blog post introduced you to neural network training. It is a high-level introduction. You saw what global and local minima were. In the next blog post we will look at the different types of training.