This was originally posted Mon, 10/26/2009, and needs some updating. I will do this soon.
This article shows how the Multi Propagation (MPROP) algorithm was implemented for Encog for Java. Though this article focuses on the Java implementation the C# version would be very similar. MPROP is based on resilient propagation, but is designed to work well with multicore computers and gain maximum performance.
Multicore computers are becoming more and more common. There seems to be only some many gigahertz that computer manufactures can squeeze out of CPU’s. The real growth in computer performance will likely be in the number of cores contained in a computer’s CPU. As of the writing of this document, October 25, 2009, Intel i7 computers can be had for around $1000 USD. An i7 computer makes use of a Quadcore Hyperthreading CPU. This results in 8 processes being available to programs running on this computer. It is virtually impossible to buy a single-core desktop computer.
Unfortunately programs will not take advantage of these new multicore machines unless the program is written to be multithread. A non-threaded application will simply consume nearly all of the processing power of one core and leave the remaining cores virtually idle. Writing programs to be multithread can be tricky. You must be able to break the task up into smaller packets that each thread can process. At some point the threads usually must communicate with each other and aggregate the job back together.
Neural network training is a very time consuming task. Computers can run for hours, if not days, on a single training task. Supervised neural networks are generally trained with resilient propagation (RPROP) or back propagation. RPROP is the more modern of the two and is almost always the preferred solution. I wanted to enhance the Encog Artificial Intelligence Framework to make use of multithreading to provide fast training on multicore machines. I began Googling for how others might have done it. Unfortunately I did not find much on the topic of multithreaded implementations of back propagation or resilient propagation. I found some solutions, but I had my doubts as to how effective they would with large numbers of processors. I wanted a solution that would work with a potentially large number of processors. I did not want a great deal of synchronization between the threads either, as I may want to run this from a grid of computers at some point.
At this point, as of Encog 2.2, I have a fairly efficient multithreaded training process in place. It only works on a single computer at this point, I will leave a grid implementation for a future version of Encog. This is implemented in the Encog class MultiPropagation. Multi Propagation, or MPROP, is a special training technique introduced in Encog that is based on Resilient Propagation.
A short example is provided that will train a neural network with both MPROP and RPROP. This allows me to compare the overall performance of MPROP on various computer hardware. The following is the main method performs this comparison.
Both MPROP and RPROP will be fed the same training data and neural network. However, the neural network will be reset for each training algorithm so that no learning from the previous training carries through. The training data is random.
The neural network is composed with the following parameters.
Input Neurons: 40
The training data is composed of 20,000 input and ideal data pairs.
All of the Encog training algorithms implement the Train interface. This makes them fairly interchangeable. The implementation of the evaluateMPROP and evaluateRPROP is very similar. The implementation of evaluateMPROP is shown here.
MultiPropagation train = new MultiPropagation(network,data);
This is a typical Encog training routine. Here we loop through 20 training iterations. We track the number of seconds that this took. A similar process is done for RPROP.
Multi Propagation Implementation
All propagation training techniques work similarly. Whether it be back propagation, resilient propagation or the Manhattan update rule, the technique is similar. There are two three distinct steps:
- Perform a Regular Feed Forward Pass
- Process the levels backwards and determine the errors at each level
- Apply the changes to the weights and thresholds
First, a regular feed forward pass is performed. The output from each level is kept so that the error for each level can be evaluated independent. Second, the errors are calculated at each level, and the derivatives of each of the activation functions are used to calculate gradient descents. These gradients will be used in the third step.
The third step is what varies among the different training algorithms. Backpropagation simply takes the gradient descents and scales them by a learning rate. The scaled gradient descents are then directly applied to the weights and thresholds. The Manhattan Update Rule only uses the sign of the gradient to decide in which direction to affect the weight. The weight is then changed in either the positive or negative direction by a fixed constant.
RPROP keeps an individual delta value for every weight and thresholds and only uses the sign of the gradient descent to increase or decrease the delta amounts. The delta amounts are then applied to the weights and thresholds.
The MPROP algorithm uses threads to perform steps 1 & 2. The training data is broken into packets that are distributed among the threads. At the beginning of each iteration threads are started to handle each of these packets. Once all threads have completed, a single thread aggregates all of the results from the threads and applies them to the neural network. There is a very brief amount of time where only one thread is executing, at the end of the iteration. This can be seen from the following monitor.
As you can see from the above image, the i7 is currently running at 100%. You can clearly see the end of each iteration, where each of the processors falls briefly. Fortunately, this is a very brief time and does not have a large impact on overall training efficiency. I did try implementations where I did not force the threads to wait at the end of the iteration for a resynchronization. However these did not provide as efficient of training because the RPROP algorithm, upon which MPROP is based, needs all changes applied before the next iteration begins.
The MPROP algorithms uses a number of threads equal to the number of processors reported by Java, plus 1. Unless there is a single processor, then the MPROP algorithms falls back to a regular single threaded RPROP algorithm.
I tried MPROP, using the above mentioned network and training data on three different computer platforms. First we will look at it on an i7 quadcore.
Dell Studio XPS 8000, 8gig RAM Intel i7 at 2.79GHz
Training 20 Iterations with RPROP
As you can see this machine had a factor improvement of about 4 times. This quadcore uses hyperthreading so 8 processors are reported to Java. Therefore MPROP used 9 threads. Despite the fact that hyperthreading reports twice the number of cores than are physically present, I do not find that it executes anywhere near as fast as additional cores. However, it does help. The fact that I got 4 times with 4 cores is really very good. Threading introduces overhead, without the hyperthreading it is very unlikely that the factor would have been 4 or higher. A factor of 4 implies that the thread switching was perfect, and not even the single threaded synchronization time at the end of each MPROP iteration affected it. Hyperthreading is to thank for this. Still, hyperthreading did not get us anywhere near 7 or 8 times.
Now we will look at a Dual Core computer, with no hyperthreading. Here you see the results from a Dual Core iMac.
Training 20 Iterations with RPROP
As you can see the factor improvement over single threaded RPROP is 1.89. There is no hyperthreading, so it is just the two cores executing. Due to threading overhead and iteration synchronization times, we do not get a perfectly efficient factor of 2.0.
We can also see the results on a single core, hyperthreading computer.
Dimension 8100, Intel Pentium 4 CPU, 3.00 ghtz. 1GB ram
Training 20 Iterations with RPROP
This computer is single core, yet has hyperthreading, so Java reports processors as two. Even with only hyperthreading, a factor improvement is still present, and the MPROP threading algorithm is still beneficial. If MPROP were run on a true single core computer, with no hyperthreading, Java would report the processor count as one and the MPROP algorithms would fall back to single threading.
MPROP offers great performance improvements over the single threaded MPROP algorithms in cases where a reasonably large training set is present and multicore hardware is used. If neither of these two factors are present, MPROP will fall back to RPROP for training. Because of this, it is the best general purpose training algorithm for Encog.
MPROP was introduced with Encog 2.2, but will continue to be enhanced with future versions of Encog. Improvements will include further optimization of the end of iteration synchronization and moving as much of this synchronization code to the threads as possible. Also options will be added at some point to allow MPROP to operate on a grid of computers.