Techniques for Multi-Threaded Backpropagation for Encog

This was originally posted Mon, 10/26/2009, and needs some updating. I will do this soon.

This article shows how the Multi Propagation (MPROP) algorithm was implemented for Encog for Java. Though this article focuses on the Java implementation the C# version would be very similar. MPROP is based on resilient propagation, but is designed to work well with multicore computers and gain maximum performance.

Multicore computers are becoming more and more common. There seems to be only some many gigahertz that computer manufactures can squeeze out of CPU’s. The real growth in computer performance will likely be in the number of cores contained in a computer’s CPU. As of the writing of this document, October 25, 2009, Intel i7 computers can be had for around $1000 USD. An i7 computer makes use of a Quadcore Hyperthreading CPU. This results in 8 processes being available to programs running on this computer. It is virtually impossible to buy a single-core desktop computer.

Unfortunately programs will not take advantage of these new multicore machines unless the program is written to be multithread. A non-threaded application will simply consume nearly all of the processing power of one core and leave the remaining cores virtually idle. Writing programs to be multithread can be tricky. You must be able to break the task up into smaller packets that each thread can process. At some point the threads usually must communicate with each other and aggregate the job back together.

Neural network training is a very time consuming task. Computers can run for hours, if not days, on a single training task. Supervised neural networks are generally trained with resilient propagation (RPROP) or back propagation. RPROP is the more modern of the two and is almost always the preferred solution. I wanted to enhance the Encog Artificial Intelligence Framework to make use of multithreading to provide fast training on multicore machines. I began Googling for how others might have done it. Unfortunately I did not find much on the topic of multithreaded implementations of back propagation or resilient propagation. I found some solutions, but I had my doubts as to how effective they would with large numbers of processors. I wanted a solution that would work with a potentially large number of processors. I did not want a great deal of synchronization between the threads either, as I may want to run this from a grid of computers at some point.

At this point, as of Encog 2.2, I have a fairly efficient multithreaded training process in place. It only works on a single computer at this point, I will leave a grid implementation for a future version of Encog. This is implemented in the Encog class MultiPropagation. Multi Propagation, or MPROP, is a special training technique introduced in Encog that is based on Resilient Propagation.

A short example is provided that will train a neural network with both MPROP and RPROP. This allows me to compare the overall performance of MPROP on various computer hardware. The following is the main method performs this comparison.

1
2
3
4
5
6
7
8
9
Logging.stopConsoleLogging();
BasicNetwork network = generateNetwork();
NeuralDataSet data = generateTraining();

double rprop = evaluateRPROP(network,data);
network.reset();
double mprop = evaluateMPROP(network,data);
double factor = rprop/mprop;
System.out.println("Factor improvement:" + factor);

Both MPROP and RPROP will be fed the same training data and neural network. However, the neural network will be reset for each training algorithm so that no learning from the previous training carries through. The training data is random.

The neural network is composed with the following parameters.

1
2
3
Input Neurons: 40
Output Neurons: 20
Hidden Layer #1 Neurons: 60

The training data is composed of 20,000 input and ideal data pairs.

All of the Encog training algorithms implement the Train interface. This makes them fairly interchangeable. The implementation of the evaluateMPROP and evaluateRPROP is very similar. The implementation of evaluateMPROP is shown here.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
MultiPropagation train = new MultiPropagation(network,data);
long start = System.currentTimeMillis();
System.out.println("Training 20 Iterations with MPROP");
for(int i=1;i<=20;i++)
{
train.iteration();
System.out.println("Iteration #" + i + " Error:" + train.getError());
}
train.finishTraining();
long stop = System.currentTimeMillis();
double diff = ((double)(stop - start))/1000.0;
System.out.println("MPROP Result:" + diff + " seconds." );
System.out.println("Final MPROP error: " + network.calculateError(data));
return diff;

This is a typical Encog training routine. Here we loop through 20 training iterations. We track the number of seconds that this took. A similar process is done for RPROP.

Multi Propagation Implementation
All propagation training techniques work similarly. Whether it be back propagation, resilient propagation or the Manhattan update rule, the technique is similar. There are two three distinct steps:

  1. Perform a Regular Feed Forward Pass
  2. Process the levels backwards and determine the errors at each level
  3. Apply the changes to the weights and thresholds

First, a regular feed forward pass is performed. The output from each level is kept so that the error for each level can be evaluated independent. Second, the errors are calculated at each level, and the derivatives of each of the activation functions are used to calculate gradient descents. These gradients will be used in the third step.

The third step is what varies among the different training algorithms. Backpropagation simply takes the gradient descents and scales them by a learning rate. The scaled gradient descents are then directly applied to the weights and thresholds. The Manhattan Update Rule only uses the sign of the gradient to decide in which direction to affect the weight. The weight is then changed in either the positive or negative direction by a fixed constant.

RPROP keeps an individual delta value for every weight and thresholds and only uses the sign of the gradient descent to increase or decrease the delta amounts. The delta amounts are then applied to the weights and thresholds.

The MPROP algorithm uses threads to perform steps 1 & 2. The training data is broken into packets that are distributed among the threads. At the beginning of each iteration threads are started to handle each of these packets. Once all threads have completed, a single thread aggregates all of the results from the threads and applies them to the neural network. There is a very brief amount of time where only one thread is executing, at the end of the iteration. This can be seen from the following monitor.

multipropagation (MPROP) on a quadcore

As you can see from the above image, the i7 is currently running at 100%. You can clearly see the end of each iteration, where each of the processors falls briefly. Fortunately, this is a very brief time and does not have a large impact on overall training efficiency. I did try implementations where I did not force the threads to wait at the end of the iteration for a resynchronization. However these did not provide as efficient of training because the RPROP algorithm, upon which MPROP is based, needs all changes applied before the next iteration begins.

The MPROP algorithms uses a number of threads equal to the number of processors reported by Java, plus 1. Unless there is a single processor, then the MPROP algorithms falls back to a regular single threaded RPROP algorithm.

Testing Results

I tried MPROP, using the above mentioned network and training data on three different computer platforms. First we will look at it on an i7 quadcore.

Dell Studio XPS 8000, 8gig RAM Intel i7 at 2.79GHz

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
Training 20 Iterations with RPROP
Iteration #1 Error:1.0592062021803321
Iteration #2 Error:1.0112968157018771
Iteration #3 Error:0.9650583848127503
Iteration #4 Error:0.9269433225981621
Iteration #5 Error:0.8947162095367102
Iteration #6 Error:0.8714873694194031
Iteration #7 Error:0.8445288449926142
Iteration #8 Error:0.8186688302191717
Iteration #9 Error:0.7952278955734976
Iteration #10 Error:0.7717422560410586
Iteration #11 Error:0.7475048877257578
Iteration #12 Error:0.7235382011165326
Iteration #13 Error:0.7026047081990957
Iteration #14 Error:0.6843757761100023
Iteration #15 Error:0.6685206160475999
Iteration #16 Error:0.6539311876046258
Iteration #17 Error:0.6412660225209257
Iteration #18 Error:0.630790400329957
Iteration #19 Error:0.6211146795350724
Iteration #20 Error:0.6136882493691617
RPROP Result:128.562 seconds.
Final RPROP error: 0.6075224766406004
Training 20 Iterations with MPROP
Iteration #1 Error:0.6075212244066446
Iteration #2 Error:0.8665463281875874
Iteration #3 Error:0.8316846996192032
Iteration #4 Error:0.7451195340393163
Iteration #5 Error:0.7005024644028119
Iteration #6 Error:0.6691870245157884
Iteration #7 Error:0.649034289358449
Iteration #8 Error:0.6339114535879514
Iteration #9 Error:0.6208812103003265
Iteration #10 Error:0.6111566730037973
Iteration #11 Error:0.6056166450414902
Iteration #12 Error:0.6003765685919015
Iteration #13 Error:0.5964873091129251
Iteration #14 Error:0.5932816072550446
Iteration #15 Error:0.5905725872184455
Iteration #16 Error:0.5882703219084173
Iteration #17 Error:0.5863667500894574
Iteration #18 Error:0.5848003831853418
Iteration #19 Error:0.5835759158206529
Iteration #20 Error:0.5823759906353797
MPROP Result:31.88 seconds.
Final MPROP error: 0.5814082684159508
Factor improvement:4.032685069008783

As you can see this machine had a factor improvement of about 4 times. This quadcore uses hyperthreading so 8 processors are reported to Java. Therefore MPROP used 9 threads. Despite the fact that hyperthreading reports twice the number of cores than are physically present, I do not find that it executes anywhere near as fast as additional cores. However, it does help. The fact that I got 4 times with 4 cores is really very good. Threading introduces overhead, without the hyperthreading it is very unlikely that the factor would have been 4 or higher. A factor of 4 implies that the thread switching was perfect, and not even the single threaded synchronization time at the end of each MPROP iteration affected it. Hyperthreading is to thank for this. Still, hyperthreading did not get us anywhere near 7 or 8 times.

Now we will look at a Dual Core computer, with no hyperthreading. Here you see the results from a Dual Core iMac.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
Training 20 Iterations with RPROP
Iteration #1 Error:1.0619945526007815
Iteration #2 Error:1.0173279127855563
Iteration #3 Error:0.9728967012381747
Iteration #4 Error:0.933266210736963
Iteration #5 Error:0.902990819036054
Iteration #6 Error:0.8785929993141319
Iteration #7 Error:0.852770802106324
Iteration #8 Error:0.8297385766666532
Iteration #9 Error:0.8038142080023881
Iteration #10 Error:0.7800412962894463
Iteration #11 Error:0.7587560796078602
Iteration #12 Error:0.7356506865399463
Iteration #13 Error:0.7151090444337569
Iteration #14 Error:0.695744709113637
Iteration #15 Error:0.6788368720802751
Iteration #16 Error:0.6642652711631868
Iteration #17 Error:0.6509332872251975
Iteration #18 Error:0.6397801404435584
Iteration #19 Error:0.630090330257044
Iteration #20 Error:0.6216381426133933
RPROP Result:183.834 seconds.
Final RPROP error: 0.6146150864453944
Training 20 Iterations with MPROP
Iteration #1 Error:0.6146143819685393
Iteration #2 Error:0.861988806667595
Iteration #3 Error:0.8245303438423693
Iteration #4 Error:0.7518132811207181
Iteration #5 Error:0.7081523967374347
Iteration #6 Error:0.6712984917380188
Iteration #7 Error:0.652201535422028
Iteration #8 Error:0.6406601654553405
Iteration #9 Error:0.629090967114433
Iteration #10 Error:0.6175595827587673
Iteration #11 Error:0.6122245715859175
Iteration #12 Error:0.6062311438183174
Iteration #13 Error:0.6013160315144382
Iteration #14 Error:0.5977755359770852
Iteration #15 Error:0.5946842580058522
Iteration #16 Error:0.5920646899231164
Iteration #17 Error:0.5896362050394485
Iteration #18 Error:0.5876920979572654
Iteration #19 Error:0.5859706453734917
Iteration #20 Error:0.5844223947199683
MPROP Result:97.25 seconds.
Final MPROP error: 0.5831411341205304
Factor improvement:1.8903239074550129

As you can see the factor improvement over single threaded RPROP is 1.89. There is no hyperthreading, so it is just the two cores executing. Due to threading overhead and iteration synchronization times, we do not get a perfectly efficient factor of 2.0.

We can also see the results on a single core, hyperthreading computer.

Dimension 8100, Intel Pentium 4 CPU, 3.00 ghtz. 1GB ram

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
Training 20 Iterations with RPROP
Iteration #1 Error:1.055330758837017
Iteration #2 Error:1.0086543528834082
Iteration #3 Error:0.9635498434678869
Iteration #4 Error:0.9234062461127825
Iteration #5 Error:0.893359620995546
Iteration #6 Error:0.8676510392528938
Iteration #7 Error:0.8426219917532086
Iteration #8 Error:0.8176701896241206
Iteration #9 Error:0.7900163103983693
Iteration #10 Error:0.7663833064550073
Iteration #11 Error:0.7423741422166754
Iteration #12 Error:0.7186589370627757
Iteration #13 Error:0.6973331339716679
Iteration #14 Error:0.6786518719968172
Iteration #15 Error:0.6632366051944965
Iteration #16 Error:0.647960280068216
Iteration #17 Error:0.637116419836954
Iteration #18 Error:0.6275179497202836
Iteration #19 Error:0.6193774112511847
Iteration #20 Error:0.6130185860535315
RPROP Result:501.408 seconds.
Final RPROP error: 0.6072752146745306
Training 20 Iterations with MPROP
Iteration #1 Error:0.607274926192379
Iteration #2 Error:0.8534056403923543
Iteration #3 Error:0.8121626914024609
Iteration #4 Error:0.7330129725685066
Iteration #5 Error:0.6952683973977223
Iteration #6 Error:0.6695707142291197
Iteration #7 Error:0.6540724010870805
Iteration #8 Error:0.6338389819166642
Iteration #9 Error:0.6234153712989258
Iteration #10 Error:0.6134344366833728
Iteration #11 Error:0.605518025749937
Iteration #12 Error:0.6015147936004235
Iteration #13 Error:0.5972975056266082
Iteration #14 Error:0.5941906318175534
Iteration #15 Error:0.5912954856379496
Iteration #16 Error:0.5890932691798955
Iteration #17 Error:0.587244744073592
Iteration #18 Error:0.585618048750272
Iteration #19 Error:0.5842085212827279
Iteration #20 Error:0.5829523593081855
MPROP Result:408.015 seconds.
Final MPROP error: 0.5819121474538267
Factor improvement:1.228895996470718

This computer is single core, yet has hyperthreading, so Java reports processors as two. Even with only hyperthreading, a factor improvement is still present, and the MPROP threading algorithm is still beneficial. If MPROP were run on a true single core computer, with no hyperthreading, Java would report the processor count as one and the MPROP algorithms would fall back to single threading.

Conclusions

MPROP offers great performance improvements over the single threaded MPROP algorithms in cases where a reasonably large training set is present and multicore hardware is used. If neither of these two factors are present, MPROP will fall back to RPROP for training. Because of this, it is the best general purpose training algorithm for Encog.

MPROP was introduced with Encog 2.2, but will continue to be enhanced with future versions of Encog. Improvements will include further optimization of the end of iteration synchronization and moving as much of this synchronization code to the threads as possible. Also options will be added at some point to allow MPROP to operate on a grid of computers.