Initial Impressions of GPU Programming

jeffheaton's picture

In this article I will discuss some of the primary issues with GPU programming. This article will focus on how we implemented the first version of GPU support in the Encog Machine Learning Framework. This article focuses mainly on using OpenCL from Java or C#, however it is general enough to apply to any language that can commu icate with OpenCL.

Overall I am happy with the first generation of Encog OpenCL support. Some things worked well, others did not. In this article I will cover both. I will describe what areas we had issues with, and how we plan to improve GPU support in future versions.

Encog is a machine learning framework for Java, C# and Silverlight. The latest version of Encog added GPU support. The GPU support in Encog 2.4 is functional, and in some cases even speeds up processing time. However, in some cases it can cause the processing to slow down. In this version of Encog I have learned a fair amount about how to use the GPU, and some of the improvements that I hope to make in future versions. In this article, i will share some of the things ghat I have learned about GPU programming.

Before I get too far into the article, let me define the layers. There is some software you must install before using your GPU this way. At the lowest level there is the GPU. You will need a OpenCL driver to make use of your GPU. OpenCL abstracts the differences between different video card vendors and defines a common, C-Like language used to write small programs for the GPU. These small programs are called kernels. Java and C# cannot communicate directly with OpenCL. You must use an intermediary called a binding. For Java I use the JOCL binding, for C# I use a binding called CLOO.

GPU programming is all about parallel programming. Parallel programming is a major trend in computer programming. Some say that it could become one of the most major paradigm shifts in computer programming, since object oriented programming. CPU manufacturers have hit the wall with clock speeds. CPU speeds hit around 3ghtz several years and have not substantially improved. The growth now is now in cores. Computers of the future will come with multiple processors, each packed with many cores.

GPU programming is a natural extension this trend. GPU programming allows you to make use of your graphics card, just as you make use of your CPU. The tasks that you assign to your graphics card may have nothing to do with graphics programming. Graphics cards can be used with general computing. Additionally, where your computer may have 2, 4 or 8 cores, many GPU's come with more than 100.

This may seem like a processing goldmine! You can tap your graphics card, that may very well have greater pressing power than your computer's actual CPU. For some tasks, this is certainly the case. However, there are some serious limitations inherent in GPU programming that you should be aware of.

Some of the current GPU limitations that you should be aware of include.

  • No officials support in programming languages
  • Must program GPU portions in a c-like language
  • Kernels are executed in isolation
  • Parallel Programming is still the issue

In the next sections we will examine each of these limitations.

No Official Support

Current programming languages all have well developed class libraries to support threading. Threading allows you to make use of multicore CPU's. This is not at all the case with GPU programming. None of the major programming languages have direct support for GPU programming. Rather, you will have to make use of third party frameworks, called bindings, that perform the lower level intricacies of executing programs on your GPU. There are several to choose from, for each programming language.

At this point GPU usage is still very much a bleeding edge technology. At some point, you will no doubt see it become a standard part of operating systems and programming languages. Already, the Macintosh operating system has incorporated OpenCL. OpenCL is one of the standard ways to implement GPU programming. OpenCL can be used directly from C/C++, however, if you are using a higher level language, such as Java or C#, you will need to use a binding to talk to the low level OpenCL API.

This all feels very bleeding edge. I have little doubt that once this is all integrated better that I sill be throwing out my current OpenCL codebase and making use of whatever method my programming language officially provides me. However, such is the case when you make use of bleeding edge technologies.

Kernels are Written in a C-Like Language

GPU programming is a polyglot programming technique. The most common example of polyglot program is database applications. You are dealing with two programming languages. Most of your program is written in your high-level language of choice, Java, for example. However, your Java application contains SQL that will be executed on the database. Often, this SQL is simply pasted into strings, contained in your Java source code. Sometimes your Java code even modifies, or generates, the SQL before sending it off to the database.

This is exactly the same situation with OpenCL. You cannot write your entire application in OpenCL. It will need some sort of host application. The host application will be written in a high level programming language, of your choice. In this article, we will con tune to assume at it is Java. For certain areas, you will rewrite your Java code into OpenCL. It is by no means a direct translation. You will be using a C-like language.

Because you cannot make use of Java, or any other higher level language, with OpenCL, you will find yourself often having to implement the application twice. Typically, you will need to isolate performance critical parts of your application and then translate those areas into the OpenCL C-Like language. You will also, likely, need to keep your original Java code around as well. You do not want to assume that the end are is going to be running with OpenCL, so you will need to keep the Java code functional as well. Additionally, if you want to get every bit of speed out of the machine that you can it will wbe necessary to be able to execute your Java based CPU code at the same time as you execute the CPU code. Parallel programming becomes even more difficult, as you must now manage CPU and GPU tasks, at the same time.

One of the advertised benefits of OpenCL as that it provides a heterogeneous environment to use both CPUs and GPUs. The idea is great, just write all of the processor critical sections of your code in OpenCL and execute them on either a CPU or GPU. The only problem is that not all card manufacturers support this mode. AMD supports this just fine. However nVidia does not support the CPU as an OpenCL device, at least not as of the writing of this article. So, unless I wanted to cut out nVidia as a supported platform I was forced to mix both Java and OpenCL code to execute simultaneously. This was somewhat complex to do, and I am sure this part will all be "throw away" code, once nVidia catches up to AMD on CPU processing.

The OpenCL code is C-Based. So there a no objects or higher level object oriented programming concepts. However, you do get pointers and all of the other lower level C constructs. This can result in some very efficient, and fast, code. However, of you are more used to Java, this language can feel like a bit of a step backwards. I started in C, as one of my earliest programming languages, so this aspect was not as much of a hardship. However, I did not like having to program the same thing twice, once for Java, and once for OpenCL.

OpenCL is Isolated

You can really only put certain parts of your application inside of an OpenCL kernel. You pass parameters and blocks of memory to the kernel. You cannot pass higher level objects to the kernel. You certainly cannot pass a database connection, or access the file system from a kernel. The kernels are very SQL like, in that you pass data for the kernels to work on, and you receive data back from them. Additionally, you cannot pas data to the kernels as Java objects. Everything is usually passed as arrays or arrays of primitives.

Because OpenCL is isolated, you can easily end up spending considerable amouts of time simply ferrying data to and from the GPU. The time that it takes to move data is a considerable consideration in grid computing. Breaking up a task over multiple processing units is not free. This sort of distribution always introduces overhead. If the overhead becomes greater than the am out of extra processing done by additional CPUs or GPUs, then your end result will be slower than if you had simply used the GPU alone.

Your GPU has a “Day Job”

It is easy to forget, that your GPU is really included in your computer to run the display. This is its primary function, or its “Day Job”. While the GPU is running OpenCL code, it cannot run the display. The GPU does not multitask on this level. Because of this, it is often necessary to write your kernel as a series of short calls. If you take too long to return from your kernel the system will become very unresponsive. If you take more than a few seconds, the operating system will think the GPU has crashed, and will “bounce” it, ending your OpenCL execution.

This can make things somewhat challenging. The whole idea of OpenCL is to allow you to offload computation intensive code to the GPU. But you can’t run for too long, only a second or two, before the operating system shuts you down. Of course, you can disable this “timeout” in the operating system. For unattended jobs, I often do this. But the computer will be nearly unusable while it runs. Further, if your OpenCL program does enter some sort of an endless loop, with the timeout disabled, you will not be able to regain control of your computer by normal means. You will likely have to reach for the power button.

Parallel Programming is Still Hard

Basically it comes done to the simple fact that parallel programming is hard. If you were having problems. Getting your application to get performance improvements form a quad core, imagine now having 100 cores to deal with. Some applications are very conducive to grid programming. Other applications simply cannot be executed in parallel.

In some ways GPU programming is more like grid programming than traditional multicore thread processing. This is especially true if you need to mix CPU and GPU tasks. If you must move data between your CPU and GPU tasks you must deal with a much higher overhead than when you are simply sharing data between multiple CPU cores.

If your task is not easily made parallel, then using the GPU is not going to be that easy. Additionally, the GPU is best used for mathematically intensive tasks. Certain programming tasks, which are common in Java can completely destroy the performance of an OpenCL application. One such statement is the ubiquitous if-statement. The if-statement will kill GPU performance. It was not all that long ago that GPUs did not support if-statements at all.

Challenges for Encog 2.4

As of Encog 2.4, OpenCL support has been added. It will be extended considerably in future versions. Not all situations will train a neural network faster using OpenCL. In this section I will describe some of the areas that proved difficult during the OpenCL implementation.

Supervised neural networks train by looping over training data and modifying the neural network weights so that the neural network produces better output that more closely matches the desired output. This is done over a series of iterations, or epochs. Each iteration will present the entire training set to the neural network. The next iteration will present the same training set to the neural network again. Each iteration will modify the training set slightly to improve the overall error level of the neural network.

The problem is that neural network training is not a process that can easily be executed in parallel. Encog has supported multithreaded training for several versions. Multithreaded training allows Encog to take advantage of a multicore CPU. To support multithreading I caused Encog to create a thread for each CPU. These threads are used in a pool, so that thread creation overhead is not a factor.

The training data is split over each thread. If the amount of training data for each thread is too small, say, below 100 items, then the training reverts to single threaded mode. If there is not enough training data the overhead of breaking the job up does not warrant the use of multithreading. Most training cases will have more than 100 items per CPU core.

Encog then starts each thread processing it's part of the training data. Once the threads have finished, they must wait for the main thread to aggregate all of their results together. Once the results have been aggregated together the network is ready to begin the the next iteration. The bottleneck is this aggregation time. The amount of work to do during the aggregation is very small, however, it requires all of the threads to wait. If one thread finishes early, then it is going to sit and wait for other to finish. For CPU only, this wait is not too much of a problem. The threads nearly always finish very close to the same time. Though, performance could be improved if the wait could be eliminated. However, multicore performance on Encog is generally very good. This is especially true for larger training sets. If there is a large amount of data to train, then the training iterations will be longer and the CPUs will stay working longer before aggregations.

To support the GPU I simply treated the GPU as though it was another CPU core. I gave it a portion of the training data and placed it in a pool with the other threads. For larger training data this seems to give a good result. However, balancing becomes a problem. You can't just give the GPU an equal portion of the training data. If there is any considerable gap between how long the GPU and CPU worked on a training set during an iteration, then training time is going to be terrible. The real challenge turned out to be getting the GPU and CPU to finish an iteration at the same time. The balancing code is manual at this point. You must enter a ratio that tells Encog in what proportion to assign work between the GPU and CPU. Additionally, if the. Is more than one GPU there is no ratio to give different GPUs different amounts. Future versions of Encog may need more advanced balancing code.

Another challenge is managing local memory. Local memory is a very high-speed area of memory that the GPU provides. I like to load the weight matrix of a neural network there. This greatly speeds up processing. The problem is that the local memory of a GPU is very small. Usually just 16k or 32k. If someone creates a neural network with 1000s of neurons, the is just not enough locals memory. If you run out of local memory, the OpenCL kernel crashes. The kernel has to be written slightly differently to make use of local memory. Therefore, future versions of Encog may need separate kernel versions depending on if you are going to use local memory or not. Further, the local memory is so much faster, that I would not want to simply not make use of it at all, just because the weight matrix does not fit.

Yet another challenge is the fact that if-statements are prohibitively slow on a GPU. Neural networks can make use of several different types of activation functions. You almost need an if-statement to tell the clod which one the user wants to use. I ended up using conditionally compiled code to determine which activation function to use.

Plans for Future Encog Versions

Improvements to GPU processing will be implemented over several releases of Encog. The next version, Encog 2.5 will focus more on using the GPU totally separately from the CPU. Balancing will not be a factor. This will be very useful when you must train several neural networks. The CPU can be training one neural network, and the GPU another. If there are multiple GPUs, then they can work on different neural networks. One application that I work on must train a neural network for each of the fortune 500 companies. I will be able to create a queue of networks to train, and Encog will loop through each and execute them in parallel using the GPU and CPU. Since the GPU and CPU will be totally isolated, there will be no balancing needed. This process would greatly speed up training 500 neural networks.

In future versions of Encog, beyond 2.5, we will undertake using the GPU and CPU on the same neural network. It will either involve real-time automatic load balancing, or removal of the need for all breads to wait at the end of an iteration. I trend to like the latter better. However, I am not totally sure it is possible. It would remove the concept of an iteration from network training. The threads would simply keep training and somehow synchronize their weights in a light-weight man nor that does not quire bread blocking.

Synchronizing the weights will be the tricky part. It would not be a straight copy. The threads are training on different parts of the dataset. You do not want to lose the training effects of the other threads as the global weights are updated.

Conclusions

GPUs can be extremely fast processors, given the right situation. This is very cutting edge technology, and quite far from mainstream use. The Encog project will continue to research the use of advanced GPU processing in future versions. I see it as a very important direction for future AI programming.

For the current version of Encog, GPU support seems to work best with a medium sided neural network and a large training set. By medium sized neural network I mean a neural network that has around 50-100 nodes. Such a network should fit into the local memory of most video cards. A large data set, with 10,000 items, or more gives each thread, and the GPU, enough to work with. Future versions of Encog should get good GPU performance on a wider array of situations.


Copyright 2005 - 2012 by Heaton Research, Inc.. Heaton Research™ and Encog™ are trademarks of Heaton Research. Click here for copyright, license and trademark information.