Chapter 3: Using Activation Functions

jeffheaton's picture


Chapter 3: Using Activation Functions

  • Activation Functions
  • Derivatives and Propagation Training
  • Choosing an Activation Function

Activation functions are used by many neural network architectures to scale the output from layers. Encog provides many different activation functions that can be used to construct neural networks. In this chapter you will be introduced to these activation functions.

The Role of Activation Functions

Activation functions are attached to layers. They are used to scale data output from a layer. Encog applies a layer's activation function to the data that the layer is about to output. If you do not specify an activation function for BasicLayer, the hyperbolic tangent activation will be the defaulted. The following code creates several BasicLayer objects with a default hyperbolic tangent activation function.

BasicNetwork network = new BasicNetwork();

network.AddLayer(new BasicLayer(2));

network.AddLayer(new BasicLayer(3));

network.AddLayer(new BasicLayer(1));

network.Structure.FinalizeStructure();

network.Reset();

If you would like to use an activation function other than the hyperbolic tangent function, use code similar to the following:

ActivationSigmoid a = new ActivationSigmoid();

BasicNetwork network = new BasicNetwork();

network.AddLayer(new BasicLayer(a,true,2));

network.AddLayer(new BasicLayer(a,true,3));

network.AddLayer(new BasicLayer(a,true,1));

network.Structure.FinalizeStructure();

network.Reset();

The sigmoid tangent activation function is assigned to the variable a and passed to each of the AddLayer calls. The true value, that was also introduced, specifies that the BasicLayer should also have threshold values.

The ActivationFunction Interface

All classes that are to serve as activation functions must implement the IActivationFunction interface. This interface is shown in Listing 3.1.

Listing 3.1: The IActivationFunction Interface

public interface IActivationFunction : IEncogPersistedObject

{

void ActivationFunction(double[] d);

void DerivativeFunction(double[] d);

bool HasDerivative

{

get;

}

}

The actual activation function is implemented inside of the ActivationFunction method. The ActivationSIN class is a very simple activation function that implements the sine wave. You can see the ActivationFunction implementation below.

public override void ActivationFunction(double[] d)

{

for (int i = 0; i < d.Length; i++)

{

d[i] = BoundMath.Sin(d[i]);

}

}

As you can see, the activation simply applies the sine function to the array of provided values. This array represents the output neuron values that the activation function is to scale. It is important that the function be given the entire array at once. Some of the activation functions perform operations, such as averaging, that require seeing the entire output array.

You will also notice from the above code that a special class, named BoundMath, is used to calculate the sine. This causes “not a number” and “infinity” values to be removed. Sometimes, during training, unusually large or small numbers may be generated. The BoundMath class is used to eliminate these values by binding them to either a very large or a very small number. The sine function will not create an out-of-bounds number, and BoundMath is used primarily for completeness.

However, we will soon see other functions that could produce out of bound numbers. Exponent and radical functions can be particularly prone to this. Once a “not a number” (NaN) is introduced into the neural network, the neural network will no longer produce useful results. As a result, bounds checking must be performed.

Derivatives of Activation Functions

If you would like to use propagation training with your activation function, then the activation function must have a derivative. Propagation training will be covered in greater detail in Chapter 5, “Propagation Training”. The derivative is calculated by a function named DerivativeFunction.

public override void DerivativeFunction(double[] d)

{

for (int i = 0; i < d.Length; i++)

{

d[i] = BoundMath.Cos(d[i]);

}

}

The DerivativeFunction works very similar to the ActivationFunction, an array of values is passed in to calculate.

Encog Activation Functions

The next sections will explain each of the activation functions supported by Encog. There are several factors to consider when choosing an activation function. Firstly, the type of neural network you are using may dictate the activation function you must use. Secondly, you should consider if you would like to train the neural network using propagation. Propagation training requires an activation function that provides a derivative. You must also consider the range of numbers you will be dealing with. This is because some activation functions deal with only positive numbers or numbers in a particular range.

ActivationBiPolar

The ActivationBiPolar activation function is used with neural networks that require bipolar numbers. Bipolar numbers are either true or false. A true value is represented by a bipolar value of 1; a false value is represented by a bipolar value of -1. The bipolar activation function ensures that any numbers passed to it are either -1 or 1. The ActivationBiPolar function does this with the following code:

if (d[i] > 0)

{

d[i] = 1;

}

else

{

d[i] = -1;

}

As you can see the output from this activation is limited to either -1 or 1. This sort of activation function is used with neural networks that require bipolar output from one layer to the next. There is no derivative function for bipolar, so this activation function cannot be used with propagation training.

Activation Competitive

The ActivationCompetitive function is used to force only a select group of neurons to win. The winner is the group of neurons that has the highest output. The outputs of each of these neurons are held in the array passed to this function. The size of the winning group of neurons is definable. The function will first determine the winners. All non-winning neurons will be set to zero. The winners will all have the same value, which is an even division of the sum of the winning outputs.

This function begins by creating an array that will track whether each neuron has already been selected as one of the winners. We also count the number of winners so far.

bool[] winners = new bool[d.Length];

double sumWinners = 0;

First, we loop maxWinners a number of times to find that number of winners.

for (int i = 0; i < this.maxWinners; i++)

{

double maxFound = Double.MinValue;

int winner = -1;

Now, we must find one winner. We will loop over all of the neuron outputs and find the one with the highest output.

for (int j = 0; j < d.Length; j++)

{

If this neuron has not already won, and it has the maximum output then it might potentially be a winner, if no other neuron has a higher activation.

if (!winners[j] && d[j] > maxFound)

{

winner = j;

maxFound = d[j];

}

}

Keep the sum of the winners that were found, and mark this neuron as a winner. Marking it a winner will prevent it from being chosen again. The sum of the winning outputs will ultimately be divided among the winners.

sumWinners += maxFound;

winners[winner] = true;

Now that we have the correct number of winners, we must adjust the values for winners and non-winners. The non-winners will all be set to zero. The winners will share the sum of the values held by all winners.

for (int i = 0; i < d.Length; i++)

{

if (winners[i])

{

d[i] = d[i] / sumWinners;

}

else

{

d[i] = 0.0;

}

This sort of an activation function can be used with competitive, learning neural networks, such as the Self Organizing Map. This activation function has no derivative, so it cannot be used with propagation training.

ActivationGaussian

The ActivationGaussian function is based on the Gaussian function. The Gaussian function produces the familiar bell-shaped curve. The equation for the Gaussian function is shown in Equation 3.1.

Equation 3.1: The Gaussian Function

There are three different constants that are fed into the Gaussian function. The constant a represents the curve’s peak. The constant b represents the position of the curve. The constant c represents the width of the curve.

Figure 3.1: The Graph of the Gaussian Function

Your browser may not support display of this image.

The Gaussian function is implemented in C# as follows.

return this.peak

* BoundMath.Exp(-Math.Pow(x - this.center, 2)

/ (2.0 * this.width * this.width));

The Gaussian activation function is not a commonly used activation function. However, it can be used when finer control is needed over the activation range. The curve can be aligned to somewhat approximate certain functions.

The radial basis function layer provides an even finer degree of control, as it can be used with multiple Gaussian functions. There is a valid derivative of the Gaussian function; therefore, the Gaussian function can be used with propagation training. The radial basis function layer is covered in Chapter 14, “Common Neural Network Patterns”.

ActivationLinear

The ActivationLinear function is really no activation function at all. It simply implements the linear function. The linear function can be seen in Equation 3.2.

Equation 3.2: The Linear Activation Function

The graph of the linear function is a simple line, as seen in Figure 3.2.

Figure 3.2: Graph of the Linear Activation Function

Your browser may not support display of this image.

The C# implementation for the linear activation function is very simple. It does nothing. The input is returned as it was passed.

public void ActivationFunction(double[] d)

{

}

The linear function is used primarily for specific types of neural networks that have no activation function, such as the self-organizing map. The linear activation function has a constant derivative of one, so it can be used with propagation training. The output layer of a feedforward neural network trained with propagation sometimes uses linear layers.

ActivationLOG

The ActivationLog activation function uses an algorithm based on the log function. The following C# code shows how this is calculated.

if (d[i] >= 0)

{

d[i] = BoundMath.Log(1 + d[i]);

}

else

{

d[i] = -BoundMath.Log(1 – d[i]);

}

This produces a curve similar to the hyperbolic tangent activation function, which will be discussed later in this chapter. You can see the graph for the logarithmic activation function in Figure 3.3.

Figure 3.3: Graph of the Logarithmic Activation Function

Your browser may not support display of this image.

The logarithmic activation function can be useful to prevent saturation. A hidden node of a neural network is considered saturated when, on a given set of inputs, the output is approximately 1 or -1 in most cases. This can slow training significantly. This makes the logarithmic activation function a possible choice when training is not successful using the hyperbolic tangent activation function.

As illustrated in Figure 3.3, the logarithmic activation function spans both positive and negative numbers. This means it can be used with neural networks where negative number output is desired. Some activation functions, such as the sigmoid activation function will only produce positive output. The logarithmic activation function does have a derivative, so it can be used with propagation training.

ActivationSigmoid

The ActivationSigmoid activation function should only be used when positive number output is expected, because the ActivationSigmoid function will only produce positive output. The equation for the ActivationSigmoid function can be seen in Equation 3.3.

Equation 3.3: The ActivationSigmoid Function

The ActivationSigmoid function will move negative numbers into the positive range. This can be seen in Figure 3.4, which shows the graph of the sigmoid function.

Figure 3.4: Graph of the ActivationSigmoid Function

Your browser may not support display of this image.

The ActivationSigmoid function is a very common choice for feedforward and simple recurrent neural networks. However, you must be sure that the training data does not expect negative output numbers. If negative numbers are required, consider using the hyperbolic tangent activation function.

ActivationSIN

The ActivationSIN activation function is based on the sine function. It is not a commonly used activation function. However, it is sometimes useful for certain data that periodically changes over time. The graph for the ActivationSIN function is shown in Figure 3.5.

Figure 3.5: Graph of the SIN Activation Function

Your browser may not support display of this image.

The ActivationSIN function works with both negative and positive values. Additionally, the ActivationSIN function has a derivative and can be used with propagation training.

ActivationSoftMax

The ActivationSoftMax activation function is an activation that will scale all of the input values so that their sum will equal one. The ActivationSoftMax activation function is sometimes used as a hidden layer activation function.

The activation function begins by summing the natural exponent of all of the neuron outputs.

double sum = 0;

for (int i = 0; i < d.length; i++)

{

d[i] = BoundMath.Exp(d[i]);

sum += d[i];

}

The output from each of the neurons is then scaled according to this sum. This produces outputs that will sum to 1.

for (int i = 0; i < d.Length; i++)

{

d[i] = d[i] / sum;

}

The ActivationSoftMax is generally used in the hidden layer of a neural network or a classification neural network.

ActivationTANH

The ActivationTANH activation function is an activation function that uses the hyperbolic tangent function. The hyperbolic tangent activation function is probably the most commonly used activation function, as it works with both negative and positive numbers. The hyperbolic tangent function is the default activation function for Encog. The equation for the hyperbolic tangent activation function can be seen in Equation 3.4.

Equation 3.4: The Hyperbolic Tangent Activation Function

The fact that the hyperbolic tangent activation function accepts both positive and negative numbers can be seen in Figure 3.6, which shows the graph of the hyperbolic tangent function.

Figure 3.6: Graph of the Hyperbolic Tangent Activation Function

Your browser may not support display of this image.

The hyperbolic tangent function that you see above calls the natural exponent function twice. This is an expensive function call. We really do not need the exact hyperbolic tangent. An approximation will do. The following code does a fast approximation of the hyperbolic tangent function.

private double ActivationFunction(double d)

{

return -1 + (2/ (1+BoundMath.Exp(-2* d ) ) );

}

The hyperbolic tangent function is a very common choice for feedforward and simple recurrent neural networks. The hyperbolic tangent function has a derivative, so it can be used with propagation training.

Summary

Encog uses activation functions to scale the output from neural network layers. By default, Encog will use a hyperbolic tangent function, which is a good general purposes activation function. Any class that acts as an activation function must implement the IActivationFunction interface. This interface requires the implementation of several methods. First an ActivationFunction method must be created to actually perform the activation function. Secondly, a DerivativeFunction method should be implemented to return the derivative of the activation function. If there is no way to take a derivative of the activation function, then an error should be thrown. Only activation functions that have a derivative can be used with propagation training.

The ActivationBiPolar activation function class is used when your network only accepts bipolar numbers. The ActivationCompetitive activation function class is used for competitive neural networks, such as the Self-Organizing Map. The ActivationGaussian activation function class is used when you want a Gaussian curve to represent the activation function. The ActivationLinear activation function class is used when you want to have no activation function at all. The ActivationLOG activation function class works similarly to the ActivationTANH activation function class except it will sometimes not saturate as a hidden layer. The ActivationSigmoid activation function class is similar to the ActivationTANH activation function class, except only positive numbers are returned. The ActivationSIN activation class can be used for periodic data. The ActivationSoftMax activation function class scales the output so that the sum is one.

Up to this point we have covered all of the major components of neural networks. Layers contain the neurons and threshold values. Synapses connect the layers together. Activation functions sit inside the layers and scale the output. Tags allow special layers to be identified. Properties allow configuration values to be associated with the neural network. The next chapter will introduce the Encog Workbench. The Encog Workbench is a GUI application that lets you build neural networks composed of all of these elements.

Questions for Review

1. When might you choose a sigmoid layer over the hyperbolic tangent layer?

2. What are the ramifications of choosing an activation function that does not have a way to calculate a derivative?

3. Which activation function should be used if you want no activation function at all for your layer?

4. Which activation function produces output that sums to one?

5. When might a logarithmic activation function be chosen over a hyperbolic tangent activation function?

Terms

BiPolar Activation Function

Competitive Activation Function

Derivative

Gaussian Activation Function

Linear Activation Function

LOG Activation Function

Sigmoid Activation Function

SIN Activation Function

SoftMax Activation Function

TANH Activation Function





Copyright 2005 - 2012 by Heaton Research, Inc.. Heaton Research™ and Encog™ are trademarks of Heaton Research. Click here for copyright, license and trademark information.