Chapter 6: Obtaining Data for Encog

jeffheaton's picture


Chapter 6: Obtaining Data for Encog

  • Finding Data for Neural Networks
  • Why Normalize?
  • Specifying Normalization Sources
  • Specifying Normalization Targets
  • Managing Long Training Times

Neural networks can provide profound insights into the data supplied to them. However, you can’t just feed any sort of data directly into a neural network. This “raw” data must usually be normalized into a form that the neural network can process. This chapter will show how to normalize “raw” data for use by Encog.

Before we can normalize data, we must first have data. Once you decide what you would like your neural network to do, you must find data so that you can teach the neural network how to perform a task. Fortunately, the Internet provides a wealth of information that can be used with neural networks.

Where to Get Data for Neural Networks

The Internet can be a great source of data for the neural network. There are many sources of data available on the Internet. Data found on the Internet can be in many different formats. One of the most convenient formats for data is the comma-separated value (CSV) format. Other times it may be necessary to create a spider or bot to obtain this data.

One very useful source for neural network data is called Data.gov. This is a site maintained by the United States Government. This site acts as a repository for a great deal of statistical data. It can be accessed from the following URL.

http://www.data.gov/

Another useful site is the Knowledge Discovery site, which is run by the University of California at Irvine.

http://kdd.ics.uci.edu/

The Knowledge Discovery site is a repository of various datasets that have been donated to the University of California. One of these datasets will be used for this chapter’s example.

What is Normalization?

Data obtained from sites, such as those listed above, often cannot be directly fed into neural networks. Neural networks can be very “intelligent”, but you cannot simply feed any sort of data into a neural network and expect a meaningful result. Often the data must first be normalized. We will begin by looking at what normalization is.

Neural networks are designed to accept floating-point numbers as their input. Usually these input numbers should be in either the range of -1 to +1 or 0 to +1 for maximum efficiency. Your choice of which range is often dictated by your choice of activation function, as certain activation functions have a positive range and others have both a negative and positive range. The sigmoid activation function, for example, has a range of only positive numbers, whereas the hyperbolic tangent activation function has a range of positive and negative numbers.

Normalizing Numeric Values

Numeric data is a very commonly used as both the input and output data. By numeric data I mean integer or floating point numbers. The values of these numbers have meaning as numbers. For example, it is significant that input “a” is larger than input “b”. Examples of seemingly numeric values that do not have meaning are US zip codes. The fact that zip code 63123 is larger than 63121 is meaningless. The zip codes are not numeric values, they are nominal values. Nominal values are normalized differently than numeric values. The process for normalizing nominal values is covered in the next section.

In this chapter we will see how to normalize real world data for Encog. We will examine data collected by the United States Forestry Service. This data provides statistical information for a large number of small areas of forest. We will attempt to create a neural network to analyze the statistics about an area of forest and to predict the type of tree cover that area has.

There are several numeric values provided for each area of the forest that was sampled. One of these numeric values is elevation. Elevation is defiantly a numeric value. Consider whether “point a” is at 1,000 meters, and “point b” is at 2,000 meters. The fact that “point b” is higher than “point a” is quite significant. The difference between these two values is also quite significant. Altitude is an example of a numeric value.

Encog normalizes numeric values by either encoding or mapping the input values. The simplest form of numeric normalization used by Encog is encoding. Encoding allows you to specify numeric ranges that should be mapped to a specific value. For example, you could specify that every number between 1 and 1,000 should be mapped to 0.1. Additionally, every number between 1,001 and 2,000 should be mapped to 0.2. You can provide as many of these mappings as needed. The OutputFieldEncode class handles this sort of normalization.

Mapping is a slightly more complex way of normalizing numeric values. Mapping allows you to map one numeric range to another. Equation 6.1 shows this.

Equation 6.1: Normalizing Numeric Values

Where:

  • x = The value to normalize
  • min = The minimum value x will ever reach
  • max = The maximum value that x will ever reach
  • low = The low value of the range to normalize into (typically -1 or 0)
  • high = The high value of the range to normalize into (typically 0 or 1)

As you can see from the above variables we must know the minimum and maximum values that the data will reach. If mapped normalization is used, Encog must make two passes over the input data. The first pass will collect the maximum and minimum values. The second pass will actually normalize the input values.

Normalizing Nominal Values

Nominal values are used to name things. One very common example of a simple nominal value is gender. Something is either male or female. Another is any sort of boolean question. Nominal values also include values that are either “yes/true” or “no/false”. However, not all nominal value have only two values.

Nominal values can also be used to describe an attribute of something, such as color. Neural networks deal best with nominal values where the set is fixed. One nominal variable that will be used later in this chapter is “soil type”; we will have 40 different soil types that the neural network will use to determine the type of tree that would likely grow there. The “tree type” is also a nominal variable. However, both sizes are fixed. We only deal with 40 different soil types and seven different tree types.

Nominal values are used both for neural network input and output. When used with neural network input, the nominal value is describes an attribute of whatever you are trying to recognize. An example of this is the soil type. When used with neural network output, nominal values allow the neural network to communicate what something is. An example of this is the type of tree that would grow on the soil type specified by the neural network input.

Encog supports two different ways to encode nominal values. The simplest means of representing nominal values is called “one-of-n” encoding. One-of-n encoding can often be hard to train, especially if there are more than a few nominal types that you are trying to encode. Equilateral encoding is usually a better choice than the simpler “one-of-n” encoding. Both encoding types will be explored in the next two sections.


Understanding one-of-n Normalization

One-of-n is a very simple form of normalization. For an example, consider the forest cover example that we will examine in this chapter. The input to the neural network is statistics about a sample of forest region. The output signifies which of seven different tree types may be covering this land. The seven tree types are listed as follows:

  • Spruce/Fir
  • Lodgepole Pine
  • Ponderosa Pine
  • Cottonwood/Willow
  • Aspen
  • Douglas-fir
  • Krummholz

If we were using one-of-n normalization, the neural network would have seven output neurons. Each of these seven neurons would represent one tree type. The tree type predicted by the neural network would correspond to the output neuron with the highest activation.

Generating training data for one-of-n is relatively easy. Simply assign a +1 to the neuron that corresponds to the tree that should have been chosen, and a -1 to the remaining neurons. For example, the Spruce/Fir tree type “ideal output” would be encoded as follows.

1,-1,-1,-1,-1,-1,-1

Likewise, the Ponderosa Pine would be encoded as follows.

-1,1,-1,-1,-1,-1,-1

The OutputOneOf class performs this sort of normalization. The one-of-n encoding is usually a good choice for input neurons. The example shown later in this chapter uses one-of-n to normalize the soil types used to predict the tree cover.

Understanding Equilateral Normalization

The output neurons are constantly checked against the ideal output values provided in the training set. The error between the actual output and the ideal output is represented by a percent. This can cause a problem for the one-of-n normalization method. Consider whether the neural network predicted a Spruce/Fir tree, when it should have predicted a Ponderosa Pine. We would have output and ideal as follows:

Ideal Output: 1,-1,-1,-1,-1,-1,-1

Actual Output: -1,1,-1,-1,-1,-1,-1

The problem is that only two output neurons are incorrect. We would like to spread the “guilt” for this error over more of the neurons. To do this, we must come up with a unique set of values for each. Each set of values should have an equal Euclidean distance from the others. The equal distance makes sure that incorrectly choosing tree 3 for tree 4 has the same error weight as choosing tree 5 for tree 1.

The following code segment shows how to use the Equilateral class to generate these values.

Equilateral eq = new Equilateral(7,-1,1);

for(int i=0;i<7;i++)

{

StringBuilder line = new StringBuilder();

line.Append(i);

line.Append(':');

double[] d = eq.Encode(i);

for(int j=0;j<d.Length;j++)

{

if( j>0 )

line.Append(',');

line.Append(d[j]);

}

Console.WriteLine(line.ToString());

}

This would produce the following output.

0:0.7637,0.4409,0.3118,0.2415,0.1972,0.1666

1:-0.7637,0.4409,0.3118,0.2415,0.1972,0.1666

2:0.0,-0.8819,0.3118,0.2415,0.1972,0.1666

3:0.0,0.0,-0.9354,0.2415,0.1972,0.1666

4:0.0,0.0,0.0,-0.9660,0.1972,0.1666

5:0.0,0.0,0.0,0.0,-0.9860,0.1666

6:0.0,0.0,0.0,0.0,0.0,-1.0

These are the values that would be used for tree types 0 through 6. As you can see the difference between each of these usually involves more than one neuron. This will spread the training more effectively. Equalaterial normalization requires that there be at least three sets. If there are only two sets, simply use one-of-n encoding, as the error will have to be equally spread over the two output neurons, as there are only two.

The Euclidean normalization technique produces one fewer output neurons than one-of-n. Notice that each of the above sets contains only six numbers. This is a side effect of finding values that are equal in distance.

What is meant by each of the sets being equal in distance from each other? It means that their Euclidean distance is equal. The Euclidean distance can be calculated using Equation 6.2.

Equation 6.2: Euclidean Distance

In the above equation the variable “i” represents the ideal output value, the variable “a” represents the actual output value. There are “n” sets of ideal and actual. Every set of values in the above listing will produce a Euclidean distance of 0.623. Euclidean normalization is implemented using the Equilateral class in Encog.

Using the DataNormalization Class

Encog supports normalization using the DataNormalization class. The normalization class works by accepting data through input fields and processing them into output fields. There are really two ways to use the DataNormalization class. They are summarized as follows:

  • Batch processing
  • Single record processing

In single record processing, you provide a set of numbers to be normalized. These numbers are normalized and returned to you. Batch processing accepts a data source and data target. All records are read from the data source, then normalized, and written to the data target. Often you will batch process data from one CSV file to another. Batch processing is particularly useful when training a neural network. Single record processing is very useful for when you are actually using the neural network.

Using Normalization in Batch Mode

The general format for using the DataNormalization class in batch mode is as follows.

DataNormalization norm = new DataNormalization();

norm.Report = this;

norm.Target = ...data target... ;

norm.AddInputField( ... Input Field 1 ... );

norm.AddInputField( ... Input Field 2 ... );

norm.AddInputField( ... Input Field 3 ... );

norm.AddOutputField(... Output Field 1 ... );

norm.AddOutputField(... Output Field 2 ... );

norm.AddOutputField(... Output Field 3 ... );

norm.AddSegregator(... Segregator 1...);

norm.Process();

First, a new DataNormalization object is created. Then a reporting object is specified. The reporting object will receive status updates as normalization progresses. A normalization job can take some time to process when dealing with large data sets.

Next, the input and output fields are added. The input fields specify the data sources to use. The data may come from one single source, or it may come from several to be aggregated together. The output fields specify how the data should be normalized. Segregators can also be added to trim some of the data. Segregators can be very useful for separating the data into groups. One group might be used to train the network, a second might be used to evaluate the network after it has been trained. The forest cover example, shown later in this chapter, will expand upon batch normalization processing.

Using Normalization in Single Record Mode

Using a DataNormalization object in single record mode is a simplified form of batch mode. There is no need to specify a reporting object, segregators or data target. The results for a single record will be calculated very quickly and returned in an INeuralData object.

DataNormalization norm = new DataNormalization();

norm.AddInputField( ... Input Field 1 ... );

norm.AddInputField( ... Input Field 2 ... );

norm.AddInputField( ... Input Field 3 ... );

norm.AddOutputField(... Output Field 1 ... );

norm.AddOutputField(... Output Field 2 ... );

norm.AddOutputField(... Output Field 3 ... );

NeuralData input = norm.BuildForNetworkInput(data);

NeuralData output = this.network.Compute(input);

As you can see, the input and output fields are created, and added, just as before. However, rather than calling Process, we call BuildForNetworkInput. This method will create input suitable for the Compute method of the BasicNetwork class.

Specifying the Input Fields

Input fields map to the individual elements from the raw data. Output fields specify the normalized fields that are produced by the normalization class. However, these “output fields” will then become the input to the neural network. There is not necessarily a one-to-one correspondence between the normalization input fields and output fields. Some of the raw data may be ignored, or used to filter the rest of the data.

There are several different types of input fields that the Encog normalization class can work with. All input fields must implement the interface IInputField. Input fields simply specify where to get a value from; they do not specify how to normalize it. The output fields will specify how fields are to be normalized. Input can be taken from a number of different sources. The different input field types are covered in the next sections.

Using BasicInputField

The BasicInputField class is the simplest of the input fields. It is the base class for most of the other input field types. It simply passes its current value to the output fields. It is not capable of reading from an input source. The BasicInputField is useful in its own right. It is often used when a normalization object is to be constructed for single record mode. BasicInputField objects can be used to provide a place to store the prenormalized data. The example program in Chapter 8, “Other Supervised Training Methods” demonstrates this concept.

The following code shows a BasicInputField being set up for single record mode.

IInputField fuelIN;

norm.AddInputField(fuelIN = new BasicInputField());

fuelIN.Max = 200;

fuelIN.Min = 0;

Here you can see an input field that will hold the amount of fuel remaining in a spacecraft. These lines are from the example you will see in Chapter 8. Here you can see that the minimum and maximum values are set. In batch mode, these values will be calculated as the data is processed. However, in single-record mode, the min and max values must be supplied to Encog.

Using InputFieldArray1D

Data to be normalized can be read from a one-dimensional array. Each row in the array maps to one record that will be fed into the neural network. Because the array is one-dimensional, only a single field per record is allowed. Of course you can aggregate multiple one-dimensional arrays by using multiple InputFieldArray1D objects. The following code shows how to use an InputfieldArray1D object.

double[] ARRAY_1D = { 1.0,2.0,3.0,4.0,5.0 };

IInputField a;

double[] arrayOutput = new double[5];

NormalizationStorageArray1D target = new

NormalizationStorageArray1D(arrayOutput);

DataNormalization norm = new DataNormalization();

norm.Report = new NullStatusReportable();

norm.Storage = target;

norm.AddInputField(a = new InputFieldArray1D(false,ARRAY_1D));

norm.AddOutputField(new OutputFieldRangeMapped(a,0.1,0.9));

norm.Process();

The above code normalizes the values contained in ARRAY_1D and stores them in arrayOutput. You will notice that there are two parameters passed to the InputFieldArray1D object, as seen here.

new InputFieldArray1D(false,ARRAY_1D)

The first parameter specifies whether this field is actually used by the neural network. If the value were false, the field would likely only be used for comparison purposes. The second parameter specifies the actual array.

It is somewhat limiting that a one-dimensional array only allows a single field per array. Encog also allows you to use a two-dimensional array, which is more flexible.

Using InputFieldArray2D

Encog can normalize data using a two-dimensional array. Each of array’s rows becomes one record. Each column can be used as a single input field. It is not necessary to make use of every column in the array. The following code segment shows how to use InputFieldArray2D.

double[][] ARRAY_2D = {

new double[5] {1.0,2.0,3.0,4.0,5.0},

new double[5] {6.0,7.0,8.0,9.0, 10.0} };

IInputField a,b;

double[][] arrayOutput = new double[2][];

arrayOutput[0] = new double[2];

arrayOutput[1] = new double[2];

NormalizationStorageArray2D target = new

NormalizationStorageArray2D(arrayOutput);

DataNormalization norm = new DataNormalization();

norm.Report = new NullStatusReportable();

norm.Storage = target;

norm.AddInputField(a = new InputFieldArray2D(true,ARRAY_2D,0));

norm.AddInputField(b = new InputFieldArray2D(true,ARRAY_2D,1));

norm.AddOutputField(new OutputFieldRangeMapped(a,0.1,0.9));

norm.AddOutputField(new OutputFieldRangeMapped(b,0.1,0.9));

norm.Process();

You will notice that there are three parameters passed to the InputFieldArray2D object, as seen here.

InputFieldArray2D(true,ARRAY_2D,0);

The first parameter specifies if this field is actually used by the neural network. If the value were false, the field would likely only be used for comparison purposes. The second parameter specifies the actual array. The zero parameter specifies the column that this field should map to. The value zero specifies the first column.

Using InputFieldCSV

One of the most commonly used input field types is the InputFieldCSV class. This class allows fields to be read from a CSV file. Often the output fields will be written to a CSV file as well. The following code shows how to define three fields to be read in from a CSV file.

double[][] outputArray = new double[2][];

outputArray[0] = new double[3];

outputArray[1] = new double[3];

IInputField a;

IInputField b;

IInputField c;

DataNormalization norm = new DataNormalization();

norm.Report = new NullStatusReportable();

norm.Storage = new NormalizationStorageCSV(FILENAME);

norm.AddInputField(a = new InputFieldCSV(false,FILENAME,0));

norm.AddInputField(b = new InputFieldCSV(false,FILENAME,1));

norm.AddInputField(c = new InputFieldCSV(false,FILENAME,2));

norm.AddOutputField(new OutputFieldRangeMapped(a,0.1,0.9));

norm.AddOutputField(new OutputFieldRangeMapped(b,0.1,0.9));

norm.AddOutputField(new OutputFieldRangeMapped(c,0.1,0.9));

norm.Storage = new NormalizationStorageArray2D(outputArray);

norm.Process();

You will notice that there are three parameters passed to the InputFieldCSV object, as seen here.

new InputFieldCSV(false,FILENAME,0)

The first parameter specifies if this field is actually used by the neural network. If the value were false, the field would likely only be used for comparison purposes. The second parameter specifies the filename of the CSV file. The zero parameter specifies the column that this field should map to. The value zero specifies the first column.

Using InputFieldNeuralDataSet

It is also possible to take input fields from a INeuralDataSet object. This is done using an InputFieldNeuralDataSet object. The INeuralDataSet objects will more often be the target of normalization, rather than the source. However, it may sometimes be useful to normalize from a INeuralDataSet. The following code reads from a INeuralDataSet and normalizes the results to a two dimensional array.

IInputField a,b;

double[][] arrayOutput = new double[2][];

arrayOutput[0] = new double[2];

arrayOutput[1] = new double[2];

BasicNeuralDataSet dataset = new BasicNeuralDataSet(ARRAY_2D,null);

NormalizationStorageArray2D target = new NormalizationStorageArray2D(arrayOutput);

DataNormalization norm = new DataNormalization();

norm.Report = new NullStatusReportable();

norm.Storage = target;

norm.AddInputField(a = new

InputFieldNeuralDataSet(false,dataset,0));

norm.AddInputField(b = new

InputFieldNeuralDataSet(false,dataset,1));

norm.AddOutputField(

new OutputFieldRangeMapped(a,0.1,0.9));

norm.AddOutputField(new OutputFieldRangeMapped(b,0.1,0.9));

norm.Process();

You will notice that there are three parameters passed to the InputFieldNeuralDataSet object, as seen here.

new InputFieldNeuralDataSet(false,dataset,0)

The first parameter specifies if this field is actually used by the neural network. If the value were false, the field would likely only be used for comparison purposes. The second parameter specifies the dataset to use. The zero parameter specifies the column that this field should map to. The value zero specifies the first column.

Specifying the Output Fields

Encog uses output fields to process the input fields. The output fields are what will be fed into the neural network. The output fields of normalization become the input fields to the neural network. If you specify eight output fields on your normalization object, you will need to have a neural network with eight input neurons to receive this data.

There are many different kinds of output field types. The type of output neuron specifies the type of normalization that you are using. It is very common to use different normalization types for different input neurons. Each output neuron will usually map to a single input field. This output field will process that input field’s values. Sometimes an output field will not map to a specific input field. Such an output field is considered synthetic. Synthetic fields will be discussed later in this section.

Encog currently supports two types of output fields. Grouped fields are grouped together with other output fields. The values of the members of the group influence each other. Non-grouped fields act independently. In this section you will learn about both grouped and nongrouped fields. We will begin with field groups.

Understanding Field Groups

Several of the output fields supported by Encog can belong to field groups. A field group is a collection of objects associated with a group object. Any group object must be of a class that implements the IFieldGroup interface. Grouped fields do not act independently. The other grouped fields affect the value of a single field in a group.

Using OutputFieldMultiplicative

The OutputFieldMultiplicative object implements a grouped field that makes use of multiplicative normalization. Because multiplicative normalization is grouped, the values of the individual grouped fields will influence each other. The following code shows three different output fields being set up with multiplicative normalization.

MultiplicativeGroup group = new MultiplicativeGroup();

norm.AddOutputField(new OutputFieldMultiplicative(group,a));

norm.AddOutputField(new OutputFieldMultiplicative(group,b));

norm.AddOutputField(new OutputFieldMultiplicative(group,c));

The multiplicative normalization algorithm ensures that all grouped fields are between the range of -1 and +1. Further, it ensures that their vector length is defined as one. The vector length is the square root of a sum of squares, as shown in Equation 6.3.

Equation 6.3: Calculating Vector Length for Multiplicative Normalization

The equation above essentially squares the value of every grouped field. The resulting length is the square root of the sum of these individual squares. We then divide each of the field values by this length, as shown in Equation 6.4.

Equation 6.4: Multiplicative Normalization of Each Value

The multiplicative normalization type can be very useful for vector quantization. One of the problems with multiplicative normalization is that the sign of the input fields is completely disregarded. This is because each of the inputs is squared. Because of this, Z-axis normalization is often used in place of multiplicative normalization.

Using OutputFieldZAxis

Z-axis normalization is also often used with self-organizing maps. Encog implements Z-axis normalization using the OutputFieldZAxis class. Z-axis normalization accomplishes the same goal as multiplicative normalization, in that it causes the vector length of the grouped fields to be one. The following code shows how to set up three input fields to be normalized using Z-axis normalization

ZAxisGroup group = new ZAxisGroup();

norm.AddOutputField(new OutputFieldZAxis(group,a));

norm.AddOutputField(new OutputFieldZAxis(group,b));

norm.AddOutputField(new OutputFieldZAxis(group,c));

norm.AddOutputField(new OutputFieldZAxisSynthetic(group));

One thing that you should notice from the above code is that the number of input and output fields is not the same. Z-axis normalization will always result in one additional output field than the number of input fields that were provided to it. This additional output field, which uses the OutputFieldZAxisSynthetic class, is called a synthetic field.

To perform Z-axis normalization, a normalization factor is calculated. This factor is calculated using Equation 6.5.

Equation 6.5: Z-Axis Normalization Factor

The normalization factor is calculated independently of the actual data. This allows the sign of the data to be preserved. This factor is then applied to each of the grouped fields. This step is performed in Equation 6.6.

Equation 6.6: Normalizing with the Z-Axis Normalization Factor

The synthetic field must now be calculated. This is where z-axis normalization derives its name. The additional field is thought of as an additional axis, just as the z-axis is an imaginary axis used in computer graphics to give the appearance of three dimensions on a two dimensional display. The synthetic field is calculated with Equation 6.7.

Equation 6.7: The Z-Axis Synthetic Field

Either Z-axis or multiplicative normalization should be used when you need a consistent vector length. Z-axis normalization is usually a better choice than multiplicative. One of the few times that multiplicative normalization may perform better than Z-axis is when all of the input fields are near zero. In this case the synthetic field may dominate them.

Using OutputFieldDirect

The OutputFieldDirect is a very simple field class that simply passes the input field directly to the output. Normalization is not performed. The following code shows how to set up a direct output field.

norm.AddOutputField(new OutputFieldDirect(inputField));

Direct output fields can be very useful when an input field is already normalized or is already within an acceptable range.

Using OutputFieldRangedMapped

The OutputFieldRangedMapped field object allows an input field to be mapped into a specific range. The range that is usually chosen is a range that is either close to -1 to +1 or 0 to +1. The following line of code shows how to use OutputFieldRangedMapped.

norm.AddOutputField(

new OutputFieldRangeMapped(inputField,0.0,1.0));

In the above code, the field IInputField is mapped to a range between 0.0 and 1.0. This is one of the most commonly used neural network normalization techniques.

Using OutputFieldEncode

The OutputFieldEncode field object allows different ranges of an input field to be mapped to output field values. The following code shows a typical set up for this field type.

OutputFieldEncode encode = new OutputFieldEncode(inputField);

encode.AddRange(0, 999, 0.1);

encode.AddRange(1000, 1999, 0.2);

encode.AddRange(2000, 2999, 0.3);

encode.CatchAll = 0.5;

Here you see an encode field that will encode three different ranges. This code also includes a “catch all”. If the input field is between 0 and 999 the output value will be 0.1. Likewise if the input field is between the other two ranges the output will be either 0.2 or 0.3. If the input field does not match any of the ranges provided, the output will be the “catch all” value of 0.5. If a “catch all” value was not provided, the “catch all” defaults to zero. The OutputFieldEncode field can provide more precise control than OutputFieldRangeMapped field; however, the ranges need to be defined manually.

Using OutputOneOf

The last two output field types that will be examined are used to encode nominal values. Nominal values indicate set membership. Later in this chapter we will look at an example program that attempts to predict the type of tree that will live on a sample area of land. This tree type is a nominal value, as there are seven distinct tree types that were sampled from the land surveyed.

As discussed previously, there are two different ways that this group can be represented in a neural network. These two approaches are called one-of-n and equilateral normalization. For more information on the differences between one-of-n and equilateral normalization, refer to the material earlier in this chapter.

To implement one-of-n encoding in Encog use the OutputOneOf class. The following lines of code demonstrate how to set up one-of-n encoding.

OutputOneOf outType = new OutputOneOf(1.0,0.0);

outType.AddItem(coverType, 1);

outType.AddItem(coverType, 2);

outType.AddItem(coverType, 3);

outType.AddItem(coverType, 4);

outType.AddItem(coverType, 5);

outType.AddItem(coverType, 6);

outType.AddItem(coverType, 7);

norm.AddOutputField(outType, true);

Not all output field objects create a single output field. The above code would actually create seven output fields. If the item were a member of one of the sets, the corresponding output field would have a value of 1.0, otherwise it would have a value of 0.0. These two values were specified by the constructor which created an OutputOneOf object called outType.

Using OutputEquilateral

To make use of equilateral norm, you should use the OutputEquilateral class. The following lines of code show how to set up an OutputEquilateral set of output fields.

OutputEquilateral outType = new OutputEquilateral(1.0,0.0);

outType.AddItem(coverType, 1);

outType.AddItem(coverType, 2);

outType.AddItem(coverType, 3);

outType.AddItem(coverType, 4);

outType.AddItem(coverType, 5);

outType.AddItem(coverType, 6);

outType.AddItem(coverType, 7);

norm.AddOutputField(outType, true);

The OutputEquilateral object created above will actually create six output fields. As previously discussed, in this chapter, equilateral normalization actually one less than the number of items provided. As a result, you must have at least three item classes for equilateral normalization to be effective.

Using Segregators

Segregators are used to exclude certain records from normalization. You can segregate based on input field values, or you can simply segregate a certain percentage of the record. This can allow you exclude certain records all together, or simple segregate records into different sets. This can be a very effective way to build a training and evaluation set.

All Encog segregators implement the ISegregator interface. The following sections will examine the different Encog segregation types.

Using IndexRangeSegregator

The IndexRangeSegregator is useful when you know exactly how many records are in your dataset and you would like to specify a certain range. For example, if you had 10,000 records, and you knew that you wanted records 1 through 7,500, you might choose to use an IndexRangeSegregator. The following lines of code illustrate this concept.

IndexSampleSegregator segregator =

new IndexRangeSegregator(0,7499);

norm.AddSegregator(segregator);

There are several disadvantages to this simple of an approach. First, you must know exactly how many values there are to normalize. Second, the records you are collecting will all occur next to each other. If you are simply grabbing the first 7,500 records, you are only accessing records from the first part of the dataset. It would be better to have a more uniform distribution.

Using IndexSampleSegregator

The IndexSampleSegregator does not require you to know the size of the dataset, and it provides a more uniform distribution. The following lines of code show how to set up an IndexSampleSegregator.

IndexSampleSegregator segregator =

new IndexSampleSegregator(start,stop,size);

norm.AddSegregator(segregator);

The variables start, stop and size specify how to select elements. First, you must select a sample size. This sample size will be repeated over and over as elements are processed. Only elements between the start and stop indexes will be included. For example, consider the following list of ten records.

Record 0

Record 1

Record 2

Record 3

Record 4

Record 5

Record 6

Record 7

Record 8

Record 9

We will specify a sample size of five, a start index of zero and an ending index of three. The following records would be included.

Record 0: Sample 0, Sample index 0, Included

Record 1: Sample 0, Sample index 1, Included

Record 2: Sample 0, Sample index 2, Included

Record 3: Sample 0, Sample index 3, Included

Record 4: Sample 0, Sample index 4, Not Included

Record 5: Sample 1, Sample index 0, Included

Record 6: Sample 1, Sample index 1, Included

Record 7: Sample 1, Sample index 2, Included

Record 8: Sample 1, Sample index 3, Included

Record 9: Sample 1, Sample index 4, Not Included

Because the sample repeats through the dataset, a much more uniform distribution is achieved.

Using IntegerBalanceSegregator

Sometimes the training set will contain too many samples of one particular item. Consider the forest cover example that will be presented in this chapter. For all of the land areas sampled, certain tree are far more common than others. This could cause the more prevalent tree types to saturate the training data. To prevent this from happening, the data should be balanced. The IntegerBalanceSegregator class can perform such a balance. The following lines of code show how to set up an IntegerBalanceSegregator.

IntegerBalanceSegregator segregator = new IntegerBalanceSegregator(balanceField,count);

norm.AddSegregator(segregator);

You must specify a field to balance on, called balanceField. Additionally, a count must be provided to tell Encog the maximum number of records for each unique value in the balanceField. To determine how many unique sets there are, Encog will truncate each unique value in the balanceField to an integer. Every unique integer will be allowed up to count records.

Once the normalization has been processed, you can display how many samples were present for each unique integer value on the balancing field. The following code shows how to do this.

norm.Process();

Console.WriteLine("Samples per tree type:");

Console.WriteLine(segregator.dumpCounts());

This will display a simple listing of the count for each of the trees.

Using RangeSegregator

The range segregator object allows you to exclude records because the value of one of the input fields falls in a specific range. For example, the forest cover data contains a field that designates which wilderness area the data was collected from. If you wished to only process data from one wilderness area, you would use a range segregator. The following lines of code show how to use a RangeSegregator.

RangeSegregator seg = new RangeSegregator(inputField,false);

seg.AddRange(1, 10, true);

norm.AddSegregator(seg);

The above code would only allow records in the range of 1 to 10. The true value on the AddRange method call indicates that values in this range are included. The false value on the constructor indicates that records that do not fall under any of the defined ranges should be excluded.

Normalization Targets

Normalization targets specify what Encog should actually do with the normalized data that it generates. Every normalization target must implement the INormalizationStorage interface. Normalization targets are provided for arrays, CSV fields and INeuralDataSet objects. The following sections describe the Encog normalization targets.

Using the NormalizationStorageArray1D Class

A one-dimensional array can be used as the target for normalized data. The NormalizationStorgageArray1D class is used to do this. The one-dimensional array is limited, in that it can only hold one single field. The following code shows how to normalize to a one-dimensional array.

double[][] arrayOutput = new double[2][];

arrayOutput[0] = new double[2];

arrayOutput[1] = new double[2];

NormalizationStorageArray2D target =

new NormalizationStorageArray2D(arrayOutput);

DataNormalization norm = new DataNormalization();

norm.Storage = target;

double[] arrayOutput = new double[5];

NormalizationStorageArray1D target =

new NormalizationStorageArray1D(arrayOutput);

DataNormalization norm = new DataNormalization();

norm.Storage = target;

As you can see from the above code, the one-dimensional array is created and passed to the constructor of the NormalizationStorgageArray1D object. If more than one output field is generated by the normalization class, an error will occur. To support multiple fields per record the NormalizationStorgageArray2D class should be used.

Using NormalizationStorageArray2D

A two-dimensional array can be used as the target for normalized data. The NormalizationStorgageArray2D class is used to do this. The two-dimensional array is less limited than the one-dimensional array, in that it can only hold multiple fields. The following code shows how to set up to normalize to a two-dimensional array.

double[][] arrayOutput = new double[2][];

arrayOutput[0] = new double[2];

arrayOutput[1] = new double[2];

NormalizationStorageArray2D target = new NormalizationStorageArray2D(arrayOutput);

DataNormalization norm = new DataNormalization();

norm.setTarget(target);

As you can see from the above code the two-dimensional array is created and passed to the constructor of the NormalizationStorgageArray2D object.

Using NormalizationStorageCSV

A very common technique is to use a CSV file to hold the normalized data. The following code defines a normalization target that will save to a CSV file.

String file = "output.csv";

DataNormalization norm = new DataNormalization();

norm.Storage = new NormalizationStorageCSV(file);

Once the normalization process is complete the CSV file will hold the results of the normalization. This CSV file can then be used for neural network training.

Using NormalizationStorageNeuralDataSet

You can also normalize directly into a INeuralDataSet. The following code shows how to create a new dataset and normalize directly into it.

DataNormalization norm = new DataNormalization();

norm.Storage = new NormalizationStorageNeuralDataSet(2, 1);

norm.Process();

INeuralDataSet training = norm.Storage.DataSet;

Once the normalization process is complete, the dataset will contain the results of the normalization.

You can also pass an already created INeuralDataSet into the constructor of the BufferedNeuralDataSet. This can be a powerful technique for saving the normalized data to a binary file that can be used for training later. Binary files train much faster than CSV files. The following lines show how this is done.

BufferedNeuralDataSet buffer =

new BufferedNeuralDataSet(filename);

DataNormalization norm = new DataNormalization();

norm.Storage = new NormalizationStorageNeuralDataSet(buffer);

buffer.BeginLoad(inputLayerSize, outputLayerSize);

norm.Process();

buffer.EndLoad();

The above code would create a binary file that contains the results of the normalization. This binary file could be used later to train a neural network. The forest cover example, shown later in the next section, uses this technique.

Running the Forest Cover Example

To demonstrate how to use normalization this chapter presents an example that attempts to predict what type of trees might be growing on an area of wilderness. It will use publicly available data. This example is meant to be very “real world”. It demonstrates the steps that you might go through with a neural network project of your own. The following four steps are needed to set up and process a neural network.

  • Obtaining the data
  • Generate training and evaluation files
  • Train the neural network
  • Evaluate the neural network

We will begin in the next section with obtaining the raw data.

Obtaining the Raw Data

The data that we will use was obtained from the United States Forest Service (USFS). It can be downloaded from the University of California at Irvine, at the following URL.

http://kdd.ics.uci.edu/databases/covertype/covertype.html

The data to be used is described as follows (from the Web site).

The actual forest cover type for a given observation (30 x 30 meter cell) was determined from US Forest Service (USFS) Region 2 Resource Information System (RIS) data. Independent variables were derived from data originally obtained from US Geological Survey (USGS) and USFS data. Data is in raw form (not scaled) and contains binary (0 or 1) columns of data for qualitative independent variables (wilderness areas and soil types).

Summary Statistics

Number of instances (observations) 581012

Number of Attributes 54

Attribute breakdown 12 measures, but 54 columns of data (10 quantitative variables, 4 binary wilderness areas and 40 binary soil type variables)

Missing Attribute Values None

The file that you will download is named covtype.data. This file will be used in the next step to generate training and evaluation data.

Generating Data Files

You should place the covtype.data file from the last section in a directory. For this example, I will assume that it is in the location c:\data. You must modify the Constant.cs file to reflect it. You can see the line to modify here.

/// <summary>

/// The base directory that all of the data for this example

/// is stored in.

/// </summary>

public const String BASE_DIRECTORY = "c:\\data\\";

All of the other data files are based on this directory. To generate the files, run the forest example with the parameter generate. You must also specify an e or an o parameter to determine if equilateral or one-of-n normalization should be used for the tree types. Generally, you will get better results with equilateral. For more information on the difference between equilateral, refer to the material earlier in this chapter. For example, to run with equilateral training you would use the following arguments.

ConsoleExamples ForestCover generate e

Once the program executes, you will see the following output:

Step 1: Generate training and evaluation files

Generate training file

10000/0 Processing data (single pass)

20000/0 Processing data (single pass)

30000/0 Processing data (single pass)

40000/0 Processing data (single pass)

50000/0 Processing data (single pass)

60000/0 Processing data (single pass)

70000/0 Processing data (single pass)

80000/0 Processing data (single pass)

90000/0 Processing data (single pass)

100000/0 Processing data (single pass)

110000/0 Processing data (single pass)

120000/0 Processing data (single pass)

130000/0 Processing data (single pass)

140000/0 Processing data (single pass)

...

390000/0 Processing data (single pass)

400000/0 Processing data (single pass)

410000/0 Processing data (single pass)

420000/0 Processing data (single pass)

430000/0 Processing data (single pass)

Generate evaluation file

10000/0 Processing data (single pass)

20000/0 Processing data (single pass)

30000/0 Processing data (single pass)

40000/0 Processing data (single pass)

50000/0 Processing data (single pass)

60000/0 Processing data (single pass)

70000/0 Processing data (single pass)

80000/0 Processing data (single pass)

90000/0 Processing data (single pass)

100000/0 Processing data (single pass)

110000/0 Processing data (single pass)

120000/0 Processing data (single pass)

130000/0 Processing data (single pass)

140000/0 Processing data (single pass)

Step 2: Balance training to have the same number of each tree

10000/0 Processing data (single pass)

20000/0 Processing data (single pass)

Samples per tree type:

1 -> 3000 count

2 -> 3000 count

3 -> 3000 count

4 -> 2066 count

5 -> 3000 count

6 -> 3000 count

7 -> 3000 count

Step 3: Normalize training data

0/0 Analyzing file

10000/0 First pass, analyzing file

20000/0 First pass, analyzing file

10000/20066 Second pass, normalizing data

20000/20066 Second pass, normalizing data

First, when you generate the data files, covtype.data is split into training data and evaluation data. The training data, which is 75% of the file, is named training.csv. The evaluation data, which is 25% of the file, is named evaluate.csv.

Next, the training data is balanced so that there are at most 3,000 of each tree type. The data has considerably more of one tree types than others. This decreases training time, and also prevents one tree type from saturating the weight matrix with its patterns. The balanced tree data is written to the file balance.csv. There is no need to balance the evaluation data. The evaluation data is meant to be what the neural network faces after it is trained. We want to do nothing to “stage” the evaluation data.

Once the data has been balanced it must be normalized. The data is still in raw form in the balance.csv file. At this point the data has been pared down, but it is still in the same form as was in the original covtype.data file. The normalized data is written to the normalized.csv file. This is the file that will be used to train the neural network. The DataNormalization object is also saved to the forest.eg file. The forest.eg file is an Encog XML persistence file. Encog persistence will be covered in Chapter 7. The exact process that was used to normalize each field will be covered later in this chapter when the source code to the forest example is reviewed.

Now that the files have been generated, the neural network is ready to train. Training will be covered in the next section.

Training the Network

There are two methods provided for training. The first is simple console-mode training. For console training you must specify how long you would like the neural network to train in the Constant.cs file. There is a constant named TRAINING_MINUTES that specifies how long to train the network. The default is 10 minutes, however you can change it to any number you like. Longer training times will produce better results. You can see the setting here.

/// <summary>

/// How many minutes to train for (console mode only)

/// </summary>

public const int TRAINING_MINUTES = 10;

To begin console-mode training the following command should be used.

ConsoleExamples ForestCover train

Of course you will need to add the appropriate path and class path information. Once the program executes, you will see the following output.

Converting training file to binary

Beginning training...

Iteration #1 Error:45.093191% elapsed time = 00:00:23 time left = 00:10:00

Iteration #2 Error:45.660918% elapsed time = 00:00:46 time left = 00:10:00

Iteration #3 Error:44.983507% elapsed time = 00:01:09 time left = 00:09:00

Iteration #4 Error:49.432105% elapsed time = 00:01:32 time left = 00:09:00

Iteration #5 Error:39.701852% elapsed time = 00:01:55 time left = 00:09:00

Iteration #6 Error:30.401943% elapsed time = 00:02:18 time left = 00:08:00

Iteration #25 Error:13.369462% elapsed time = 00:09:48 time left = 00:01:00

Iteration #26 Error:13.275960% elapsed time = 00:10:14 time left = 00:00:00

Training complete, saving network...

The ten-minute default is not enough to thoroughly train the neural network. However, it is enough for a quick example of what the program is capable of. In this example the neural network was trained to around 13% error.

It is also possible to train using the GUI. GUI training displays statistics about training and does not require the training time to be specified. To begin in GUI training mode run the example with the traingui argument.

ConsoleExamples ForestCover traingui

Of course you will need to add the appropriate path and class path information. Once the program executes, you will see the training dialog. Figure 6.1 shows the GUI training being used.

Figure 6.1: GUI Training

Your browser may not support display of this image.

When you are ready to stop training, simply click “Stop” and training will cease. Once training has stopped, the neural network will be saved to the forest.eg file. As you can see from the above dialog, I ran the training for over two days. I allowed it to continue even further. However, training progressed very slowly after this point. Training was stopped once I had reached 63,328 iterations. This took five days and eleven hours. The additional three days of training had only lowered the error rate from 7.4% to 7.19%.

Now that the neural network has been trained, it is time to evaluate its performance.

Evaluating the Network

While evaluating the performance of the neural network the evaluate.csv file is used. This file contains the 25% of the raw data that was saved for evaluation. To evaluate the neural network you should run the example with the evaluate argument.

ConsoleExamples ForestCover evaluate

Of course you will need to add the appropriate path and class path information. Once the program executes, you will see the following output.

Total cases:145253

Correct cases:92725

Correct percent:64%

Tree Type #0 - Correct/total: 35560/52986(67%)

Tree Type #1 - Correct/total: 39151/70779(55%)

Tree Type #2 - Correct/total: 6724/8947(75%)

Tree Type #3 - Correct/total: 650/681(95%)

Tree Type #4 - Correct/total: 2227/2384(93%)

Tree Type #5 - Correct/total: 3451/4348(79%)

Tree Type #6 - Correct/total: 4962/5128(97%)

The above output is from a neural network that was trained to a 7.19% error rate. Overall, the success rate was 62%. However, you will notice that tree type #1 is the primary reason for this somewhat low score. Most of the other tree types scored at least 70% or higher. Some scored 90% or higher.

Further training may be able to improve it. More advanced handling of the data may improve it as well. This example does not make use of the “wilderness area” column. This column tells from which wilderness area the data was collected. You may want to limit the example to only one wilderness area, or in some way incorporate this field into the input data for the neural network. The four areas are relatively close, so it is unlikely that it will have a significant effect; however, it is an area for further study.

Another method to further refine the results might be to examine what tree type the network is consistently guessing incorrectly for tree type 0. It could be that these two species of trees are very similar and some additional criteria might be required to tell them apart.

In the past few sections you saw how to execute the forest cover example. In the next section we will examine how the forest example was constructed.

Understanding the Forest Cover Example

The last few sections described how to execute the forest cover example. We will now look the source code behind the forest cover neural network example. There are several files that make up this example. These files are listed here.

  • Constant.cs – Configuration information for the program.
  • Evaluate.cs – Evaluate the trained neural network.
  • ForestCover.cs – Main entry point for the program.
  • GenerateData.cs – Generate the data files.
  • TrainNetwork.cs – Train the neural network.

The Constant class contains configuration items that you can change. For example, you can set the number of hidden neurons to use. By default the program uses 100 hidden neurons. The main entry point for the program is the ForestCover class. This class is shown in Listing 6.1.

Listing 6.1: The Forest Cover Program Entry Point

using System;

using System.Collections.Generic;

using System.Linq;

using System.Text;

using Encog.Util.Logging;

using Encog.Normalize;

using Encog.Persist;

using System.IO;

using ConsoleExamples.Examples;

using Encog.Examples.Adaline;

namespace Encog.Examples.Forest

{

public class ForestCover : IExample

{

public static ExampleInfo Info

{

get

{

ExampleInfo info = new ExampleInfo(

typeof(ForestCover),

"forest",

"Forest Cover Prediction",

"Predicts forest cover using normalization.");

return info;

}

}

private IExampleInterface app;

public void Generate(bool useOneOf)

{

GenerateData generate = new GenerateData(this.app);

generate.Step1();

generate.Step2();

DataNormalization norm = generate.Step3(useOneOf);

EncogPersistedCollection encog =

new EncogPersistedCollection(

Constant.TRAINED_NETWORK_FILE, FileMode.CreateNew);

encog.Add(Constant.NORMALIZATION_NAME, norm);

}

public void Train(bool useGUI)

{

TrainNetwork program = new TrainNetwork(this.app);

program.Train(useGUI);

}

public void PerformEvaluate()

{

Evaluate evaluate = new Evaluate(this.app);

evaluate.PerformEvaluate();

}

public void Execute(IExampleInterface app)

{

this.app = app;

if (app.Args.Length < 1)

{

app.WriteLine(

"Usage: ForestCover [generate [e/o]/train/traingui/evaluate] ");

}

else

{

Logging.StopConsoleLogging();

if (String.Compare(

app.Args[0], "generate", true) == 0)

{

if (app.Args.Length < 2)

{

app.WriteLine("When using generate, you must specify an 'e' or an 'o' as the second parameter.");

}

else

{

bool useOneOf;

if (String.Compare(app.Args[1],

"e", true) == 0)

useOneOf = false;

else

useOneOf = true;

Generate(useOneOf);

}

}

else if (String.Compare(app.Args[0],

"train", true) == 0)

Train(false);

else if (String.Compare(app.Args[0],

"traingui", true) == 0)

Train(true);

else if (String.Compare(app.Args[0],

"evaluate", true) == 0)

PerformEvaluate();

}

}

}

}

As you can see this class is mainly concerned with passing control to one of the other classes listed above. We will examine each of these classes in the following sections.

Generating Training and Evaluation Data

The Generate method is used to generate the training and evaluation data. This method begins by accepting a parameter to determine if one-of-n normalization should be used.

public void Generate(bool useOneOf)

{

Next, an instance to the GenerateData class is created. This class will be examined later in this section.

GenerateData generate = new GenerateData(this.app);

Steps one and two of the generation process are executed. Step one segregates the data into training and evaluation files. Step two balances the numbers of cover types we have so that one cover type does not saturate the training.

generate.Step1();

generate.Step2();

Step 3 of file generation is executed. The DataNormalization object that was used by step three is obtained.

DataNormalization norm = generate.Step3(useOneOf);

The normalization object is then saved to an Encog persistence file. Encog persistence will be covered in greater detail in Chapter 7.

EncogPersistedCollection encog =

new EncogPersistedCollection(

Constant.TRAINED_NETWORK_FILE, FileMode.CreateNew);

encog.Add(Constant.NORMALIZATION_NAME, norm);

The Generate method makes use of methods from the GenerateData class. The GenerateData class is shown in Listing 6.2.

Listing 6.2: The Forest Cover Data File Generation

using System;

using System.Collections.Generic;

using System.Linq;

using System.Text;

using Encog.Normalize.Input;

using Encog.Normalize;

using Encog.Normalize.Output.Nominal;

using Encog.Normalize.Target;

using Encog.Normalize.Output;

using Encog.Normalize.Segregate.Index;

using Encog.Normalize.Segregate;

namespace Encog.Examples.Forest

{

public class GenerateData : IStatusReportable

{

private IExampleInterface app;

public GenerateData(IExampleInterface app)

{

this.app = app;

}

public void BuildOutputOneOf(DataNormalization norm,

IInputField coverType)

{

OutputOneOf outType = new OutputOneOf(0.9, 0.1);

outType.AddItem(coverType, 1);

outType.AddItem(coverType, 2);

outType.AddItem(coverType, 3);

outType.AddItem(coverType, 4);

outType.AddItem(coverType, 5);

outType.AddItem(coverType, 6);

outType.AddItem(coverType, 7);

norm.AddOutputField(outType, true);

}

public void BuildOutputEquilateral(

DataNormalization norm, IInputField coverType)

{

OutputEquilateral outType =

new OutputEquilateral(0.9, 0.1);

outType.AddItem(coverType, 1);

outType.AddItem(coverType, 2);

outType.AddItem(coverType, 3);

outType.AddItem(coverType, 4);

outType.AddItem(coverType, 5);

outType.AddItem(coverType, 6);

outType.AddItem(coverType, 7);

norm.AddOutputField(outType, true);

}

public void Copy(

String source,

String target,

int start,

int stop,

int size)

{

IInputField[] inputField = new IInputField[55];

DataNormalization norm = new DataNormalization();

norm.Report = this;

norm.Storage = new NormalizationStorageCSV(target);

for (int i = 0; i < 55; i++)

{

inputField[i] = new InputFieldCSV(

true, source, i);

norm.AddInputField(inputField[i]);

IOutputField outputField =

new OutputFieldDirect(inputField[i]);

norm.AddOutputField(outputField);

}

// load only the part we actually want,

// i.e. training or eval

IndexSampleSegregator segregator2 =

new IndexSampleSegregator(start, stop, size);

norm.AddSegregator(segregator2);

norm.Process();

}

public void Narrow(

String source, String target, int field, int count)

{

IInputField[] inputField = new IInputField[55];

DataNormalization norm = new DataNormalization();

norm.Report = this;

norm.Storage = new NormalizationStorageCSV(target);

for (int i = 0; i < 55; i++)

{

inputField[i] = new InputFieldCSV(true,

source, i);

norm.AddInputField(inputField[i]);

IOutputField outputField =

new OutputFieldDirect(inputField[i]);

norm.AddOutputField(outputField);

}

IntegerBalanceSegregator segregator =

new IntegerBalanceSegregator(

inputField[field], count);

norm.AddSegregator(segregator);

norm.Process();

app.WriteLine("Samples per tree type:");

app.WriteLine(segregator.DumpCounts());

}

public void Step1()

{

app.WriteLine(

"Step 1: Generate training and evaluation files");

app.WriteLine(

"Generate training file");

Copy(Constant.COVER_TYPE_FILE,

Constant.TRAINING_FILE, 0, 2, 4); // take 3/4

app.WriteLine("Generate evaluation file");

Copy(Constant.COVER_TYPE_FILE,

Constant.EVALUATE_FILE, 3, 3, 4); // take 1/4

}

public void Step2()

{

app.WriteLine(

"Step 2: Balance training to have the same number of each tree");

Narrow(Constant.TRAINING_FILE,

Constant.BALANCE_FILE, 54, 3000);

}

public DataNormalization Step3(bool useOneOf)

{

app.WriteLine("Step 3: Normalize training data");

IInputField inputElevation;

IInputField inputAspect;

IInputField inputSlope;

IInputField hWater;

IInputField vWater;

IInputField roadway;

IInputField shade9;

IInputField shade12;

IInputField shade3;

IInputField firepoint;

IInputField[] wilderness = new IInputField[4];

IInputField[] soilType = new IInputField[40];

IInputField coverType;

DataNormalization norm =

new DataNormalization();

norm.Report = this;

norm.Storage =

new NormalizationStorageCSV(Constant.NORMALIZED_FILE);

norm.AddInputField(inputElevation =

new InputFieldCSV(true, Constant.BALANCE_FILE, 0));

norm.AddInputField(inputAspect =

new InputFieldCSV(true, Constant.BALANCE_FILE, 1));

norm.AddInputField(inputSlope =

new InputFieldCSV(true, Constant.BALANCE_FILE, 2));

norm.AddInputField(

hWater = new InputFieldCSV(true, Constant.BALANCE_FILE, 3));

norm.AddInputField(

vWater = new InputFieldCSV(true, Constant.BALANCE_FILE, 4));

norm.AddInputField(

roadway = new InputFieldCSV(true, Constant.BALANCE_FILE, 5));

norm.AddInputField(

shade9 = new InputFieldCSV(true, Constant.BALANCE_FILE, 6));

norm.AddInputField(

shade12 = new InputFieldCSV(true, Constant.BALANCE_FILE, 7));

norm.AddInputField(

shade3 = new InputFieldCSV(true, Constant.BALANCE_FILE, 8));

norm.AddInputField(

firepoint = new InputFieldCSV(true,

Constant.BALANCE_FILE, 9));

for (int i = 0; i < 4; i++)

{

norm.AddInputField(wilderness[i] =

new InputFieldCSV(true, Constant.BALANCE_FILE, 10 + i));

}

for (int i = 0; i < 40; i++)

{

norm.AddInputField(soilType[i] =

new InputFieldCSV(true, Constant.BALANCE_FILE, 14 + i));

}

norm.AddInputField(coverType =

new InputFieldCSV(false, Constant.BALANCE_FILE, 54));

norm.AddOutputField(

new OutputFieldRangeMapped(inputElevation, 0.1, 0.9));

norm.AddOutputField(

new OutputFieldRangeMapped(inputAspect, 0.1, 0.9));

norm.AddOutputField(

new OutputFieldRangeMapped(inputSlope, 0.1, 0.9));

norm.AddOutputField(

new OutputFieldRangeMapped(hWater, 0.1, 0.9));

norm.AddOutputField(

new OutputFieldRangeMapped(vWater, 0.1, 0.9));

norm.AddOutputField(

new OutputFieldRangeMapped(roadway, 0.1, 0.9));

norm.AddOutputField(

new OutputFieldRangeMapped(shade9, 0.1, 0.9));

norm.AddOutputField(

new OutputFieldRangeMapped(shade12, 0.1, 0.9));

norm.AddOutputField(

new OutputFieldRangeMapped(shade3, 0.1, 0.9));

norm.AddOutputField(

new OutputFieldRangeMapped(firepoint, 0.1, 0.9));

for (int i = 0; i < 40; i++)

{

norm.AddOutputField(

new OutputFieldDirect(soilType[i]));

}

if (useOneOf)

BuildOutputOneOf(norm, coverType);

else

BuildOutputEquilateral(norm, coverType);

norm.Process();

return norm;

}

public void Report(int total, int current,

String message)

{

app.WriteLine(current + "/" + total + " " + message);

}

}

}

The Copy method is used twice by the first step. It essentially copies one CSV file to another, while segregating away some of the data. This is how the training and evaluation CSV files are created. The Copy method begins by accepting a source and target file. The start, stop and size parameters are used with an IndexSampleSegregator. For more information on the meaning of these three parameters, refer to the description of IndexSampleSegregator earlier in this chapter.

public void Copy(

String source,

String target,

int start,

int stop,

int size)

{

First we create an array of input fields to hold the 55 fields that make up the cover type CSV file downloaded earlier in this chapter.

IInputField[] inputField = new IInputField[55];

A DataNormalization object is created that reports its progress to the current object, and has a normalization target of a CSV file. This sends the output to the CSV file specified by the target parameter.

DataNormalization norm = new DataNormalization();

norm.Report = this;

norm.Storage = new NormalizationStorageCSV(target);

Now we must create all 55 input and output fields. The input fields come from fields in the CSV file, using InputFieldCSV. The output fields are all direct copies of the input fields, using OutputFieldDirect.

for (int i = 0; i < 55; i++)

{

inputField[i] = new InputFieldCSV(true, source, i);

norm.AddInputField(inputField[i]);

IOutputField outputField =

new OutputFieldDirect(inputField[i]);

norm.AddOutputField(outputField);

}

Next a segregator is created. It will work on a sample size of size. Only indexes, within this sample, that are between start and stop will be written to the target file.

IndexSampleSegregator segregator2 =

new IndexSampleSegregator(start, stop, size);

norm.AddSegregator(segregator2);

norm.Process();

}

Short of segregation, no actual normalization is done by the Copy methods. The copy method is used by the Step1 method.

public void Step1()

{

Step one generates the training and evaluation files.

app.WriteLine(

"Step 1: Generate training and evaluation files");

app.WriteLine(

"Generate training file");

First, we create the training file. We specify a sample size of four. This breaks the file up into sections of four rows. The first four rows in the file make up the first sample. The second four make up the next sample, and so on. Specifying a start and stop index of zero and two means that of the four-sized sample, we will use index zero, one and two. The third will be left for the evaluation data. As a result, we use 75% of the data for training.

Copy(Constant.COVER_TYPE_FILE,

Constant.TRAINING_FILE, 0, 2, 4); // take 3/4

Next we create the evaluation file. We again use a sample size of four, however the starting and stopping index are both three. This means that we will only use the fourth sample (index 3) from each sample.

app.WriteLine("Generate evaluation file");

Copy(Constant.COVER_TYPE_FILE,

Constant.EVALUATE_FILE, 3, 3, 4); // take 1/4

The result is that we use 25% of the data for evaluation. The narrow method is used by step two to narrow down the files and allow a maximum of 3,000 of each tree type.

public void Narrow(

String source,

String target,

int field,

int count)

{

The Narrow method accepts the source and target files to use. We also specify the field to narrow on, as well as the maximum count of each of this field to allow. This method begins very similarly to the Copy method. We create 55 fields to be directly copied from the input field to the output field.

IInputField[] inputField = new IInputField[55];

DataNormalization norm = new DataNormalization();

norm.Report = this;

norm.Storage = new NormalizationStorageCSV(target);

for (int i = 0; i < 55; i++)

{

inputField[i] = new InputFieldCSV(true, source, i);

norm.AddInputField(inputField[i]);

IOutputField outputField =

new OutputFieldDirect(inputField[i]);

norm.AddOutputField(outputField);

}

The Narrow method differs from the Copy method, in that an IntegerBalanceSegregator is used. This segregator will allow at most count items from the specified balancing field.

IntegerBalanceSegregator segregator =

new IntegerBalanceSegregator(

inputField[field], count);

norm.AddSegregator(segregator);

The normalization is now performed.

norm.Process();

app.WriteLine("Samples per tree type:");

app.WriteLine(segregator.DumpCounts());

The last activity performed by the Narrow method is to display the counts for each unique value found on the balancing field. Step three will actually normalize the data. This covered in the next section.

Normalizing the Data

The Step3 method normalizes the data. It accepts a parameter to tell whether we are using one-of-n normalization.

public DataNormalization Step3(bool useOneOf)

{

app.WriteLine("Step 3: Normalize training data");

First we must create a number of local variables to use to set up the normalization object.

IInputField inputElevation;

IInputField inputAspect;

IInputField inputSlope;

IInputField hWater;

IInputField vWater;

IInputField roadway;

IInputField shade9;

IInputField shade12;

IInputField shade3;

IInputField firepoint;

IInputField[] wilderness = new IInputField[4];

IInputField[] soilType = new IInputField[40];

IInputField coverType;

The normalization object is created. It will report to the current object, and it will output to a CSV file.

DataNormalization norm = new DataNormalization();

norm.Report = this;

norm.Storage =

new NormalizationStorageCSV(Constant.NORMALIZED_FILE);

Next we must add all of the input fields from the data file. There are 55 of them. We start with the elevation field, and add in the other simple fields.

norm.AddInputField(inputElevation =

new InputFieldCSV(true, Constant.BALANCE_FILE, 0));

norm.AddInputField(inputAspect =

new InputFieldCSV(true, Constant.BALANCE_FILE, 1));

norm.AddInputField(inputSlope =

new InputFieldCSV(true, Constant.BALANCE_FILE, 2));

norm.AddInputField(hWater =

new InputFieldCSV(true, Constant.BALANCE_FILE, 3));

norm.AddInputField(vWater =

new InputFieldCSV(true, Constant.BALANCE_FILE, 4));

norm.AddInputField(roadway =

new InputFieldCSV(true, Constant.BALANCE_FILE, 5));

norm.AddInputField(shade9 =

new InputFieldCSV(true, Constant.BALANCE_FILE, 6));

norm.AddInputField(shade12 =

new InputFieldCSV(true, Constant.BALANCE_FILE, 7));

norm.AddInputField(shade3 =

new InputFieldCSV(true, Constant.BALANCE_FILE, 8));

norm.AddInputField(firepoint =

new InputFieldCSV(true, Constant.BALANCE_FILE, 9));

Once the initial fields have been added we must add in the wilderness area and soil types. Both of these are arrays of fields. There are four wilderness areas and 44 soil types.

for (int i = 0; i < 4; i++)

{

norm.AddInputField(wilderness[i] =

new InputFieldCSV(true, Constant.BALANCE_FILE, 10 + i));

}

for (int i = 0; i < 40; i++)

{

norm.AddInputField(soilType[i] =

new InputFieldCSV(true, Constant.BALANCE_FILE, 14 + i));

}

This field is the cover type; it is index 54 in the CSV file. The cover type is what we are attempting to predict.

norm.AddInputField(coverType =

new InputFieldCSV(false, Constant.BALANCE_FILE, 54));

For the initial fields, we will range map them to values between 0.1 and 0.9. The values 0.1 and 0.9 where chosen over 0.0 and 1.0 to prevent the data from being too close to the extreme ends of what the neural network allows. This is 0.0 and 1.0 because this neural network will use a sigmoid activation function.

norm.AddOutputField(

new OutputFieldRangeMapped(inputElevation, 0.1, 0.9));

norm.AddOutputField(

new OutputFieldRangeMapped(inputAspect, 0.1, 0.9));

norm.AddOutputField(

new OutputFieldRangeMapped(inputSlope, 0.1, 0.9));

norm.AddOutputField(

new OutputFieldRangeMapped(hWater, 0.1, 0.9));

norm.AddOutputField(

new OutputFieldRangeMapped(vWater, 0.1, 0.9));

norm.AddOutputField(

new OutputFieldRangeMapped(roadway, 0.1, 0.9));

norm.AddOutputField(

new OutputFieldRangeMapped(shade9, 0.1, 0.9));

norm.AddOutputField(

new OutputFieldRangeMapped(shade12, 0.1, 0.9));

norm.AddOutputField(

new OutputFieldRangeMapped(shade3, 0.1, 0.9));

norm.AddOutputField(

new OutputFieldRangeMapped(firepoint, 0.1, 0.9));

The soil types already have vales of 0 or 1, so they can be directly placed into the network. It might be interesting to try values of 0.1 and 0.9 to see how they affect the efficiency of the neural network. However, because these are absolute Boolean values, we placed them at the extremes of 0.0 and 1.0.

for (int i = 0; i < 40; i++)

{

norm.AddOutputField(new OutputFieldDirect(soilType[i]));

}

The cover type is normalized using either equilateral or one-of-n. The methods BuildOuputOneOf and BuildOutputEquilateral can be seen in Listing 6.2.

if (useOneOf)

BuildOutputOneOf(norm, coverType);

else

BuildOutputEquilateral(norm, coverType);

Finally, the normalization is performed and the normalization object is returned.

norm.Process();

return norm;

}

The training data has now been normalized. The neural network can be trained.

Training the Network

Now that the data is ready, the network must be trained. The code for training is shown in Listing 6.3.

Listing 6.3: The Forest Cover Network Training

using System;

using System.Collections.Generic;

using System.Linq;

using System.Text;

using Encog.Neural.Networks;

using Encog.Neural.NeuralData;

using Encog.Neural.Activation;

using Encog.Neural.Networks.Layers;

using Encog.Neural.Networks.Logic;

using Encog.Persist;

using Encog.Normalize;

using Encog.Neural.Data.Buffer;

using Encog.Util.Simple;

using System.IO;

namespace Encog.Examples.Forest

{

public class TrainNetwork

{

private IExampleInterface app;

public TrainNetwork(IExampleInterface app)

{

this.app = app;

}

public BasicNetwork GenerateNetwork(

INeuralDataSet trainingSet)

{

BasicNetwork network = new BasicNetwork();

network.AddLayer(

new BasicLayer(new ActivationSigmoid(), true,

trainingSet.InputSize));

network.AddLayer(new BasicLayer(

new ActivationSigmoid(), true,

Constant.HIDDEN_COUNT));

network.AddLayer(

new BasicLayer(new ActivationSigmoid(),

true, trainingSet.IdealSize));

network.Logic = new FeedforwardLogic();

network.Structure.FinalizeStructure();

network.Reset();

return network;

}

public void Train(bool useGUI)

{

app.WriteLine("Converting training file to binary");

EncogPersistedCollection encog =

new EncogPersistedCollection(

Constant.TRAINED_NETWORK_FILE, FileMode.Open);

DataNormalization norm =

(DataNormalization)encog.Find(

Constant.NORMALIZATION_NAME);

EncogUtility.ConvertCSV2Binary(

Constant.NORMALIZED_FILE, Constant.BINARY_FILE,

norm.GetNetworkInputLayerSize(),

norm.GetNetworkOutputLayerSize(), false);

BufferedNeuralDataSet trainingSet =

new BufferedNeuralDataSet(Constant.BINARY_FILE);

BasicNetwork network = (BasicNetwork)encog.Find(

Constant.TRAINED_NETWORK_NAME);

if (network == null)

network = EncogUtility.SimpleFeedForward(

norm.GetNetworkInputLayerSize(),

Constant.HIDDEN_COUNT,

0,

norm.GetNetworkOutputLayerSize(),

false);

if (useGUI)

{

EncogUtility.TrainDialog(network, trainingSet);

}

else

{

EncogUtility.TrainConsole(network,

trainingSet, Constant.TRAINING_MINUTES);

}

app.WriteLine(

"Training complete, saving network...");

encog.Add(Constant.TRAINED_NETWORK_NAME, network);

}

}

}

The Train method will actually train the neural network. A parameter is passed in that indicates if GUI training should be used or not.

public void Train(bool useGUI)

{

The network and normalization objects are read from the forest.eg file, which is an Encog persistence file.

app.WriteLine("Converting training file to binary");

EncogPersistedCollection encog =

new EncogPersistedCollection(

Constant.TRAINED_NETWORK_FILE, FileMode.Open);

DataNormalization norm =

(DataNormalization)encog.Find(Constant.NORMALIZATION_NAME);

The neural network will train from the normalized.csv file. Generally, it is a bad idea to train directly from a CSV file. A CSV file contains ASCII encoded numbers, and these must be passed for each line. It is much better to parse all of the numbers at once, and store them to a binary file. The neural network is then trained from this binary file. To convert the CSV file to a binary file, use the ConvertCSV2Binary method.

EncogUtility.ConvertCSV2Binary(

Constant.NORMALIZED_FILE, Constant.BINARY_FILE,

norm.GetNetworkInputLayerSize(),

norm.GetNetworkOutputLayerSize(), false);

A BufferedNeuralDataSet is then created that will read from the newly created binary file. This allows a file that might be too large and cannot fit in memory to be trained. As the rows are needed, they are read from the file. Also, because it is binary data, no time is wasted reparsing the same rows over and over.

BufferedNeuralDataSet trainingSet =

new BufferedNeuralDataSet(Constant.BINARY_FILE);

A new neural network is created, using a utility function. This utility function is a quick way to create a simple feed forward network. The input size, hidden layer one size, hidden layer two size, output layer and activation function type are all passed in. These parameters are passed in this order. The final parameter, which specifies the activation type, uses false for sigmoid and true for hyperbolic tangent.

BasicNetwork network = (BasicNetwork)encog.Find(

Constant.TRAINED_NETWORK_NAME);

if (network == null)

network = EncogUtility.SimpleFeedForward(

norm.GetNetworkInputLayerSize(),

Constant.HIDDEN_COUNT,

0,

norm.GetNetworkOutputLayerSize(),

false);

Now that the network has been created, it must be trained. A utility method is also used to train the network. In Chapter 5 we trained by looping and repeatedly calling the iteration method of the trainer. You can also use the TrainConsole or TrainDialog methods of EncogUtility to perform this training.

if (useGUI)

{

EncogUtility.TrainDialog(network, trainingSet);

}

else

{

EncogUtility.TrainConsole(network,

trainingSet, Constant.TRAINING_MINUTES);

}

You can train using console mode only by calling the TrainConsole method. You can also use the GUI to train. Using the GUI displays the dialog seen in Figure 6.1.

app.WriteLine("Training complete, saving network...");

encog.Add(Constant.TRAINED_NETWORK_NAME, network);

Once the network has been trained, it is saved to an Encog persistence file.

Evaluating the Network

Now that the network has been trained, it can be evaluated. The code used to evaluate the neural network is shown in Listing 6.4.

Listing 6.4: The Forest Cover Network Evaluation

using System;

using System.Collections.Generic;

using System.Linq;

using System.Text;

using Encog.Neural.Networks;

using Encog.Normalize;

using Encog.Normalize.Output.Nominal;

using Encog.Neural.Data;

using System.IO;

using Encog.Persist;

using Encog.Util.CSV;

using Encog.Util;

namespace Encog.Examples.Forest

{

public class Evaluate

{

private IExampleInterface app;

private int[] treeCount = new int[10];

private int[] treeCorrect = new int[10];

public Evaluate(IExampleInterface app)

{

this.app = app;

}

public void KeepScore(int actual, int ideal)

{

treeCount[ideal]++;

if (actual == ideal)

treeCorrect[ideal]++;

}

public BasicNetwork LoadNetwork()

{

String file = Constant.TRAINED_NETWORK_FILE;

if (!File.Exists(file))

{

app.WriteLine("Can't read file: " + file);

return null;

}

EncogPersistedCollection encog =

new EncogPersistedCollection(file, FileMode.Open);

BasicNetwork network = (BasicNetwork)encog.Find(

Constant.TRAINED_NETWORK_NAME);

if (network == null)

{

app.WriteLine(

"Can't find network resource: " + Constant.TRAINED_NETWORK_NAME);

return null;

}

return network;

}


public DataNormalization LoadNormalization()

{

String file = Constant.TRAINED_NETWORK_FILE;

EncogPersistedCollection encog =

new EncogPersistedCollection(file, FileMode.Open);

DataNormalization norm =

(DataNormalization)encog.Find(

Constant.NORMALIZATION_NAME);

if (norm == null)

{

app.WriteLine(

"Can't find normalization resource: "

+ Constant.NORMALIZATION_NAME);

return null;

}

return norm;

}

public int DetermineTreeType(

OutputEquilateral eqField, INeuralData output)

{

int result = 0;

if (eqField != null)

{

result = eqField.Equilateral.Decode(output.Data);

}

else

{

double maxOutput = double.NegativeInfinity;

result = -1;

for (int i = 0; i < output.Count; i++)

{

if (output.Data[i] > maxOutput)

{

maxOutput = output.Data[i];

result = i;

}

}

}

return result;

}

public void PerformEvaluate()

{

BasicNetwork network = LoadNetwork();

DataNormalization norm = LoadNormalization();

ReadCSV csv = new ReadCSV(

Constant.EVALUATE_FILE.ToString(), false, ',');

double[] input = new double[norm.InputFields.Count];

OutputEquilateral eqField = (OutputEquilateral)

norm.FindOutputField(typeof(OutputEquilateral), 0);

int correct = 0;

int total = 0;

while (csv.Next())

{

total++;

for (int i = 0; i < input.Length; i++)

{

input[i] = csv.GetDouble(i);

}

INeuralData inputData =

norm.BuildForNetworkInput(input);

INeuralData output = network.Compute(inputData);

int coverTypeActual = DetermineTreeType(

eqField, output);

int coverTypeIdeal = (int)csv.GetDouble(54) - 1;

KeepScore(coverTypeActual, coverTypeIdeal);

if (coverTypeActual == coverTypeIdeal)

{

correct++;

}

}

app.WriteLine("Total cases:" + total);

app.WriteLine("Correct cases:" + correct);

double percent = (double)correct / (double)total;

app.WriteLine("Correct percent:"

+ Format.FormatPercentWhole(percent));

for (int i = 0; i < 7; i++)

{

double p = ((double)this.treeCorrect[i]

/ (double)this.treeCount[i]);

app.WriteLine("Tree Type #"

+ i

+ " - Correct/total: "

+ this.treeCorrect[i]

+ "/" + treeCount[i] + "("

+ Format.FormatPercentWhole(p) + ")");

}

}

}

}

The PerformEvaluate method is called to evaluate the neural network. It begins by loading the network and the normalization objects. Both of these objects are read from the Encog persistence file.

public void PerformEvaluate()

{

BasicNetwork network = LoadNetwork();

DataNormalization norm = LoadNormalization();

We will now open the evaluation file for use with a ReadCSV object. The contents of the evaluation file will be read and the network evaluated. For evaluation we will only pass over the contents of the file once, so it is okay to read a CSV file without converting it to binary, as we did earlier.

ReadCSV csv = new ReadCSV(

Constant.EVALUATE_FILE.ToString(), false, ',');

Next, we read in all of the input fields into a double array that will be presented to the neural network.

double[] input = new double[norm.InputFields.Count];

We also obtain the OutputEquilateral object from the normalization object. We will use the OutputEquilateral to interpret the results of the neural network and see which actual tree type the neural network is predicting.

OutputEquilateral eqField = (OutputEquilateral)

norm.FindOutputField(typeof(OutputEquilateral), 0);

We now loop over every row in the CSV file. We will count the number of correct records, as well as the total number of records.

int correct = 0;

int total = 0;

while (csv.Next())

{

total++;

The input for the neural network is read right from the CSV file and loaded into an array to be normalized.

for (int i = 0; i < input.Length; i++)

{

input[i] = csv.GetDouble(i);

}

Next, the normalization object is used to normalize the input data.

INeuralData inputData =

norm.BuildForNetworkInput(input);

The data is presented to the neural network. The output will tell us what tree type the network predicted for the input data.

INeuralData output = network.Compute(inputData);

The neural network was trained with the tree type converted to an equilateral normalized array. As a result, the output from the neural network is normalized and must be converted to an actual tree number. The DetermineTreeType takes the equilateral normalized output from the neural network and converts it to a tree type number. We also get the ideal tree type (that should have been predicted) from the CSV file.

int coverTypeActual = DetermineTreeType(eqField, output);

int coverTypeIdeal = (int)csv.GetDouble(54) - 1;

The KeepScore method is a very simple method that keeps track of correct guesses by the neural network on a tree type basis. This allows us to see that the neural network is better at predicting some tree types than others.

KeepScore(coverTypeActual, coverTypeIdeal);

We also keep track of the overall correct count.

if (coverTypeActual == coverTypeIdeal)

{

correct++;

}

}

Finally, we display the statistics of how well the evaluation went.

app.WriteLine("Total cases:" + total);

app.WriteLine("Correct cases:" + correct);

double percent = (double)correct / (double)total;

app.WriteLine("Correct percent:"

+ Format.FormatPercentWhole(percent));

for (int i = 0; i < 7; i++)

{

double p = ((double)this.treeCorrect[i]

/ (double)this.treeCount[i]);

app.WriteLine("Tree Type #"

+ i

+ " - Correct/total: "

+ this.treeCorrect[i]

+ "/" + treeCount[i] + "("

+ Format.FormatPercentWhole(p) + ")");

}

}

We will now examine how the DetermineTreeType method converts an equilateral output from the neural network into an actual tree number. The DetermineTreeType method is passed both the equilateral object, from the normalization object, as well as the neural network output.

public int DetermineTreeType(

OutputEquilateral eqField, INeuralData output)

{

We are going to loop over all of the equilateral encodings for each of the seven tree types, held in the equilateral object. Whichever one has the lowest Euclidean distance to the neural network output is considered to be the tree type that the neural network predicted.

int result = 0;

First, we see if equilateral normalization was used. If it was, simply use the Decode method. This method will determine which tree type had the lowest equilateral distance.

if (eqField != null)

{

result = eqField.Equilateral.Decode(output.Data);

}

else

{

For one-of-n encoding, we loop over all of the output neurons and see which has the highest activation. The neuron with the highest activation corresponds to the tree type that was predicted.

double maxOutput = double.NegativeInfinity;

result = -1;

for (int i = 0; i < output.Count; i++)

{

if (output.Data[i] > maxOutput)

{

maxOutput = output.Data[i];

result = i;

}

}

}

Finally, return the result.

return result;

The forest example is a very good starting point for creating an Encog-based application that classifies input data into specific groups. In this case, the trees were these groups. Any such application will have to go through the process of generating data, training and evaluation.

Summary

In this chapter you saw how to normalize data for a neural network. Neural networks can very rarely handle data in a raw form. To normalize the data you restrict and map the data into specific ranges. You also convert nominal data into arrays of values using either equilateral or one-of-n encoding.

Encog provides the NormalizeData class to make normalization easier. This class supports a variety of normalization types. This class makes use of InputField, IOutputField, ISegregator, and INormalizationStorage classes. Using subclasses of these four basic classes, many different normalization techniques can be achieved.

IInputField derived objects are added to the NormalizeData class to define where raw input data should be come from. CSV files are a very common choice, as they are a convenient means of storing many rows of numeric data. There are also input fields defined for arrays and INeuralDataSet objects.

IOutputField-derived objects are added to the NormalizeData class to define how the input fields should be normalized. Encog supports many different normalization types. There are normalization types for multiplicative, z-axis, one-of-n, equilateral and ranged mapped normalization techniques.

ISegregator-derived objects are added to the NormalizeData class to define which rows should not be processed. There are many different segregators available. You can choose to take a sample for training or evaluation. You can also exclude rows based on the values of their fields. Rows can be removed to maintain balance, and prevent one row type from saturating the network.

A single INormalizationStorage derived object tells the NormalizeData object what to do with the normalized data. Data can be written to a variety of targets, such as arrays, CSV files and INeuralDataSet objects.

This chapter demonstrated an example that attempts to predict the type of tree that may cover wilderness area. This example used real-world data provided by the United States Forestry Service. This example demonstrated how to normalize this raw data, and predict forest cover. This example demonstrated many common techniques in neural network programming, such as normalization, training and evaluation.

This chapter also introduced Encog persistence files. These files can contain different types of Encog objects. These files are very useful because it can take days to properly train a neural network. The next chapter will expand on Encog persistence files and show how they are used.

Questions for Review

  1. Why is it necessary to normalize data for neural networks?
  2. Describe the purpose of each of the following Encog normalization object types: input fields, output fields, segregators and output targets.
  3. When is it necessary for Encog to do a “two-pass” normalization process?
  4. What is the difference between z-axis normalization and multiplicative normalization? When would you use each?
  5. What is the difference between one-of-n normalization and equilateral normalization?
  6. Given a type of gender, which has the values male/female? Would you use one-of-n or equilateral normalization? Why?
  7. What are balance segregators used for?
  8. Training data read directly from CSV files can be slow, as the data must be reparsed for each iteration. How can an Encog application overcome this?
  9. Describe what happens in each of these phases of a typical neural network program: generation, training, and evaluation.
  10. What advantage does a two-dimensional array have over a one-dimensional array for output storage?

Terms

CSV

EG File

Equilateral Normalization

Euclidian Distance

Evaluation

Field Group

Input Field

Multiplicative Normalization

Nominal Value

Normalization

Normalization Target

Numeric Value

one-of-n Normalization

Output Field

Segregator

Training

Vector Length

Z-Axis Normalization


Copyright 2005 - 2012 by Heaton Research, Inc.. Heaton Research™ and Encog™ are trademarks of Heaton Research. Click here for copyright, license and trademark information.