Workbench Regression Example

From Encog Machine Learning Framework
Jump to: navigation, search
Miles Per Gallon

This example shows how to use the Encog Workbench to perform regression. This example also uses the Encog Analyst. You can see this example as a [[Command Line Regression Example ]]. Regression is the process where a Machine Learning Method learns to produce numeric output from input data. Using Supervised Learning the machine learning method is provided with a Training Set. The Machine Learning Method learns to transform each Training Set element into the appropriate output value.

This example will use the Encog Analyst to learn to predict the miles per gallon for an automobile. This example makes use of the MPG Data Set. This data set provides several attributes, including the miles per gallon (MPG), for a number of US cars from the 1970's and early 1980's.

Contents

Steps for Running the Example

To walk through this example follow these steps. This example has been updated to Encog 3.0.

Step 1: Download the MPG Data

First start up the Encog Workbench. Create a new project. Name it anything you like, such as MPG Example. This will create an empty folder to hold your data in. You now need to obtain your data. The Encog Workbench contains a number of built in data sets, however, it does not contain the MPG Data Set. This data set can be downloaded from the UCI Machine Learning Repository.

Choose Tools:Generate Training Data from the Encog Workbench menu bar. Choose to generate the Download from URL, and name it something such as mpg.csv. This should create a CSV File. Specify the following URL.

http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data

You can see a small sample of this data here.

18.0   8   307.0      130.0      3504.      12.0   70  1	"chevrolet chevelle malibu"
15.0   8   350.0      165.0      3693.      11.5   70  1	"buick skylark 320"
18.0   8   318.0      150.0      3436.      11.0   70  1	"plymouth satellite"
16.0   8   304.0      150.0      3433.      12.0   70  1	"amc rebel sst"
17.0   8   302.0      140.0      3449.      10.5   70  1	"ford torino"
15.0   8   429.0      198.0      4341.      10.0   70  1	"ford galaxie 500"
14.0   8   454.0      220.0      4354.       9.0   70  1	"chevrolet impala"

As you can see there are several attributes provided. The attributes, listed in order, are:

  • Column 1. mpg: continuous
  • Column 2. cylinders: multi-valued discrete
  • Column 3. displacement: continuous
  • Column 4. horsepower: continuous
  • Column 5. weight: continuous
  • Column 6. acceleration: continuous
  • Column 7. model year: multi-valued discrete
  • Column 8. origin: multi-valued discrete
  • Column 9. car name: string (unique for each instance)

We would like to create a Machine Learning Method that will learn to predict the MPG, when provided some of the other attributes. We will attempt to predict column 1, using columns 2,3,4,5 and 6. Columns 7, 8 and 9 are not useful for prediction. We will divide this training set into a training data set and an evaluation data set. The larger training data set will be used for the Machine Learning Method to learn from. The evaluation data set will be used to test the Machine Learning Method on data that it was not trained with. It is also possible to use cross validation, and use a single data set.

Step 2: Use the Analyst Wizard

Now that you have input data in the workbench you should use the Encog Analyst Wizard to create an Encog Analyst File. The Encog Analyst File (*.ega) is a script file that tells Encog Analyst how to process your file data. To generate a EGA File right-click mpg.csv and choose Analyst Wizard.... This will show a screen similar to the following.

Analyst-wizard-1.png

You must change several values, the image above is generic. Set the values as follows.

  • Source CSV File(*.csv) - Leave as is. The image above shows iris, but it will be whatever file you right-clicked to bring up the wizard.
  • File Format - As you can see from the sample data above, this CSV file is space (or tab) delimited. Choose Decimal Point (USA/English) & Space Separator
  • Machine Learning - Set this to Feedforward or Support Vector Machine. For this example, I will assume you used Support Vector Machine.
  • Goal - Set this to Regression.
  • Target Field - Set this to field:1. There are no headers, so we need to identify the field by number. This is the MPG field.
  • CSV Headers - Set this field to false (unchecked), as you can see above, there are no headers. Most UCI Machine Learning Repository files have no headers.
  • Normalization Range - Set to -1 to 1.
  • Missing Values - Set to DiscardMissing.

Encog Analyst will now generate a EGA File with the same base name as your data file. You should now see two files in the workbench project area: mpg.csv and mpg.ega. Double click the mpg.ega file and you will see a window similar to following. This is a generic image of the EGA File Editor, so the source code will be different.

Analyst-ega-1.png

This shows you the EGA File that was generated by analyzing the MPG data. You can see the complete file here. The file text below is not generic, yours should be very similar.

[HEADER]
[HEADER:DATASOURCE]
rawFile=FILE_RAW
sourceFile=
sourceFormat=decpnt|space
sourceHeaders=f
[SETUP]
[SETUP:CONFIG]
allowedClasses=integer,string
csvFormat=decpnt|comma
inputHeaders=f
maxClassCount=50
[SETUP:FILENAMES]
FILE_RANDOMIZE=mpg_random.csv
FILE_EVAL_NORM=mpg_eval_norm.csv
FILE_BALANCE=mpg_balance.csv
FILE_EVAL=mpg_eval.csv
FILE_RAW=mpg.csv
FILE_ML=mpg_train.eg
FILE_OUTPUT=mpg_output.csv
FILE_CLUSTER=mpg_cluster.csv
FILE_NORMALIZE=mpg_norm.csv
FILE_TRAINSET=mpg_train.egb
FILE_TRAIN=mpg_train.csv
[DATA]
[DATA:CONFIG]
goal=classification
[DATA:STATS]
"name","isclass","iscomplete","isint","isreal","amax","amin","mean","sdev"
"field:1",0,1,0,1,46.6,9,23.5145728643,7.8061590613
"field:2",1,1,1,1,8,3,5.4547738693,1.6988659605
"field:3",0,1,0,1,455,68,193.425879397,104.1387635271
"field:4",0,1,0,0,0,0,11.4572864322,0
"field:5",0,1,0,1,5140,1613,2970.4246231156,845.7772335198
"field:6",0,1,0,1,24.8,8,15.5680904523,2.7542223176
"field:7",1,1,1,1,82,70,76.0100502513,3.6929784656
"field:8",1,1,1,1,3,1,1.5728643216,0.8010466374
"field:9",0,1,0,0,0,0,0,0
[DATA:CLASSES]
"field","code","name"
"field:2","3","3",4
"field:2","4","4",204
"field:2","5","5",3
"field:2","6","6",84
"field:2","8","8",103
"field:7","70","70",29
"field:7","71","71",28
"field:7","72","72",28
"field:7","73","73",40
"field:7","74","74",27
"field:7","75","75",30
"field:7","76","76",34
"field:7","77","77",28
"field:7","78","78",36
"field:7","79","79",29
"field:7","80","80",29
"field:7","81","81",29
"field:7","82","82",31
"field:8","1","1",249
"field:8","2","2",70
"field:8","3","3",79
[NORMALIZE]
[NORMALIZE:CONFIG]
sourceFile=FILE_TRAIN
targetFile=FILE_NORMALIZE
[NORMALIZE:RANGE]
"name","io","timeSlice","action","high","low"
"field:1","output",0,"range",1,-1
"field:2","input",0,"equilateral",1,-1
"field:3","input",0,"range",1,-1
"field:4","input",0,"ignore",0,0
"field:5","input",0,"range",1,-1
"field:6","input",0,"range",1,-1
"field:7","input",0,"equilateral",1,-1
"field:8","input",0,"equilateral",1,-1
"field:9","input",0,"ignore",0,0
[RANDOMIZE]
[RANDOMIZE:CONFIG]
sourceFile=FILE_RAW
targetFile=FILE_RANDOMIZE
[CLUSTER]
[CLUSTER:CONFIG]
clusters=0
sourceFile=FILE_EVAL
targetFile=FILE_CLUSTER
type=kmeans
[BALANCE]
[BALANCE:CONFIG]
balanceField=
countPer=
sourceFile=
targetFile=
[SEGREGATE]
[SEGREGATE:CONFIG]
sourceFile=FILE_RANDOMIZE
[SEGREGATE:FILES]
"file","percent"
"FILE_TRAIN",75
"FILE_EVAL",25
[GENERATE]
[GENERATE:CONFIG]
sourceFile=FILE_NORMALIZE
targetFile=FILE_TRAINSET
[ML]
[ML:CONFIG]
architecture=?:B->TANH->31:B->TANH->?
evalFile=FILE_EVAL
machineLearningFile=FILE_ML
outputFile=FILE_OUTPUT
trainingFile=FILE_TRAINSET
type=feedforward
[ML:TRAIN]
arguments=
cross=
targetError=0.01
type=rprop
[TASKS]
[TASKS:task-cluster]
cluster
[TASKS:task-create]
create
[TASKS:task-evaluate]
evaluate
[TASKS:task-evaluate-raw]
set ML.CONFIG.evalFile="FILE_EVAL_NORM"
set NORMALIZE.CONFIG.sourceFile="FILE_EVAL"
set NORMALIZE.CONFIG.targetFile="FILE_EVAL_NORM"
normalize
evaluate-raw
[TASKS:task-full]
randomize
segregate
normalize
generate
create
train
evaluate
[TASKS:task-generate]
randomize
segregate
normalize
generate
[TASKS:task-train]
train

Now that the EGA file has been generated the wizard will make several changes to it in the next section. For more information on the format of this file, see the article on EGA Files and Encog Analyst.

Step 3: Modifications to the EGA File

Sometimes your EGA file will need no modifications. This is not the case with the MPG example. There are no headers, so we want to rename the fields. There are also fields we do not want to use. First, look at the field definitions.

"field:1",0,1,0,1,46.6,9,23.5145728643,7.8061590613
"field:2",1,1,1,1,8,3,5.4547738693,1.6988659605
"field:3",0,1,0,1,455,68,193.425879397,104.1387635271
"field:4",0,1,0,0,0,0,11.4572864322,0
"field:5",0,1,0,1,5140,1613,2970.4246231156,845.7772335198
"field:6",0,1,0,1,24.8,8,15.5680904523,2.7542223176
"field:7",1,1,1,1,82,70,76.0100502513,3.6929784656
"field:8",1,1,1,1,3,1,1.5728643216,0.8010466374
"field:9",0,1,0,0,0,0,0,0

This should be changed to:

"mpg",0,1,0,1,46.6,9,23.5145728643,7.8061590613
"cylinders",1,1,1,1,8,3,5.4547738693,1.6988659605
"displacement",0,1,0,1,455,68,193.425879397,104.1387635271
"horsepower",0,1,0,0,0,0,11.4572864322,0
"weight",0,1,0,1,5140,1613,2970.4246231156,845.7772335198
"acceleration",0,1,0,1,24.8,8,15.5680904523,2.7542223176
"year",0,1,1,1,82,70,76.0100502513,3.6929784656
"origin",0,1,1,1,3,1,1.5728643216,0.8010466374
"name",0,1,0,0,0,0,0,0

We are not going to be using ,year,origin or name. However, do not delete them. They are still fields in the file, even if we will not be using them. We have renamed the fields, and we are treating none of them as classes. Encog wants to treat year and cylinders as classes. Encog wants to do this because there are not that many values for each, so they seem like classes. However cylinders is really a number. More cylinders is more power. Years is not used anyway, so we don't need the class information. If one year is much better than another year for cars (think wine), then they should be a class. If as years progress, cars become more efficient, then years should be a number. I would tend to treat years as a numbers. Cars are progressively getting more efficient. Its not as if one year is a particularly good "vintage". As a result, your classes section is empty, as seen here.

[DATA:CLASSES]
[NORMALIZE]
[NORMALIZE:CONFIG]

The final version of the EGA File is shown here.

[HEADER]
[HEADER:DATASOURCE]
rawFile=FILE_RAW
sourceFile=
sourceFormat=decpnt|space
sourceHeaders=f
[SETUP]
[SETUP:CONFIG]
allowedClasses=integer,string
csvFormat=decpnt|comma
inputHeaders=f
maxClassCount=50
[SETUP:FILENAMES]
FILE_RANDOMIZE=mpg_random.csv
FILE_EVAL_NORM=mpg_eval_norm.csv
FILE_EVAL=mpg_eval.csv
FILE_RAW=mpg.csv
FILE_ML=mpg_train.eg
FILE_OUTPUT=mpg_output.csv
FILE_CLUSTER=mpg_cluster.csv
FILE_NORMALIZE=mpg_norm.csv
FILE_TRAINSET=mpg_train.egb
FILE_TRAIN=mpg_train.csv
[DATA]
[DATA:CONFIG]
goal=regression
[DATA:STATS]
"name","isclass","iscomplete","isint","isreal","amax","amin","mean","sdev"
"field:1",0,1,0,1,46.6,9,23.5145728643,7.8061590613
"field:2",1,1,1,1,8,3,5.4547738693,1.6988659605
"field:3",0,1,0,1,455,68,193.425879397,104.1387635271
"field:4",0,0,0,1,230,46,104.4693877551,38.4420327144
"field:5",0,1,0,1,5140,1613,2970.4246231156,845.7772335198
"field:6",0,1,0,1,24.8,8,15.5680904523,2.7542223176
"field:7",1,1,1,1,82,70,76.0100502513,3.6929784656
"field:8",1,1,1,1,3,1,1.5728643216,0.8010466374
"field:9",0,1,0,0,0,0,0,0
[DATA:CLASSES]
"field","code","name"
"field:2","3","3",4
"field:2","4","4",204
"field:2","5","5",3
"field:2","6","6",84
"field:2","8","8",103
"field:7","70","70",29
"field:7","71","71",28
"field:7","72","72",28
"field:7","73","73",40
"field:7","74","74",27
"field:7","75","75",30
"field:7","76","76",34
"field:7","77","77",28
"field:7","78","78",36
"field:7","79","79",29
"field:7","80","80",29
"field:7","81","81",29
"field:7","82","82",31
"field:8","1","1",249
"field:8","2","2",70
"field:8","3","3",79
[NORMALIZE]
[NORMALIZE:CONFIG]
missingValues=DiscardMissing
sourceFile=FILE_TRAIN
targetFile=FILE_NORMALIZE
[NORMALIZE:RANGE]
"name","io","timeSlice","action","high","low"
"field:1","output",0,"range",1,-1
"field:2","input",0,"equilateral",1,-1
"field:3","input",0,"range",1,-1
"field:4","input",0,"range",1,-1
"field:5","input",0,"range",1,-1
"field:6","input",0,"range",1,-1
"field:7","input",0,"equilateral",1,-1
"field:8","input",0,"equilateral",1,-1
"field:9","input",0,"ignore",0,0
[RANDOMIZE]
[RANDOMIZE:CONFIG]
sourceFile=FILE_RAW
targetFile=FILE_RANDOMIZE
[CLUSTER]
[CLUSTER:CONFIG]
clusters=2
sourceFile=FILE_EVAL
targetFile=FILE_CLUSTER
type=kmeans
[BALANCE]
[BALANCE:CONFIG]
balanceField=
countPer=
sourceFile=
targetFile=
[SEGREGATE]
[SEGREGATE:CONFIG]
sourceFile=FILE_RANDOMIZE
[SEGREGATE:FILES]
"file","percent"
"FILE_TRAIN",75
"FILE_EVAL",25
[GENERATE]
[GENERATE:CONFIG]
sourceFile=FILE_NORMALIZE
targetFile=FILE_TRAINSET
[ML]
[ML:CONFIG]
architecture=?->R(type=new,kernel=rbf)->?
evalFile=FILE_EVAL
machineLearningFile=FILE_ML
outputFile=FILE_OUTPUT
trainingFile=FILE_TRAINSET
type=svm
[ML:TRAIN]
arguments=
cross=
targetError=0.05
type=svm-search
[TASKS]
[TASKS:task-cluster]
cluster
[TASKS:task-create]
create
[TASKS:task-evaluate]
evaluate
[TASKS:task-evaluate-raw]
set ML.CONFIG.evalFile="FILE_EVAL_NORM"
set NORMALIZE.CONFIG.sourceFile="FILE_EVAL"
set NORMALIZE.CONFIG.targetFile="FILE_EVAL_NORM"
normalize
evaluate-raw
[TASKS:task-full]
randomize
segregate
normalize
generate
create
train
evaluate
[TASKS:task-generate]
randomize
segregate
normalize
generate
[TASKS:task-train]
train

Step 4: Visualizing Your Data

You will notice that the EGA File Editor, which was opened in the last step, has several buttons on the top. The Visualize button provides several ways to visualize your data. Clicking the Visualize button will provide you with a list of data visualizations provided by Encog Analyst.

Ranges Report

The first is the range report. The range report tells you what ranges your data columns were in. Encog Analyst determined this while analyzing the data file. All of these ranges are saved inside of the EGA File. You can see some of the data produced in a range report here:

Analyst-ranges-1.png

The above is a generic range report from Encog Analyst, and shows iris data. The MPG data would be presnted in a similar form. The range report for this specific data set will look different. There is additional information if you scroll down. This information is necessary for the analyst to normalize your data. Most Machine Learning Methods require some form of normalization to work with data.

Scatter Plot

Miles per Gallon Scatter Plot

A scatter plot can also show some interesting information about your data. You can easily see if your data forms clusters. Choose Scatter Plot from the dialog that appears when you click the Visualize button in the EGA File Editor. You will be prompted for what attributes you wish to plot. Place a check in all check boxes. This will produce a multivariate scatter plot, as seen here. This multivariate scatter plot shows pairings of each of the ranges of MPG. This allows you to see how the pairs relate to each other. Ideally you will see large clusters of similar colored dots. If you do not, your data is either very noisy, or is simply not expressed in a way that is going to be easy for a Machine Learning Method to learn.

Step 5: Execute the Analyst Script

Execute the Encog Analyst Script

Now that the EGA File has been created, you can execute it. This will perform several steps. Click the Execute button from the EGA File Editor, that was opened in Step 2. This takes the data through 7 steps. There may be more, or fewer steps, for other Encog Analyst projects, depending on what options are chosen. The entire execution should take under a minute on most computers.

This process will also create a number of files. The complete list of files, in this project is:

  • mpg.csv - The raw data.
  • mpg.ega - The EGA File. This is the Encog Analyst script.
  • mpg_eval.csv - The evaluation data.
  • mpg_norm.csv - The normalized version of mpg_train.csv.
  • mpg_output.csv - The output from running mpg_eval.csv.
  • mpg_random.csv - The randomized output from running mpg.csv.
  • mpg_train.csv - The training data.
  • mpg_train.eg - The Machine Learning Method that was trained.
  • mpg_train.egb - The binary training data, created from mpg_norm.egb.

Step 6: Examine the Output

To see how well the newly trained Machine Learning Method performed, examine mpg_output.csv. You can see part of this file here.

"sepal_l","sepal_w","petal_l","petal_w","species","Output:species"
"mpg","cylinders","displacement","horsepower","weight","acceleration","year","origin","name","Output:mpg"
20.0,4,130.0,102.0,3150.,15.7,76,2,volvo 245,27.0766085659
28.4,4,151.0,90.00,2670.,16.0,79,1,buick skylark limited,27.011264803
32.8,4,78.00,52.00,1985.,19.4,78,3,mazda glc deluxe,31.8777234817
19.0,6,232.0,100.0,2901.,16.0,74,1,amc hornet,19.4085398701
26.0,4,79.00,67.00,1963.,15.5,74,2,volkswagen dasher,32.4327434887
26.0,4,122.0,80.00,2451.,16.5,74,1,ford pinto,29.5519134777

As you can see, the learning method's output(far-right) is trying to match to the expected output(1st column). Most cars it is fairly close on, however, others it is not. This is data that the network was not trained with, so we are seeing how well the network performs on new data.

Understanding the Example

This is an example of regression. The output from the Machine Learning Method is a number, the expected miles per gallon for a car. In this example we used a Support Vector Machine. From a purely "black box" standpoint, a Support Vector Machine is very similar to a Neural Network. Both accept input data and produce output data. For regression, the input and output of a Support Vector Machine is identical to a Neural Network. For classification a Support Vector Machine their output is slightly different. Encog Analyst hides these differences.

External Links

The completed example can be downloaded here.

Personal tools