Command Line Regression Example
This example shows how to use the Encog Command Line Utility to perform regression. This example also uses the Encog Analyst. Regression is the process where a Machine Learning Method learns to produce numeric output from input data. Using Supervised Learning the machine learning method is provided with a Training Set. The Machine Learning Method learns to transform each Training Set element into the appropriate output value.
This example will use the Encog Analyst to learn to predict the miles per gallon for an automobile. This example makes use of the MPG Data Set. This data set provides several attributes, including the miles per gallon (MPG), for a number of US cars from the 1970's and early 1980's.
Contents |
Steps for Running the Example
To walk through this example follow these steps. This example has been updated to Encog 3.0.
Step 1: Download the MPG Data
First create an empty directory to hold the data for this example. Name it anything you like, such as MPG Example. This will create an empty folder to hold your data in. You now need to obtain your data. This data set can be downloaded from the UCI Machine Learning Repository.
http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data
You can see a small sample of this data here.
18.0 8 307.0 130.0 3504. 12.0 70 1 "chevrolet chevelle malibu" 15.0 8 350.0 165.0 3693. 11.5 70 1 "buick skylark 320" 18.0 8 318.0 150.0 3436. 11.0 70 1 "plymouth satellite" 16.0 8 304.0 150.0 3433. 12.0 70 1 "amc rebel sst" 17.0 8 302.0 140.0 3449. 10.5 70 1 "ford torino" 15.0 8 429.0 198.0 4341. 10.0 70 1 "ford galaxie 500" 14.0 8 454.0 220.0 4354. 9.0 70 1 "chevrolet impala"
As you can see there are several attributes provided. The attributes, listed in order, are:
- Column 1. mpg: continuous
- Column 2. cylinders: multi-valued discrete
- Column 3. displacement: continuous
- Column 4. horsepower: continuous
- Column 5. weight: continuous
- Column 6. acceleration: continuous
- Column 7. model year: multi-valued discrete
- Column 8. origin: multi-valued discrete
- Column 9. car name: string (unique for each instance)
We would like to create a Machine Learning Method that will learn to predict the MPG, when provided some of the other attributes. We will attempt to predict column 1, using columns 2,3,4,5 and 6. Columns 7, 8 and 9 are not useful for prediction. We will divide this training set into a training data set and an evaluation data set. The larger training data set will be used for the Machine Learning Method to learn from. The evaluation data set will be used to test the Machine Learning Method on data that it was not trained with. It is also possible to use cross validation, and use a single data set.
Step 2: Use the Analyst Wizard
Now that you have input data in the workbench you should use the Encog Analyst Wizard to create an Encog Analyst File. The Encog Analyst File (*.ega) is a script file that tells Encog Analyst how to process your file data. To generate a EGA File execute the code below.
D:\test>EncogCmd wizard mpg.csv Encog 3.0.0(32-bit) Command Line Utility Copyright 2011 by Heaton Research, Inc. Released under the Apache License Executing command: wizard Enter value for [headers] (default=True): f Enter value for [format] (default=decpnt|comma): Enter value for [goal] (default=c): r Enter value for [targetField] (default=): field:1 Enter value for [method] (default=ff): svm Enter value for [range] (default=-1t1): Enter value for [missing] (default=DiscardMissing): Enter value for [lagWindow] (default=0): Enter value for [leadWindow] (default=0): Enter value for [includeTarget] (default=False): Enter value for [normalize] (default=True): Enter value for [randomize] (default=True): Enter value for [segregate] (default=True): Enter value for [balance] (default=False): Enter value for [cluster] (default=False): Analyzing data Saving analyst file Done. Runtime was 00:00:00 (748ms). D:\test>
Make sure you enter all settings from above. These values are explained here.
- Source CSV File(*.csv) - Leave as is. The image above shows iris, but it will be whatever file you right-clicked to bring up the wizard.
- File Format - As you can see from the sample data above, this CSV file is space (or tab) delimited. Choose Decimal Point (USA/English) & Space Separator
- Machine Learning - Set this to Feedforward or Support Vector Machine. For this example, I will assume you used Support Vector Machine (SVM).
- Goal - Set this to Regression (r).
- Target Field - Set this to field:1. There are no headers, so we need to identify the field by number. This is the MPG field.
- CSV Headers - Set this field to false, as you can see above, there are no headers. Most UCI Machine Learning Repository files have no headers.
- Normalization Range - Set to -1 to 1.
- Missing Values - Set to DiscardMissing.
Encog Analyst will now generate a EGA File with the same base name as your data file. You should now see two files in the workbench project area: mpg.csv and mpg.ega.
This shows you the EGA File that was generated by analyzing the MPG data. You can see the complete file here. The file text below is not generic, yours should be very similar.
[HEADER] [HEADER:DATASOURCE] rawFile=FILE_RAW sourceFile= sourceFormat=decpnt|space sourceHeaders=f [SETUP] [SETUP:CONFIG] allowedClasses=integer,string csvFormat=decpnt|comma inputHeaders=f maxClassCount=50 [SETUP:FILENAMES] FILE_RANDOMIZE=mpg_random.csv FILE_EVAL_NORM=mpg_eval_norm.csv FILE_BALANCE=mpg_balance.csv FILE_EVAL=mpg_eval.csv FILE_RAW=mpg.csv FILE_ML=mpg_train.eg FILE_OUTPUT=mpg_output.csv FILE_CLUSTER=mpg_cluster.csv FILE_NORMALIZE=mpg_norm.csv FILE_TRAINSET=mpg_train.egb FILE_TRAIN=mpg_train.csv [DATA] [DATA:CONFIG] goal=classification [DATA:STATS] "name","isclass","iscomplete","isint","isreal","amax","amin","mean","sdev" "field:1",0,1,0,1,46.6,9,23.5145728643,7.8061590613 "field:2",1,1,1,1,8,3,5.4547738693,1.6988659605 "field:3",0,1,0,1,455,68,193.425879397,104.1387635271 "field:4",0,1,0,0,0,0,11.4572864322,0 "field:5",0,1,0,1,5140,1613,2970.4246231156,845.7772335198 "field:6",0,1,0,1,24.8,8,15.5680904523,2.7542223176 "field:7",1,1,1,1,82,70,76.0100502513,3.6929784656 "field:8",1,1,1,1,3,1,1.5728643216,0.8010466374 "field:9",0,1,0,0,0,0,0,0 [DATA:CLASSES] "field","code","name" "field:2","3","3",4 "field:2","4","4",204 "field:2","5","5",3 "field:2","6","6",84 "field:2","8","8",103 "field:7","70","70",29 "field:7","71","71",28 "field:7","72","72",28 "field:7","73","73",40 "field:7","74","74",27 "field:7","75","75",30 "field:7","76","76",34 "field:7","77","77",28 "field:7","78","78",36 "field:7","79","79",29 "field:7","80","80",29 "field:7","81","81",29 "field:7","82","82",31 "field:8","1","1",249 "field:8","2","2",70 "field:8","3","3",79 [NORMALIZE] [NORMALIZE:CONFIG] sourceFile=FILE_TRAIN targetFile=FILE_NORMALIZE [NORMALIZE:RANGE] "name","io","timeSlice","action","high","low" "field:1","output",0,"range",1,-1 "field:2","input",0,"equilateral",1,-1 "field:3","input",0,"range",1,-1 "field:4","input",0,"ignore",0,0 "field:5","input",0,"range",1,-1 "field:6","input",0,"range",1,-1 "field:7","input",0,"equilateral",1,-1 "field:8","input",0,"equilateral",1,-1 "field:9","input",0,"ignore",0,0 [RANDOMIZE] [RANDOMIZE:CONFIG] sourceFile=FILE_RAW targetFile=FILE_RANDOMIZE [CLUSTER] [CLUSTER:CONFIG] clusters=0 sourceFile=FILE_EVAL targetFile=FILE_CLUSTER type=kmeans [BALANCE] [BALANCE:CONFIG] balanceField= countPer= sourceFile= targetFile= [SEGREGATE] [SEGREGATE:CONFIG] sourceFile=FILE_RANDOMIZE [SEGREGATE:FILES] "file","percent" "FILE_TRAIN",75 "FILE_EVAL",25 [GENERATE] [GENERATE:CONFIG] sourceFile=FILE_NORMALIZE targetFile=FILE_TRAINSET [ML] [ML:CONFIG] architecture=?:B->TANH->31:B->TANH->? evalFile=FILE_EVAL machineLearningFile=FILE_ML outputFile=FILE_OUTPUT trainingFile=FILE_TRAINSET type=feedforward [ML:TRAIN] arguments= cross= targetError=0.01 type=rprop [TASKS] [TASKS:task-cluster] cluster [TASKS:task-create] create [TASKS:task-evaluate] evaluate [TASKS:task-evaluate-raw] set ML.CONFIG.evalFile="FILE_EVAL_NORM" set NORMALIZE.CONFIG.sourceFile="FILE_EVAL" set NORMALIZE.CONFIG.targetFile="FILE_EVAL_NORM" normalize evaluate-raw [TASKS:task-full] randomize segregate normalize generate create train evaluate [TASKS:task-generate] randomize segregate normalize generate [TASKS:task-train] train
Now that the EGA file has been generated the wizard will make several changes to it in the next section. For more information on the format of this file, see the article on EGA Files and Encog Analyst.
Step 3: Modifications to the EGA File
Sometimes your EGA file will need no modifications. This is not the case with the MPG example. There are no headers, so we want to rename the fields. There are also fields we do not want to use. First, look at the field definitions.
"field:1",0,1,0,1,46.6,9,23.5145728643,7.8061590613 "field:2",1,1,1,1,8,3,5.4547738693,1.6988659605 "field:3",0,1,0,1,455,68,193.425879397,104.1387635271 "field:4",0,1,0,0,0,0,11.4572864322,0 "field:5",0,1,0,1,5140,1613,2970.4246231156,845.7772335198 "field:6",0,1,0,1,24.8,8,15.5680904523,2.7542223176 "field:7",1,1,1,1,82,70,76.0100502513,3.6929784656 "field:8",1,1,1,1,3,1,1.5728643216,0.8010466374 "field:9",0,1,0,0,0,0,0,0
This should be changed to:
"mpg",0,1,0,1,46.6,9,23.5145728643,7.8061590613 "cylinders",1,1,1,1,8,3,5.4547738693,1.6988659605 "displacement",0,1,0,1,455,68,193.425879397,104.1387635271 "horsepower",0,1,0,0,0,0,11.4572864322,0 "weight",0,1,0,1,5140,1613,2970.4246231156,845.7772335198 "acceleration",0,1,0,1,24.8,8,15.5680904523,2.7542223176 "year",0,1,1,1,82,70,76.0100502513,3.6929784656 "origin",0,1,1,1,3,1,1.5728643216,0.8010466374 "name",0,1,0,0,0,0,0,0
We are not going to be using ,year,origin or name. However, do not delete them. They are still fields in the file, even if we will not be using them. We have renamed the fields, and we are treating none of them as classes. Encog wants to treat year and cylinders as classes. Encog wants to do this because there are not that many values for each, so they seem like classes. However cylinders is really a number. More cylinders is more power. Years is not used anyway, so we don't need the class information. If one year is much better than another year for cars (think wine), then they should be a class. If as years progress, cars become more efficient, then years should be a number. I would tend to treat years as a numbers. Cars are progressively getting more efficient. Its not as if one year is a particularly good "vintage". As a result, your classes section is empty, as seen here.
[DATA:CLASSES] [NORMALIZE] [NORMALIZE:CONFIG]
The final version of the EGA File is shown here.
[HEADER] [HEADER:DATASOURCE] rawFile=FILE_RAW sourceFile= sourceFormat=decpnt|space sourceHeaders=f [SETUP] [SETUP:CONFIG] allowedClasses=integer,string csvFormat=decpnt|comma inputHeaders=f maxClassCount=50 [SETUP:FILENAMES] FILE_RANDOMIZE=mpg_random.csv FILE_EVAL_NORM=mpg_eval_norm.csv FILE_EVAL=mpg_eval.csv FILE_RAW=mpg.csv FILE_ML=mpg_train.eg FILE_OUTPUT=mpg_output.csv FILE_CLUSTER=mpg_cluster.csv FILE_NORMALIZE=mpg_norm.csv FILE_TRAINSET=mpg_train.egb FILE_TRAIN=mpg_train.csv [DATA] [DATA:CONFIG] goal=regression [DATA:STATS] "name","isclass","iscomplete","isint","isreal","amax","amin","mean","sdev" "field:1",0,1,0,1,46.6,9,23.5145728643,7.8061590613 "field:2",1,1,1,1,8,3,5.4547738693,1.6988659605 "field:3",0,1,0,1,455,68,193.425879397,104.1387635271 "field:4",0,0,0,1,230,46,104.4693877551,38.4420327144 "field:5",0,1,0,1,5140,1613,2970.4246231156,845.7772335198 "field:6",0,1,0,1,24.8,8,15.5680904523,2.7542223176 "field:7",1,1,1,1,82,70,76.0100502513,3.6929784656 "field:8",1,1,1,1,3,1,1.5728643216,0.8010466374 "field:9",0,1,0,0,0,0,0,0 [DATA:CLASSES] "field","code","name" "field:2","3","3",4 "field:2","4","4",204 "field:2","5","5",3 "field:2","6","6",84 "field:2","8","8",103 "field:7","70","70",29 "field:7","71","71",28 "field:7","72","72",28 "field:7","73","73",40 "field:7","74","74",27 "field:7","75","75",30 "field:7","76","76",34 "field:7","77","77",28 "field:7","78","78",36 "field:7","79","79",29 "field:7","80","80",29 "field:7","81","81",29 "field:7","82","82",31 "field:8","1","1",249 "field:8","2","2",70 "field:8","3","3",79 [NORMALIZE] [NORMALIZE:CONFIG] missingValues=DiscardMissing sourceFile=FILE_TRAIN targetFile=FILE_NORMALIZE [NORMALIZE:RANGE] "name","io","timeSlice","action","high","low" "field:1","output",0,"range",1,-1 "field:2","input",0,"equilateral",1,-1 "field:3","input",0,"range",1,-1 "field:4","input",0,"range",1,-1 "field:5","input",0,"range",1,-1 "field:6","input",0,"range",1,-1 "field:7","input",0,"equilateral",1,-1 "field:8","input",0,"equilateral",1,-1 "field:9","input",0,"ignore",0,0 [RANDOMIZE] [RANDOMIZE:CONFIG] sourceFile=FILE_RAW targetFile=FILE_RANDOMIZE [CLUSTER] [CLUSTER:CONFIG] clusters=2 sourceFile=FILE_EVAL targetFile=FILE_CLUSTER type=kmeans [BALANCE] [BALANCE:CONFIG] balanceField= countPer= sourceFile= targetFile= [SEGREGATE] [SEGREGATE:CONFIG] sourceFile=FILE_RANDOMIZE [SEGREGATE:FILES] "file","percent" "FILE_TRAIN",75 "FILE_EVAL",25 [GENERATE] [GENERATE:CONFIG] sourceFile=FILE_NORMALIZE targetFile=FILE_TRAINSET [ML] [ML:CONFIG] architecture=?->R(type=new,kernel=rbf)->? evalFile=FILE_EVAL machineLearningFile=FILE_ML outputFile=FILE_OUTPUT trainingFile=FILE_TRAINSET type=svm [ML:TRAIN] arguments= cross= targetError=0.05 type=svm-search [TASKS] [TASKS:task-cluster] cluster [TASKS:task-create] create [TASKS:task-evaluate] evaluate [TASKS:task-evaluate-raw] set ML.CONFIG.evalFile="FILE_EVAL_NORM" set NORMALIZE.CONFIG.sourceFile="FILE_EVAL" set NORMALIZE.CONFIG.targetFile="FILE_EVAL_NORM" normalize evaluate-raw [TASKS:task-full] randomize segregate normalize generate create train evaluate [TASKS:task-generate] randomize segregate normalize generate [TASKS:task-train] train
Step 3: Execute the Analyst Script
Now that the EGA File has been created, you can execute it. This can be done with the following command.
D:\test>EncogCmd analyst mpg.ega Encog 3.0.0(32-bit) Command Line Utility Copyright 2011 by Heaton Research, Inc. Released under the Apache License Executing command: analyst Beginning Task#1/7 : randomize 1 : Analyzing 398/398 : Done analyzing 1/398 : Processing 398/398 : Done processing Task randomize completed, task elapsed time 00:00:00 Beginning Task#2/7 : segregate 1 : Analyzing 398/398 : Done analyzing 1/398 : Processing 398/398 : Done processing Task segregate completed, task elapsed time 00:00:00 Beginning Task#3/7 : normalize 1 : Processing 0 : Done processing Task normalize completed, task elapsed time 00:00:01 Beginning Task#4/7 : generate Task generate completed, task elapsed time 00:00:02 Beginning Task#5/7 : create Task create completed, task elapsed time 00:00:02 Beginning Task#6/7 : train Iteration #1 Error:Infinity% elapsed time = 00:00:03 Iteration #2 Error:Infinity% elapsed time = 00:00:03 Iteration #3 Error:Infinity% elapsed time = 00:00:04 Iteration #4 Error:Infinity% elapsed time = 00:00:04 Iteration #5 Error:Infinity% elapsed time = 00:00:04 Iteration #6 Error:Infinity% elapsed time = 00:00:04 Iteration #7 Error:Infinity% elapsed time = 00:00:04 Iteration #8 Error:Infinity% elapsed time = 00:00:04 Iteration #9 Error:Infinity% elapsed time = 00:00:04 Iteration #10 Error:Infinity% elapsed time = 00:00:04 ... Iteration #114 Error:17.503746% elapsed time = 00:00:24 Iteration #115 Error:17.503746% elapsed time = 00:00:24 Iteration #116 Error:17.503746% elapsed time = 00:00:25 Iteration #117 Error:17.503746% elapsed time = 00:00:25 Iteration #118 Error:17.503746% elapsed time = 00:00:25 Iteration #119 Error:17.503746% elapsed time = 00:00:25 Iteration #120 Error:17.503746% elapsed time = 00:00:25 Iteration #121 Error:17.503746% elapsed time = 00:00:26 Iteration #122 Error:17.503746% elapsed time = 00:00:27 Iteration #123 Error:17.503746% elapsed time = 00:00:29 Iteration #124 Error:17.503746% elapsed time = 00:00:31 Iteration #125 Error:0.436507% elapsed time = 00:00:32 Task train completed, task elapsed time 00:00:32 Beginning Task#7/7 : evaluate 1 : Analyzing 100/100 : Done analyzing 1/100 : Processing 100/100 : Done processing Task evaluate completed, task elapsed time 00:00:33 Done. Runtime was 00:00:33 (33160ms). D:\test>
This will perform several steps. Click the Execute button from the EGA File Editor, that was opened in Step 2. This takes the data through 7 steps. There may be more, or fewer steps, for other Encog Analyst projects, depending on what options are chosen. The entire execution should take under a minute on most computers.
- Step 1: Randomize - Shuffle the file into a random order.
- Step 2: Segregate - Create a Training Data Set and an Evaluation Data Set
- Step 3: Normalize - Normalize the data into a form usable by the selected Machine Learning Method
- Step 4: Generate - Generate the training data into an EGB File that can be used to train.
- Step 5: Create - Generate the selected Machine Learning Method.
- Step 6: Train - Train the selected Machine Learning Method.
- Step 7: Evaluate - Evaluate the Machine Learning Method.
This process will also create a number of files. The complete list of files, in this project is:
- mpg.csv - The raw data.
- mpg.ega - The EGA File. This is the Encog Analyst script.
- mpg_eval.csv - The evaluation data.
- mpg_norm.csv - The normalized version of mpg_train.csv.
- mpg_output.csv - The output from running mpg_eval.csv.
- mpg_random.csv - The randomized output from running mpg.csv.
- mpg_train.csv - The training data.
- mpg_train.eg - The Machine Learning Method that was trained.
- mpg_train.egb - The binary training data, created from mpg_norm.egb.
Step 4: Ranges Report
The first is the range report. The range report tells you what ranges your data columns were in. Encog Analyst determined this while analyzing the data file. All of these ranges are saved inside of the EGA File. You can see some of the data produced in a range report here:
The above is a generic range report from Encog Analyst. It is contained in the file mpg.html. The range report for this specific data set will look different. There is additional information if you scroll down. This information is necessary for the analyst to normalize your data. Most Machine Learning Methods require some form of normalization to work with data.
Step 5: Examine the Output
To see how well the newly trained Machine Learning Method performed, examine mpg_output.csv. You can see part of this file here.
"sepal_l","sepal_w","petal_l","petal_w","species","Output:species" "mpg","cylinders","displacement","horsepower","weight","acceleration","year","origin","name","Output:mpg" 20.0,4,130.0,102.0,3150.,15.7,76,2,volvo 245,27.0766085659 28.4,4,151.0,90.00,2670.,16.0,79,1,buick skylark limited,27.011264803 32.8,4,78.00,52.00,1985.,19.4,78,3,mazda glc deluxe,31.8777234817 19.0,6,232.0,100.0,2901.,16.0,74,1,amc hornet,19.4085398701 26.0,4,79.00,67.00,1963.,15.5,74,2,volkswagen dasher,32.4327434887 26.0,4,122.0,80.00,2451.,16.5,74,1,ford pinto,29.5519134777
As you can see, the learning method's output(far-right) is trying to match to the expected output(1st column). Most cars it is fairly close on, however, others it is not. This is data that the network was not trained with, so we are seeing how well the network performs on new data.
Understanding the Example
This is an example of regression. The output from the Machine Learning Method is a number, the expected miles per gallon for a car. In this example we used a Support Vector Machine. From a purely "black box" standpoint, a Support Vector Machine is very similar to a Neural Network. Both accept input data and produce output data. For regression, the input and output of a Support Vector Machine is identical to a Neural Network. For classification a Support Vector Machine their output is slightly different. Encog Analyst hides these differences.
External Links
The completed example can be downloaded here.
