Workbench Bayesian Classification Example
This example shows how to use the Encog Workbench to perform classification with a Bayesian Network. This example also uses the Encog Analyst. Classification is the process where a Machine Learning Method learns to classify data into classes. The structure and probability truth tables for a the Bayesian Network are constructed from the Training Set. Using this information the Bayesian Network learns to classify each Training Set element into the appropriate class. The Machine Learning Method should be able to classify new data into appropriate classes, based on what was learned from the Training Set.
This example will use the Encog Analyst to learn to classify the species of Iris presented to it. This example makes use of the Iris Data Set. This is a classic training set that presents four attributes and one species label for 150 irises.
Steps for Running the Example
To walk through this example follow these steps. This example requires Encog 3.1 or later. Earlier version of Encog do not include support for Bayesian Networks
Step 1: Generate the Iris Data
First start up the Encog Workbench. Create a new project. Name it anything you like, such as Iris example. This will create an empty folder to hold your data in. You now need to obtain your data. The Encog Workbench contains a number of built in data sets. The Iris Data Set is one of these. Choose Tools:Generate Training Data from the Encog Workbench menu bar. Choose to generate the Iris Data Set, and name it something such as iris.csv. This should create a CSV File. You can see a small sample of this data here.
|Sepal Length||Sepal Width||Petal Length||Petal Width||Species|
As you can see there are three species of iris. Measurements are provided for each. We would like to create a Machine Learning Method that will learn to predict what type of iris we have by simply providing the four measurements. We will divide this training set into a training data set and an evaluation data set. The larger training data set will be used for the Machine Learning Method to learn from. The evaluation data set will be used to test the Machine Learning Method on data that it was not trained with. It is also possible to use cross validation, and use a single data set.
Step 2: Use the Analyst Wizard
Now that you have input data in the workbench you should use the Encog Analyst Wizard to create an Encog Analyst File. The Encog Analyst File (*.ega) is a script file that tells Encog Analyst how to process your file data. To generate a EGA File right-click iris.csv and choose Analyst Wizard.... This will show a screen similar to the following.
You must change the value shown above for CSV File Headers. Place a check in the CSV File Header box. Optionally, you could specify that the target field is species. The column heading, in the CSV File, is named species. However, since this is a classification problem, and there is only one class field, the Encog Analyst is smart enough to figure out that this one class field is what you are trying to classify. If there were multiple class fields, then you must enter a target field. For this example we must change the Machine Learning option to Bayesian Network.
Because you choose a Bayesian Network you will prompted for a few extra parameters.
For this example you should not choose a Naive Bayesian Network. It will work just fine if you choose Naive Bayes. However, we want Encog to choose from all available Bayesian Structures. Naive Bayes is one particular structure of Bayesian Network that is often very effective. You should also choose the default value of 3 for the Evidence Bands. This determines how the continuous input values will be mapped to discrete classes. Using a value of three maps the input to three classes. If you data is more complex, you may wish to use more bands. You should never use fewer than two bands.
Encog Analyst will now generate a EGA File with the same base name as your data file. You should now see two files in the workbench project area: iris.csv and iris.ega. Double click the iris.ega file and you will see the following.
This shows you the EGA File that was generated by analyzing the Iris data. You can see the complete file here.
[HEADER] [HEADER:DATASOURCE] rawFile=FILE_RAW sourceFile= sourceFormat= sourceHeaders=t [SETUP] [SETUP:CONFIG] allowedClasses=integer,string csvFormat=decpnt|comma inputHeaders=t maxClassCount=50 [SETUP:FILENAMES] FILE_RANDOMIZE=iris_random.csv FILE_EVAL_NORM=iris_eval_norm.csv FILE_EVAL=iris_eval.csv FILE_RAW=iris.csv FILE_ML=iris_train.eg FILE_OUTPUT=iris_output.csv FILE_CLUSTER=iris_cluster.csv FILE_NORMALIZE=iris_norm.csv FILE_TRAINSET=iris_train.egb FILE_TRAIN=iris_train.csv [DATA] [DATA:CONFIG] goal=classification [DATA:STATS] "name","isclass","iscomplete","isint","isreal","amax","amin","mean","sdev" "sepal_l",0,1,0,1,7.9,4.3,5.8483221477,0.8280812566 "sepal_w",0,1,0,1,4.4,2,3.0543624161,0.4358764778 "petal_l",0,1,0,1,6.9,1,3.7738255034,1.7653696439 "petal_w",0,1,0,1,2.5,0.1,1.2060402685,0.7622673736 "species",1,1,0,0,0,0,0,0 [DATA:CLASSES] "field","code","name" "species","Iris-setosa","Iris-setosa",49 "species","Iris-versicolor","Iris-versicolor",50 "species","Iris-virginica","Iris-virginica",50 [NORMALIZE] [NORMALIZE:CONFIG] missingValues=DiscardMissing sourceFile=FILE_TRAIN targetFile=FILE_NORMALIZE [NORMALIZE:RANGE] "name","io","timeSlice","action","high","low" "sepal_l","input",0,"pass",0,0 "sepal_w","input",0,"pass",0,0 "petal_l","input",0,"pass",0,0 "petal_w","input",0,"pass",0,0 "species","input",0,"single",0,0 [RANDOMIZE] [RANDOMIZE:CONFIG] sourceFile=FILE_RAW targetFile=FILE_RANDOMIZE [CLUSTER] [CLUSTER:CONFIG] clusters=3 sourceFile=FILE_EVAL targetFile=FILE_CLUSTER type=kmeans [BALANCE] [BALANCE:CONFIG] balanceField= countPer= sourceFile= targetFile= [SEGREGATE] [SEGREGATE:CONFIG] sourceFile=FILE_RANDOMIZE [SEGREGATE:FILES] "file","percent" "FILE_TRAIN",75 "FILE_EVAL",25 [GENERATE] [GENERATE:CONFIG] sourceFile=FILE_NORMALIZE targetFile=FILE_TRAINSET [ML] [ML:CONFIG] architecture=P(sepal_l[Type0:4.3 to 5.5,Type1:5.5 to 6.7,Type2:6.7 to 7.9]) P(sepal_w[Type0:2 to 2.8,Type1:2.8 to 3.6,Type2:3.6 to 4.4]) P(petal_l[Type0:1 to 2.9667,Type1:2.9667 to 4.9333,Type2:4.9333 to 6.9]) P(petal_w[Type0:0.1 to 0.9,Type1:0.9 to 1.7,Type2:1.7 to 2.5]) P(species[Iris-setosa,Iris-versicolor,Iris-virginica]) evalFile=FILE_EVAL machineLearningFile=FILE_ML outputFile=FILE_OUTPUT query=P(species|sepal_l,sepal_w,petal_l,petal_w) trainingFile=FILE_TRAINSET type=bayesian [ML:TRAIN] arguments=maxParents=1,estimator=simple,search=k2,init=naive cross= targetError=0.05 type=bayesian [TASKS] [TASKS:task-cluster] cluster [TASKS:task-create] create [TASKS:task-evaluate] evaluate [TASKS:task-evaluate-raw] set ML.CONFIG.evalFile="FILE_EVAL_NORM" set NORMALIZE.CONFIG.sourceFile="FILE_EVAL" set NORMALIZE.CONFIG.targetFile="FILE_EVAL_NORM" normalize evaluate-raw [TASKS:task-full] randomize segregate normalize generate create train evaluate [TASKS:task-generate] randomize segregate normalize generate [TASKS:task-train] train
Now that the EGA file has been generated the wizard will make no further changes to it. You can make additional customizations to the EGA file by directly editing its text. For more information on the format of this file, see the article on EGA Files and Encog Analyst.
Step 3: Visualizing Your Data
You will notice that the EGA File Editor, which was opened in the last step, has several buttons on the top. The Visualize button provides several ways to visualize your data. Clicking the Visualize button will provide you with a list of data visualizations provided by Encog Analyst.
The first is the range report. The range report tells you what ranges your data columns were in. Encog Analyst determined this while analyzing the data file. All of these ranges are saved inside of the EGA File. You can see some of the data produced in a range report here:
There is additional information if you scroll down. This information is necessary for the analyst to normalize your data. Most Machine Learning Methods require some form of normalization to work with data. Bayesian Networks do not require data to be normalized, however, the data must be classified. Bayesian networks deal with discrete input. If your data is continuous, which is the case with the Iris Data Set this continuous data must be normalized into discrete ranges. By default the Encog analyst breaks the input into three discrete bands, this value can be changed by the wizard dialog that is displayed after you generate an EGA File from the Iris data.
A scatter plot can also show some interesting information about your data. You can easily see if your data forms clusters. Choose Scatter Plot from the dialog that appears when you click the Visualize button in the EGA File Editor. You will be prompted for what attributes you wish to plot. Place a check in all check boxes. This will produce a multivariate scatter plot, as seen here. This multivariate scatter plot shows pairings of each of the attributes. This allows you to see how the pairs relate to each other. Ideally you will see large clusters of similar colored dots. If you do not, your data is either very noisy, or is simply not expressed in a way that is going to be easy for a Machine Learning Method to learn. The iris data set does have well defined clusters.
Another interesting feature of the Iris data set is that the clusters are not linearly separable. At least not all three. Iris Setosa is linearly separable from the other two. But Iris Versicolor and Iris Virginica are not linearly separable. At least, not on all pairings.
Additionally, there are only two clusters, if you do not have species information. Imagine all dots were black. You would only only see two clusters. Because of this a simple unsupervied clustering Machine Learning Method would not be able to learn the difference. This also illustrates the difference between clustering and classification. Clustering is unsupervised and simply places data into natural clusters. Classification is generally supervised, and learns to classify new data that it has not yet seen.
Step 4: Execute the Analyst Script
Now that the EGA File has been created, you can execute it. This will perform several steps. Click the Execute button from the EGA File Editor, that was opened in Step 2. This takes the data through 7 steps. There may be more, or fewer steps, for other Encog Analyst projects, depending on what options are chosen. The entire execution should take under a minute on most computers.
- Step 1: Randomize - Shuffle the file into a random order.
- Step 2: Segregate - Create a Training Data Set and an Evaluation Data Set
- Step 3: Normalize - Normalize the data into a form usable by the selected Machine Learning Method
- Step 4: Generate - Generate the training data into an EGB File that can be used to train.
- Step 5: Create - Generate the selected Machine Learning Method.
- Step 6: Train - Train the selected Machine Learning Method.
- Step 7: Evaluate - Evaluate the Machine Learning Method.
This process will also create a number of files. The complete list of files, in this project is:
- iris.csv - The raw data.
- iris.ega - The EGA File. This is the Encog Analyst script.
- iris_eval.csv - The evaluation data.
- iris_norm.csv - The normalized version of iris_train.csv.
- iris_output.csv - The output from running iris_eval.csv.
- iris_random.csv - The randomized output from running iris.csv.
- iris_train.csv - The training data.
- iris_train.eg - The Machine Learning Method that was trained.
- iris_train.egb - The binary training data, created from iris_norm.egb.
Step 5: Examine the Output
To see how well the newly trained Machine Learning Method performed, examine iris_output.csv. You can see part of this file here.
"sepal_l","sepal_w","petal_l","petal_w","species","Output:species" 4.6,3.1,1.5,0.2,Iris-setosa,Iris-setosa 5.4,3.4,1.5,0.4,Iris-setosa,Iris-setosa 5.8,2.6,4.0,1.2,Iris-versicolor,Iris-versicolor 5.5,2.6,4.4,1.2,Iris-versicolor,Iris-versicolor 5.2,3.5,1.5,0.2,Iris-setosa,Iris-setosa 4.9,2.4,3.3,1.0,Iris-versicolor,Iris-versicolor 6.2,2.8,4.8,1.8,Iris-virginica,Iris-virginica 5.5,2.4,3.8,1.1,Iris-versicolor,Iris-versicolor 5.8,2.8,5.1,2.4,Iris-virginica,Iris-virginica 5.6,2.5,3.9,1.1,Iris-versicolor,Iris-versicolor
As you can see, the learning method's output(far-right) is matching well to the expected output(2nd to the last column).
Step 6: Analyze the Network
This example assumes that you used a Bayesian Network as the Machine Learning Method. However, a feedforward neural network, Support Vector Machine, or other compatibleMachine Learning Method could have been used in step 2. Encog makes Machine Learning Methods very interchangeable.
You can examine the Bayesian Network created with this example. Double click the iris_train.eg file, and choose the Visualize button. Choose Network Structure. You will see the network structure.
A Bayesian network is based on probabilities. For example, the probability structure that was learned for the Iris data is shown here.
P(sepal_l|species) P(sepal_w|species) P(petal_l|species) P(petal_w|species) P(species)
The above line expresses five probabilities. The probability of the species, as well as the probability of the species given each of the input values. By examining the training data the K2 training algorithm created the above structure, as well as the probability tables for the Bayesian Network. Because Bayesian networks deal in probability they have several advantages over other machine learning methods. Bayesian networks deal very well with missing data. If you omit some of the input measurements, the Bayesian Network will still give an expected species. Additionally, when presented with input data, the Bayesian Network will tell you the probability of each of the species.
Understanding the Example
The Encog Analyst actually shielded you from a fair amount of complexity. All normalization decisions were made automatically and encoded into the EGA File. If you like, you can change any of the options in the EGA File and rerun the example. The wizard is really meant to just give you a starting point with an EGA File. Additionally, equilateral class normalization was chosen. This causes you to have two output neurons. Equilateral normalization is often a good choice when there are more than 2 classes. Equilateral normalization also requires one fewer than the total number of classes.
The analyst wizard also made a quick estimate of how many hidden neurons might be needed. You may get better results by varying this number.
The completed example can be downloaded here.