Basic Classification in R: Neural Networks and Support Vector Machines

2013-06-12

In this article I will introduce you to classification in R. We will use the Iris data
set to perform this classification. The Iris data set is a classic data set that is
often used to demonstrate machine learning. This data set provides four measurements
for three different iris species. Data such as this typically comes in a CSV File. The
iris CSV file looks something like this.

"sepal_l","sepal_w","petal_l","petal_w","species"
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
5.0,3.6,1.4,0.2,Iris-setosa
5.4,3.9,1.7,0.4,Iris-setosa
4.6,3.4,1.4,0.3,Iris-setosa

You can download the above file here.

Reading a CSV File in R

By default R expects to find files in your home directory. You can also specify a full path. We will now load the iris dataset. Of course, R has the iris dataset build into the variables iris and iris3. However, we will assume that you might want to use your own dataset. Therefore I will demonstrate how to load the iris.csv file. The following command is used to load the Iris data set.

1	irisdata <- read.csv(file="iris.csv",head=TRUE,sep=",")

You can also load the data right over the web.

1	irisdata <- read.csv("http://www.heatonresearch.com/dload/data/iris.csv",head=TRUE,sep=",")

Now that the iris data set is loaded, you can display the entire data set just by entering the variable name.

> irisdata
sepal_l sepal_w petal_l petal_w species
1 5.1 3.5 1.4 0.2 Iris-setosa
2 4.9 3.0 1.4 0.2 Iris-setosa
3 4.7 3.2 1.3 0.2 Iris-setosa
4 4.6 3.1 1.5 0.2 Iris-setosa
5 5.0 3.6 1.4 0.2 Iris-setosa
6 5.4 3.9 1.7 0.4 Iris-setosa
7 4.6 3.4 1.4 0.3 Iris-setosa
...

You can also use the summary function to provide a very useful summary of the iris data.

> summary(irisdata)
 sepal_l sepal_w petal_l petal_w
 Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
 Median :5.800 Median :3.000 Median :4.350 Median :1.300
 Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
 Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
 species
 Iris-setosa :50
 Iris-versicolor:50
 Iris-virginica :50

Training and Validation Data

It is often useful to break the data into training and validation sets. This allows you to validate the SVM or ANN on data that it was never trained with. The Iris dataset has 150 elements in it. For our training set we will sample 100 elements from this 150 element set. This is done with the following commands.

1 2	irisTrainData = sample(1:150,100) irisValData = setdiff(1:150,irisTrainData)

It is very important to note that the above vectors are only indexes, and not the actual data. To obtain the actual data you must use one of the following commands.

1 2	irisdata[irisTrainData,] irisdata[irisValData,]

Using a Support Vector Machine (SVM)

I will now show you how to train a support vector for the Iris data set. First, we must tell R that we are using SVM’s.

1	library(kernlab)

Next, we create a radial basis function (RBF) that will be used during training. This will be used as the kernel function.

1	rbf <- rbfdot(sigma=0.1)

Next we train the SVM.

1	irisSVM <- ksvm(species~.,data=irisdata[irisTrainData,],type="C-bsvc",kernel=rbf,C=10,prob.model=TRUE)

Next we get the fitted values for this iris SVM.

1	fitted(irisSVM)

Test on the validation set with probabilities as output. The -5 means to remove the 5th column, which is species. We are trying to predict species.

1	predict(irisSVM, irisdata[irisValData,-5], type="probabilities")

This produces output similar to the following.

 Iris-setosa Iris-versicolor Iris-virginica
 [1,] 0.964182671 0.022183652 0.013633677
 [2,] 0.952685528 0.032202528 0.015111944
 [3,] 0.966094194 0.021206723 0.012699083
 [4,] 0.965805632 0.020603214 0.013591154
 [5,] 0.962410318 0.024487673 0.013102009
 [6,] 0.964783325 0.022303353 0.012913322
 [7,] 0.975483475 0.012628443 0.011888082
 [8,] 0.918612644 0.060459572 0.020927784
 [9,] 0.953575715 0.030428791 0.015995494
[10,] 0.948050721 0.035563597 0.016385682
...

The above shows the predictions for the first 10 elements of the validation set. The numbers you see are probabilities. As you can see each line has one column with the maximum probability. These samples are all Iris-setosa. I only show ten rows, so there is not much variety. If you run the above command in R, you will see the other species as well.

Using a Neural Network (ANN)

I will now show you how to do exactly the same thing using an Artificial Neural Network. First, we must tell R that we are using ANN’s.

1	library(nnet)

The neural network requires that the species be normalized using one-of-n normalization. We will normalize between 0 and 1. This can be done with the following command.

1	ideal <- class.ind(irisdata$species)

We can now train a neural network for the training data.

1	irisANN = nnet(irisdata[irisTrainData,-5], ideal[irisTrainData,], size=10, softmax=TRUE)

Now we can test the output from the neural network.

1	predict(irisANN, irisdata[irisValData,-5], type="class")

The new series of books will cover R, as well as the usual Java and C#. You can pledge ($7) at Kickstarter and pre-order and support this project.

Heaton Research

Basic Classification in R: Neural Networks and Support Vector Machines

Reading a CSV File in R

Training and Validation Data

Using a Support Vector Machine (SVM)

Using a Neural Network (ANN)

About

Categories

Archives

Recents