The discovery of the Rosetta Stone in 1799 allowed scholars to finally decipher ancient Egyptian hieroglyphics. The actual content of the Rosetta stone is a very mundane government decree issued by the Egyptian government around 196 BC. The importance of the Rosetta stone is that this mundane decree was written in Ancient Egyptian using both hieroglyphic script and Demotic script, as well as Ancient Greek. The Ancient Greek allowed the translation of the Egyptian hieroglyphics. Sometimes it helps to see what you do not understand represented as what you do understand. The Rosetta Stone has become a metaphor for that which allows the decoding of something else.
This article attempts a “Rosetta stone” of data science by showing a simple classification in the following languages:
- Python - I use Anaconda Python.
- MATLAB - See Octave for a free alternative.
- SAS - I used the free University edition.
We will begin with a simple data science classification problem. Classification is where you would like the model to learn to identify a non-numeric outcome for input data. In this case, we would like to input information about passengers on the RMS Titanic and get a prediction on if that passenger might survive. This is a binomial/binary prediction because there are two outcomes: survive (1) or perish (0). To perform this classification a GLM Logistic Regression is used. This technique achieves approximately an 80% accuracy rate.
Considering the pandemonium that was the final hours of the RMS Titanic, an 80% accuracy rate is decent. Given the limited information on the passengers, you simply can’t predict everything. There will always be noise. Generally females had a much higher probability of survival than males. For females the class of their ticket was the second most important factor in survival. For males, age was the second most important factor. However, there is always noise/outliers. Consider Miss Helen Loraine Allison, a female 1st class passenger age 2. Simply by the numbers, she should have survived. But she did not. Somehow she was separated from her parents in the pandemonium that was the sinking Titanic. Trying to predict Helen’s death is not constructive. Similarly, Mr. Karl Edwart Dahl, a male 3rd class passenger age 45. Simply by the numbers, he should have died. But he did not. Likewise, trying to predict cases like Mr. Dahl is simply not productive. An 80% accuracy rate for Titanic is not bad.
The model used in these examples is a Logistic regression. There are more complex models, such as random forests, gradient boosted machines, or deep neural networks. However, because the purpose of this article is to highlight the languages, a relatively simple model was chosen.
Some common features to all languages will be mentioned here:
Age - There are a number of missing ages. The Kaggle Titanic tutorial participants have found some very clever means of estimating missing ages. However, for this simple example we will simply fill in all missing ages with the median of the age.
Embarked - There are two records where the port that the passenger embarked from is missing. These are simply filled in as the most common port (Southampton). Also, the three departure ports in Europe are encoded into dummy variables.
Dropped Fields - The fields ‘Name’, ‘PassengerId’, ‘Ticket’, ‘Cabin’ are all dropped because they are not particularly predictive. Several published solutions do increase accuracy by using these fields. However, to keep this example relatively simple we will not use them.
Crossvalidation - The data are split into a training set and validation set. The model is trained using the training set and evaluated using the validation set. This gives a better indication of how accurate the model is by evaluating it on data that it had not seen during training.
The first language provided is Python 3.x. Python is a general purpose programming language that is widely used both by data scientists and software developers alike.
The first section simply imports the needed libraries and defines a convenience function that will encode dummy variables. Pandas is used for all data preprocessing.
Next, we load the data set CSV and perform the preprocessing. Male and female are converted to 1 and 0. Some of the languages have a categorical type and do not require this conversion. Embarked is encoded to dummies and missing values are filled in.
Next, the data are setup for the logistic regression. Python requires that the predictors (x) be separated from the target (y). The target is whether the person survived or not. The predictors are all of the other values used to determine survival. Some languages will allow us to keep the predictors and target together. We also split the training and validation sets.
Next we setup the Logistic regression classifier and fit the model to the training set.
Finally, we evaluate the accuracy of the model by having it predict the validation set.
R and Python are the two heavyweights of open source data science. According to KDDNuggets, R is the most popular programming language for data science – but it is pretty close. R does win the award for the shortest program in this article. The small size of the R source code is a result of several sets that R handles for you.
The first section loads the data file and preprocesses the code.
You will notice that unlike Python, R allows the x and y values to remain in the same dataframe. Additionally, there is no need to encode sex or embarked, because R loaded these values as factors (categoricals) and they are automatically encoded.
Next the training and validation sets are split.
The Logistic regression model is created. R allows more of the GLM inner-workings to be specified than Python. Below you can see that we specify both the family (binomial, which means there are two outputs). Additionally, the link function is the logit, or logistic function.
The predictions are made. These predictions are all probabilities of survival. The round function ensures that any prediction >0.5 is treated as survival. Finally, the accuracy is calculated and displayed.
MATLAB is a commercial programming language that is favored by many areas of academia and industry. My primary exposure to MATLAB was in academia. I used MATLAB for several classes as an undergraduate. My phd advisor used MATLAB for many of his projects. The free programming language Octave is somewhat compatible with MATLAB.
In MATLAB, everything is a matrix. A single integer is just a matrix of height and width of 1. Semicolons tell MATLAB if you want the return value printed. A semicolon will suppress output to the console of the return value of a function. Julia makes similar use of the semicolon.
Like the previous examples, the first step is to load and preprocess the data. MATLAB does have a categorical type, and we use it for both Embarked and Sex. However, we will still need to encode these values. Unlike R, dummy variables are not automatically created for us.
MATLAB, just like Python, requires that the x and y be split into two variables.
The dummy variables for Embarked are concatenated into the matrix.
Next, the training and validation sets are split.
Just like the previous examples, a binomial logist regression is created and fit to the training.
Finally, the validation set is used to create predictions and accuracy is evaluated.
SAS is a commercial language used to create statistical models. The first thing that you will notice about SAS is that the two primary statement are PROC and DATA. Both the PROC and DATA statements are ended by a RUN statement. A SAS program is essentially made up of PROC and DATA statements that pass data sets between each other. The data set is the only object type and it is stored to disk at each step.
SAS programs can have other statements beyond PROC and DATA, such as a macro language. However, the macro language functions as macros and effectively repeats PROC and DATA segments of the source file. A PROC statement calls a predefined SAS function. A DATA statement loops over a data set and modifies it.
The first step is to read in the CSV file. The OUT parameter specifies that the CSV file will be stored in a temporary binary file named train. Because SAS performs all operations on binary files, it does not require that everything fit into RAM. All of the other language examples from this article require everything to fit into RAM.
Next, the missing ages are filled in with their median values. This is done with a DATA statement that loops over the train data set that was just loaded in. The resulting data set is written to the same data set as the source. The DATA parameter specifies the input and the OUT parameter specifies the output.
Next, we will separate the training and validation set. The first step is to add a column, named selected to the train data set that specifies if it is part of the ultimate training set or not. This is done by using PROC SURVEYSELECT, a total of 70% of the data will have a selected variable with a value of 1. The input and output data sets are both train. The previous train data set (which held all rows) is replaced by a new train data set that contains only the training data.
Once the train data set has been properly labeled, it is split into two data sets that are named train and validate.
Now that we have a train and validation data set we are ready to fit the Logistic regression. This fitting is performed with PROC LOGISTIC. The two categorical values, Sex and Embarked must be specified with the CLASS parameters. The ref setting means to encode as dummy variables.
The formula that we are regressing is that Survived is equal to some regression of Sex, Age, Pclass, Parch, SibSp, and Embarked. The model parameters themselves are written to a binary file called model. The descending flag specifies that the model will predict values of 1 (survived), as opposed to 0 (perished).
Now that the model has been fit, we can predict with another call to PROC LOGISTIC. This will create a data set named pred that contains the validate data set augmented with predictions. Because we are predicting a binary outcome, two additional columns are added to the data set: P_1 specifies the probability that the person survived and P_2 specifies the probability that the person perished. These two values sum to 1.0 for each row.
Next a prediction data set is created where any Survived probability is greater than or equal to 0.5 is assumed to have survived is created. Only the PassengerId, Survived, and P_1 columns are kept. Additionally, a new column named pred_survived is created that holds a value of 1 if the probability of survival was greater or equal than 0.5.
Finally, a confusion matrix is generated that measures model performance. This will tells us the percent of survived and perished predictions that were correct.
Julia is a relatively new programming language that the data science community is showing increased support for. Julia can use advanced linear algebra packages to perform matrix operations with great performance; however, loops and traditional programming constructs will also perform at near C/C++ speed.
The first step is to read the CSV file into a DataFrame.
The usual preprocessing steps are performed. In Julia you will notice that function names and operators sometimes have a prefix/suffix, such as ! or .. The ! suffix, such as delete! notes that this function will modify what is passed to it. The . prefix/suffix, such as .~ means that the not(~) is applied to each member of the vector, not the whole vector.
Next we split the training and validation sets into df_train and df_validate.
The model is created and the validation set is scored for prediction. The formula of the regression uses the same regression format as R. The prediction probabilities are rounded, just like the other languages. This rounding means that probabilities of survival of 0.5 and higher indicate survived (1); whereas, 0 indicates perished.
Finally we can report the accuracy.
Now you have seen a simple Logistic regression in Python, R, MATLAB, SAS, and Julia. You can see some of the parallels and differences between the languages. Python and MATLAB require that x and y be split; whereas, the others do not. Julia and R use a similar regression formula format.