Heaton Research

Quick and Very Dirty Data Wrangling Example

Data science is often described as the intersection of statistics, domain knowledge and
hacking skills. One important part of hacking skills is data wrangling. Data are rarely
in the exact form that you need them. I am currently working on an example for
AIFH Vol 3 that will use a SOM and compare nations based on several statistics. I could
not find a dataset that fit exactly what I was looking for. So I decided to create my
own dataset.

I wanted a list of countries with three different data points that somehow indicate that
nation’s prosperity. I chose GDP, lifespan and literacy rate. Remember, this is a
computer science experiment, not a sociology experiment. I am sure others could come up
with a much better set of data points to compare countries. However, for my example
program these will work just fine.

I could not find a data set that was already completed. However, all of this data is
contained in Wikipedia. To wrangle the data I created a simple Python script to
accomplish this. I am really starting to like Python for quick scripting projects.
I could have also used R, Groovy, Perl or a host of others. The end result looks
something like this:

ATG,Antigua and Barbuda,1220,75.8,0.984
[Full File]

You can download the entire contents of Wikipedia into a data file. This is usually how
you should deal with Wikipedia data. Do not use HTTP to pull large volumes of data from
Wikipedia. This is a good way to get blocked from Wikipedia. Also, the datafile for
Wikipedia is not HTML encoded and much easier to parse. I simply pulled the nation codes
page, GDP, literacy, and lifespan pages into text files that my Python script could parse.

I linked the files together (joined) using the nation name as a key. If a nation’s name
did not appear in all lists I discarded that nation.

You can see my Python code here. This code could be more readable. But it gets the job
done. It is a quick data wrangling hack. If I needed to re-pull the data on a frequent
basis, particularly if it were high-velocity data, I would do something more formal.