Data science is often described as the intersection of statistics, domain knowledge and hacking skills. One important part of hacking skills is data wrangling. Data are rarely in the exact form that you need them. I am currently working on an example for AIFH Vol 3 that will use a SOM and compare nations based on several statistics. I could not find a dataset that fit exactly what I was looking for. So I decided to create my own dataset.

I wanted a list of countries with three different data points that somehow indicate that nation’s prosperity. I chose GDP, lifespan and literacy rate. Remember, this is a computer science experiment, not a sociology experiment. I am sure others could come up with a much better set of data points to compare countries. However, for my example program these will work just fine.

I could not find a data set that was already completed. However, all of this data is contained in Wikipedia. To wrangle the data I created a simple Python script to accomplish this. I am really starting to like Python for quick scripting projects.
I could have also used R, Groovy, Perl or a host of others. The end result looks something like this:

code,country,gdp,lifespan,literacy
AFG,Afghanistan,20650,60,0.431
ALB,Albania,12800,74,0.98
DZA,Algeria,215700,73.12,0.918
AND,Andorra,4800,84.2,1.0
AGO,Angola,124000,52,0.826
ATG,Antigua and Barbuda,1220,75.8,0.984
[Full File]

You can download the entire contents of Wikipedia into a data file. This is usually how you should deal with Wikipedia data. Do not use HTTP to pull large volumes of data from Wikipedia. This is a good way to get blocked from Wikipedia. Also, the datafile for Wikipedia is not HTML encoded and much easier to parse. I simply pulled the nation codes page, GDP, literacy, and lifespan pages into text files that my Python script could parse.

I linked the files together (joined) using the nation name as a key. If a nation’s name did not appear in all lists I discarded that nation.

You can see my Python code here. This code could be more readable. But it gets the job done. It is a quick data wrangling hack. If I needed to re-pull the data on a frequent basis, particularly if it were high-velocity data, I would do something more formal.