Wikipedia contains a vast amount of data. It is possible to make use of this data in computer programs for a variety of purposes. However, the sheer size of Wikipedia makes this difficult. You should not access Wikipedia data programmatically. Such access would generate a large volume of additional traffic for Wikipedia and likely result in your IP address being banned by Wikipedia. Rather, you should download an offline copy of the Wikipedia for your use. There are a variety of Wikipedia dump files available. However, for this demonstration we will make use of the XML file that contains just the latest versions of each of the Wikipedia articles. The file that you will need to download is named:
The file will be tarred and zipped, so you must decompress it.
Format of the Wikipedia XML Dump
Do not try to open the enwiki-latest-pages-articles.xml file directly with a XML or text editor, as it is very large. The code below shows you the beginning of this file. As you can see the file is made up of page tags that contain revision tags.
To read this file it is important that the XML is streamed and not read directly into memory as a DOM parser might do. The xml.etree.ElementTree class can be used to do this. The following imports are needed for this example. For the complete source code see the following GitHub link.
The following constants are defined to specify the three export files and the path. Adjust the path to the location on your computer that holds the Wikipedia articles XML dump.
This example program will separate the articles, redirects and templates into three CSV files.
I use the following function to display a time elapsed. This program was typically taking about 30 minutes on my computer.
The following function is used to strip the namespaces from the tags.
Setup the filenames according to the path:
Reset counters to track the types of pages found.
Begin streaming the XML file and write the headers for the 3 CSV files that will be built according to the data found in the XML.
Process all of the start/end tags and obtain the name (tname) of each tag.
For end tags, collect the title, id, redirect, ns and page tags, which mean:
title - The title of the page.
id - The internal Wikipedia ID for the page.
redirect - What this page redirects to.
ns - Namespaces help identify what type of page. Type 10 is a template page.
page - The actual page(contains the previous listed tags).
The following code processes these tag types:
Once a page ends, we can collect the other values.