Reading Wikipedia XML Dumps with Python

2017-03-03

Wikipedia contains a vast amount of data. It is possible to make use of this data in computer programs for a variety of purposes. However, the sheer size of Wikipedia makes this difficult. You should not access Wikipedia data programmatically. Such access would generate a large volume of additional traffic for Wikipedia and likely result in your IP address being banned by Wikipedia. Rather, you should download an offline copy of the Wikipedia for your use. There are a variety of Wikipedia dump files available. However, for this demonstration we will make use of the XML file that contains just the latest versions of each of the Wikipedia articles. The file that you will need to download is named:

enwiki-latest-pages-articles.xml

This file can be found at the following location:

https://dumps.wikimedia.org/enwiki/latest/

The file will be tarred and zipped, so you must decompress it.

Format of the Wikipedia XML Dump

Do not try to open the enwiki-latest-pages-articles.xml file directly with a XML or text editor, as it is very large. The code below shows you the beginning of this file. As you can see the file is made up of page tags that contain revision tags.

<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.10/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.10/ http://www.mediawiki.org/xml/export-0.10.xsd" version="0.10" xml:lang="en">
  <siteinfo>
    <sitename>Wikipedia</sitename>
    <dbname>enwiki</dbname>
    <base>https://en.wikipedia.org/wiki/Main_Page</base>
    <generator>MediaWiki 1.29.0-wmf.12</generator>
    <case>first-letter</case>
    <namespaces>
      ...
    </namespaces>
  </siteinfo>
  <page>
    <title>AccessibleComputing</title>
    <ns>0</ns>
    <id>10</id>
    <redirect title="Computer accessibility" />
    <revision>
      <id>631144794</id>
      <parentid>381202555</parentid>
      <timestamp>2014-10-26T04:50:23Z</timestamp>
      <contributor>
        <username>Paine Ellsworth</username>
        <id>9092818</id>
      </contributor>
      <comment>add [[WP:RCAT|rcat]]s</comment>
      <model>wikitext</model>
      <format>text/x-wiki</format>
      <text xml:space="preserve">#REDIRECT [[Computer accessibility]]

\{\{Redr|move|from CamelCase|up\}\}</text>
      <sha1>4ro7vvppa5kmm0o1egfjztzcwd0vabw</sha1>
    </revision>
  </page>
  <page>
    <title>Anarchism</title>
    <ns>0</ns>
    <id>12</id>
    <revision>
      <id>766348469</id>
      <parentid>766047928</parentid>
      <timestamp>2017-02-19T18:08:07Z</timestamp>
      <contributor>
        <username>GreenC bot</username>
        <id>27823944</id>
      </contributor>
      <minor />
      <comment>Reformat 1 archive link. [[User:Green Cardamom/WaybackMedic_2.1|Wayback Medic 2.1]]</comment>
      <model>wikitext</model>
      <format>text/x-wiki</format>

<text xml:space="preserve">
...
</text>
</mediawiki>

To read this file it is important that the XML is streamed and not read directly into memory as a DOM parser might do. The xml.etree.ElementTree class can be used to do this. The following imports are needed for this example. For the complete source code see the following GitHub link.

import xml.etree.ElementTree as etree
import codecs
import csv
import time
import os

The following constants are defined to specify the three export files and the path. Adjust the path to the location on your computer that holds the Wikipedia articles XML dump.

PATH_WIKI_XML = 'C:\\Users\\jeffh\\data\\'
FILENAME_WIKI = 'enwiki-latest-pages-articles.xml'
FILENAME_ARTICLES = 'articles.csv'
FILENAME_REDIRECT = 'articles_redirect.csv'
FILENAME_TEMPLATE = 'articles_template.csv'
ENCODING = "utf-8"

This example program will separate the articles, redirects and templates into three CSV files.

I use the following function to display a time elapsed. This program was typically taking about 30 minutes on my computer.

# Nicely formatted time string
def hms_string(sec_elapsed):
    h = int(sec_elapsed / (60 * 60))
    m = int((sec_elapsed % (60 * 60)) / 60)
    s = sec_elapsed % 60
    return "{}:{:>02}:{:>05.2f}".format(h, m, s)

The following function is used to strip the namespaces from the tags.

def strip_tag_name(t):
    t = elem.tag
    idx = k = t.rfind("}")
    if idx != -1:
        t = t[idx + 1:]
    return t

Setup the filenames according to the path:

pathWikiXML = os.path.join(PATH_WIKI_XML, FILENAME_WIKI)
pathArticles = os.path.join(PATH_WIKI_XML, FILENAME_ARTICLES)
pathArticlesRedirect = os.path.join(PATH_WIKI_XML, FILENAME_REDIRECT)
pathTemplateRedirect = os.path.join(PATH_WIKI_XML, FILENAME_TEMPLATE)

Reset counters to track the types of pages found.

totalCount = 0
articleCount = 0
redirectCount = 0
templateCount = 0
title = None
start_time = time.time()

Begin streaming the XML file and write the headers for the 3 CSV files that will be built according to the data found in the XML.

with codecs.open(pathArticles, "w", ENCODING) as articlesFH, \
        codecs.open(pathArticlesRedirect, "w", ENCODING) as redirectFH, \
        codecs.open(pathTemplateRedirect, "w", ENCODING) as templateFH:
    articlesWriter = csv.writer(articlesFH, quoting=csv.QUOTE_MINIMAL)
    redirectWriter = csv.writer(redirectFH, quoting=csv.QUOTE_MINIMAL)
    templateWriter = csv.writer(templateFH, quoting=csv.QUOTE_MINIMAL)

    articlesWriter.writerow(['id', 'title', 'redirect'])
    redirectWriter.writerow(['id', 'title', 'redirect'])
    templateWriter.writerow(['id', 'title'])

Process all of the start/end tags and obtain the name (tname) of each tag.

for event, elem in etree.iterparse(pathWikiXML, events=('start', 'end')):
    tname = strip_tag_name(elem.tag)

    if event == 'start':
        if tname == 'page':
            title = ''
            id = -1
            redirect = ''
            inrevision = False
            ns = 0
        elif tname == 'revision':
            # Do not pick up on revision id's
            inrevision = True

For end tags, collect the title, id, redirect, ns and page tags, which mean:

title - The title of the page.
id - The internal Wikipedia ID for the page.
redirect - What this page redirects to.
ns - Namespaces help identify what type of page. Type 10 is a template page.
page - The actual page(contains the previous listed tags).

The following code processes these tag types:

else:
    if tname == 'title':
        title = elem.text
    elif tname == 'id' and not inrevision:
        id = int(elem.text)
    elif tname == 'redirect':
        redirect = elem.attrib['title']
    elif tname == 'ns':
        ns = int(elem.text)

Once a page ends, we can collect the other values.

elif tname == 'page':
    totalCount += 1

    if ns == 10:
        templateCount += 1
        templateWriter.writerow([id, title])
    elif len(redirect) > 0:
        articleCount += 1
        articlesWriter.writerow([id, title, redirect])
    else:
        redirectCount += 1
        redirectWriter.writerow([id, title, redirect])

Display status updates.


    if totalCount > 1 and (totalCount % 100000) == 0:
        print("{:,}".format(totalCount))

elem.clear()

Display final stats.

elapsed_time = time.time() - start_time

print("Total pages: {:,}".format(totalCount))
print("Template pages: {:,}".format(templateCount))
print("Article pages: {:,}".format(articleCount))
print("Redirect pages: {:,}".format(redirectCount))
print("Elapsed time: {}".format(hms_string(elapsed_time)))

Heaton Research

Reading Wikipedia XML Dumps with Python

Format of the Wikipedia XML Dump

About

Categories

Archives

Recents