Replacing dead links for the Dutch Royal Library (Koninklijke Bibliotheek)

Posted on 12 September 2015 in wikimedia

This is mostly a set of notes to allow others to perform these replacements as well. They are written for the specific case of the KB, but should be applicable to URL replacements in general.

The original requests can be found at 1 and 2.

Requirements:

  • a working pywikibot installation
  • a set of replacements from the KB, as Excel file (e.g. this set)

What we need to do:

  • reformat the data into a format Pywikibot accepts
  • run the replacements on the live wikis

Step 1: Reformatting

We split the file into two files: one with replacements and one with pages to work on. This makes the rest of the processing much easier, at the cost of a few extra requests to the wikis.

  • Create a file called urls which contains all the wiki page urls listed in the Excel file
  • Create a file called replacements which contains the two columns with old and new URLs. These columns should be tab-separated, and should not have any headers.

Reformatting urls

We now have to reformat the URLs to page titles Pywikibot understands. We use the following script for that:

# hervorm 'urls' tot 'nlpages' en 'pages'

import re

matcher = re.compile(r".*?//(\w+)\..*?/wiki/(.*)")

nlout = open('nlpages', 'w')
oout = open('pages', 'w')

for line in open('urls'):
    line = line.strip()
    match = matcher.match(line)
    if not match:
        print "Failed to match", line
    else:
        lang, pagename = match.groups()
        line = "[[%s:%s]]\n" % (lang, pagename)
        if lang == "nl":
            nlout.write(line)
        else:
            oout.write(line)

This script reads the list of URLs, and parses the language code and page title from the URL (this assumes you are only working on a single wiki family). The pages are split between the local (nl) wiki and other wikis, as we will use a different edit summary for both.

Reformatting replacements

We reformat the replacements file into two files: one with plain-text replacements and one with regex-based replacements. The first is used for replacements within the same site, while the second is used for e.g. kb.nl to archive.org replacements.

f = open('replacements')
out1 = open('replacements_kb_kb', 'w')
out2 = open('replacements_kb_other', 'w')

replacements = set()

for line in f:
    old,new = line.strip().replace('\r', '').split("\t")
    if (old,new) in replacements:
        continue
    replacements.add((old,new))

    new = new.replace("https://", "http://")
    if new.startswith("http://www.kb.nl"):
        new = new[11:]
        old = old[11:]
        out1.write(old + "\n" + new + "\n")
    else:
        old = old.split("www.")[1]
        old = r"(https?:)?(//)?(www\.)?" + old
        out2.write(old + "\n" + new + "\n")

Which gives us two files: replacements_kb_kb and replacements_kb_other.

Running replacements

We now run replace.py four times, for each combination of the newly-generated files. We start with the local replacements:

pwb.py replace "-file:nlpages" "-replacementfile:replacements_kb_kb" "-summary:[[nl:Wikipedia:Verzoekpagina_voor_bots#Dode_URLs_.28404s.29_vervangen_in_artikelen_-_in_bulk_via_Excel-bestand_-_2e_ronde]]"`
pwb.py replace "-file:nlpages" "-replacementfile:replacements_kb_other" "-regex" "-summary:[[nl:Wikipedia:Verzoekpagina_voor_bots#Dode_URLs_.28404s.29_vervangen_in_artikelen_-_in_bulk_via_Excel-bestand_-_2e_ronde]]"

and then the replacements on other wikis:

pwb.py replace "-file:pages" "-replacementfile:replacements_kb_kb" "-summary:Human-assisted replacement of broken URLs based on [[nl:Wikipedia:Verzoekpagina_voor_bots#Dode_URLs_.28404s.29_vervangen_in_artikelen_-_in_bulk_via_Excel-bestand_-_2e_ronde]]"
pwb.py replace "-file:pages" "-replacementfile:replacements_kb_other" "-regex"  "-summary:Human-assisted replacement of broken URLs based on [[nl:Wikipedia:Verzoekpagina_voor_bots#Dode_URLs_.28404s.29_vervangen_in_artikelen_-_in_bulk_via_Excel-bestand_-_2e_ronde]]"

Each replace.py run will show us the replacements that will be applied. In the case of the regex replacements, extra care should be taken to make sure the replacements are correct.