Replacing dead links for the Dutch Royal Library (Koninklijke Bibliotheek)
Posted on 12 September 2015 in wikimedia
This is mostly a set of notes to allow others to perform these replacements as well. They are written for the specific case of the KB, but should be applicable to URL replacements in general.
The original requests can be found at 1 and 2.
Requirements:
- a working pywikibot installation
- a set of replacements from the KB, as Excel file (e.g. this set)
What we need to do:
- reformat the data into a format Pywikibot accepts
- run the replacements on the live wikis
Step 1: Reformatting
We split the file into two files: one with replacements and one with pages to work on. This makes the rest of the processing much easier, at the cost of a few extra requests to the wikis.
- Create a file called
urls
which contains all the wiki page urls listed in the Excel file - Create a file called
replacements
which contains the two columns with old and new URLs. These columns should be tab-separated, and should not have any headers.
Reformatting urls
We now have to reformat the URLs to page titles Pywikibot understands. We use the following script for that:
# hervorm 'urls' tot 'nlpages' en 'pages'
import re
matcher = re.compile(r".*?//(\w+)\..*?/wiki/(.*)")
nlout = open('nlpages', 'w')
oout = open('pages', 'w')
for line in open('urls'):
line = line.strip()
match = matcher.match(line)
if not match:
print "Failed to match", line
else:
lang, pagename = match.groups()
line = "[[%s:%s]]\n" % (lang, pagename)
if lang == "nl":
nlout.write(line)
else:
oout.write(line)
This script reads the list of URLs, and parses the language code and page title
from the URL (this assumes you are only working on a single wiki family). The
pages are split between the local (nl
) wiki and other wikis, as we will use
a different edit summary for both.
Reformatting replacements
We reformat the replacements file into two files: one with plain-text replacements and one with regex-based replacements. The first is used for replacements within the same site, while the second is used for e.g. kb.nl to archive.org replacements.
f = open('replacements')
out1 = open('replacements_kb_kb', 'w')
out2 = open('replacements_kb_other', 'w')
replacements = set()
for line in f:
old,new = line.strip().replace('\r', '').split("\t")
if (old,new) in replacements:
continue
replacements.add((old,new))
new = new.replace("https://", "http://")
if new.startswith("http://www.kb.nl"):
new = new[11:]
old = old[11:]
out1.write(old + "\n" + new + "\n")
else:
old = old.split("www.")[1]
old = r"(https?:)?(//)?(www\.)?" + old
out2.write(old + "\n" + new + "\n")
Which gives us two files: replacements_kb_kb
and replacements_kb_other
.
Running replacements
We now run replace.py
four times, for each combination of the newly-generated files.
We start with the local replacements:
pwb.py replace "-file:nlpages" "-replacementfile:replacements_kb_kb" "-summary:[[nl:Wikipedia:Verzoekpagina_voor_bots#Dode_URLs_.28404s.29_vervangen_in_artikelen_-_in_bulk_via_Excel-bestand_-_2e_ronde]]"`
pwb.py replace "-file:nlpages" "-replacementfile:replacements_kb_other" "-regex" "-summary:[[nl:Wikipedia:Verzoekpagina_voor_bots#Dode_URLs_.28404s.29_vervangen_in_artikelen_-_in_bulk_via_Excel-bestand_-_2e_ronde]]"
and then the replacements on other wikis:
pwb.py replace "-file:pages" "-replacementfile:replacements_kb_kb" "-summary:Human-assisted replacement of broken URLs based on [[nl:Wikipedia:Verzoekpagina_voor_bots#Dode_URLs_.28404s.29_vervangen_in_artikelen_-_in_bulk_via_Excel-bestand_-_2e_ronde]]"
pwb.py replace "-file:pages" "-replacementfile:replacements_kb_other" "-regex" "-summary:Human-assisted replacement of broken URLs based on [[nl:Wikipedia:Verzoekpagina_voor_bots#Dode_URLs_.28404s.29_vervangen_in_artikelen_-_in_bulk_via_Excel-bestand_-_2e_ronde]]"
Each replace.py
run will show us the replacements that will be applied. In the case of the regex replacements, extra care should be taken to make sure the replacements are correct.