Wikicorpus This repo records a list of Wikipedia-related corpora Off-the-shelf wiki.xml: Extracted file from English Wiki-dump (10Oct2014) using Wikipedia_Extractor Build-It-Yourself SeedLing: a seed corpus for the Human Language Project Lucene Wiki: After downloading wiki.xml, you can use WikiIndexer.py to index the text with pylucene.