Skip to content

CristianCantoro/Wikipedia-Template-Parser

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Wikipedia-Template-Parser

A simple library for extracting data from Wikipedia templates

Examples

Get all pages that contain the given template

from wikipedia_template_parser import pages_with_template

pages_with_template("Template:Infobox_Italian_comune", lang="en")

By default the script doesn't return user and template pages; if you want them to be returned, you can set the skip_users_and_templates param:

pages_with_template("Template:Infobox_Italian_comune", lang="en", skip_users_and_templates=False)

Get key/value data from all templates used in the given page

from wikipedia_template_parser import data_from_templates

data_from_templates("Trento", lang="it")

Instead of requesting the page on the fly, you can pass the wikitext to data_from_templates and have it parsed:

# get just the text
pisa_text = get_wikitext_from_api("Torre pendente di Pisa", "it")

# an alternative could be be:
#   with open('some_file_with_wikitext.txt', 'r') as in_:
#        pisa_text = in_.read().decode('utf-8')

# manipulate the text as you wish
# ... do stuff ...

data_from_templates("Torre pendente di Pisa",
                    lang="it",
                    wikitext=pisa_text
                    )

For pages containing the {{coord}} template (it.wiki), data are parsed and returned as a dictionary with 'lat' and 'lon' keys:

from wikipedia_template_parser import data_from_templates

data_from_templates("Cattedrale di San Vigilio", "it")

[{'data': {'lat': '46.067017', 'lon': '11.121385'}, 'name': u'coord'}, 
(...)]

If the coordinates are embedded in another template, like for Template:Infobox_struttura_militare/man you can:

from wikipedia_template_parser import data_from_templates

data_from_templates("Forte Campo Luserna", "it", extra_coords={
                    'infobox struttura militare': [  # lowercase and with no underscores
                        ['LatGradi', 'LatPrimi', 'LatSecondi', 'LatNS'],  # the latitude attributes in the template data
                        ['LongGradi', 'LongPrimi', 'LongSecondi', 'LongEW'],  # the longitude ones
                    ]
                })

[{'data': {'lat': '45.926814', 'lon': '11.3366', (...)}, 'name': u'Infobox_struttura_militare'}, 
(...)]

Get all pages in a given category. A maxdepth parameter can be specified, if omitted it is set to zero (subcategories are not visited)

from wikipedia_template_parser import pages_in_category

pages_in_category("Categoria:Architetture_religiose_d'Italia", "it", maxdepth=20):

pages_in_category("Categoria:Chiese_di_Prato","it")

About

A simple library for extracting data from Wikipedia templates

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%