Attribute Analyzer

Extract useful relationships from Wikipedia's Infobox attributes using mappings from Chris's WikipediaBase. Work for the CSAIL InfoLab

Generating an Infobox Attribute Graph

How to extract unrendered ==> rendered mappings from Chris's WikipediaBase, and then create a network (or graph) such as this one:

Steps:

1: First, find a machine with Chris's WikipediaBase installed, or otherwise install WikipediaBase.

$ git clone https://github.com/fakedrake/WikipediaBase.git
$ sudo pip install wikipediabase
$ sudo apt-get install libxml2-dev libxslt1-dev python-dev  # some ubuntu machines might not have some packages installed
$ sudo pip install -r requirements.txt

2: Clone this repo and copy some key files onto the machine with WikipediaBase installed.

$ git clone https://github.com/michaelsilver/Attribute-Analyzer.git
$ cp Attribute-Analyzer/createNetwork.py WikipediaBase/
$ cp Attribute-Analyzer/synonym_network.py WikipediaBase/

You will also need to put infoboxes.xlsx in the WikipediaBase directory. If you have it on your computer, you can use scp. Syntax: scp /path/to/file username@a:/path/to/destination

3: Run allInfoboxAttributes.py and infoboxes.json should pop up.

$ python allInfoboxAttributes.py

Congratulations, you've now stolen all the data out of WikipediaBase that we need! All the unrendered ==> rendered attribute mappings are now stored in infoboxes.json

4: Put a full clone of the repo on any machine of your choosing, and put infoboxes.json. This way you will be indipendant of WikipediaBase

$ git clone https://github.com/michaelsilver/Attribute-Analyzer.git
$ mkdir data/  # to structure the repo the way we need it
$ scp username@remote-machine:/path/to/file ./data/  # put infoboxes.json where it needs to be
$ python createNetwork.py

Now the graph is saved in data/attributeSynonyms.gpickle. All done; you now have the graph, proceed to "Analyzing the Attribute Graph"

Analyzing the Attribute Graph

All analysis tools are located in the synonym_network.py libarary. To use it, import the necessary libaries, load the saved graph, and you can then do whatever analysis you want.

$ python
>>> import networkx as nx
>>> import synonym_network as sn
>>> G = nx.read_gpickle("data/attributeSynonyms.gpickle")

Then you can do whatever you want with the graph now saved in the variable G; for example,

>>> G.nodes()

will print out all the nodes in the network.

Files

File	Description
`allInfoboxAttributes.py`	Loops through `infoboxes.xlsx` and creates a JSON dictionary (`infoboxes.json`) with all of the `unrendered : rendered` attribute pairs, organized by infobox template name
`findEmpty.py`	Saves another JSON file with a list of `{"Template:Infobox <missed infobox1>" : # of pages, "Template:Infobox <missed infobox2>" : # of pages, … etc}` for infoboxes that `get_meta_infobox('<TEMPLATE_NAME>').rendered_keys()` returns `{}`
`createNetwork.py`	Creates a network of unrendered and rendered infobox attributes in an attempt to identify synonyms. Each node is an attribute, and a directed edge links an unrendered to rendered attribute (in that direction).

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
attribute_analyzer		attribute_analyzer
images		images
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly