Python4Linguist

Description

This is a school project containing a script that parses KAF / NAF files in Python.

Installation

Clone the repository from github

git clone git@github.com:SBelkaid/Python4Linguist.git

You will need to have installed the lxml library for Python (http://lxml.de/). Usually just by runningpip install --user lxml should be enough for installing lxml.

You will need to have installed the BeautifulSoup library for python (http://www.crummy.com/software/BeautifulSoup/). Usually just by runningpip install --user bs4 should be enough for installing bs4.

You will need to have installed the pykml library for Python. Usually just by runningpip install --user pykml should be enough for installing pykml.

You will need to have installed the numpy library for Python. Usually just by runningpip install --user numpy should be enough for installing numpy.

You will need to have installed the nltk library for Python nltk.org. Best thing to do is install Anaconda https://www.continuum.io/downloads, contains more usefull modules, such as numpy, matplotlib and pandas.

Usage

These are python scripts, that read a KAF or NAF file and parses it. It basically parses one KAF/NAF file and extracts data using XPath. Make sure to place the folder containing theses (defined in parser.py as: DIR_NAME = 'thesis_vu_2015') in the same folder as the parser script. Example of usage:

NameOfComputer: python parser.py #to parse all the files in the theses folder

NameOfComputer: python showStats.py #visualisation

NameOfComputer: python generateKML.py #generation of kml files

Results

parser.py generates scripties.json file containing parsed information from the XML files. It's a dictionairy with the original folder structure of where the files where located and as mentioned earlier data that is parsed

scraper.py retrieves all dbpedia location urls from the scripties.json file and orders them on study program. Next it crawles the urls, parses the source and creates a dictionairy containing urls as keys and coordinates as values.

showStats.py prints some stats per language in programme and programme alone. To view the stats individually for a thesis do the folowing:

>>> import json
>>> scripties = json.load(open('scripties.json', 'r'))
>>> scripties['ges']['en'][u'Scriptie_Alders_trim.txt.naf.nohyphen']

The above will show all the stats available for the given author.

generateKML.py extracts all locations from all theses from one master programme, gather all coordinates for these entities and put them in KML.

To show the stats on a more general level instead of an individual author, uncomment the following lines in showStats.py:

	# statsPerLanguageAndProgram(stats)
	# statsPerProgramme(stats)

Testing

In some docstrings test have been made available. These can be run like so:

python -m doctest parser.py -v

Visualistion

This is the visualisation of the frequency of the types per language.

Generated by when executing the showStats.py script. This can be done after the parser has finished and the scripties.json file has been generated:

NameOfComputer: python showStats.py

Future Work

Add multithreading

Contact

Soufyan Belkaid
s.belkaid@student.vu.nl
Vrije University of Amsterdam

License

nothing special

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

images

images

README.md

README.md

generateKML.py

generateKML.py

parser.py

parser.py

showStats.py

showStats.py

Repository files navigation

Python4Linguist

Description

Installation

Usage

Results

Testing

Visualistion

Future Work

Contact

License

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
images		images
README.md		README.md
generateKML.py		generateKML.py
parser.py		parser.py
showStats.py		showStats.py

SBelkaid/Python4Linguist

Folders and files

Latest commit

History

Repository files navigation

Python4Linguist

Description

Installation

Usage

Results

Testing

Visualistion

Future Work

Contact

License

About

Resources

Stars

Watchers

Forks

Languages