This is a school project containing a script that parses KAF / NAF files in Python.
Clone the repository from github
git clone git@github.com:SBelkaid/Python4Linguist.git
You will need to have installed the lxml library for Python (http://lxml.de/). Usually just by runningpip install --user lxml
should be enough for installing lxml.
You will need to have installed the BeautifulSoup library for python (http://www.crummy.com/software/BeautifulSoup/). Usually just by runningpip install --user bs4
should be enough for installing bs4.
You will need to have installed the pykml library for Python. Usually just by runningpip install --user pykml
should be enough for installing pykml.
You will need to have installed the numpy library for Python. Usually just by runningpip install --user numpy
should be enough for installing numpy.
You will need to have installed the nltk library for Python nltk.org. Best thing to do is install Anaconda https://www.continuum.io/downloads, contains more usefull modules, such as numpy, matplotlib and pandas.
These are python scripts, that read a KAF or NAF file and parses it. It basically parses one KAF/NAF file
and extracts data using XPath. Make sure to place the folder containing theses (defined in parser.py as: DIR_NAME = 'thesis_vu_2015'
) in the same folder as the parser script. Example of usage:
NameOfComputer: python parser.py #to parse all the files in the theses folder
NameOfComputer: python showStats.py #visualisation
NameOfComputer: python generateKML.py #generation of kml files
parser.py generates scripties.json file containing parsed information from the XML files. It's a dictionairy with the original folder structure of where the files where located and as mentioned earlier data that is parsed
scraper.py retrieves all dbpedia location urls from the scripties.json file and orders them on study program. Next it crawles the urls, parses the source and creates a dictionairy containing urls as keys and coordinates as values.
showStats.py prints some stats per language in programme and programme alone. To view the stats individually for a thesis do the folowing:
>>> import json
>>> scripties = json.load(open('scripties.json', 'r'))
>>> scripties['ges']['en'][u'Scriptie_Alders_trim.txt.naf.nohyphen']
The above will show all the stats available for the given author.
generateKML.py extracts all locations from all theses from one master programme, gather all coordinates for these entities and put them in KML.
To show the stats on a more general level instead of an individual author, uncomment the following lines in showStats.py:
# statsPerLanguageAndProgram(stats)
# statsPerProgramme(stats)
In some docstrings test have been made available. These can be run like so:
python -m doctest parser.py -v
This is the visualisation of the frequency of the types per language.
Generated by when executing the showStats.py script. This can be done after the parser has finished and the scripties.json file has been generated:
NameOfComputer: python showStats.py
Add multithreading
- Soufyan Belkaid
- s.belkaid@student.vu.nl
- Vrije University of Amsterdam
nothing special