This site contains part of the materials we will be using during the tutorial on Corpus Statistics with Open Source Tools at NASSLLI 2016. The tutorial will be interactive. Basic analytical concepts and techniques will be exemplified on the datasets listed below. It presupposes that you come with a laptop and that you install a Git versioning client.
REMARK: This site contains already a case study in corpus analysis that we will discuss
together. At the end of the tutorial, the notes, slides and some extra sample code will
be uploaded to this repository.
The course will rely on two pillars: (1) the R statistical analysis enviroment and (2) the Python scripting language. A companion tool for R is the RStudio IDE. For Python, you can use the IDE of your choice (e.g., Eclipse with the PyDev plugin). I will help on how to install and set up most of the required tools/resources during the tutorial, albeit for Linux environments. Below, I list the main requirements and references. Additional (but minor) libraries and references will be mentioned as we go.
- R 2.0+, with libraries:
- languageR (English datasets)
- infotheo (Shannon entropy)
- xlsx (to write/read .xls and .csv files)
- RStudio 0.99+ (IDE for R)
- Python 2.7+, with libraries:
- NumPy 1.0+ (numerical computation)
- Matplotlib 1.0+ (plotting)
- SciPy 1.0+ (basic statistics)
- NLTK 2.0+ (NLP)
- Gensim (word embeddings)
- Word2Vec models
- Peter Dalgaard. Introductory Statistics with R. Springer, 2009.
- Stefan T. Gries. "Useful statistics for corpus linguistics". In: A Mosaic of Corpus Linguistics: Selected Approaches, p. 269-291. Peter Lang, 2010.
- Chris Manning and Hinrich Schutze. Foundations of Statistical Natural Language Processing. The MIT Press, 1999.
- Steven Bird, Ewan Klein and Edward Loper. Natural Language Processing with Python. O'Reilly, 2009.
- R. H. Baayen. Analyzing Linguistic Data. A Practical Introduction to Statistics.. Cambrige University Press, 2008.