CoLing

A repo of my experiments in Computational Linguistics, till they are mature enough to require their own repos.

##Hot Now

<<<<<<< HEAD

List of a LOT of Nepali last names!
Just added the IPYNB =======
Use diff-patch-match to identify common spelling errors
Just added the IPYNB

origin/master

Added the segmentation tool. Works remarkably well.
Need to figure out the use of Word2Vec similarities for best segmentation.

##Papers/Projects of note

###Papers Here's an interesting list of recent papers/works in Computational Linguistics that I've been following:

Computational Linguistics and Deep Learning by Prof Chris Manning of Stanford.

Gives a nice outline of the field, and explains how traditional NLP people should embrace Neural Networks and deep learning instead of grudging them .

A Primer on Neural Network Models for Natural Language Processing by Prof Yoav Goldberg.

Is a really, really good intro to the field, and brings one up-to-date with the happenings in the field.

Distributed Representations of Words and Phrases and their Compositionality by Mikolov et al.

The paper that introduced the Word2Vec model to the world. There's many follow-up papers that are important too that I'll update this list with.

###Projects

For those more humanistically inclined, Benamin Schmidt has a fantastic post titled Rejecting the gender binary: a vector-space operation that will most certainly give one experiment ideas.

#Downloads

You can download the pre-trained models that I created here. Look in the "Model" folder for instructions and other details.

##Currently Reading and Exploring

This is a list of papers I'm reading/exploring and either figuring out a way to implement or waiting for implementation.

##Running Experiments

Because the experiments can often fail miserably,
making the experimenter look foolish for not considering the obvious, 
I have not included the models for the following in the datasets 
that have been made public.

That would be the first thought. However, collecting data is a work in 
itself, and so is processing it. All the data models mentioned in 
this repo are available, most likely at my onedrive linked in the README.

If they are not, send me a message, and they will be made available at the
earliest.

Stripping the Nepali corpus of all vowels, to see how much 'one-off' words/misspellings decrease by.
Using 3-grams to replace rare words, such that vocabulary is not reduced. Kind of like a poor man's version of char-level embeddings. Out of vocabulary words (OOV) are a problem. 375 000 words >10 occurences. 1.4 M words with usage less than that. Mispellings + morphological complexity is a problem. This is ALL strictly untrained. Difference with existing work.
Combining the above two, along with ways to retrofit them(look at the reference papers) to make sure multiple meanings of de-voweled words are preserved. This is STRICTLY work in progess.
Related to above: Need to be able to retrofit the Nepali trained model using a Nepali dictionary. The UChicago librarian had given offer of help, but has now gone AWOL. Need to reestablish contact and work on that.
#####Projects in digital humanities
- Cooccurence of castes according to activities?
- Gendered words, and the relationship between gendered pronouns and neighboring words
- More stuff here. <<<<<<< HEAD

##Work in Progress

Using ~5 million Nepali tweets to train a word2vec model/use lessons from running experiments. =======

0ee4f57bfa2899a5cfabc9490ed057274dba3200

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
Trained Models		Trained Models
Experiments.ipynb		Experiments.ipynb
LICENSE		LICENSE
README.md		README.md
nepaliNames.txt		nepaliNames.txt
observations.md		observations.md
provenance.txt		provenance.txt
subtitle_cleaner.py		subtitle_cleaner.py
top_nepali_words.csv		top_nepali_words.csv
vector_reject.py		vector_reject.py
word_modifiers.csv		word_modifiers.csv
word_segmentation.py		word_segmentation.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trained Models

Trained Models

Experiments.ipynb

Experiments.ipynb

LICENSE

LICENSE

README.md

README.md

nepaliNames.txt

nepaliNames.txt

observations.md

observations.md

provenance.txt

provenance.txt

subtitle_cleaner.py

subtitle_cleaner.py

top_nepali_words.csv

top_nepali_words.csv

vector_reject.py

vector_reject.py

word_modifiers.csv

word_modifiers.csv

word_segmentation.py

word_segmentation.py

Repository files navigation

CoLing

About

Releases

Packages

Languages

License

shirish93/CoLing

Folders and files

Latest commit

History

Repository files navigation

CoLing

About

Resources

License

Stars

Watchers

Forks

Languages