Skip to content

A repo of my experiments in Computational Linguistics, till they are mature enough to require their own repos.

License

Notifications You must be signed in to change notification settings

shirish93/CoLing

Repository files navigation

CoLing

A repo of my experiments in Computational Linguistics, till they are mature enough to require their own repos.

##Hot Now

<<<<<<< HEAD

  • List of a LOT of Nepali last names!

  • Just added the IPYNB =======

  • Use diff-patch-match to identify common spelling errors

  • Just added the IPYNB

origin/master

  • Added the segmentation tool. Works remarkably well.
  • Need to figure out the use of Word2Vec similarities for best segmentation.

##Papers/Projects of note

###Papers Here's an interesting list of recent papers/works in Computational Linguistics that I've been following:

Gives a nice outline of the field, and explains how traditional NLP people should embrace Neural Networks and deep learning instead of grudging them .

Is a really, really good intro to the field, and brings one up-to-date with the happenings in the field.

The paper that introduced the Word2Vec model to the world. There's many follow-up papers that are important too that I'll update this list with.


###Projects


#Downloads

You can download the pre-trained models that I created here. Look in the "Model" folder for instructions and other details.


##Currently Reading and Exploring

This is a list of papers I'm reading/exploring and either figuring out a way to implement or waiting for implementation.


##Running Experiments

Because the experiments can often fail miserably,
making the experimenter look foolish for not considering the obvious, 
I have not included the models for the following in the datasets 
that have been made public.

That would be the first thought. However, collecting data is a work in 
itself, and so is processing it. All the data models mentioned in 
this repo are available, most likely at my onedrive linked in the README.

If they are not, send me a message, and they will be made available at the
earliest.
  • Stripping the Nepali corpus of all vowels, to see how much 'one-off' words/misspellings decrease by.

  • Using 3-grams to replace rare words, such that vocabulary is not reduced. Kind of like a poor man's version of char-level embeddings. Out of vocabulary words (OOV) are a problem. 375 000 words >10 occurences. 1.4 M words with usage less than that. Mispellings + morphological complexity is a problem. This is ALL strictly untrained. Difference with existing work.

  • Combining the above two, along with ways to retrofit them(look at the reference papers) to make sure multiple meanings of de-voweled words are preserved. This is STRICTLY work in progess.

  • Related to above: Need to be able to retrofit the Nepali trained model using a Nepali dictionary. The UChicago librarian had given offer of help, but has now gone AWOL. Need to reestablish contact and work on that.

  • #####Projects in digital humanities

    • Cooccurence of castes according to activities?
    • Gendered words, and the relationship between gendered pronouns and neighboring words
    • More stuff here. <<<<<<< HEAD

##Work in Progress

  • Using ~5 million Nepali tweets to train a word2vec model/use lessons from running experiments. =======

0ee4f57bfa2899a5cfabc9490ed057274dba3200

About

A repo of my experiments in Computational Linguistics, till they are mature enough to require their own repos.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published