Skip to content

POS tagger trained on Sequoia corpus with lstm (pytorch)

Notifications You must be signed in to change notification settings

JohannaSimoens/POS_tagger_lstm

Repository files navigation

POS_tagger_lstm

POS tagger trained on Sequoia corpus with lstm (pytorch)

Given a Sequence of words (sentence), predict a sequence of part-of-speech tags.

  • Model 1: LSTM using one-hot vectors to encode words
  • Model 2: LSTM using pretrained word embeddings vectors to encode words

The baseline of this NLP task (Part-of-Speech tagging) is the Most Frequent Part-of-speech.

Requirements

  • Python 3.8.5
  • Pytorch

You need to download French word embeddings "vecs100-linear-frwiki" trained by M. Coavoux, via word2vec (skip-gram model) on the wikipedia dump (650 millions of words) frwiki-20140804-corpus.xml.bz2 (downloaded there http://linguatools.org/tools/corpora/wikipedia-monolingual-corpora/):

http://www.linguist.univ-paris-diderot.fr/~mcandito/vecs100-linear-frwiki.bz2

And put this file in the main repository.

NB: Other word embeddings are possible: the dimension of the word embeddings should be 100, and the file containing these word embeddings must be a text file with a word embedding per line, the token (word) and the float values (vector values, word embeddings) must be separated by spaces (first the token, then the float values).

Corpus

https://deep-sequoia.inria.fr/

Corentin Ribeyre, Marie Candito, et Djamé Seddah. 2014. Semi-Automatic Deep Syn- tactic Annotations of the French Treebank. Proceedings of the 13th International Workshop on Treebanks and Linguistic Theories. Tübingen Universität, Tübingen, Germany

About

POS tagger trained on Sequoia corpus with lstm (pytorch)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages