- Word Embeddings are an important component of many language models capable of producing state of the art results in several NLP tasks;
- SparseNLP proposes an alternative approach for deriving word embeddings. In contrast with the tradicional dense vector representations, it creates sparse distributed representations (SDR) for each word; the representation shares the main idea of a word embedding, which was formulated by Firth in 1957: "a word is characterized by the company it keeps";
- Validation of the whole methodology is done by using these word embeddings as language models in several Natural Language Processing tasks;
git clone https://github.com/avsilva/sparse-nlp.git
TODO: pip install -r requirements.txt
In order to use SparseNLP you need your data stored in a database table with 2 columns:
- id (int): primary key
- cleaned_text (str): text tokens for each sentence
-
nosetests --cover-package=.\sparsenlp --with-coverage --nologcapture -x
-
python -m unittest -v
(run just one class test)
-
python -m unittest -q tests.test_datacleaner.TestDataClean
-
py.test -q -s tests/test_datacleaner.py::TestDataClean
(run just one functional test)
- python -m unittest -q tests.test_datacleaner.TestDataClean.test_ingestfiles_json_to_dict
- Training Corpora Definition - 16 Feb
- Corpora pre-processing - 28 Feb
- Sentence tokenization - 12 Mar
- Sentence vetorization - 26 Mar
- Word to sentence database - 9 Apr
- Cluster sentences - 23 Apr
- Word fingerprint - 4 May
- Text fingerprint - 17 May
- Evaluation - 30 Aug
Wikipedia dumps from wikimedia 2018-01-01
github - attardi/wikiextractor
Document files contains a series of Wikipedia articles, represented each by an XML doc element:
<doc>...</doc>
<doc>...</doc>
...
<doc>...</doc>
The element doc has the following attributes:
- id, which identifies the document by means of a unique serial number
- url, which provides the URL of the original Wikipedia page. The content of a doc element consists of pure text, one paragraph per line.
Example:
<doc id="2" url="http://it.wikipedia.org/wiki/Harmonium">
Harmonium.
L'harmonium è uno strumento musicale azionato con una tastiera, detta manuale.
Sono stati costruiti anche alcuni harmonium con due manuali.
...
</doc>
Evaluation code repository: github - kudkudak/word-embeddings-benchmarks Evaluation methods: arxiv.org/abs/1702.02170
other alternative methods: github - mfaruqui/eval-word-vectors