SparseNLP

Introduction

Word Embeddings are an important component of many language models capable of producing state of the art results in several NLP tasks;
SparseNLP proposes an alternative approach for deriving word embeddings. In contrast with the tradicional dense vector representations, it creates sparse distributed representations (SDR) for each word; the representation shares the main idea of a word embedding, which was formulated by Firth in 1957: "a word is characterized by the company it keeps";
Validation of the whole methodology is done by using these word embeddings as language models in several Natural Language Processing tasks;

Installation

git clone https://github.com/avsilva/sparse-nlp.git

TODO: pip install -r requirements.txt

How to use it

In order to use SparseNLP you need your data stored in a database table with 2 columns:

id (int): primary key
cleaned_text (str): text tokens for each sentence

Tests

nosetests --cover-package=.\sparsenlp --with-coverage --nologcapture -x
python -m unittest -v

(run just one class test)
python -m unittest -q tests.test_datacleaner.TestDataClean
py.test -q -s tests/test_datacleaner.py::TestDataClean

(run just one functional test)

python -m unittest -q tests.test_datacleaner.TestDataClean.test_ingestfiles_json_to_dict

Project Planning

Training Corpora Definition - 16 Feb
Corpora pre-processing - 28 Feb
Sentence tokenization - 12 Mar
Sentence vetorization - 26 Mar
Word to sentence database - 9 Apr
Cluster sentences - 23 Apr
Word fingerprint - 4 May
Text fingerprint - 17 May
Evaluation - 30 Aug

1. Training Corpora Definition

Wikipedia dumps from wikimedia 2018-01-01

1.1 Extracting plain text from Wikipedia dumps

github - attardi/wikiextractor

Document files contains a series of Wikipedia articles, represented each by an XML doc element:

<doc>...</doc>
<doc>...</doc>
...
<doc>...</doc>

The element doc has the following attributes:

id, which identifies the document by means of a unique serial number
url, which provides the URL of the original Wikipedia page. The content of a doc element consists of pure text, one paragraph per line.

Example:

<doc id="2" url="http://it.wikipedia.org/wiki/Harmonium">
Harmonium.
L'harmonium è uno strumento musicale azionato con una tastiera, detta manuale.
Sono stati costruiti anche alcuni harmonium con due manuali.
...
</doc>

9. Evaluation

Evaluation code repository: github - kudkudak/word-embeddings-benchmarks Evaluation methods: arxiv.org/abs/1702.02170

other alternative methods: github - mfaruqui/eval-word-vectors

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
sparsenlp		sparsenlp
tests		tests
utils		utils
README.md		README.md
_config.yml		_config.yml
baseline.py		baseline.py
cluster.py		cluster.py
create_fingerprints.py		create_fingerprints.py
create_fp_multiprocessing.py		create_fp_multiprocessing.py
create_som.py		create_som.py
evaluate_fingerprints.py		evaluate_fingerprints.py
index.html		index.html
metrics.py		metrics.py
process_snippets.py		process_snippets.py
requirements.txt		requirements.txt
scale_images.py		scale_images.py
sparsenlp.py		sparsenlp.py
sparsify.py		sparsify.py
text-processing.ipynb		text-processing.ipynb

dubing12/sparse-nlp

Folders and files

Latest commit

History

Repository files navigation

SparseNLP

Introduction

Installation

How to use it

Tests

Project Planning

1. Training Corpora Definition

1.1 Extracting plain text from Wikipedia dumps

9. Evaluation

About

Resources

Stars

Watchers

Forks

Languages