GitHub - aatimofeev/spacy_russian_tokenizer: Custom Russian tokenizer for spaCy

spacy_russian_tokenizer: Russian segmentation and tokenization rules for spaCy

Tokenization in Russian language is not that simple topic when it comes to compound words connected by hyphens. Some of them (i.e. "какой-то", "кое-что", "бизнес-ланч") should be treated as single unit, while other (i.e. "суп-харчо", "инженер-программист") treated as multiple tokens. Correct tokenization is especially important when training language model, because in most training datasets (i.e. SynTagRus) tokens are split or merged correctly and wrong tokenization reduces model's quality. Example of default behaviour:

from spacy.lang.ru import Russian
text = "Не ветер, а какой-то ураган!"
nlp = Russian()
doc = nlp(text)
print([token.text for token in doc])
# ['Не', 'ветер', ',', 'а', 'какой', '-', 'то', 'ураган', '!']
# Notice that word "какой-то" is split into three tokens.

This package uses spaCy Matcher API to create rules for specific cases and exceptions in Russian language.

Installation

pip install git+https://github.com/aatimofeev/spacy_russian_tokenizer.git

Implementation

Basically, the package is just a collection of manually tunes Matcher patterns. Most patterns were acquired from SynTagRus vocabulary and lemma dictionary from National Russian Language Corpus (НКРЯ).

Usage

Core patterns are collected in MERGE_PATTERNS variable.

from spacy.lang.ru import Russian
from spacy_russian_tokenizer import RussianTokenizer, MERGE_PATTERNS
text = "Не ветер, а какой-то ураган!"
nlp = Russian()
doc = nlp(text)
russian_tokenizer = RussianTokenizer(nlp, MERGE_PATTERNS)
nlp.add_pipe(russian_tokenizer, name='russian_tokenizer')
doc = nlp(text)
print([token.text for token in doc])
# ['Не', 'ветер', ',', 'а', 'какой-то', 'ураган', '!']
# Notice that word "какой-то" remains a single token.

One can also add patterns, found in SynTagRus but absent in National Russian Language Corpus

from spacy.lang.ru import Russian
from spacy_russian_tokenizer import RussianTokenizer, MERGE_PATTERNS, SYNTAGRUS_RARE_CASES
text = "«Фобос-Грунт» — российская автоматическая межпланетная станция (АМС)."
nlp = Russian()
doc = nlp(text)
russian_tokenizer = RussianTokenizer(nlp, MERGE_PATTERNS + SYNTAGRUS_RARE_CASES)
nlp.add_pipe(russian_tokenizer, name='russian_tokenizer')
doc = nlp(text)
print([token.text for token in doc])
# ['«', 'Фобос-Грунт', '»', '—', 'российская', 'автоматическая', 'межпланетная', 'станция', '(', 'АМС', ')', '.']

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
spacy_russian_tokenizer		spacy_russian_tokenizer
.gitignore		.gitignore
README.md		README.md
evaluate_on_opencorpora.py		evaluate_on_opencorpora.py
evaluate_on_syntagrus.py		evaluate_on_syntagrus.py
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spacy_russian_tokenizer

spacy_russian_tokenizer

.gitignore

.gitignore

README.md

README.md

evaluate_on_opencorpora.py

evaluate_on_opencorpora.py

evaluate_on_syntagrus.py

evaluate_on_syntagrus.py

setup.py

setup.py

Repository files navigation

spacy_russian_tokenizer: Russian segmentation and tokenization rules for spaCy

Installation

Implementation

Usage

About

Releases

Packages

Languages

aatimofeev/spacy_russian_tokenizer

Folders and files

Latest commit

History

Repository files navigation

spacy_russian_tokenizer: Russian segmentation and tokenization rules for spaCy

Installation

Implementation

Usage

About

Topics

Resources

Stars

Watchers

Forks

Languages