Skip to content

bplank/multilingualtokenizer

Repository files navigation

multilingualtokenizer

A trivial punctuation-based sentence splitter and tokenizer for multi-lingual data.

Requires python3 and the regex package. Install with pip install regex or conda install regex.

Usage:

python trivialssplitter.py FILE > OUTPUT.s
python tinytokenizer.py FILE > OUTPUT.s
python tinytokenizer.py --conll FILE > OUTPUT.t

The --conll option outputs one token per line. Default is to have one sentence per line.

About

A trivial punctuation-based sentence splitter and tokenizer for multilingual data.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages