flair-extra

This repository is mainly a collection of python-scripts that simplify some of the workflows when using flair:

Preprocessing, formatting and analyzing text datasets
Training embeddings / language models (from scratch)
Performing Named-Entity-Recognition and Intent-Detection
Evaluating trained models

Example

Say you have a raw text corpus and a NER dataset from the same domain. You want to train your own flair language model in order to use it for training a NER model for your domain. Steps in that workflow include:

Preprocess, format and analyze your corpus/datasets
Train a custom language model (LM)
Train an NER model (with your previously trained LM, maybe even in combination with other LMs/embeddings)
Evaluate the performance of yout final model

The following examples assume that you are familiar with flair base classes, data types, etc. You can get started with flair tutorials here. Most parts of this workflow (with output) are illustrated as jupyter notebooks.

Preprocess

e.g. clean, replace umlaute, remove accents and puntuaction tokens

$ cd scripts/language_modeling
$ python preprocess_corpus.py -cuap /path/to/corpus/ /path/to/corpus_proc/

Format

E.g. create a corpus folder from a plain text file using 97% as training-, 1% as validation and 2% as test set. Split training set into 20 parts

$ cd scripts/language_modeling
$ python make_corpus_folder.py /path/to/corpus/ /path/to/corpus/folder/ -p 97-1-2 -s 20

E.g. create a column corpus folder (NER) from a column file using 70% as training-, 20% as validation and 10% as test set. Shuffle lines

$ cd scripts/named_entity_recognition
$ python make_nercorpus_folder.py /path/to/ner_dataset /path/to/destination/ -p 70-20-10 --shuffle

Analyze

Analyze text corpus: lines, tokens, etc.; most common tokens; wordcloud

from modules.corpus_analysis import TextAnalysis

ta = TextAnalysis(path/to/corpus)

print(ta.obtain_statistics())
print(ta.most_common_tokens(nr_tokens=15, stop_words=['the', 'is',...]))
ta.wordcloud(stop_words=[], savefig_file=None, figsize = ((15,15)))

Analyze a ColumnCorpus (NER): tags/tokens stats; most common tokens; wordcloud; visualize sentences, tag distribution, most common tokens per tags

from modules.corpus_analysis import ColumnCorpusAnalysis

columns = {0: 'text', 1: 'ner', 2: 'pos'}
cca = ColumnCorpusAnalysis(path=path/to/corpus_folder, columns=columns, tag_types=['ner', 'pos'])

print(cca.obtain_statistics(tag_type='ner'))
print(cca.most_common_tokens(nr_tokens=20, stop_words=stopwords.words('german')))
cca.wordcloud(savefig_file=None, figsize = ((15,15))
cca.visualize_ner_tags(display_index=range(5))
cca.tag_distribution(savefig_file=None, tag_type='ner', figsize=(13, 10))
cca.most_common_tokens_per_tags(max_tokens=10, tag_type='ner', print_without_count=True)

Train a Flair Language Model

First, set parameters (epochs, learning rate, etc.) in a special file (e.g. options_lm.json). Then:

$ cd scripts/language_modeling
$ python train_lm_flair.py -c /path/to/corpus/ -t /path/to/model/folder/ -o options_lm.json [--continue_training]

Train a NER Model

Again, set parameters in a options-file. Say, you have trained a forward and a backwards LM stored in fwd-lm.pt and bwd-lm.pt, then:

$ cd scripts/named_entity_recognition
$ python train_ner_flair.py -c /path/to/corpus/ -t /path/to/model/folder/ -o options_ner_flair [--continue_training] [--tensorboard] -e fwd-lm.pt bwd-lm.pt

It will use a stacked combination of all embeddings you specify after te flag -e.

Evaluate trained Model

Predictions; Precision, Recall, F1 scores (total/ for each tag) as table and as plot; training curves, weights, etc.

from modules.model_evaluation import SequenceTaggerEvaluation

ste = SequenceTaggerEvaluation(path/to/model/folder/, model='best-model.pt')

text = "Lorem Ipsum ..."
sentences = SequenceTaggerEvaluation._preprocess(text)
ste.predict(sentences)

ste.result_tables()
ste.plot_tag_stats(mode='tp_fn',...)
ste.plot_training_curves()
ste.plot_weights()
ste.plot_learning_rate()

The code in this repository was used for the thesis: "Effects of different Word Embeddings on the Performance of Intent Detection and Named Entity Recognition for German texts" (Parikh, 2019).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

modules

modules

notebooks

notebooks

resources

resources

scripts

scripts

tutorials

tutorials

.gitignore

.gitignore

README.md

README.md

Repository files navigation

flair-extra

Example

Preprocess

Format

Analyze

Train a Flair Language Model

Train a NER Model

Evaluate trained Model

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
modules		modules
notebooks		notebooks
resources		resources
scripts		scripts
tutorials		tutorials
.gitignore		.gitignore
README.md		README.md

ajpar94/flair-extra

Folders and files

Latest commit

History

Repository files navigation

flair-extra

Example

Preprocess

Format

Analyze

Train a Flair Language Model

Train a NER Model

Evaluate trained Model

About

Topics

Resources

Stars

Watchers

Forks

Languages