pyparadigm

A project for automatic construction of morphological paradigms

1. Assigning words to paradigms according to full inflection table.
python3 pyparadigm.py <inflection_tables> <LANGUAGE_CODE> <maximal_gap> <maximal_initial_gap> <table_processing_mode> <words_by_paradigms_file> <paradigm_codes_file>
<inflection_tables> : file with paradigms in the format as in data/Latin/latin_noun_paradigms.txt
<LANGUAGE_CODE> : code of the language (LA for 'Latin', RU for 'Russian', FI for 'Finnish' and RU_verbs for Russian verbs are supported)
<maximal_gap> : maximal gap length in lcs method
<maximal_initial_gap> : maximal initial gap length in lcs method
<paradigm_processing_mode> : paradigm processing method, «first» uses only first variant in case of multiple word forms for one word, «all» considers all the variants
<words_by_paradigms_file> : output file containing words and paradigms in the format as in data/Latin/nouns_by_paradigms.txt
<paradigm_stats_file> : output file containing paradigms with one member for each paradigm

2. Automatic detection of paradigms

python transform_for_learning.py <words_by_paradigm_file> <paradigm_stats_file> <inflection_tables> <outfile_for_lemmas> <outfile_for_paradigm_codes>
Transforms the output of pyparadigm to the format used in paradigms learning
<words_by_paradigm_file> : first output file of step 1
<paradigm_stats_file> : second output file of step 2
<inflection_tables> : input file of step 1, is required only for ordering.
<lemmas_with_codes> : output file with lemmas and paradigm codes for future learning
<paradigms_with_codes> : output file with paradigms and their codes

2a. python learn_paradigms.py [-p] [-m] [-T <train_data_dir>] [-O <test_output_dir>] cross-validation <paradigms_with_codes> <lemmas_with_codes> <max_feature_lengths> <feature_fractions> <train_data_fractions> <folds_number> [<feature_selection_method> ]
-p : if True, class probabilities are also predicted. Default value is False,
-m : if True, multiple paradigms for one lemma can be predicted. Default value is False,
train_data_dir : a directory to output train data splits,
test_output_dir : a directory to output classification results,
<paradigms_with_codes> : second outfile of the previous step,
<lemmas_with_codes> : first outfile of the previous step,
<max_feature_lengths> : comma-separated list of maximal feature lengths,
<train_data_fractions> : comma-separated list of train data fractions,
<feature_fraction> : comma-separated list of feature fractions. This parameter determines the proportion of features which are selected on data preprocessing step.
<folds_number> : number of folds in cross-validation,
<feature_selection_method> : feature selection algorithm ('ambiguity' or 'log_odds', default and preferred is 'ambiguity')

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
data		data
scripts		scripts
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

scripts

scripts

README.md

README.md

Repository files navigation

pyparadigm

About

Releases

Packages

Languages

AlexeySorokin/pyparadigm

Folders and files

Latest commit

History

Repository files navigation

pyparadigm

About

Resources

Stars

Watchers

Forks

Languages