Skip to content

AlexeySorokin/pyparadigm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

39 Commits
 
 
 
 
 
 

Repository files navigation

pyparadigm

A project for automatic construction of morphological paradigms

1. Assigning words to paradigms according to full inflection table.
python3 pyparadigm.py <inflection_tables> <LANGUAGE_CODE> <maximal_gap> <maximal_initial_gap> <table_processing_mode> <words_by_paradigms_file> <paradigm_codes_file>
<inflection_tables> : file with paradigms in the format as in data/Latin/latin_noun_paradigms.txt
<LANGUAGE_CODE> : code of the language (LA for 'Latin', RU for 'Russian', FI for 'Finnish' and RU_verbs for Russian verbs are supported)
<maximal_gap> : maximal gap length in lcs method
<maximal_initial_gap> : maximal initial gap length in lcs method
<paradigm_processing_mode> : paradigm processing method, «first» uses only first variant in case of multiple word forms for one word, «all» considers all the variants
<words_by_paradigms_file> : output file containing words and paradigms in the format as in data/Latin/nouns_by_paradigms.txt
<paradigm_stats_file> : output file containing paradigms with one member for each paradigm

2. Automatic detection of paradigms

  1. python transform_for_learning.py <words_by_paradigm_file> <paradigm_stats_file> <inflection_tables> <outfile_for_lemmas> <outfile_for_paradigm_codes>
    Transforms the output of pyparadigm to the format used in paradigms learning
    <words_by_paradigm_file> : first output file of step 1
    <paradigm_stats_file> : second output file of step 2
    <inflection_tables> : input file of step 1, is required only for ordering.
    <lemmas_with_codes> : output file with lemmas and paradigm codes for future learning
    <paradigms_with_codes> : output file with paradigms and their codes

2a. python learn_paradigms.py [-p] [-m] [-T <train_data_dir>] [-O <test_output_dir>] cross-validation <paradigms_with_codes> <lemmas_with_codes> <max_feature_lengths> <feature_fractions> <train_data_fractions> <folds_number> [<feature_selection_method> ]
-p : if True, class probabilities are also predicted. Default value is False,
-m : if True, multiple paradigms for one lemma can be predicted. Default value is False,
train_data_dir : a directory to output train data splits,
test_output_dir : a directory to output classification results,
<paradigms_with_codes> : second outfile of the previous step,
<lemmas_with_codes> : first outfile of the previous step,
<max_feature_lengths> : comma-separated list of maximal feature lengths,
<train_data_fractions> : comma-separated list of train data fractions,
<feature_fraction> : comma-separated list of feature fractions. This parameter determines the proportion of features which are selected on data preprocessing step.
<folds_number> : number of folds in cross-validation,
<feature_selection_method> : feature selection algorithm ('ambiguity' or 'log_odds', default and preferred is 'ambiguity')

About

A project for automatic construction of morphological paradigms

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages