GitHub - anukat2015/compositional-semantics: Toolkit for learning compositional functions over word vector representations.

compositional-semantics is a package for learning compositional functions given vector representations of words. A special feature is the detection and handling of non-compositional phrases.

System Requirements

NumPy and SciPy

End-to-end Pipeline Instructions

Inputs: word vectors, context vectors, PPDB

In order to make things faster, it is better to run the TrainingExtractor module a priori on PPDB:
```
python extract_training.py -A X ppdb_loc > sorted_ppdb_examples
```

The -A X flag indicates we should output all POS pairs: specific pairs for the top X, and the rest in a generic 'X X' pair type. Other options available are:

-a: adjective-noun
-n: noun-noun
-d: determiner-noun
-v: verb-noun
-p: POS tags have also been provided along with PPDB data

Run the learning module to learn and output the compositional functions:
```
python learning.py wordVecs sorted_ppdb_examples parameters
```

Flags:

-n: normalize word vectors to be unit norm (recommended)
-j X: sets number of jobs/cores. Default is 4
-f: filter duplicates; this removes training examples where a word is repeated ('trade union' --> 'union')
-r X: sets the regressor to be used when learning the compositional functions. Default is lasso
-d: diagonal approximation: only the diagonal of the resulting tensor parameters is active
-c: concatenative model. Instead of N^2 covariates, where N is the dimension of one vector, we have 2*N covariates. Note that the diagonal approximation and the concatenative model flags cannot be used together.
-p: use PPDB; this means that the second argument passed to the script is actually PPDB, not the POS-sorted examples
-P: linguistic regularization prior strength
-m: multivariate version of whatever regressor is chosen with -r. By default, we distribute estimation over cores, but with -m it's a single core process, but you get better R^2

If you want to evaluate the learned representations against the human evaluations (in the Mitchell & Lapata 2010 study), you can run the evaluation module:
```
python evaluation.py wordVecs parameters < mitchell_lapata_scores
```

Flags:

-j: adjective-noun phrase similarity evaluation
-J: adjective-noun non-compositionality evaluation
-n: noun-noun phrase similarity evaluation
-N: noun-noun non-compositionality evaluation
-m: output correlation scores from a simple point-wise multiplicative model (no learning, so parameters argument is dummy)
-a: output correlation scores from a simple point-wise additive model (no learning, so parameters argument is dummy)
-d: output most divergent and least divergent similarity pairs between learned representations and human evaluations
-c: concatenative model. If the parameters have been learned using a concatenative model, then this flag is required.
-e: compute Spearman's correlation scores only with the most extreme similarities on the 1-7 scale

By default, input vectors are assumed normalized.

If you want to get non-compositionality scores for phrases using these parameters:

python non_comp_detect.py wordVecs contextVecs parameters unigram_counts per_sentence_grammar_loc_in < dev_corpus_with_pos_tags

Flags:

-c: concatenative model. If the parameters have been learned using a concatenative model, then this flag is required.
-b X: bin the values; argument is the type of binning (i.e., constant size bins or constant width bins); adds a type of non-linearity and more discrimination
-B X: number of bins; default: 10
-a: averaging; divide the log probability by the number of words to normalize for context length (beginning/end sentence effects)
-l X: context length; the number of words to look at on one side of the phrase (so the window is double this)
-s X: number of stop words to filter in the context. 0 means do not filter stop words.
-n X: apply normalization (and use X cores in the process). Not using this flag will leave scores as unnormalized.
-u: apply unigram correction; this is a heuristic that divides the score, which is the probability of the context given the phrasal representation, by the product of the unigram probabilities of the context. Only works if the -n flag is enabled.
-P: output perplexity instead of log probability; this also takes into account the length of the context
-C: score non-compositionality using a simple cosine similarity heuristic (ignoring context), where we average the cosine similarity of the phrasal representation with each of the word representations.
-v: print phrasal representation vectors only; useful output if you want to compute the distances between these representations and directly learned word2vec phrasal representations.
-V: print phrasal representation vectors when writing out per-sentence grammars as extra features in the MT grammar
-f: print and score phrases with full i.e., 2 times context length words in the context.
-p: print correlation and distance information by POS pair only
-d X: compute correlation with distance between directly learned phrasal representation and composed representation; only works for bigrams. X denotes the location of the file with the distances pre-computed.
-r: select right-branching when computing compositional representations for phrases of length greater than 2. The default is left-branching.
-h: impose heuristic headedness when computing compositional representations. For 'JJ-NN' and 'NN-NN' the right word is the head word and only its representation is used; for 'VV-NN' the verb is the head word.
-w X: write per-sentence grammars out; the grammars are in the same format as the grammars being read in, except we add an extra 'Segmentation' feature based on the non-compositionality score
-N: featurize rules witih non-terminals as well, based on the lexical items they contain.
-A: evaluate non-compositionality using simple pointwise additive compositional representations.
-M: evaluate non-compositionality using simple pointwise multiplicative compositional representations.

Note that in order to do non-compositionality detection, an evaluation set must be provided with the POS tags (separated by '#'). In order to do this, run the following command:

java -mx5000m -cp $POSTAGGER/stanford-postagger.jar edu.stanford.nlp.tagger.maxent.MaxentTagger -model $POSTAGGER/models/english-bidirectional-distsim.tagger -textFile eval.lc-tok -outputFormat slashTags -tagSeparator \# -tokenize false > eval.lc-tok.pos

Things to add

include version of word2vec where we can output context vectors as well in repository (context vectors are necessary for non-compositionality calculation)
stand-alone interface to CompoModel module where we output the phrasal representations given a list of phrases

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
ACL		ACL
NAACL		NAACL
analysis		analysis
utils		utils
LICENSE		LICENSE
README.md		README.md
compute_composed.py		compute_composed.py
evaluation.py		evaluation.py
extract_training.py		extract_training.py
learning.py		learning.py
non_comp_detect.py		non_comp_detect.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ACL

ACL

NAACL

NAACL

analysis

analysis

utils

utils

LICENSE

LICENSE

README.md

README.md

compute_composed.py

compute_composed.py

evaluation.py

evaluation.py

extract_training.py

extract_training.py

learning.py

learning.py

non_comp_detect.py

non_comp_detect.py

Repository files navigation

System Requirements

End-to-end Pipeline Instructions

Things to add

About

Releases

Packages

Languages

License

anukat2015/compositional-semantics

Folders and files

Latest commit

History

Repository files navigation

System Requirements

End-to-end Pipeline Instructions

Things to add

About

Resources

License

Stars

Watchers

Forks

Languages