To train the lexical functions, the corpus can be prepared as a file(s) with one sentence per line. The punctuations and numbers are removed and the words are preprocessed (lemmatized). The sentence tokens are pos-tagged (Stanford PosTagger(http://nlp.stanford.edu/software/tagger.shtml) can be used ) and separated by a single space. [Here] (https://github.com/anupama-gupta/AN_Composition/blob/master/sample_sentences.txt) are few sample sentences.
The demo script implements the entire pipeline on a given corpus. It collects vocabulary counts, constructs cooccurrence matrix, and creates a semantic space of vectors which are used to train the composition models or lexical functions. In the final step it lists the nearest neighbours of a given compound (eg: small_town) to verify the quality of the predicted compound vector.
Usage :
$ git clone https://github.com/anupama-gupta/AN_Composition
$ cd AN_Composition
$ ./demo.sh /path/corpus
After the corpus is created, the lexical functions can be learned by using the following 4 tools :
Constructs unigram(adjectives and nouns) or bigram(adjective noun compounds) or context(bag-of-words) counts from the corpus and optionally thresholds the resulting vocabulary based on the total vocabulary size. Vocabulary file(s) are generated as output.
$ python vocab_count.py /path/corpus --unigrams c1 --bigrams c2 --contexts c3
where :
c1 : most frequent unigrams
c2 : most frequent bigrams
c3 : most frequent context words
output files :
-
[/dict/unigrams_vocab.txt] (https://github.com/anupama-gupta/AN_Composition/blob/master/file_links.txt)
-
[/dict/bigrams_vocab.txt] (https://github.com/anupama-gupta/AN_Composition/blob/master/file_links.txt)
-
[/dict/contexts_vocab.txt] (https://github.com/anupama-gupta/AN_Composition/blob/master/file_links.txt)
Constructs cooccurrence counts (unigram-context or bigram-context) from the corpus. The files containing the unigrams or bigrams are obtained by running 'vocab_count' in 1). The user may specify optional parameters such as, context window size, number of processes etc. This tool generates a sparse matrix file of cooccurence counts.
$ python cooccur.py /path/corpus file1 --unigrams file2 --bigrams file3 --workers 8
where :
file1 - [/dict/contexts_vocab.txt] (https://github.com/anupama-gupta/AN_Composition/blob/master/file_links.txt)
file2 - [/dict/unigrams_vocab.txt] (https://github.com/anupama-gupta/AN_Composition/blob/master/file_links.txt)
file3 - [/dict/bigrams_vocab.txt] (https://github.com/anupama-gupta/AN_Composition/blob/master/file_links.txt)
output files :
-
[/dict/unigrams_cooccur.txt] (https://github.com/anupama-gupta/AN_Composition/blob/master/file_links.txt)
-
[/dict/bigrams_cooccur.txt] (https://github.com/anupama-gupta/AN_Composition/blob/master/file_links.txt)
Constructs a vector space from the cooccurence counts obtained from 'cooccur'. The vectors are weighted using positive point-wise mutual information (ppmi), normalized to unit length and then reduced to 300 dimensions using singular value decomposition(svd).
####a. To create unigrams semantic space:
$ python semantic_space.py unigram_space file1 file2 file3
where :
file1 - [/dict/contexts_vocab.txt] (https://github.com/anupama-gupta/AN_Composition/blob/master/file_links.txt)
file2 - [/dict/unigrams_vocab.txt] (https://github.com/anupama-gupta/AN_Composition/blob/master/file_links.txt)
file3 - [/dict/unigrams_cooccur.txt] (https://github.com/anupama-gupta/AN_Composition/blob/master/file_links.txt)
output file :
[/space/unigrams_space.pkl] (https://github.com/anupama-gupta/AN_Composition/blob/master/file_links.txt)
####b. To create bigrams semantic space:
$ python semantic_space.py bigram_space file1 file2 file3
where :
file1 - [/dict/contexts_vocab.txt] (https://github.com/anupama-gupta/AN_Composition/blob/master/file_links.txt)
file2 - [/space/unigrams_space.pkl] (https://github.com/anupama-gupta/AN_Composition/blob/master/file_links.txt)
file3 - [/dict/bigrams_cooccur.txt] (https://github.com/anupama-gupta/AN_Composition/blob/master/file_links.txt)
output file :
[/space/bigrams_space.pkl] (https://github.com/anupama-gupta/AN_Composition/blob/master/file_links.txt)
This tool performs the following 3 tasks :
$ python lex_functions.py learn_ADJ file1 file2
where :
file1 - [/space/unigrams_space.pkl] (https://github.com/anupama-gupta/AN_Composition/blob/master/file_links.txt)
file2 - [/space/bigrams_space.pkl] (https://github.com/anupama-gupta/AN_Composition/blob/master/file_links.txt)
output file :
[/matrices/ADJ_matrices.pkl] (https://github.com/anupama-gupta/AN_Composition/blob/master/file_links.txt)
$ python lex_functions.py learn_TENSOR file1 file2
where :
file1 - [/space/unigrams_space.pkl] (https://github.com/anupama-gupta/AN_Composition/blob/master/file_links.txt)
file2 - [/space/bigrams_space.pkl] (https://github.com/anupama-gupta/AN_Composition/blob/master/file_links.txt)
output file :
[/matrices/TENSOR_matrix.pkl] (https://github.com/anupama-gupta/AN_Composition/blob/master/file_links.txt)
$ python lex_functions.py ADJ_space file1 file2 file3
where :
file1 - [/space/unigrams_space.pkl] (https://github.com/anupama-gupta/AN_Composition/blob/master/file_links.txt)
file2 - [/space/bigrams_space.pkl] (https://github.com/anupama-gupta/AN_Composition/blob/master/file_links.txt)
file3 - [/matrices/ADJ_matrices.pkl] (https://github.com/anupama-gupta/AN_Composition/blob/master/file_links.txt)
output file :
[/composed_space/composed_space_ADJ.pkl] (https://github.com/anupama-gupta/AN_Composition/blob/master/file_links.txt)
$ python lex_functions.py TENSOR_space file1 file2 file3
where :
file1 - [/space/unigrams_space.pkl] (https://github.com/anupama-gupta/AN_Composition/blob/master/file_links.txt)
file2 - [/space/bigrams_space.pkl] (https://github.com/anupama-gupta/AN_Composition/blob/master/file_links.txt)
file3 - [/matrices/TENSOR_matrix.pkl] (https://github.com/anupama-gupta/AN_Composition/blob/master/file_links.txt)
output file :
[/composed_space/composed_space_TENSOR.pkl] (https://github.com/anupama-gupta/AN_Composition/blob/master/file_links.txt)
$ python lex_functions.py neighbours_ADJ file1 compound file2 file3
where :
compound - eg : good_boy, old_tree, young_actor etc.
file1 - [/space/unigrams_space.pkl] (https://github.com/anupama-gupta/AN_Composition/blob/master/file_links.txt)
file2 - a [semantic space file] (https://github.com/anupama-gupta/Composition/blob/master/file_links.txt) ( in /space or in /composed_space ). This is the space where the neighbours will be searched for.
file3 - [/matrices/ADJ_matrices.pkl] (https://github.com/anupama-gupta/AN_Composition/blob/master/file_links.txt)
output :
list of nearest neighbours and their cosine similarity
$ python lex_functions.py neighbours_TENSOR file1 compound file2 file3
where :
compound - eg : good_boy, old_tree, young_actor etc.
file1 - [/space/unigrams_space.pkl] (https://github.com/anupama-gupta/AN_Composition/blob/master/file_links.txt)
file2 - a [semantic space file] (https://github.com/anupama-gupta/Composition/blob/master/file_links.txt)( in /space or in /composed_space ). This is the space where the neighbours will be searched for.
file3 - [/matrices/TENSOR_matrix.pkl] (https://github.com/anupama-gupta/AN_Composition/blob/master/file_links.txt)
output :
list of nearest neighbours and their cosine similarity
$ python lex_functions.py neighbours_bigrams compound file1 file2
where :
file1 - a [semantic space ] (https://github.com/anupama-gupta/Composition/blob/master/file_links.txt) ( in /space or in /composed_space ). This is the space where the neighbours will be searched for.
file2 - [/space/bigrams_space.pkl] (https://github.com/anupama-gupta/AN_Composition/blob/master/file_links.txt)
output :
list of nearest neighbours and their cosine similarity