This repo contains codes and, selfcontained, some documentation about implementation of theoretical issues of my PhD thesis on Natural Language Processing (NLP). This work is entitled "Learning Kernels for Distributional Semantic Clustering" in which we present a novel (distributional) semantic embedding method, which aims at consistently performing semantic clustering at sentence level. Taking into account special aspects of Vector Space Models (VSMs), we propose to learn reproducing kernels in classification tasks. By this way, capturing spectral features from data is possible. These features make it theoretically plausible to model semantic similarity criteria in Hilbert spaces, i.e. the embedding spaces where consistency of semantic similarity measures and stability of algorithms implementing them can be ensured. We could improve the semantic assessment over embeddings, which are criterion-derived distributional representations from traditional semantic vectors (e.g. windowed cooccurrence counting). The learned kernel could be easily transferred to clustering methods, where the Class Imbalance Problem is considered (e.g. semantic clustering of polysemic definitions of terms. Their multimple meanings are Zipf distributed: Passonneau, et.al, 2012).
See the initial publications at my personal web site for futher details: www.corpus.unam.mx:8069/member/1
We have made some tests with toy datasets. At the moment we have parallelized the tool in a local machine with multiple CPUs.
-- Dependencies --
Ubunbtu 14.04 - The Operating system we developed this tool
modshogun - The Shogun Machine Learning Toolbox
scipy - The scientific Python stack
-- Command line --
We need executing the following piped command in the Ubuntu shell (be sure all files are in the same system file path where you execute the command):
$ python gridGen.py -f gridParameterDic.txt -t 5 | ./mklParallel.sh | python mklReducer.py [-options] > o.txt
The gridGen.py
script writes, e.g., 5 parameter search paths to the stdout
separately. These paths are randomly generated from the file gridParameterDic.txt
which contains a python dictionary of parameters. The set of paths is read by the mklParallel.sh
bash script who yields multiple training jobs (The OS manages the thing if there are more processes than machine cores). Until the results of these jobs are written at all to stdout, the mklReducer.py
script reads them, prints them and emits the one with the maximum performance (command python mklReducer.py -h
for seeing [-options]
). These results are subsequently written either to an output file, e.g. o.txt
, or to the stdout (if the last one is not specified by >
).
Currently the training a test sets are provided in the same file. This dataset file must be specified in the configuration file called mkl_object.conf
. Its first line is the name of the file containing data samples (one sample vector by each line of the file), the second line is the name of the file containing labels {-1, 1}
, associated to each sample; one label by line in the file. The third line is optional (unlike the two first ones) and may specify the directory where the two first files are stored in. The portion of samples (and labels) used for training and testing, respectively, may be specified in code (for now) as the first argument of the function load_binData()
, which is called at the beginning of the mklCall.py
script. The current default is 75% of samples for training; for testing the remaining ahead ones.
After results are emitted, we can use the resulting best path as parameter set for performing inner products between pairs of input word/phrase/sentence (WPS) vectors. These pairwise products are easily conceived as consistent semantic similarity scores.
According to the approach given in SemEval2015-task 1, we are currently posing our problem as a two class one. The class 1 corresponds to similar sentence pairs and the class 0 (-1) to dissimilar ones ({s_a, s_b}
). A similarity criterion is learned from labels by the multikernel machine during training. Given that this machine classifies single vectors (but not pairs of them), we propose different sentence vector combination schemata, i.e. sentence_pair_vector = combine_pair{combine_sa(word_vector_1, word_vector_2,...), combine_sb(word_vector_1, word_vector_2,...)}
, where sentence_pair_vector
is associated to a label in {0, 1}
and fed to the multikernel machine. We estimate the combine_pair{}
operation allows holding dissimilarity features between combine_sa()
and combine_sb()
vectors (s_a
sentence and s_b
sentence, respectively), so the multikernel machine will filter them. Constituent word vectors word_vector_i
of each sentence are simply window co-occurrence counting.
We are also planning using morph and PMI vectors as alternative inputs to our combination schemata. A binary NBayes classifier as well as a logistic output neural network are considered as baseline algorithms for addressing the task.