Distributional Semantics Stability and Consistence

This repo contains codes and, selfcontained, some documentation about implementation of theoretical issues of my PhD thesis on Natural Language Processing (NLP). This work is entitled "Learning Kernels for Distributional Semantic Clustering" in which we present a novel (distributional) semantic embedding method, which aims at consistently performing semantic clustering at sentence level. Taking into account special aspects of Vector Space Models (VSMs), we propose to learn reproducing kernels in classification tasks. By this way, capturing spectral features from data is possible. These features make it theoretically plausible to model semantic similarity criteria in Hilbert spaces, i.e. the embedding spaces where consistency of semantic similarity measures and stability of algorithms implementing them can be ensured. We could improve the semantic assessment over embeddings, which are criterion-derived distributional representations from traditional semantic vectors (e.g. windowed cooccurrence counting). The learned kernel could be easily transferred to clustering methods, where the Class Imbalance Problem is considered (e.g. semantic clustering of polysemic definitions of terms. Their multimple meanings are Zipf distributed: Passonneau, et.al, 2012).

See the initial publications at my personal web site for futher details: www.corpus.unam.mx:8069/member/1

Usage

We have made some tests with toy datasets. At the moment we have parallelized the tool in a local machine with multiple CPUs.

-- Dependencies --

Ubunbtu 14.04 - The Operating system we developed this tool

modshogun - The Shogun Machine Learning Toolbox

scipy - The scientific Python stack

-- Command line --

We need executing the following piped command in the Ubuntu shell (be sure all files are in the same system file path where you execute the command):

$ python gridGen.py -f gridParameterDic.txt -t 5 | ./mklParallel.sh | python mklReducer.py [-options] > o.txt

The gridGen.py script writes, e.g., 5 parameter search paths to the stdout separately. These paths are randomly generated from the file gridParameterDic.txt which contains a python dictionary of parameters. The set of paths is read by the mklParallel.sh bash script who yields multiple training jobs (The OS manages the thing if there are more processes than machine cores). Until the results of these jobs are written at all to stdout, the mklReducer.py script reads them, prints them and emits the one with the maximum performance (command python mklReducer.py -h for seeing [-options]). These results are subsequently written either to an output file, e.g. o.txt, or to the stdout (if the last one is not specified by >).

Currently the training a test sets are provided in the same file. This dataset file must be specified in the configuration file called mkl_object.conf. Its first line is the name of the file containing data samples (one sample vector by each line of the file), the second line is the name of the file containing labels {-1, 1}, associated to each sample; one label by line in the file. The third line is optional (unlike the two first ones) and may specify the directory where the two first files are stored in. The portion of samples (and labels) used for training and testing, respectively, may be specified in code (for now) as the first argument of the function load_binData(), which is called at the beginning of the mklCall.py script. The current default is 75% of samples for training; for testing the remaining ahead ones.

After results are emitted, we can use the resulting best path as parameter set for performing inner products between pairs of input word/phrase/sentence (WPS) vectors. These pairwise products are easily conceived as consistent semantic similarity scores.

Current idea

According to the approach given in SemEval2015-task 1, we are currently posing our problem as a two class one. The class 1 corresponds to similar sentence pairs and the class 0 (-1) to dissimilar ones ({s_a, s_b}). A similarity criterion is learned from labels by the multikernel machine during training. Given that this machine classifies single vectors (but not pairs of them), we propose different sentence vector combination schemata, i.e. sentence_pair_vector = combine_pair{combine_sa(word_vector_1, word_vector_2,...), combine_sb(word_vector_1, word_vector_2,...)}, where sentence_pair_vector is associated to a label in {0, 1} and fed to the multikernel machine. We estimate the combine_pair{} operation allows holding dissimilarity features between combine_sa() and combine_sb() vectors (s_a sentence and s_b sentence, respectively), so the multikernel machine will filter them. Constituent word vectors word_vector_i of each sentence are simply window co-occurrence counting.

We are also planning using morph and PMI vectors as alternative inputs to our combination schemata. A binary NBayes classifier as well as a logistic output neural network are considered as baseline algorithms for addressing the task.

Name		Name	Last commit message	Last commit date
Latest commit History 152 Commits
db_word_space		db_word_space
mkl_outs		mkl_outs
.gitignore		.gitignore
LICENSE		LICENSE
MKL.out		MKL.out
MKL_sts_H10.out		MKL_sts_H10.out
README.md		README.md
bin_label_ape_gut.dat		bin_label_ape_gut.dat
file_names_util.py		file_names_util.py
fm_ape_gutXX.dat		fm_ape_gutXX.dat
gridGen.py		gridGen.py
gridObj.py		gridObj.py
gridObj.pyc		gridObj.pyc
gridParameterDic.txt		gridParameterDic.txt
gridParameterDicConvW2v.txt		gridParameterDicConvW2v.txt
gridParameterDicD2VH10puces.txt		gridParameterDicD2VH10puces.txt
gridParameterDicD2VH10sts.txt		gridParameterDicD2VH10sts.txt
gridParameterDicD2VH20sts.txt		gridParameterDicD2VH20sts.txt
gridParameterDicGauss.txt		gridParameterDicGauss.txt
gridParameterDicGaussLogCoo.txt		gridParameterDicGaussLogCoo.txt
gridParameterDicLess.txt		gridParameterDicLess.txt
gridParameterDicLessCmin.txt		gridParameterDicLessCmin.txt
gridParameterDicPolyGauss.txt		gridParameterDicPolyGauss.txt
gridParameterDicSVD.txt		gridParameterDicSVD.txt
gridParameterDicSparse.txt		gridParameterDicSparse.txt
labels_toy.txt		labels_toy.txt
mkl.py		mkl.py
mklCall.py		mklCall.py
mklCall.py~		mklCall.py~
mklObj.py		mklObj.py
mklObj.py.save		mklObj.py.save
mklObj.py.save.1		mklObj.py.save.1
mklObj.pyc		mklObj.pyc
mklObj.py~		mklObj.py~
mklParallel.sh		mklParallel.sh
mklReducer.py		mklReducer.py
mkl_1.out		mkl_1.out
mkl_all-eng-515e3-NO-test-correct_d2v_H10_sub_m5w8.out		mkl_all-eng-515e3-NO-test-correct_d2v_H10_sub_m5w8.out
mkl_all-eng-515e3-NO-test-correct_d2v_H20_sub_m5w8.out		mkl_all-eng-515e3-NO-test-correct_d2v_H20_sub_m5w8.out
mkl_object.conf		mkl_object.conf
mkl_objectD2VH10sts.conf		mkl_objectD2VH10sts.conf
mkl_objectD2VH20sts.conf		mkl_objectD2VH20sts.conf
mkl_objectPucesEn.conf		mkl_objectPucesEn.conf
mkl_objectSVD.conf		mkl_objectSVD.conf
mkl_regressor.py		mkl_regressor.py
mkl_regressor.pyc		mkl_regressor.pyc
mkl_test.py		mkl_test.py
mkl_test.pyc		mkl_test.pyc
out_puces.out		out_puces.out
out_puces_sort.out		out_puces_sort.out
output_csbsp_sub.txt		output_csbsp_sub.txt
output_gs_sub_ccbsp_r.txt		output_gs_sub_ccbsp_r.txt
paths.txt		paths.txt
plotting.py		plotting.py
resultados.txt		resultados.txt
resultados_pairs_headlines_conv_dense70-30.txt		resultados_pairs_headlines_conv_dense70-30.txt
resultados_pairs_headlines_dense_convs_coocurrences70-30_svd.txt		resultados_pairs_headlines_dense_convs_coocurrences70-30_svd.txt
resultados_pairs_headlines_dense_sub70-30_svd.txt		resultados_pairs_headlines_dense_sub70-30_svd.txt
t.py		t.py
test_data10.mtx		test_data10.mtx
test_labels10.mtx		test_labels10.mtx
test_paths.txt		test_paths.txt
toy_data.dat		toy_data.dat
toy_labels.dat		toy_labels.dat
toy_reg_labels.dat		toy_reg_labels.dat
train_data10.mtx		train_data10.mtx
train_labels10.mtx		train_labels10.mtx

License

wandabwa2004/distributionalSemanticStabilityThesis

Folders and files

Latest commit

History

Repository files navigation

Distributional Semantics Stability and Consistence

Usage

Current idea

About

Resources