Navigation Menu

Skip to content

vmanisha/QueryExpansion

Repository files navigation

Packages for entity tagged query analysis

Requirements

Entity Tagging and query clustering
	Following python packages are required to run respective modules. 
	1. NLTK (stemming) and punkt for sentence extraction and tokenization
	2. cluster (k-mediods)
	3. distance (jaccard)
	5. numpy
	6. html2text
	7. pattern (for html processing)
	8. lxml (questions package needs it for parsing yahoo questions dataset)
	

	Following resources have to be downloaded
	1. dbpedia article category file 
	2. dbpedia entity type file


Files and input parameters

 Feature generation 
	Following features are calculated for each query pair
	1. Session cosine
	2. Ngrams cosine and jaccard
	3. Users cosine
	4. URls cosine
	5. Entity Type cosine
	6. Entity Category cosine
	7. Entity cosine
	
	They are combined by weights (tuned on training dataset) 

	Important files and functions
	1. features/generateFeatures.py : takes query file and tags queries with entity, category and their types
	3. features/__init__.py : find out weight matrix -- findPairwiseDistance(featFile, outFile)
	4. /entity/category/findCategoryClusters.py : finds out clusters per entity type. 


 Steps followed to cluster
	INPUT : Aol-type formated file (tab seperated)
		<uid> query <timestamp> <click rank> <curl>
	
	1. Tag queries with entities
	2. accumulate all features with following options 
		-- uid : true or false  (whether user id present or not, if not then put in dummy id there)
		-- wtype : phrase or queries 
	3. write the feature file
	4. find pairwise similarity
	5. output clusters with following options
	
	OUTPUT : folder with clusters 
		each file in folder has following format
		-- algoTypeK.txt : number of final clusters (k) computed using algo (algoType)
		-- <type> <wtype> <wtype> .... <wtype>  (tab seperated)  : Each line contains a type (wiki category or type file) and its cluster of wtypes (either words, phrases or queries)
	

EXECUTION SEQUENCE (to run!!)
	Add project into the classpath -- export PYTHONPATH=<path-to-source-code>
	1. Run dexter: java -jar dexter-2.1.0.jar 
	2. Run features/generateFeatures.py with required commands:
		'--iFile', '--oFile', '--typeFile', '--catFile', '--uid', '--wtype'		
	3. Run features/__init__.py featFile outFile
	4. Run entity/category/findCategoryClusters.py with required commands
		 '--featFile','--distFile','--outDir','--algo','--lowerLimit','--upperLimit'

	

About

Modules to generate query expansion terms

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published