GitHub - vmanisha/QueryExpansion: Modules to generate query expansion terms

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 116 Commits
Whoosh		Whoosh
bayes		bayes
clueweb_docs		clueweb_docs
clustering		clustering
dbPedia		dbPedia
entity		entity
evaluate		evaluate
features		features
plots		plots
queryLog		queryLog
questions		questions
random_code		random_code
randomwalk		randomwalk
scripts		scripts
search		search
sequence		sequence
tasks		tasks
tweetcrawl		tweetcrawl
utils		utils
.gitignore		.gitignore
Readme		Readme
ToDo		ToDo

Repository files navigation

Packages for entity tagged query analysis

Requirements

Entity Tagging and query clustering
	Following python packages are required to run respective modules. 
	1. NLTK (stemming) and punkt for sentence extraction and tokenization
	2. cluster (k-mediods)
	3. distance (jaccard)
	5. numpy
	6. html2text
	7. pattern (for html processing)
	8. lxml (questions package needs it for parsing yahoo questions dataset)
	

	Following resources have to be downloaded
	1. dbpedia article category file 
	2. dbpedia entity type file


Files and input parameters

 Feature generation 
	Following features are calculated for each query pair
	1. Session cosine
	2. Ngrams cosine and jaccard
	3. Users cosine
	4. URls cosine
	5. Entity Type cosine
	6. Entity Category cosine
	7. Entity cosine
	
	They are combined by weights (tuned on training dataset) 

	Important files and functions
	1. features/generateFeatures.py : takes query file and tags queries with entity, category and their types
	3. features/__init__.py : find out weight matrix -- findPairwiseDistance(featFile, outFile)
	4. /entity/category/findCategoryClusters.py : finds out clusters per entity type. 


 Steps followed to cluster
	INPUT : Aol-type formated file (tab seperated)
		<uid> query <timestamp> <click rank> <curl>
	
	1. Tag queries with entities
	2. accumulate all features with following options 
		-- uid : true or false  (whether user id present or not, if not then put in dummy id there)
		-- wtype : phrase or queries 
	3. write the feature file
	4. find pairwise similarity
	5. output clusters with following options
	
	OUTPUT : folder with clusters 
		each file in folder has following format
		-- algoTypeK.txt : number of final clusters (k) computed using algo (algoType)
		-- <type> <wtype> <wtype> .... <wtype>  (tab seperated)  : Each line contains a type (wiki category or type file) and its cluster of wtypes (either words, phrases or queries)
	

EXECUTION SEQUENCE (to run!!)
	Add project into the classpath -- export PYTHONPATH=<path-to-source-code>
	1. Run dexter: java -jar dexter-2.1.0.jar 
	2. Run features/generateFeatures.py with required commands:
		'--iFile', '--oFile', '--typeFile', '--catFile', '--uid', '--wtype'		
	3. Run features/__init__.py featFile outFile
	4. Run entity/category/findCategoryClusters.py with required commands
		 '--featFile','--distFile','--outDir','--algo','--lowerLimit','--upperLimit'