vmanisha/QueryExpansion
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Packages for entity tagged query analysis Requirements Entity Tagging and query clustering Following python packages are required to run respective modules. 1. NLTK (stemming) and punkt for sentence extraction and tokenization 2. cluster (k-mediods) 3. distance (jaccard) 5. numpy 6. html2text 7. pattern (for html processing) 8. lxml (questions package needs it for parsing yahoo questions dataset) Following resources have to be downloaded 1. dbpedia article category file 2. dbpedia entity type file Files and input parameters Feature generation Following features are calculated for each query pair 1. Session cosine 2. Ngrams cosine and jaccard 3. Users cosine 4. URls cosine 5. Entity Type cosine 6. Entity Category cosine 7. Entity cosine They are combined by weights (tuned on training dataset) Important files and functions 1. features/generateFeatures.py : takes query file and tags queries with entity, category and their types 3. features/__init__.py : find out weight matrix -- findPairwiseDistance(featFile, outFile) 4. /entity/category/findCategoryClusters.py : finds out clusters per entity type. Steps followed to cluster INPUT : Aol-type formated file (tab seperated) <uid> query <timestamp> <click rank> <curl> 1. Tag queries with entities 2. accumulate all features with following options -- uid : true or false (whether user id present or not, if not then put in dummy id there) -- wtype : phrase or queries 3. write the feature file 4. find pairwise similarity 5. output clusters with following options OUTPUT : folder with clusters each file in folder has following format -- algoTypeK.txt : number of final clusters (k) computed using algo (algoType) -- <type> <wtype> <wtype> .... <wtype> (tab seperated) : Each line contains a type (wiki category or type file) and its cluster of wtypes (either words, phrases or queries) EXECUTION SEQUENCE (to run!!) Add project into the classpath -- export PYTHONPATH=<path-to-source-code> 1. Run dexter: java -jar dexter-2.1.0.jar 2. Run features/generateFeatures.py with required commands: '--iFile', '--oFile', '--typeFile', '--catFile', '--uid', '--wtype' 3. Run features/__init__.py featFile outFile 4. Run entity/category/findCategoryClusters.py with required commands '--featFile','--distFile','--outDir','--algo','--lowerLimit','--upperLimit'
About
Modules to generate query expansion terms
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published