Skip to content

ragavvenkatesan/SearchEngine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SearchEngine

This is the repository for my Information Retrieval course project from ASU Spring 2015 course. It was implemented progressively.

update 0: TF-IDF based retrieval

update 1: added options for Authorites and Hubs

update 2: added options for page rank

To run the code. First you need to setup jython and install pylucene. Then run the code as jython SearchEngine.py <lexicon_filename>

The code CreateLexicon.py running as jython CreateLexicon.py <lexicon_filname> will just create an index and save it down in a .pkl.gz file. This also can save document norms pre-saved. The code SearchFiles.py is the re-writing of the code SearchFiles.java into python. The code LinkAnalysis.py is the re-writing of the code LinkAnalysis.java into python. This can be used to understand how pylucene works. These can be used to understand how pylucene works.

Setting up SearchFiles.py

  • verbose = False will run the code silently.
  • create_lexicon_flag = True. if True will rebuild lexicon from scratch, if False will load a pre-created one as supplied in sys_arg[1]
  • normalize = False. if True will use document norms and normalized tf-idf, False will not.
  • n_retrieves = 10 number of documents to retreive
  • tf_idf_flag = True True retrieves based on Tf/idf, False retrieves based on only Tf.
  • ah_flag = True will run the code using Authorites and Hubs weighting.
  • pr_flag = True will run the code using Page Rank weighting.
  • directory = '../index' will setup the location of index.
  • having both the above False will simply run just TF-IDF.
  • linksFile and citationsFile are text files containing links and citations needed for HITS and pagerank algorithms.
  • root_set_size is a variable that determines the root set for hits algorithm.
  • maxIter determines how many iterations for power iterating to get eigen values.

About

This is the repository for my Information Retrieval course project from ASU Spring 2015 course.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published