Skip to content

ShubhamAgrawal-13/Wiki-Search-Engine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Wiki Search Engine

Author : Shubham Agrawal Institute: IIIT-Hyderabad

Searching in Wiki search engine

1. Index Creation:


  1. Indexing - Created posting list for each word in documents.
  2. Merging - While merging, done multi-level indexing and created secondary index for fast searching.

Threshold for index creation is 26,000 documents.

Threshold for merging and creating new file is 1,00,000 words.

For Merging, heapq module of python is used.

There are 3 files for index creation:

1. indexer.py
2. merger.py
3. title_mapper.py 
  1. I have run it for 34 xmls around 42.6 GB wiki english dump.

  2. Index_size = 11.1 GB

  3. Number of Index files = 132

  4. each inverted_index file contains around 1 lakh words.

  5. Number of tokens(words in index files) = 13125805.

  6. Total Number of documents = 9829059

  7. I have used my own doc ID scheme and created title_mappping.txt for mapping title to doc_id.

  8. Created secondary index for title file as it was around 350 MB, for fast document title retrieval (title_mapper.py).

2. Searching:


There are 2 files for Searching:

1. search.py 
	- It takes queries.txt as 1st argument.
2. search_one_query.py 
	- It takes query string as 1st argument.

Queries type:

  1. Phrase Query: Eg: cricket world cup
  2. Field Query: Eg: t:World Cup i:2019 c:Cricket

Method for searching:

  1. In secondary index (named multi_level.txt) for fast searching, search the inverted index file number and I am using binary search on secondary index (named multi_level.txt) to speed up search.

  2. Get the posting lists of the words related to query.

  3. Take Intersection of Posting lists.

  4. Rank them according to the TF-IDF Score.

  5. Display Top K results.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published