GitHub - aditya260694/Wikipedia-Corpus-Search-Engine: A search engine which indexes 42 GB wikipedia xml corpus and takes Queries to find relevant pages.

WIKIPEDIA SEARCH ENGINE - PHASE 1 Aditya Chandran 201201115

Index consists of multiple files which represent the word-page mapping along with number of ocurrences of the word in each page of Title,Infobox,Body Text,Links

PACKAGES TO BE INSTALLED

blist can be downloaded from https://pypi.python.org/packages/source/b/blist/blist-1.3.6.tar.gz execute python setup.py install to install

EXTRA FEATURES Created a secondary indexing for Body Text information i.e Body Text is split into multiple files of 500Mb each and start words for each file are specified Hashed the document ids so that lesser space is taken for storing multiple ocurrences Removed stop words, the list of which is in stopwords.txt Used a B-Tree based implementation of lists and dictionaries which ensures even large data can be handled efficiently by using main memory.

The Corpus will be indexed to src/NewIndex

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
src		src
README.md		README.md
run.sh		run.sh
stopwords.txt		stopwords.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

src

src

README.md

README.md

run.sh

run.sh

stopwords.txt

stopwords.txt

Repository files navigation

About

Releases

Packages

Languages

aditya260694/Wikipedia-Corpus-Search-Engine

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Languages