Skip to content

A Wikipedia-based search engine built on top of a multi-threaded and efficient web crawler. WIKIEngine queries over 100,000 pages and over 50,000 images indexed by the web crawler for the search results. Spell suggestion, low memory indexing, and rule-based query results are some of the features.

Notifications You must be signed in to change notification settings

ShankulShukla/The-WIKIEngine

Repository files navigation

The WIKIEngine

Introdution -

WIKIEngine is a Wikipedia-based search engine built on top of a multi-threaded and efficient web crawler.

The WIKIEngine is capable of querying relevant Wikipedia pages as well as images associated with the user search.

Currently, WIKIEngine queries over 100,000 pages and over 50,000 images indexed by the web crawler.

In addition to searching, the WIKIEngine is also able to provide search suggestions (do you mean functionality), if the WIKIEngine is not able to find any relevant results under the user search.

WIKIEngine categorizes search results into most relevant and other relevant results, based on the relevance of the result page or image with respect to the user search.

WIKIEngine's web crawler scrapes the pages for the most relevant data and then indexes them into its database. Creating union pages under the same key, noting page references for keys, using NLP to optimize the key indexing corresponding to the pages, customizable crawl parameters are some features of the crawler.

Web app for WIKIEngine is developed using flask and deployed on Heroku.

Visit appropriate files to understand the functionality in detail, I have tried to add a brief comment for functions.

Deployed WIKIEngine -

The WIKIEngine is publicly deployed on Heroku Server here or at https://wikiengine.herokuapp.com.

image

Spell check/ Search suggestion functionality -

image

Code overview -

static[Folder]- CSS and image resource for web app

templates[Folder]- HTML template for web app

Procfile- specifies commands that are executed by the web app on the startup

apps.py- defines app object instance of Flask object, also from here application is started and requests are handled

Networking.py- request the page and handles the response

Parser.py- parse the page to extract data for indexing

Pre-process-for-search.py- pre-process the indexing created by the crawler

Update_index.py- extend and update the indexing after the crawl

index.json- json based page index

imgind.txt- cpickle based image index (space efficient and faster loading)

ind.txt- cpickle based page index (space-efficient and faster loading)

main_spider.py- base web crawler handler

wikiSearch.py- base search handler

Usage and Installation -

Required python 3.x installed

Recommend python>3.5 (nothing will break 😁)

Clone the repo

Install the dependencies

pip install -r requirements.txt

To run WIKIEngine

python apps.py

To run web crawler

python main_spider.py

This will ask for a seed URL as its starting point for web parsing.

Yeah!!! Just three simple commands 🤓 but you can definitely play with the code

About

A Wikipedia-based search engine built on top of a multi-threaded and efficient web crawler. WIKIEngine queries over 100,000 pages and over 50,000 images indexed by the web crawler for the search results. Spell suggestion, low memory indexing, and rule-based query results are some of the features.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published