The WIKIEngine

Introdution -

WIKIEngine is a Wikipedia-based search engine built on top of a multi-threaded and efficient web crawler.

The WIKIEngine is capable of querying relevant Wikipedia pages as well as images associated with the user search.

Currently, WIKIEngine queries over 100,000 pages and over 50,000 images indexed by the web crawler.

In addition to searching, the WIKIEngine is also able to provide search suggestions (do you mean functionality), if the WIKIEngine is not able to find any relevant results under the user search.

WIKIEngine categorizes search results into most relevant and other relevant results, based on the relevance of the result page or image with respect to the user search.

WIKIEngine's web crawler scrapes the pages for the most relevant data and then indexes them into its database. Creating union pages under the same key, noting page references for keys, using NLP to optimize the key indexing corresponding to the pages, customizable crawl parameters are some features of the crawler.

Web app for WIKIEngine is developed using flask and deployed on Heroku.

Visit appropriate files to understand the functionality in detail, I have tried to add a brief comment for functions.

Deployed WIKIEngine -

The WIKIEngine is publicly deployed on Heroku Server here or at https://wikiengine.herokuapp.com.

Spell check/ Search suggestion functionality -

Code overview -

static[Folder]- CSS and image resource for web app

templates[Folder]- HTML template for web app

Procfile- specifies commands that are executed by the web app on the startup

apps.py- defines app object instance of Flask object, also from here application is started and requests are handled

Networking.py- request the page and handles the response

Parser.py- parse the page to extract data for indexing

Pre-process-for-search.py- pre-process the indexing created by the crawler

Update_index.py- extend and update the indexing after the crawl

index.json- json based page index

imgind.txt- cpickle based image index (space efficient and faster loading)

ind.txt- cpickle based page index (space-efficient and faster loading)

main_spider.py- base web crawler handler

wikiSearch.py- base search handler

Usage and Installation -

Required python 3.x installed

Recommend python>3.5 (nothing will break 😁)

Clone the repo

Install the dependencies

pip install -r requirements.txt

To run WIKIEngine

python apps.py

To run web crawler

python main_spider.py

This will ask for a seed URL as its starting point for web parsing.

Yeah!!! Just three simple commands 🤓 but you can definitely play with the code

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
__pycache__		__pycache__
static		static
templates		templates
.gitattributes		.gitattributes
Networking.py		Networking.py
Parser.py		Parser.py
Pre-process-for-search.py		Pre-process-for-search.py
Procfile		Procfile
README.md		README.md
Update_index.py		Update_index.py
apps.py		apps.py
imgind.txt		imgind.txt
ind.txt		ind.txt
index.json		index.json
main_spider.py		main_spider.py
requirements.txt		requirements.txt
wikiSearch.py		wikiSearch.py

ShankulShukla/The-WIKIEngine

Folders and files

Latest commit

History

Repository files navigation

The WIKIEngine

Introdution -

Deployed WIKIEngine -

Code overview -

Usage and Installation -

About

Resources

Stars

Watchers

Forks

Languages