WIKIEngine is a Wikipedia-based search engine built on top of a multi-threaded and efficient web crawler.
The WIKIEngine is capable of querying relevant Wikipedia pages as well as images associated with the user search.
Currently, WIKIEngine queries over 100,000 pages and over 50,000 images indexed by the web crawler.
In addition to searching, the WIKIEngine is also able to provide search suggestions (do you mean functionality), if the WIKIEngine is not able to find any relevant results under the user search.
WIKIEngine categorizes search results into most relevant and other relevant results, based on the relevance of the result page or image with respect to the user search.
WIKIEngine's web crawler scrapes the pages for the most relevant data and then indexes them into its database. Creating union pages under the same key, noting page references for keys, using NLP to optimize the key indexing corresponding to the pages, customizable crawl parameters are some features of the crawler.
Web app for WIKIEngine is developed using flask and deployed on Heroku.
Visit appropriate files to understand the functionality in detail, I have tried to add a brief comment for functions.
The WIKIEngine is publicly deployed on Heroku Server here or at https://wikiengine.herokuapp.com.
Spell check/ Search suggestion functionality -
static[Folder]- CSS and image resource for web app
templates[Folder]- HTML template for web app
Procfile- specifies commands that are executed by the web app on the startup
apps.py- defines app object instance of Flask object, also from here application is started and requests are handled
Networking.py- request the page and handles the response
Parser.py- parse the page to extract data for indexing
Pre-process-for-search.py- pre-process the indexing created by the crawler
Update_index.py- extend and update the indexing after the crawl
index.json- json based page index
imgind.txt- cpickle based image index (space efficient and faster loading)
ind.txt- cpickle based page index (space-efficient and faster loading)
main_spider.py- base web crawler handler
wikiSearch.py- base search handler
Required python 3.x installed
Recommend python>3.5 (nothing will break 😁)
Clone the repo
Install the dependencies
pip install -r requirements.txt
To run WIKIEngine
python apps.py
To run web crawler
python main_spider.py
This will ask for a seed URL as its starting point for web parsing.
Yeah!!! Just three simple commands 🤓 but you can definitely play with the code