Search engine built to enable users to query the various Word and Powerpoint files that are uploaded to the Moodle CMS of BITS Pilani Hyderabad.
- Search through the various file contents of the documents on the CMS
- Returns the closest matching documents to the query
- For each document, you are shown the 5 most similar sentences containing your query words
- The index of documents is updated regularly and dynamically - no need to reconstruct it everytime
- Backend in MongoDB for persisting the index
- Fully documented code, viewable from docs
- Clone this repo / click "Download as Zip" and extract the files.
- Rename the
sample_config.toml
toconfig.toml
and set the required values. - Ensure Python 3.7 is installed, and in your system
PATH
. - Install pipenv using
pip install -U pipenv
. - In the project folder, run
pipenv install
to install all python dependencies. - Download the nltk datasets:
- Run
pipenv run python
. >>> nltk.download("stopwords")
.>>> nltk.download("wordnet")
.>>> nltk.download("genesis")
.
- Run
- [For doc support] Install
catdoc
to enable extraction from.doc
files usingapt install catdoc
(Ubuntu). If you are on Windows, you can skip processingdoc
files by removing it fromALLOWED_EXTS
in config file.
To generate the index: pipenv run python indexer.py
. It will go through all the enrolled courses in your CMS account, and if a new file is encountered, add it to the index after processing it.
To query the index: pipenv run python main.py
.