web-search-engine

Required Installations:

Install the following (using pip/conda):

bs4
pandas
nltk

Specify File Paths:

You may have to change the pathToWebpages (file path to webpages folder) and pathToBook (file path to JSON bookkeeping file) variables depending on your directory structure.

To change these variables, go to processing.py and specify your file paths on lines 20-21.

Running the Code:

Run indexer.py to create the inverted index (the index will be saved locally as a compressed pickle file - invertedIdk.pbz2).

Run gui.py to start the GUI where you may input search queries and get results.

Brief Description of the Python Files:

indexer.py: Create the inverted index and save it to the pickle file invertedIdx.pkl
processing.py: Most of the functions for this python project can be found in this file. It provides all the functions that indexer.py calls to create the inverted index, such as creating a list of tokens from an HTML file, lemmatizing and classifying the tokens as valid or not, assigning specific weights to tokens in certain HTML tags (such as headers, titles etc.), inserting the (token, doc) pair into the index and computing tf-idf score for the tokens. It also computes tf-idf score for tokens in the query, normalizes these scores, and computes and returns co-sine similarity of the query and all the documents in the index. Additionally, it retrieves tokens from a docID, and compresses and decompresses the pickle file. Finally, it contains functions whose results are passed to gui.py, such as functions to retrive the title of a doc, a description of a doc, and the top 20 results of the query.
gui.py: Creates the GUI.
query.py: Ranks the top 20 results of the query by their score after calling the respective functions from processing.py, and passes this to the GUI.

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
.gitignore		.gitignore
README.md		README.md
gui.py		gui.py
indexer.py		indexer.py
processing.py		processing.py
query.py		query.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.gitignore

.gitignore

README.md

README.md

gui.py

gui.py

indexer.py

indexer.py

processing.py

processing.py

query.py

query.py

Repository files navigation

web-search-engine

Required Installations:

Specify File Paths:

Running the Code:

Brief Description of the Python Files:

About

Releases

Packages

Contributors 3

Languages

tanyasreenagesh/WebSearchEngine

Folders and files

Latest commit

History

Repository files navigation

web-search-engine

Required Installations:

Specify File Paths:

Running the Code:

Brief Description of the Python Files:

About

Resources

Stars

Watchers

Forks

Languages