READ_ME

Semantic Search using Word Embeddings

Name: Aishwarya Sahani

UIN: 652324475

The scope of the project would be to crawl & retrieve the webpages belonging “uic.edu” domain. These pages would be preprocessed. The tokenized words will be represented by word embeddings in terms of vectors. We will evaluate multiple word embeddings like word2vec, Glove to evaluate the best embedding for this purpose. These words will be combined with tf-idf values to represent the words. Using a cosine similarity metrics to evaluate the similarity of the document & selecting the top 10 based on their similarity. These pages would be then sorted using Pagerank algorithm to rank relevance,

The Search Engine would utilize & combine the information retrieval technique of tf-idf with a language modeling approach like word2vec/Glove to get the best results. In order to sort the results based on their importance, the pagerank algorithm has been incorporated for ranking the results.

Steps to run:

Download the Crawled Pages Dataset from the link and save it in the project folder Links
Download the 300 dimensional Glove embeddings
(Optional) You can download the 300 dimensional word2vec embeddings for comparison
Place the file in the Embeddings folder
You can run the searchEngine.py file
The system will ask for a query. Input your query or press Enter to evaluate pre-defined queries
Wait for results. Check the file Output.txt for a summary of results.
To change the embeddings, uncomment the line which loads the embeddings in searchEngine.py

Crawling the UIC (uic.edu) domain:

Run the file crawler.py.
You can set the min_url count to fetch the number of pages.
You can check the file Links.csv for results

Libraries used:

math: to perform mathematical operations like logarithms
ast: to use literal_eval to convert a string to an expression
numpy: for large, multi-dimensional arrays and matrices, and array operations
pandas: for data manipulation and analysis
re: for regular expression operation
pkl: handlign pickle files
nltk: Word Lemmatizer
glob: file retrieval
spacy: for stop word list
string: for string punctuations during cleaning
concurrent: thread handling from concurrent.futures import ThreadPoolExecutor
queue: using data structure Queue
requests: request web pages
urllib: working with URLs
bs4: parsing HTML docs
sys: exception handling
time: for time handling & formatting
gensim: word2vec embeddings

Embeddings: The folder consists of the 300 dimensional Glove embedding file and/or 300 dimensional word2vec embeddings
crawler.py Run this file to crawl the web.
searchEngine.py Run this file to run the Search Engine.
PageRank.py The file contains the pagerank algorithm
utils.py The file contains the utility methods used during the implementation
Links.csv Contains the dataset of crawled pages
relevance.txt Contains the gold standard of results for predefined queries
Output.txt Contains the summary of results
Report.pdf Contains the report of the project

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
SearchEngine		SearchEngine
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SearchEngine

SearchEngine

README.md

README.md

Repository files navigation

READ_ME

Semantic Search using Word Embeddings

Name: Aishwarya Sahani

UIN: 652324475

Steps to run:

Crawling the UIC (uic.edu) domain:

Libraries used:

Contents:

About

Releases

Packages

Languages

aishwaryaSahani/SearchEngine

Folders and files

Latest commit

History

SearchEngine

SearchEngine

README.md

README.md

Repository files navigation

READ_ME

Semantic Search using Word Embeddings

Name: Aishwarya Sahani

UIN: 652324475

Steps to run:

Crawling the UIC (uic.edu) domain:

Libraries used:

Contents:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages