Search_Engine

A search engine project base on corpus of 10 million twitts.

Each twitt, started with parser (tokenized) which remove all stopwords and irrelevant words and signs.

After the parser, the twitt go to indexer which save for each term in the twiit the twiitID.

After all the twiitts were indexed, we start using LDA model to which separates all the twitts to topics.

While there is a query, the searcher check on the LDA model which topic is more relevant and send all the twitts that are connected to the same topic.

Meanwhile the searcher sends the ranker each twitt and by cossime gets a rank.

First 100 twitts with the best score are save in a csv file.

#technology

Based on python 3.7
Needs to install all packages in requirements.txt file

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
.idea		.idea
__pycache__		__pycache__
corpus/date=07-30-2020		corpus/date=07-30-2020
venv/Lib/site-packages		venv/Lib/site-packages
.gitignore		.gitignore
LDAModel.py		LDAModel.py
README.md		README.md
configuration.py		configuration.py
document.py		document.py
indexer.py		indexer.py
inverted_idxwithstem.pkl		inverted_idxwithstem.pkl
ldadictionarywithstem.pkl		ldadictionarywithstem.pkl
ldamodelwithstem.pkl		ldamodelwithstem.pkl
ldasearcherwithstem.pkl		ldasearcherwithstem.pkl
main.py		main.py
metrics.py		metrics.py
parser_module.py		parser_module.py
queries.txt		queries.txt
ranker.py		ranker.py
reader.py		reader.py
requirements.txt		requirements.txt
results.csv		results.csv
search_engine.py		search_engine.py
searcher.py		searcher.py
setup.py		setup.py
stemmer.py		stemmer.py
test.py		test.py
utils.py		utils.py

amityosefi/Search_Engine

Folders and files

Latest commit

History

Repository files navigation

Search_Engine

About

Resources

Stars

Watchers

Forks

Languages