SearchEngine

The project is a search engine written in python that uses the Vector Space Model for retrieval.The list of 92 health health articles have been used and it has approximately 5000 unique words.Open this Readme file’s working directory in command line and type “python query.py” to run the code. It has a few dependencies such as NLTK which can be easily pip installed.You will be prompted to enter your query after a while. At this point type your query into the terminal and the documents are ranked and stored in total_scores.csv file

Documentation

Py files

BeautifulSoup.py: All the documents are scrapped from https://www.medicalnewstoday.com/popular using Beautiful Soup the text is grepped from the class: ”article_body” .

T_N.py: It takes files_list(list of documents) from lod.py as the input argument and perform tokenization and normalization and stores tokenized words in T_N.txt.(Tokenized using NLTK)

stop_words.py: The file T_N.txt is opened and stop word are removed from this file and the words are the stored in another file stop_word.txt

stemmed_words.py: The file stop_words.txt is opened and the words are stemmed using Porters Stemmer and are stored in stemmed.txt file.

lod.py: Contains two file lists and the definitions for Tokenization, Normalization, removing stop words, stemming and unique_list. Functions implemented in lod: T_N(file_content): Tokenizes and Normalizes the given input “file_content” remove_stopwords(file): Removes stop words from the input “file” stem_words(files): Stems the input “files” unique_words(text): Removes duplicates of a word

inverted_index.py: In this module documents are assigned their id’s inverted index, term frequency table for each word, idf table is built and the tables are stores in tf_document_data.csv, idf_table.csv.

document_frequency.py: The function “idf” in this module calculates the idf of each word.

stem_documents.py: Stems each document in files_list(imported from lod.py) ,creates files listed in stem_document_list(lod.py) and stores the stemmed document.

query.py: Takes the input from the user and writes documents names and cosines scores in the descending order into total_scores.csv file.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
__pycache__		__pycache__
'Natural insecticide' kills advanced prostate cancer cells.txt		'Natural insecticide' kills advanced prostate cancer cells.txt
'Normal' blood sugar levels may not be so normal after all.txt		'Normal' blood sugar levels may not be so normal after all.txt
'Too much' brain calcium may cause Parkinson's.txt		'Too much' brain calcium may cause Parkinson's.txt
1.txt		1.txt
10.txt		10.txt
11.txt		11.txt
12.txt		12.txt
13.txt		13.txt
14.txt		14.txt
15.txt		15.txt
16.txt		16.txt
16:8 fasting diet actually works, study finds.txt		16:8 fasting diet actually works, study finds.txt
17.txt		17.txt
18.txt		18.txt
19.txt		19.txt
2.txt		2.txt
20.txt		20.txt
21.txt		21.txt
22.txt		22.txt
23.txt		23.txt
24.txt		24.txt
25.txt		25.txt
26.txt		26.txt
27.txt		27.txt
28.txt		28.txt
29.txt		29.txt
3.txt		3.txt
30.txt		30.txt
31.txt		31.txt
32.txt		32.txt
33.txt		33.txt
34.txt		34.txt
35.txt		35.txt
36.txt		36.txt
37.txt		37.txt
38.txt		38.txt
39.txt		39.txt
4.txt		4.txt
40.txt		40.txt
41.txt		41.txt
42.txt		42.txt
43.txt		43.txt
44.txt		44.txt
45.txt		45.txt
46.txt		46.txt
47.txt		47.txt
48.txt		48.txt
49.txt		49.txt
5.txt		5.txt
50.txt		50.txt
51.txt		51.txt
52.txt		52.txt
53.txt		53.txt
54.txt		54.txt
55.txt		55.txt
56.txt		56.txt
57.txt		57.txt
58.txt		58.txt
59.txt		59.txt
6.txt		6.txt
60.txt		60.txt
61.txt		61.txt
62.txt		62.txt
63.txt		63.txt
64.txt		64.txt
65.txt		65.txt
66.txt		66.txt
67.txt		67.txt
68.txt		68.txt
69.txt		69.txt
7.txt		7.txt
70.txt		70.txt
71.txt		71.txt
72.txt		72.txt
73.txt		73.txt
74.txt		74.txt
75.txt		75.txt
76.txt		76.txt
77.txt		77.txt
78.txt		78.txt
79.txt		79.txt
8.txt		8.txt
80.txt		80.txt
81.txt		81.txt
82.txt		82.txt
83.txt		83.txt
84.txt		84.txt
85.txt		85.txt
86.txt		86.txt
87.txt		87.txt
88.txt		88.txt
89.txt		89.txt
9.txt		9.txt
90.txt		90.txt
91.txt		91.txt
92.txt		92.txt
A waking nightmare: The enigma of sleep paralysis.txt		A waking nightmare: The enigma of sleep paralysis.txt
Alcohol 'more damaging to brain health than marijuana'.txt		Alcohol 'more damaging to brain health than marijuana'.txt
Alzheimer's risk 10 times lower with herpes medication.txt		Alzheimer's risk 10 times lower with herpes medication.txt

Srinayan/SearchEngine

Folders and files

Latest commit

History