WebSearchEngine-API

Implementation of different techniques used in Web Search Engine.

Classification techiniques:

1. Naive Bayes

The key is how to compute the posterior probability that document d_j belongs to category c_i. According to Bayes formula, the posterior probability is translated to compute the prior probability . Then, the categories that have the most prior probability are judged into the final categories of document d_j.

Here P(c_i) denotes the probability of category csub>i in the training set and P(dsub>j) denotes document dsub>j in the training set. Because P(dsub>j) is invariant for a given document dsub>j in all categories. The final category is decided by following formula :

We have used Multinomial Naive Bayes, in which we take into account the term frequency in the class, the term count of the class and vocabulary of the dataset. If , then the probability of a token wsub>j given class c_i is calculated by

Using this, the probability of a document given the class is given by

2. K-means Clustering

Algorithm Pseudocode:

Pick K mean vectors using labeled data
Calculate initial mean and allow documents to assign to different cluster contradicting the label tags. We do this step to not over fit the data
Iterate until
- Assign each document x_i to its closest mean vector μ_j.
- Update each mean vector μ_j to be the mean of the x_i’s assigned to it.

Distance between documents and mean are calculated using Cosine Similarity (since the documents are normalized according to their length). An error function is used as Gradient descent and the objective is to minimize this error function. It is the sum of the distance between the documents to their assigned clusters.

Error Function:

3. K-Nearest Neighbor:

The model for kNN is the entire training dataset. When a prediction is required for a unseen data instance, the kNN algorithm will search through the training dataset for the k-most similar instances. The prediction attribute of the most similar instances is summarized and returned as the prediction for the unseen instance. The decision rule in kNN can be written as:

These methods are used to find relevance of a given document to a query or retrieving a set a pages.

We also implement PageRank to rank pages according to their popularity once we find a set of pages relevant to the user.

We use Google's page rank method to rank pages which deals with spider traps and deadends.

Equation:

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
Comparing evaluation techniques		Comparing evaluation techniques
K-Means Cluster		K-Means Cluster
Naive Bayes		Naive Bayes
Page Rank		Page Rank
TFIDF Champion Index		TFIDF Champion Index
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comparing evaluation techniques

Comparing evaluation techniques

K-Means Cluster

K-Means Cluster

Naive Bayes

Naive Bayes

Page Rank

Page Rank

TFIDF Champion Index

TFIDF Champion Index

README.md

README.md

Repository files navigation

WebSearchEngine-API

About

Releases

Packages

Languages

deulgaonkaranup/WebSearchEngine-API

Folders and files

Latest commit

History

Repository files navigation

WebSearchEngine-API

About

Resources

Stars

Watchers

Forks

Languages