Skip to content

saraivaufc/search-engine

Repository files navigation

search-engine

Codacy Badge

Simplistic Term Vector Model

We designed and implemented a web search engine based on the vector space model. This search engine uses the non-relational database mongodb and the tookit of natural language processing nltk.

Dataset

File: data.csv

Source: https://dataverse.harvard.edu/dataset.xhtml?id=3010077

Description: This is a reusable publicly-available dataset for “media bias” studies. The content of this dataset is publish date, title, subtitle and text for 3824 news articles. These articles are collected by a project within 3 months from December of 2016 to march 2017. The source of these news articles are from ABC News, CNN news, The Huffington Post, BBC News, DW News, TASS News, Al Jazeera News, China Daily and RTE News. All of them are collected by using RSS feeds of each news sites. (2017-3-31)

Install MongoDB

$ sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 68818C72E52529D4
$ sudo echo "deb http://repo.mongodb.org/apt/ubuntu bionic/mongodb-org/4.0 multiverse" | sudo tee /etc/apt/sources.list.d/mongodb-org-4.0.list
$ sudo apt-get update
$ sudo apt-get install -y mongodb-org
$ sudo systemctl start mongod
$ sudo systemctl enable mongod

Create a database

$ mongo
$ use search-engine

Performing the ingestion of the articles in the database

$ python3 ingestion.py

Run the search algorithm

$ python3 search.py
Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

About

Search engine based on the vector space model

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages