AutoNews: Automatic News Articles Crawler and Curator using Machine Learning

Model Training

The dataset and other files related to model training can be found in auto_news_backend. Python notebooks dedicated to each step of the process beginning from EDA, text preprocessing and model creation can be found in the same folder.

The dataset is created from Intel India Market Intelligence Newsletters archives and has around 400 articles with 6 categories namely:iot, telecom, dc(data center), cloud, cc(client computing), ai(artificial intelligence), and industry. Around 15% of the data was used for testing.

The model uses SVM with grid search cross validation for categorizing the articles. The model achieved an accuracy of 77.89% on testing data, which can be improved once mor data is added in the corpus.

Steps to Setup the API.

Clone this repository.
Go to auto_news_backend and install requirements using the following command

`pip install -r requirements.txt`

To run the API, in the project directory go to auto_news_backend/api and then run:

`python autonews_api.py`

This will run the API on localhost:5000 which can be changed in app.run() method.

API Documentation

This API uses POST request to communicate and . All requests must include a content-type of application/json except request to localhost:5000/download and the body must be valid JSON.

Crawl using Websites' RSS

You send: Any additional sources(it's name, it's website's URL and it's RSS feed URL along with a flag("0","1" or "2") which determines whether you want to search default sources or the new one only).
You get: A JSON response.

Example Request:

POST request to /crawlweb 
Accept: application/json
Content-Type: application/json
{
    "sources": [{"source":"cio_etc_dc","url":"https://cio.economictimes.indiatimes.com/","rss":"https://cio.economictimes.indiatimes.com/rss/data-center"}],
    "add": "0" 
}

Successful Response:

{
   "StatusMessage":"Crawling Done"
}

Failed Response:

{
   "StatusMessage":"Error Occured"
}

Crawl using Google News

You send: The date range within which news is needed
You get: A JSON response.

Example Request:

POST request to /crawlgoogle 
Accept: application/json
Content-Type: application/json
{
    "startDate": "2020-07-28",
    "endDate": "2020-07-31" 
}

Successful Response:

{
   "StatusMessage":"Crawling Done"
}

Failed Response:

{
   "StatusMessage":"Error Occured"
}

Predict

You send: A flag("google" or "other") which selects which crawled articles to predict whether from Google or from Websites' RSS and a confidence score which determines the threshold of probability of predcition below which the articles will be eliminated.
You get: A JSON response.

Example Request:

POST request to /predictCategory
Accept: application/json
Content-Type: application/json
{
    "value":"google",
    "confidence": "40.0" 
}

Successful Response:

{
   "StatusMessage":"Predicting Done"
}

Failed Response:

{
   "StatusMessage":"Error Occured"
}

Download

You send: A flag("other" or "google") whether to download google predcited articles or websites' articles.
You get: A JSON response.

Example Request:

POST request to /download
response-type: "blob"
{
    "value":"google"
}

Successful Response:
File will be downloaded

Failed Response:

{
   "StatusMessage":"Error Occured"
}

The default sources for Websites Crawling is stored in `auto_news_backend/sources.json` file.
The query terms for Google Crawling is stored in `auto_news_backend/query_terms.txt` file.

React Application

To run the React Application, in the project directory, you can run:

`npm start`

Runs the app in the development mode.
Open http://localhost:3000 to view it in the browser.

The page will reload if you make edits.
You will also see any lint errors in the console.

The first screen src/Crawl.js of application does the crawling, where the sources can be chosen between Websites' RSS or Google News. Depending upon it, the respective API call is made which is a POST request to localhost:5000/crawlweb for websites and localhost:5000/crawlgoogle for Google News.

The second screen src/Predict.js does the categorisation and makes a POST request to localhost:5000/predictCategory .

The third screen src/Download.js downloads the required file by making a POST request to localhost:5000/downlaod .

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
auto_news_backend		auto_news_backend
public		public
src		src
.gitignore		.gitignore
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

auto_news_backend

auto_news_backend

public

public

src

src

.gitignore

.gitignore

README.md

README.md

package-lock.json

package-lock.json

package.json

package.json

requirements.txt

requirements.txt

Repository files navigation

AutoNews: Automatic News Articles Crawler and Curator using Machine Learning

Model Training

Steps to Setup the API.

`pip install -r requirements.txt`

`python autonews_api.py`

API Documentation

Crawl using Websites' RSS

Crawl using Google News

Predict

Download

React Application

`npm start`

About

Releases

Packages

Languages

riyakwl28/auto_news

Folders and files

Latest commit

History

Repository files navigation

AutoNews: Automatic News Articles Crawler and Curator using Machine Learning

Model Training

Steps to Setup the API.

pip install -r requirements.txt

python autonews_api.py

API Documentation

Crawl using Websites' RSS

Crawl using Google News

Predict

Download

React Application

npm start

About

Resources

Stars

Watchers

Forks

Languages

`pip install -r requirements.txt`

`python autonews_api.py`

`npm start`