NewsScrapper

This project aims to be a scrapper manager which scrape news from different websites. You have create your own scrappers from the parent class. The scrapper manager will call all of them.

Each scrapper posts processed content inside a database defined in a config file. The scrapped content will be published in a WordPress site too.

createTableIfNotExist(): if there is no table called as the section the manager is scrapping it will create that table.
getURLs(): Get the list of scrapped URL's of current section. This function is very useful to avoid storing repeated content into our database.
addData(): add items into database.

Scrapper definition

In lib folder we will find our Scrapper parent class which defines the common functions for all scrappers. These functions are:

addItemsToMysql() : This function calls the db functions addData which will store the scrapped items into our database.
addItemsToWordpress() : This function will store our scrapped content into our WordPress site.

For each new scrapper we create we must write a new class in lib/scrappers folder and add its definition to lib/scrapperFactory.py.

Defining new scrappers

To create a new scrapper we have to follow these steps:

Create a new scrapper class in lib/scrappers folder

The class skeleton would be like this:

#!/usr/bin/python3

import sys
sys.path.append('../../')

from lib.scrapper import Scrapper

class ScrapperWebsite1Economy( Scrapper ):

    def __init__( self, db, wpinfo, table, url, slug, log ):
        Scrapper.__init__( self, db, wpinfo, table, url, slug, log )

    def scrape():
        # Write here your scrape function

        self.log.info( "We can write logs from this function too")

        # I recommend you to get the URL's that have been already scrapped to avoid repeated content into tour database.

        storedItems = db.getURLs()

        # At the end of this function we need to store our items into self.items

        self.items = ourScrappedContent

Each item you scrape would have the following data:

item['title'] -> The title of the article.
item['description'] -> The description of the article.
item['url'] -> The original url of the article.
item['image_url'] -> And image about this article.
item['video_url'] -> A youtube video about this article.
item['content'] -> The content of this article.
item['slug'] -> the slug (permalink) of this article.
item['keywords'] -> keywords for this article.
item['referer'] -> name of the referer.
item['referer_url'] -> referer url.

The manager only needs to store the title and the URL of each article, the other parts are optional.

Finally we need to add our new class into our scrapperFactory class.

import sys
sys.path.append('../')

from lib.scrappers import website1_economy
from lib.scrappers import website1_politics
from lib.scrappers import website2_science
from lib.scrappers import website2_sports
from lib.scrappers import YOUR_NEW_SCRAPPER

class ScrapperFactory( object ):

    def factory( type, db, wpinfo, table, url, slug, log ):
        if type == "website1_economy": return website1_economy.ScrapperWebsite1Economy( db, wpinfo, table, url, slug, log )
        if type == "website1_politics": return website1_politics.ScrapperWebsite1Politics( db, wpinfo, table, url, slug, log )
        if type == "website2_science": return website2_science.ScrapperWebsite2Science( db, wpinfo, table, url, slug, log )
        if type == "website2_sports": return website2_sports.ScrapperWebsite2Sports( db, wpinfo, table, url, slug, log )
        if type == "YOUR_NEW_SCRAPPER": return YOUR_NEW_SCRAPPER.ScrapperYOUR_NEW_SCRAPPER( db, wpinfo, table, url, slug, log )        
    factory = staticmethod(factory)

Example

Here is an ownmade scrapper which gets news from boxing section of clarin.com

## TODO

Allow to set WordPress post categories name.
The WordPress client is http, it ccould be https too.
"Source" name should be able to change.
The origin "source" of the content should be optional.
Write argument parssing module to disable post submiting to WordPress.
Dockerize this app

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
conf		conf
lib		lib
log		log
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
scrapperManager.py		scrapperManager.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

conf

conf

lib

lib

log

log

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

scrapperManager.py

scrapperManager.py

Repository files navigation

NewsScrapper

Table of Contents

Required software

Configuration files

Database Class

Scrapper definition

Defining new scrappers

Create a new scrapper class in lib/scrappers folder

Example

About

Releases 1

Packages

Languages

License

a-castellano/NewsScrapper

Folders and files

Latest commit

History

Repository files navigation

NewsScrapper

Table of Contents

Required software

Configuration files

Database Class

Scrapper definition

Defining new scrappers

Create a new scrapper class in lib/scrappers folder

Example

About

Resources

License

Stars

Watchers

Forks

Languages