Skip to content

With single command build a beautiful web scraping tool for scheduled scraping and store scraped data in postgres database

License

Notifications You must be signed in to change notification settings

NikZak/advance_scraping

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

79 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Advance Web Scraping

Using airflow to schedule, monitor, and log. Postgres as backend and data storage. Redis as a broker

advance_image

Getting Started

Checklist to start our services

  • make sure docker is running, and volume mounting is enabled.
  • git clone advance_scraping
  • set environment variables
  • run the service with a single docker-compose command

Git Clone

git clone https://github.com/Proteusiq/advance_scraping.git
cd advance_scraping

Set Environment Variables

Set environment variables in .env contents. Edit, the .env_demo contents and save it as .env

WARNING: Remember to add your .env to .gitignore. Do not share your secrets

Check the environments to be set by docker-compose with:
docker-compose config

Make sure you can see the environment variable docker-compose fetches from .env

See: docker-compose options.

Start services with a single command:

WARNING: Postgres container has issue with persisting data after restart. Until then, we will use labled volume Do docker volume create --name=pgdata to create a name volume (to delete docker volume rm pgdata)

docker-compose up

Note: Only the initial build will take awhile. Go grap a cup of coffee as docker downloads and install necessary tools. You can run the services in detach mode. --detach or -d flag. This will leave services running.

See: docker-compose up options

UI Services:

advance_image

  • Airflow: address: localhost:8080 default_user: danpra default_pwd: changeme
  • pgAdmin: address: localhost:5050 default_email: pgadmin@example.com default_pwd: admin
  • Flower: address: localhost:5555
  • Ridis Insight: address: localhost:8001
  • Grafana: address: localhost:3000

Airflow UI

Head to localhost:8080 on your browser. Login with credentials used in your environment ADMIN_USER and ADMIN_PASSWORD variables. Example: danpra and password airflowpwd

Postgres Admin Tool

Head to localhost:5050. Login with credentials used in your environment PGADMIN_DEFAULT_EMAIL and PGADMIN_DEFAULT_PASSWORD variables. Example: danpra@example.com and password postgrespwd

postgres_image

Adding a connection to postgres DB in pgAdmin, click Add New Server. Type any name and select Connection. Name:Boliga > Host name/address: postgres: Postgres Username and Password and click Save

postgres_image

Grafana

Head to localhost:3000. Login with user admin and and password grafanapwd. Change credentials in containers/grafana/config.monitoring. Add postgres as data source with the postgres username and password as we did in pgAdmin.

grafana_image

Charts coming soon

Stop services with:

Press Ctrl + C to stop our services without killing your volumes data. Then do

docker-compose down

Use docker-compose down -v to remove also the volumes.

WARNING: remember to backup your data before removing volumes.

docker-compose down -v

See: docker-compose down options

Web Scraping and Design Pattern [Opinionated Rumbling]

A lazy programmer, like me, loves to write less yet comprehensive codes. (:) Yes, I said it). Design Pattern in Python is not as useful and in most cases, an overkill, as other languages like Java, C#, and C++. In order to design a simple bolig[danish for estate] scrapping tool from different estate websites in Denmark, I decided to use a bit of Singleton Pattern and Abstract Factory Pattern.

Bolig (pipelines.boliger.bolig.Bolig) ensures that there exists a single instance and single object that can be used by all other bolig related classes. The form of singleton design is Early Instantiation. We create an instance at load time.

Bolig class also defines an interface[abstract class] for creating families of related objects without specifying their concrete sub-classes[functions]. This ensures consistency among all objects by isolating the client code from implementation. We want to use the same function but with different implementations. get_page and get_pages will always be called in the same way but the implementation is different.

# inheritance tree
Bolig                   # singleton and abstract
Boliga(Bolig)           # overides get_page and get_pages for boliga.dk api logic
Services(Bolig)         # overides get_page and get_pages for home.dk and estate.dk api logic
BoligaRecent(Boliga)    # initiate with recent boliga as url
BoligaSold(Boliga)      # initiate with sold boliga as url
Home(Services)          # initiate with home recent home as url
Estate(Services)        # initiate with estate recent home as url

Todo:

  • Add a web-scraper examples
  • Add simple Airflow examples
  • Add an introduction to Airflow README
  • Add custom error handling class
  • Add tests Airflow dags and Scrapers
  • Add Grafana visualization of tasks and estates prices

dev

Repos that made this project possible and lots of github issues:

Docker Apache Airflow

Docker Basics:

Kill all containers

docker container ps | awk {' print $1 '} | tail -n+2 > tmp.txt; for line in $(cat tmp.txt); do docker container kill $line; done; rm tmp.txt

About

With single command build a beautiful web scraping tool for scheduled scraping and store scraped data in postgres database

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Jupyter Notebook 56.4%
  • Python 36.0%
  • Shell 4.9%
  • Dockerfile 2.1%
  • JavaScript 0.6%