Using airflow to schedule, monitor, and log. Postgres as backend and data storage. Redis as a broker
Checklist to start our services
- make sure docker is running, and volume mounting is enabled.
- git clone advance_scraping
- set environment variables
- run the service with a single docker-compose command
git clone https://github.com/Proteusiq/advance_scraping.git
cd advance_scraping
Set environment variables in .env
contents. Edit, the .env_demo
contents and save it as .env
WARNING: Remember to add your .env to .gitignore. Do not share your secrets
docker-compose config
Make sure you can see the environment variable docker-compose fetches from .env
See: docker-compose options.
WARNING: Postgres container has issue with persisting data after restart. Until then, we will use labled volume Do
docker volume create --name=pgdata
to create a name volume (to deletedocker volume rm pgdata
)
docker-compose up
Note: Only the initial build will take awhile. Go grap a cup of coffee as docker downloads and install necessary tools.
You can run the services in detach mode. --detach
or -d
flag. This will leave services running.
See: docker-compose up options
-
pgAdmin:
address:
localhost:5050default_email:
pgadmin@example.comdefault_pwd
: admin
Head to localhost:8080
on your browser. Login with credentials used in your environment ADMIN_USER and ADMIN_PASSWORD variables. Example: danpra
and password airflowpwd
Head to localhost:5050
. Login with credentials used in your environment PGADMIN_DEFAULT_EMAIL and PGADMIN_DEFAULT_PASSWORD variables. Example: danpra@example.com
and password postgrespwd
Adding a connection to postgres
DB in pgAdmin
, click Add New Server
. Type any name and select Connection
. Name:Boliga > Host name/address: postgres
: Postgres Username and Password and click Save
Head to localhost:3000
. Login with user admin
and and password grafanapwd
. Change credentials in containers/grafana/config.monitoring
. Add postgres as data source with the postgres username and password as we did in pgAdmin.
Charts coming soon
Press Ctrl + C
to stop our services without killing your volumes data. Then do
docker-compose down
Use docker-compose down -v to remove also the volumes.
WARNING: remember to backup your data before removing volumes.
docker-compose down -v
See: docker-compose down options
A lazy programmer, like me, loves to write less yet comprehensive codes. (:) Yes, I said it). Design Pattern in Python is not as useful and in most cases, an overkill, as other languages like Java, C#, and C++. In order to design a simple bolig[danish for estate] scrapping tool from different estate websites in Denmark, I decided to use a bit of Singleton Pattern and Abstract Factory Pattern.
Bolig (pipelines.boliger.bolig.Bolig
) ensures that there exists a single instance and single object that can be used by all other bolig related classes. The form of singleton design is Early Instantiation. We create an instance at load time.
Bolig class also defines an interface[abstract class] for creating families of related objects without specifying their concrete sub-classes[functions]. This ensures consistency among all objects by isolating the client code from implementation. We want to use the same function but with different implementations. get_page
and get_pages
will always be called in the same way but the implementation is different.
# inheritance tree
Bolig # singleton and abstract
Boliga(Bolig) # overides get_page and get_pages for boliga.dk api logic
Services(Bolig) # overides get_page and get_pages for home.dk and estate.dk api logic
BoligaRecent(Boliga) # initiate with recent boliga as url
BoligaSold(Boliga) # initiate with sold boliga as url
Home(Services) # initiate with home recent home as url
Estate(Services) # initiate with estate recent home as url
- Add a web-scraper examples
- Add simple Airflow examples
- Add an introduction to Airflow README
- Add custom error handling class
- Add tests Airflow dags and Scrapers
- Add Grafana visualization of tasks and estates prices
Docker Basics:
Kill all containers
docker container ps | awk {' print $1 '} | tail -n+2 > tmp.txt; for line in $(cat tmp.txt); do docker container kill $line; done; rm tmp.txt