packtpub-crawler

Download FREE eBook every day from www.packtpub.com

This crawler automates the following step:

access to private account
claim the daily free eBook
parse title, description and useful information
download favorite format .pdf .epub .mobi
download source code and book cover
upload files to Google Drive
store data on Firebase
notify via email
schedule daily job on Heroku or with Docker

Default command

# upload pdf to drive, store data and notify via email
python script/spider.py -c config/prod.cfg -u drive -s firebase -n

Other options

# download all format
python script/spider.py --config config/prod.cfg --all

# download only one format: pdf|epub|mobi
python script/spider.py --config config/prod.cfg --type pdf

# download also additional material: source code (if exists) and book cover
python script/spider.py --config config/prod.cfg -t pdf --extras
# equivalent (default is pdf)
python script/spider.py -c config/prod.cfg -e

# download and then upload to Drive (given the download url anyone can download it)
python script/spider.py -c config/prod.cfg -t epub --upload drive
python script/spider.py --config config/prod.cfg --all --extras --upload drive

Basic setup

Before you start you should

Verify that your currently installed version of Python is 2.x with python --version
Clone the repository git clone https://github.com/niqdev/packtpub-crawler.git
Install all the dependencies (you might need sudo privilege) pip install -r requirements.txt
Create a config file cp config/prod_example.cfg config/prod.cfg
Change your Packtpub credentials in the config file

[credential]
credential.email=PACKTPUB_EMAIL
credential.password=PACKTPUB_PASSWORD

Now you should be able to claim and download your first eBook

python script/spider.py --config config/prod.cfg

Upload setup

From documentation, Drive API requires OAuth2.0 for authentication, so to upload files you should:

Go to Google APIs Console and create a new Drive project named PacktpubDrive
On API manager > Overview menu
- Enable Google Drive API
On API manager > Credentials menu
- In OAuth consent screen tab set PacktpubDrive as the product name shown to users
- In Credentials tab create credentials of type OAuth client ID and choose Application type Other named PacktpubDriveCredentials
Click Download JSON and save the file config/client_secrets.json
Change your Drive credentials in the config file

[drive]
...
drive.client_secrets=config/client_secrets.json
drive.gmail=GOOGLE_DRIVE@gmail.com

Now you should be able to upload your eBook to Drive

python script/spider.py --config config/prod.cfg --upload drive

Only the first time you will be prompted to login in a browser which has javascript enabled (no text-based browser) to generate config/auth_token.json. You should also copy and paste in the config the FOLDER_ID, otherwise every time a new folder with the same name will be created.

[drive]
...
drive.default_folder=packtpub
drive.upload_folder=FOLDER_ID

Documentation: OAuth, Quickstart, example and permissions

Database setup

Create a new Firebase project, copy the database secret from your settings

https://console.firebase.google.com/project/PROJECT_NAME/settings/database

and update the configs

[firebase]
firebase.database_secret=DATABASE_SECRET
firebase.url=https://PROJECT_NAME.firebaseio.com

Now you should be able to store your eBook details on Firebase

python script/spider.py --config config/prod.cfg --upload drive --store firebase

Notification setup

To send a notification via email using Gmail you should:

Allow "less secure apps" and "DisplayUnlockCaptcha" on your account
Troubleshoot sign-in problems and examples
Change your Gmail credentials in the config file

[notify]
...
notify.username=EMAIL_USERNAME@gmail.com
notify.password=EMAIL_PASSWORD
notify.from=FROM_EMAIL@gmail.com
notify.to=TO_EMAIL_1@gmail.com,TO_EMAIL_2@gmail.com

Now you should be able to notify your accounts

python script/spider.py --config config/prod.cfg --upload drive --notify

Heroku setup

Create a new branch

git checkout -b heroku-scheduler

Update the .gitignore and commit your changes

# remove
config/prod.cfg
config/client_secrets.json
config/auth_token.json
# add
dev/
config/dev.cfg
config/prod_example.cfg

Create, config and deploy the scheduler

heroku login
# create a new app
heroku create APP_NAME
# or if you already have an existing app
heroku git:remote -a APP_NAME

# deploy your app
git push -u heroku heroku-scheduler:master
heroku ps:scale clock=1

# useful commands
heroku ps
heroku logs --ps clock.1
heroku logs --tail
heroku run bash

Update script/scheduler.py with your own preferences.

More info about Heroku Scheduler, Clock Processes, Add-on and APScheduler

Docker setup

Build your image

docker build -t niqdev/packtpub-crawler:1.3.0 .

Run manually

docker run \
  --rm \
  --name my-packtpub-crawler \
  niqdev/packtpub-crawler:1.3.0 \
  python script/spider.py --config config/prod.cfg --upload drive

Run scheduled crawler in background

docker run \
  --detach \
  --name my-packtpub-crawler \
  niqdev/packtpub-crawler:1.3.0

# useful commands
docker exec -i -t my-packtpub-crawler bash
docker logs -f my-packtpub-crawler

Development (only for spidering)

Run a simple static server with

node dev/server.js

and test the crawler with

python script/spider.py --dev --config config/dev.cfg --all

Disclaimer

This project is just a Proof of Concept and not intended for any illegal usage. I'm not responsible for any damage or abuse, use it at your own risk.

Name		Name	Last commit message	Last commit date
Latest commit History 99 Commits
config		config
dev		dev
script		script
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
Procfile		Procfile
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

config

config

dev

dev

script

script

.dockerignore

.dockerignore

.gitignore

.gitignore

Dockerfile

Dockerfile

LICENSE

LICENSE

Procfile

Procfile

README.md

README.md

requirements.txt

requirements.txt

Repository files navigation

packtpub-crawler

Download FREE eBook every day from www.packtpub.com

Default command

Other options

Basic setup

Upload setup

Database setup

Notification setup

Heroku setup

Docker setup

Development (only for spidering)

Disclaimer

About

Releases

Packages

Languages

License

kuchy/packtpub-crawler

Folders and files

Latest commit

History

Repository files navigation

packtpub-crawler

Download FREE eBook every day from www.packtpub.com

Default command

Other options

Basic setup

Upload setup

Database setup

Notification setup

Heroku setup

Docker setup

Development (only for spidering)

Disclaimer

About

Resources

License

Stars

Watchers

Forks

Languages