Distributed-Web-Crawler-with-Celery

Python: selenium, beautifulsoup2, celery, rabbitmq, Amazon AWS(EC2, S3)

client.py : running on one machine, is the "main" script to collect urls and send them to workers

        python client.py or python client.py aNumber (only 1~that number pages be tested)

proj folder: running on many workers to do tasks to save htmls to s3 storage

__init__.py
celery.py
task.py
upload.proj

admin folder: after client and workers finished all tasks, there are some scripts to verify its completeness

check.py: to find the first missing html file during the process if some error happends
checkAndUpload.py: to check which is missing but also upload the missing ones to s3
count.py：to find the total user numbers on kaggle.com
upload.py: is used by checkAndUpload, same with that one in proj folder

you need to install all necessary libs before you run the code

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
code		code
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

code

code

README.md

README.md

Repository files navigation

Distributed-Web-Crawler-with-Celery

About

Releases

Packages

Languages

killets/Distributed-Web-Crawler-with-Celery

Folders and files

Latest commit

History

code

code

README.md

README.md

Repository files navigation

Distributed-Web-Crawler-with-Celery

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages