A script for crawling jet.com for a list of all products, and storing them in a redis database.
Crawling involves a two step process. First the master node is run, and inserts urls to be crawled into sqs. These are then crawled by the worker node(s), and results are placed in a redis database for serving up to clients.
Inject the following environment variables into the running container:
- AWS_ACCESS_KEY_ID
- AWS_SECRET_ACCESS_KEY
- Follow the setup steps from above, and then run the tool using all three of the following modes:
python3 run.py [get_categories|get_items|get_details]
- get_categories retrieves a list of categories to crawl against, and places them in sqs.
- get_items retrieves a list of items to crawl.
- get_details retrieves the price details for each item.