usage: scrape-quotes [-h] [--company-ids COMPANY_IDS [COMPANY_IDS ...]]
[--wait-element-timeout WAIT_ELEMENT_TIMEOUT]
[--dom-ready-timeout DOM_READY_TIMEOUT]
[--pool-size POOL_SIZE] [--batch-size BATCH_SIZE]
[--cache-capacity CACHE_CAPACITY] [--headless]
db_config_file_name
positional arguments:
db_config_file_name The database configuration file name. The available
names is/are: db_config.aws.ini, db_config.uc.ini
optional arguments:
-h, --help show this help message and exit
--company-ids COMPANY_IDS [COMPANY_IDS ...]
A list of car rental company id, defaults to [1, 2,
3]. Thrifty: 1, Budget: 2, GO rentals: 3
--wait-element-timeout WAIT_ELEMENT_TIMEOUT
The timeout used by the Selenium explicit waits,
defaults to 10 seconds
--dom-ready-timeout DOM_READY_TIMEOUT
The timeout used to wait until the document.readyState
becomes complete, defaults to 10 seconds,
--pool-size POOL_SIZE
The number of processes that the pool creates to
scrape quotes in parallel, defaults to 8
--batch-size BATCH_SIZE
The number of pending scraping requests retrieved from
the database at a time, defaults to 1000
--cache-capacity CACHE_CAPACITY
The maximun number of rental quotes that the cache can
accommodate, defaults to 1000
--headless Enable to use headless Chrome
-
selenium==3.141.0
pip install selenium
-
beautifulsoup4==4.9.3
pip install beautifulsoup4
-
mysql-connector-python==8.0.21
pip install mysql-connector-python
- A Rental Route is defined as a pair of locations. One represents the pick-up location and the other represents the drop-off location. Each car rental company offers customers a number of rental routes.
- A Rental Duration represents the number of rental days. Adding a rental duration to a pick-up date results in a drop-off date.
- When the program starts to run, the Scraping Date is the current date and the Pick-up Date is the day next to the scraping date. The program keeps running unless it is forced to quit. As time passes, the scraping date is always the current date as time passes. The pick-up date is always the day next to the scraping date.
- A Scraping Request is an object which encapsulates data required to operate on a car rental company's website to get rental quotes. It consists of a company, a pick-up location, a pick-up date, a pick-up time, a drop-off location, a drop-off date and a drop-off time.
- A Processed Scraping Request is a scraping request whose corresponding rental quotes have been scraped and saved to the database.
- A Scraping Request Statistics is an object which reflects the total number of scraping requests and the number of processed scraping requests for a designated scraping date. The formula to calulcate the total number of scraping requests for a designated scraping date is: Number of rental routes * Number of pick-up times * Number of rental durations
- Three companies of scraping tasks contain different locations in New Zealand, Thrifty with 28 locations, Budget with 35 and Gorentals with 6. The integration of the locations reduce the duplication works.
- Only 5 rental durations are available, namely 1 day, 2 days, 3 days, 4 days and 5 days.
- Only 2 pick-up times are available, namely 10:00 AM and 02:00 PM.
- add_office
- debug_msg
- get_scraping_request_statistics
- get_location_name
- get_pending_scraping_requests
- refresh_rental_routes
For Thrifty, it has 28 locations, namely 784 (28 * 28) rental routes. There are 7840 (784 * 2 * 5) scraping requests. Each scraping request could bring 1 to 10 rental quotes. So the total of rental quotes could be 7840 to 78400 (7840 * 10).
For Budget, it has 35 locations, namely 1225 (35 * 35) rental routes. There are 12250 (1225 * 2 * 5) scraping requests. Each scraping request could bring 1 to 10 rental quotes. So the total of rental quotes could be 12250 to 122500 (12250 * 10).
For GO Rentals, it has 6 locations, namely 36 (6 * 6) rental routes. There are 360 (36 * 2 * 5) scraping requests. Each scraping request could bring around 20 rental quotes. So the total of rental quotes could be 360 to 7200 (360 * 20).
In summary, we have 3 companies, 2045 (784 + 1225 + 36) rental routes, 20450 (7840 + 12250 + 360) scraping requests and collect 20450 (7840 + 12250 + 360) to 208100 (78400 + 122500 + 7200) rental quotes.
- https://selenium-python.readthedocs.io/
- Beautiful Soup Documentation
- https://dev.mysql.com/doc/connector-python/en/
- https://www.crummy.com/software/BeautifulSoup/bs4/doc/
- https://stackoverflow.com/questions/43164411/why-do-we-still-need-parser-like-beautifulsoup-if-we-can-use-selenium
- Check the timezone of db server: SELECT @@global.time_zone;
- Set timezone in MySql workbench: set time_zone = '+13:00';
- WebDriver click() vs JavaScript click()
- Selenium WebDriver How to Resolve Stale Element Reference Exception?
- ChromeDriver - WebDriver for Chrome
- Context managers and multiprocessing pools
- multiprocessing.Pool: What's the difference between map_async and imap?
- In what situation do we need to use 'multiprocessing.Pool.imap_unordered'?
- Add RentalsScraping to /usr/local/bin
ln -s [The absolute path to bin/scrape_quotes.py] /usr/local/bin/scrape-quotes
chmod 755 /usr/local/bin/scrape-quotes
- Logging in Python
- Python3: Logging With Multiprocessing