This is the project I did during the seven-week Insight Data Engineering Fellows Program
which helps recent grads and experienced software engineers learn the latest open source technologies
by building a data platform to handle large datasets.
DealGalaxy is batch platform that helps optimizing online shopping experience. The platform gives
you daily cheapest price. A website can be found at www.dealgalgaxy.site
This is an application developed using Amazon Web Service. It calculates the website discount everyday by adding cash back percentage from ebates website, gift card discount from ebay website and the coupon discount. Then it applies the website discount to the item selling on those shopping websites and calculates the cheapest price for an item.
Python script (using multithreading) runs at midnight and then pushes data into S3, using the AWS Python SDK. Then AWS Data Pipeline runs to pull data from S3 to RedShift, where there is another Python script running to update the website total discount and item discount price using the current day’s information. Then Flask is used as web server to visualize information.
RedShift is used for Batch Processing because RedShift combines the power of both relational database and columnar database in a distributed manner. The scraping scrape about 7GB data every day, and the past data is saved for analysis and predictions.
1) Finding the cheapest price for an item 2) Visualize the past cash back information
The website is also able to answer the following questions:
3) Current day's trending website, which offers the biggest discount
4) Ebay Gift Card Buy-It-Now percentage and total number