This is a project during Insight Engineering Fellow Program.
This project is inspired by Spark Summit East talk by Vida Ha. The main tool is Spark, which will perform batch process and join query on dataframe. The engineering challenge of this project is to optimize the time and space complexity of two join queries:
- Join a person information table with a credit score table to figure out each client's credit score.
- Join a daily transaction table with a card table to add card information to today's transaction information.
The data is generated. Data size:
Person info Table: 3,000,000.
Card info Table: 3,000,000,000.
Daily Transaction Table: 100,000,000.