Spark-ETL is Data Integration tool built on Spark and Simple-Salesforce.
Spark - http://spark.apache.org/downloads.html. Version 1.5.2 was used for current implementation.
Simple-Salesforce - https://github.com/heroku/simple-salesforce
#Dependencies: Python3 Simple-Salesforce Spark 1.5.2
#Functionality SPARK-ETL application performs the following operations:
Connects to Salesforce org
Extracts data from tables defined in property files
Loads each respective table data into Spark DataFrame
Creates Spark SQL table for further operations
#Structure: There is top level folder called 'spark-etl' and four subfolders: 'connections', 'etl-config', 'logs' and 'scripts'
spark-etl
connections - contains connection information in json files. Copy sfdc_connections.json.template here
etl-config - contains table properties for data processing. Copy sfdc_connections.json.template here
logs - log file will be generated by the app. Name - spark_etl.log
scripts - Copy all .py scripts here.
#Installation/Configuration ###Environment and Application Configuration
Environment Variables Besides setting up Spark specific variables (refer to Spark documentation). we need to set top level spark-etl specific variable: SPARK_ETL_HOME. export SPARK_ETL_HOME=../../../spark-etl Or make sure to start main script spark_etl_main.py from script folder. App will set all its own variables appropriately if recommended folder structure was followed.
SFDC Namespace as a parameter
Serialize data into HiveMetastore tables
Store data in csv files on AWS S3
Load data into relational database (PostgreSQL, MySQL)