Skip to content

petershvets/spark-etl

Repository files navigation

spark-etl

Spark-ETL is Data Integration tool built on Spark and Simple-Salesforce.

Spark - http://spark.apache.org/downloads.html. Version 1.5.2 was used for current implementation.
Simple-Salesforce - https://github.com/heroku/simple-salesforce

#Dependencies: Python3 Simple-Salesforce Spark 1.5.2

#Functionality SPARK-ETL application performs the following operations:

Connects to Salesforce org
Extracts data from tables defined in property files
Loads each respective table data into Spark DataFrame
Creates Spark SQL table for further operations

#Structure: There is top level folder called 'spark-etl' and four subfolders: 'connections', 'etl-config', 'logs' and 'scripts'

spark-etl
	connections - contains connection information in json files. Copy sfdc_connections.json.template here
	etl-config - contains table properties for data processing. Copy sfdc_connections.json.template here
	logs - log file will be generated by the app. Name - spark_etl.log
	scripts - Copy all .py scripts here.

#Installation/Configuration ###Environment and Application Configuration

Environment Variables Besides setting up Spark specific variables (refer to Spark documentation). we need to set top level spark-etl specific variable: SPARK_ETL_HOME. export SPARK_ETL_HOME=../../../spark-etl Or make sure to start main script spark_etl_main.py from script folder. App will set all its own variables appropriately if recommended folder structure was followed.

Troubleshooting

Future functionality

SFDC Namespace as a parameter
Serialize data into HiveMetastore tables
Store data in csv files on AWS S3
Load data into relational database (PostgreSQL, MySQL)

About

Data Integration tool built on Spark

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages