spark-etl

Spark-ETL is Data Integration tool built on Spark and Simple-Salesforce.

Spark - http://spark.apache.org/downloads.html. Version 1.5.2 was used for current implementation.
Simple-Salesforce - https://github.com/heroku/simple-salesforce

#Dependencies: Python3 Simple-Salesforce Spark 1.5.2

#Functionality SPARK-ETL application performs the following operations:

Connects to Salesforce org
Extracts data from tables defined in property files
Loads each respective table data into Spark DataFrame
Creates Spark SQL table for further operations

#Structure: There is top level folder called 'spark-etl' and four subfolders: 'connections', 'etl-config', 'logs' and 'scripts'

spark-etl
	connections - contains connection information in json files. Copy sfdc_connections.json.template here
	etl-config - contains table properties for data processing. Copy sfdc_connections.json.template here
	logs - log file will be generated by the app. Name - spark_etl.log
	scripts - Copy all .py scripts here.

#Installation/Configuration ###Environment and Application Configuration

Environment Variables Besides setting up Spark specific variables (refer to Spark documentation). we need to set top level spark-etl specific variable: SPARK_ETL_HOME. export SPARK_ETL_HOME=../../../spark-etl Or make sure to start main script spark_etl_main.py from script folder. App will set all its own variables appropriately if recommended folder structure was followed.

Troubleshooting

Future functionality

SFDC Namespace as a parameter
Serialize data into HiveMetastore tables
Store data in csv files on AWS S3
Load data into relational database (PostgreSQL, MySQL)

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
LICENSE		LICENSE
README.md		README.md
contact_properties.json.template		contact_properties.json.template
sfdc_connections.json.template		sfdc_connections.json.template
spark_etl_extract.py		spark_etl_extract.py
spark_etl_extract_V1.py		spark_etl_extract_V1.py
spark_etl_main.py		spark_etl_main.py
spark_etl_main_V1.py		spark_etl_main_V1.py
udf_spark_etl.py		udf_spark_etl.py
util.py		util.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LICENSE

LICENSE

README.md

README.md

contact_properties.json.template

contact_properties.json.template

sfdc_connections.json.template

sfdc_connections.json.template

spark_etl_extract.py

spark_etl_extract.py

spark_etl_extract_V1.py

spark_etl_extract_V1.py

spark_etl_main.py

spark_etl_main.py

spark_etl_main_V1.py

spark_etl_main_V1.py

udf_spark_etl.py

udf_spark_etl.py

util.py

util.py

Repository files navigation

spark-etl

Troubleshooting

Future functionality

About

Releases

Packages

Languages

License

petershvets/spark-etl

Folders and files

Latest commit

History

Repository files navigation

spark-etl

Troubleshooting

Future functionality

About

Resources

License

Stars

Watchers

Forks

Languages