Synopsis

On average 90% of data exploration is spent on prep work. We aim at alleviating this pain by creating normalized form of our domain objects and make them available for processing using Spark, creating a clear contract between the Data platform and all the downstream computers.

Repository contents

A Python based Spark application. This can be run standalone or the egg file can be packaged and made to run in a hosted Spark Environment.

Run project

After checking out, create a python virtual environment and run pip3 install -r requirements.txt

run python ./catalog/dao/write_catalog.py

This reads the data from Postgres follower database by default. You can add additional data-sources, normalizes the dataset and creates domain model including Candidates, Jobs and Employers and persists then in S3.

Api

TODO: Databricks Api.

Problems?

If you find a problem, please file an issue.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
catalog		catalog
es		es
src/datatools		src/datatools
tests		tests
.gitignore		.gitignore
README.md		README.md
data-catalog.iml		data-catalog.iml
databricks_interface.py		databricks_interface.py
main.py		main.py
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

catalog

catalog

es

es

src/datatools

src/datatools

tests

tests

.gitignore

.gitignore

README.md

README.md

data-catalog.iml

data-catalog.iml

databricks_interface.py

databricks_interface.py

main.py

main.py

requirements.txt

requirements.txt

setup.py

setup.py

Repository files navigation

Synopsis

Repository contents

Run project

Api

Problems?

About

Releases

Packages

Languages

hired/data-catalog

Folders and files

Latest commit

History

Repository files navigation

Synopsis

Repository contents

Run project

Api

Problems?

About

Resources

Stars

Watchers

Forks

Languages