Data lake architecture POC hosted on OSIRIM (https://osirim.irit.fr/site/)

This repository is the main working repository for the data lake architecture. The goals are :

Create an opensource architecture for data management based on data lake architecture
Handle any type of data
Handle any volumetry of datas
Make it easy to deploy the architecture
Make an interoperable architecture through the usage of REST API whenever it is possible

Table of content

The project
- Data exchanges between services
TODO
- Development on the project
- Development for the project
Other informations
Contacts

Context

This project is supported by neOCampus, OSIRIM, the CNRS, the IRIT and the SMAC team in IRIT.

This project has been started with a internship for 2nd year of Master in Statistic and Decisional Computing (SID at Université Toulouse 3 - Paul-Sabatier in Toulouse) funded by neOCampus. The project has been continued through a 1year-fixed-term contract by the CNRS and the OSIRIM platform.

NeOCampus is an operation based in the University with numerous research laboratory. A big part of the network used in this operation is used by sensors and effector. But we can find a lot of other kind of informations. The goal of this project is to create an architecture of data lake that can handle the needs and the data in neOCampus operation.
OSIRIM (Observatoire des Systèmes d'Indexation et de Recherche d'Information Multimedia) is one of IRIT's platforms. It is mainly supported by the European Regional Development Fund (ERDF), the French government, the Midi-Pyrénées region and the Centre National de la Recherche Scientifique (CNRS).
The SMAC team is interested in modeling and problem solving in complex systems using multi-agen technology. ( https://www.irit.fr/departement/intelligence-collective-interaction/equipe-smac/ )

Name		Name	Last commit message	Last commit date
Latest commit History 140 Commits
Openstack/swift		Openstack/swift
apache_airflow		apache_airflow
dataset/input_file_test		dataset/input_file_test
frontend		frontend
git_image		git_image
influxdb		influxdb
mongodb		mongodb
neo4j		neo4j
.gitignore		.gitignore
CODEOWNERS		CODEOWNERS
README.md		README.md
Toward_a_datalake.pdf		Toward_a_datalake.pdf
python_test_script.py		python_test_script.py
requirements.txt		requirements.txt
step_to_reproduce.md		step_to_reproduce.md

	Swift	Metadata `MongoDB`	Airflow	Airflow `Jobs`	Neo4J `"Gold" zone`	InfluxDB`"Gold" zone`	Mongodb`"Gold" zone`
Todo
Working

truongdx5/docker_datalake

Folders and files

Latest commit

History

Repository files navigation

Data lake architecture POC hosted on OSIRIM (https://osirim.irit.fr/site/)

Table of content

Context

The project

Concept

Current architecture

Areas description

Raw data management area or landing area

Metadata management area

Process area

Consumption zone or processed data area or gold zone

Services area

Security and monitoring area

Services available

Tools used

Diagrams

Activity diagram

Data integration activity diagram for Apache Airflow

How to

Insert a new data

Process a data already inserted

Access to services

TCP Ports used

API descrption

Deploy the architecture

Create an Airflow job

Create an Airflow pipeline

Integrate a new process pipeline in Airflow

Problems already encountered

Data formats in

Openstack Swift

MongoDB metadata database

Data exchanges between services

TODO

Development on the project

Development for the projet

Around the architecture

Documentation

What to change for a production deployment

How to go further

Neo4J as image recommendation system

Other information

Used tools

MsSQL 20xx

Tools not used

In the input area

In process area

In processed data area

Start Openstack Swift docker container

More documentation

Licence

Contacts

About

Resources

Stars

Watchers

Forks

Languages