gdc-data-exercise

Small Python app to clean CSV files and insert it to a Postgresql database.

Time spent : 5 hours

Prerequisites

You need docker and docker-compose to run this app. You can follow this link to install (but Docker 19.03.0+ is needed).

Installing

git clone https://github.com/nicolazg/gdc-data-exercise.git
cd gdc-data-exercice/
docker-compose build

Running

docker-compose up app

Technical choices

Postgresql

object-relational database open source
high concurrency, ACID compliance
adheres more closely to SQL standards

SQLAlchemy

I made the choice to have a more generic brick at the cost of performance. As a result, it is easier to change the provider. Nevertheless, using chunksize in pandas.to_sql method allows to simulate bulk insert and to obtain acceptable performances.

Functionnal choices

I've chosen to separate users' connections in a separate table (and dropped misc from the user table) to better exploit this information. However, it takes a lot of time and makes the treatment more complicated.

Datascience part

I didn't have enough time to process this part as I had planned.

My initial goal was to analyze the time between the publication of an ad and its transaction in order to identify a potential user alert suggesting to put his ad back on top of the pile.

Unfortunately, I only produced a graph showing the distribution of this delta by category (as shown below).

Testing

Install pytest 6.1.2 and then run pytest at the root of the project for launching few tests.

Next steps and optimizations

Adding more tests (not enough time)
Speeding up the application
- optimizing some data treatment
- using database driver and bulk insert
Adding security to the database (schema)

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
data		data
db		db
src		src
tests		tests
.env		.env
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
candidate_instructions.md		candidate_instructions.md
docker-compose.yml		docker-compose.yml
main.py		main.py
requirements.txt		requirements.txt
wait-for-it.sh		wait-for-it.sh

nicolas-gaillard/gdc-data-exercise

Folders and files

Latest commit

History

Repository files navigation

gdc-data-exercise

Prerequisites

Installing

Running

Technical choices

Functionnal choices

Datascience part

Testing

Next steps and optimizations

About

Resources

Stars

Watchers

Forks

Languages