qDiff - sponsored by Qventus

Overview

A tool for finding the difference between multiple data sources which should have the same value.

Heavily tested versions

Django	Python 2	Python 3
1.11.15	V
2.0.5		V

Goals

Reduce the efforts required to check data validation from different source.
Report the differences to user.
Resolve the differences for user basing on input rules.

Scenarios

Comparing tables within same database
Comparing tables from different databases
Comparing tables with different range of data within same/different database
Comparing table and CSV file
Comparing unordered CSV file and database

Setup locally

install on python2 Venv

One-time setup on Mac

in these steps, the bold and italic sentences are the commands for terminal

install brew

sudo xcodebuild -license
/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"

install mysql via brew
```
brew install mysql
```
start mysql
```
brew services start mysql
```
install rabbitmq
```
brew install rabbitmq
```

start rabbitmq

/usr/local/sbin/rabbitmq-server -detached

pull the source code

cd ~
git clone https://github.com/analyticsMD/datadiff.git

create virtual environment p2env for qdiff
```
cd datadiff
python -m venv p2env
```
activate the virtual environment
```
source p2env/bin/activate
```
install dependency for qdiff
```
pip install -r requirements.txt
```

create database

mysql -uroot -p
CREATE DATABASE qdiff;
exit;

init tables in the database

python manage.py makemigrations
python manage.py migrate

start Celery workers

celery -A qdiff worker -l info --detach

do a test for sanity check
```
python manage.py test
```
run the server for demo!!
```
python manage.py runserver
```
check URL http://127.0.0.1:8000/

Consecutive runs

After you restart your device, you only need to do the following to launch Qdiff

start rabbitmq server

/usr/local/sbin/rabbitmq-server -detached

activate virtual environment

cd ~/datadiff
source p2env/bin/activate

start Celery workers

celery -A qdiff worker -l info --detach

run the django server
```
python manage.py runserver
```

install on python3 Venv

One-time setup on Mac

in these steps, the bold and italic sentences are the commands for terminal

install brew

sudo xcodebuild -license
/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"

install mysql via brew
```
brew install mysql
```
start mysql
```
brew services start mysql
```
install rabbitmq
```
brew install rabbitmq
```

start rabbitmq

/usr/local/sbin/rabbitmq-server -detached

install python36
```
brew install python
```

pull the source code

cd ~
git clone https://github.com/analyticsMD/datadiff.git

create virtual environment p3env for qdiff
```
cd datadiff
python3 -m venv p3env
```
activate the virtual environment
```
source p3env/bin/activate
```
install dependency for qdiff
```
pip install -r requirements.txt
```

create database

mysql -uroot -p
CREATE DATABASE qdiff;
exit;

init tables in the database

python manage.py makemigrations
python manage.py migrate

start Celery workers

celery -A qdiff worker -l info --detach

do a test for sanity check
```
python manage.py test
```
run the server for demo!!
```
python manage.py runserver
```
check URL http://127.0.0.1:8000/

Consecutive runs

After you restart your device, you only need to do the following to launch Qdiff

start rabbitmq server

/usr/local/sbin/rabbitmq-server -detached

activate virtual environment

cd ~/datadiff
source p3env/bin/activate

start Celery workers

celery -A qdiff worker -l info --detach

run the django server
```
python manage.py runserver
```

Setup on EC2

detail guides

Install

detail guides

install mysql
```
sudo yum install mysql-server
```
install python 3.6
```
sudo yum install python36
```

clone repository

ssh-agent bash -c 'ssh-add /path/to/your/private/rsakey; git clone git@github.com:analyticsMD/datadiff.git'

install gcc for compiling mysql-connector
```
sudo yum install gcc
```
install python-devel for compiling mysql-connector
```
sudo yum install -y python36-devel
```

install dependency

python3 -m pip install -r datadiff/requirements.txt --user

Setup database - mysql

detail guides

start mysql server
```
sudo service mysqld start
```
login into mysql with root user
```
mysql -u root -p
```

create database qdiff and qdiff_test

mysql> create database qdiff;
mysql> create database qdiff_test;

change database password in the /qdiff/setting/settings.py and settings_test.py

install the database schema

python3 manage.py makemigrations
python3 manage.py migrate

install rabbitmq as broker for celery

detail guides

install Erlang Version 20.1

cd /opt
sudo wget https://github.com/rabbitmq/erlang-rpm/releases/download/v20.1.7/erlang-20.1.7-1.el6.x86_64.rpm
sudo rpm -ivh erlang-20.1.7-1.el6.x86_64.rpm

install Socat
```
sudo yum install socat
```

RabbitMQ v3.7.0

sudo wget https://dl.bintray.com/rabbitmq/all/rabbitmq-server/3.7.0/rabbitmq-server-3.7.0-1.el6.noarch.rpm
sudo rpm -ivh rabbitmq-server-3.7.0-1.el6.noarch.rpm

start rabbitmq, run the command
```
rabbitmq-server
```
or you can start rabbitmq as service
```
sudo service rabbitmq-server start
```

start celery worker, use daemon or inline cli.

detail guides

celery -A qdiff worker -l info --detach

You can also check http://docs.celeryproject.org/en/latest/userguide/daemonizing.html

Sanity test

detail guides

run command
```
sudo python3 manage.py test
```
Cheers if all the test cases are successful

Deploy with Nginx, UWSGI in EC2

Please make sure you already finish this section

detail guides

pull the code into the folder /opt/datadiff/

detail guides

make directory /opt/
```
sudo mkdir /opt
```

clone the code with git

cd /opt
git clone git@github.com:analyticsMD/datadiff.git

install uwsgi

detail guides

run command
```
sudo python3 -m pip install uwsgi
```

install nginx

detail guides

run command
```
sudo yum install nginx
```

create qdiff.conf file for nginx

detail guides

run command
```
sudo vim /etc/nginx/nginx.conf
```

edit context for your requirement, this is a simple sample

user ec2-user ec2-user;
worker_processes auto;
error_log /var/log/nginx/error.log;
pid /var/run/nginx.pid;
include /usr/share/nginx/modules/*.conf;

events {
    worker_connections 1024;
}
http {
    server_names_hash_bucket_size 128;
    log_format  main  '$remote_addr - $remote_user [$time_local] "$request" '
                      '$status $body_bytes_sent "$http_referer" '
                      '"$http_user_agent" "$http_x_forwarded_for"';

    access_log  /var/log/nginx/access.log  main;

    sendfile            on;
    tcp_nopush          on;
    tcp_nodelay         on;
    keepalive_timeout   65;
    types_hash_max_size 2048;
    include             /etc/nginx/mime.types;
    default_type        application/octet-stream;
    include /etc/nginx/conf.d/*.conf;
    include /etc/nginx/sites-enable/*;
    index   index.html index.htm;
}

create qdiff_nginx.conf for nginx

detail guides

make two directories, /etc/nginx/sites-enable/ and /etc/nginx/sites-available/
```
mkdir /etc/nginx/sites-enable/
mkdir /etc/nginx/sites-available/
```
create qdiff_nginx.conf in /etc/nginx/sites-available/
```
sudo vim /etc/nginx/sites-available/qdiff_nginx.conf
```

edit content as follow

# the upstream component nginx needs to connect to
upstream django {
    server unix:///opt/datadiff/qdiff.sock; # for a file socket
    # server 127.0.0.1:8001; # for a web port socket
}

# configuration of the server
server {
    # the port your site will be served on
    listen      8000;
    # the domain name it will serve for
    server_name ec2-54-183-250-158.us-west-1.compute.amazonaws.com; # substitute your machine's IP address or FQDN
    charset     utf-8;

    # max upload size
    client_max_body_size 1000M;   # adjust to taste

    # Django media
    location /media  {
        alias /opt/datadiff/media;  # your Django project's media files - amend as required
    }
    location /static {
        alias /opt/datadiff/static; # your Django project's static files - amend as required
    }

    # Finally, send all non-media requests to the Django server.
    location / {
        uwsgi_pass  django;
        include     /opt/datadiff/uwsgi_params; # the uwsgi_params file you installed
    }
}

make sure the permissions of the folder /var/lib/nginx/tmp/client_body/ is readable

detail guides

check the read write permissions

ls -l /var/lib/nginx/tmp/

it should return

drwxr-xr-x 2 ec2-user ec2-user 4096 Jul 31 19:36 client_body
drwx------ 2 ec2-user ec2-user 4096 Jul 24 02:06 fastcgi
drwx------ 2 ec2-user ec2-user 4096 Jul 24 02:06 proxy
drwx------ 2 ec2-user ec2-user 4096 Jul 24 02:06 scgi
drwx------ 7 ec2-user ec2-user 4096 Jul 31 19:36 uwsgi

start the qdiff

detail guides

make sure the rabbit mq already start
start the nginx service
```
sudo service nginx start
```

start uwsgi worker

cd /opt/datadiff
uwsgi --socket qdiff.sock --module setting.wsgi --chmod-socket=664 --daemonize uwsgi.log

start and test your machine

detail guides

access URL host:port/tasks to check if it works or not

for example: http://ec2-xx-xx-xx-xx.us-west-1.compute.amazonaws.com/tasks

if not working, check following files for debug

/var/log/nginx/error.log    #nginx reverse server log
/opt/datadiff/uwsgi.log     #uwsgi server log
/opt/datadiff/dev.log       #qdiff log

System architecture

Architecture

details

When input databases as datasources, users need to generate a database access token first. The database access token is an encrypted JSON file which contains the database setting for Django.

Components

details

Data reader

Using Django DB cursor to read the data
Supporting multiple databases and file sources

file reader

Support CSV excel

Comparators

Field Comparators

A wrapper for package tableschema

It will query setting.settings.SCHEMA_INFER_LIMIT number records and infer the schema from them.
The setting.settings.SCHEMA_INFER_CONFIDENCE permits the minor occurance of abnomral, default is 1.00.
The setting.settings.SCHEMA_CSV_MISSING_VALUES and SCHEMA_DATABASE_MISSING_VALUES is used to configurate what should be consider as missing.

Value Comparators

Variable definitions

data1, data2: the input data, can be queryset, list, dictionary

item1, item2: the elements from data1 and data2

dict1, dict2: the list for saving unmatch records
steps
1. Sort the data1 and data2
2. Iterate over data1 and data2 at same time, item1 comes from data1 and item2 comes from data2
  1. If the item1 is identical to item2, iterate next item pair
  2. Else if the item1 in dict2, save the all elements in dict2 except item1 as conflicted results, iterate next item1
  3. Else if the item2 in dict1, save the all elements in dict1 except item2 as conflicted results, iterate next item2
  4. Else, the item1 is different from the item2, put item1 in dict1 and item2 in dict2, iterate next item pair
Complexity.

given m,n = len(data1),len(data2)

Time complexity:
```
 O(m+n)
```
Space complexity:
```
 O(m+n)
```

pseudo code in python

qDiff(data1, data2):
    iter1 = iter(sorted(data1))
    iter2 = iter(sorted(data2))
    temp_dict1 ={}
    temp_dict2 ={}
    item1 = None
    item2 = None
    try:
        while True:
            item1 = next(iter1)
            item2 = next(iter2)
            h1 = hash(item1)
            h2 = hash(item2)
            if h1==h2:
                item1 = None
                item2 = None
                continue
            elif h1 in temp_dict2 or h2 in temp_dict1:
                if h1 in temp_dict2:
                    temp_dict2.pop(h1)
                    saveToConflictedResult(temp_dict2.values())
                if h2 in temp_dict1:
                    temp_dict1.pop(h2) 
                    saveToConflictedResult(temp_dict1.values())
            else:
                temp_dict1[h1]=item1
                temp_dict2[h2]=item2
            item1 = None
            item2 = None
    except StopIteration as e:
        if not item1:
            saveToConflictedResult(list(iter2))
        else:
            saveToConflictedResult([item1] + list(iter1))

Report viewer

Providing the comparison result
GUI for accepting rules for resolving ((Not implemented yet))

Rule parser (Not implemented yet)

Parsing the input rules and save as rule set for reuse

Rules:

  Where to write the resolved result
  Left join, right join, inner join, and outer join
  Condition based rule, (E.X. when field1 == 0 and field2 > 3)

Conflict resolver (Not implemented yet)

Filtering the conflicted results basing on the input rules

ERD

Entities

Task
- Information about the data source
- Uploaded file path
- Database information (This will not contain password)
- Datetime, Recording the start time and end time for performance evaluation
- Owner (Not implemented yet)
Conflict record
- Raw data
- What source it belongs to
- The name of raw table
Raw Table
- Fields here are dynamics, this table schema is depending on the schema of given datasources
Report
- Which generator will process the data
- Generated file path
- Parameters for the generator, in JSON format
Rule set (Not implemented yet)
- Name
- Description
- Rule, formatted rules in json format

Name		Name	Last commit message	Last commit date
Latest commit History 224 Commits
diagrams		diagrams
qdiff		qdiff
setting		setting
.gitignore		.gitignore
README.html		README.html
README.md		README.md
TODO.md		TODO.md
manage.py		manage.py
qdiff.png		qdiff.png
requirements.txt		requirements.txt
uwsgi.ini		uwsgi.ini
uwsgi_params		uwsgi_params

lin13k/qdiff

Folders and files

Latest commit

History

Repository files navigation

qDiff - sponsored by Qventus

Overview

Goals

Setup locally

One-time setup on Mac

Consecutive runs

One-time setup on Mac

Consecutive runs

Setup on EC2

Install

Setup database - mysql

install rabbitmq as broker for celery

start celery worker, use daemon or inline cli.

Sanity test

Deploy with Nginx, UWSGI in EC2

pull the code into the folder /opt/datadiff/

install uwsgi

install nginx

create qdiff.conf file for nginx

create qdiff_nginx.conf for nginx

make sure the permissions of the folder /var/lib/nginx/tmp/client_body/ is readable

start the qdiff

start and test your machine

System architecture

Architecture

Components

Data reader

file reader

Comparators

Field Comparators

Value Comparators

Report viewer

Rule parser (Not implemented yet)

Conflict resolver (Not implemented yet)

ERD

About

Resources

Stars

Watchers

Forks

Languages