Skip to content

lin13k/qdiff

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

qDiff - sponsored by Qventus

Overview

A tool for finding the difference between multiple data sources which should have the same value.

Heavily tested versions
Django Python 2 Python 3
1.11.15 V
2.0.5 V

Goals

  1. Reduce the efforts required to check data validation from different source.
  2. Report the differences to user.
  3. Resolve the differences for user basing on input rules.
Scenarios
  • Comparing tables within same database
  • Comparing tables from different databases
  • Comparing tables with different range of data within same/different database
  • Comparing table and CSV file
  • Comparing unordered CSV file and database

Setup locally

install on python2 Venv

One-time setup on Mac

in these steps, the bold and italic sentences are the commands for terminal

  1. install brew

    sudo xcodebuild -license
    /usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
    
  2. install mysql via brew

    brew install mysql
    
  3. start mysql

    brew services start mysql
    
  4. install rabbitmq 

    brew install rabbitmq
    
  5. start rabbitmq

    /usr/local/sbin/rabbitmq-server -detached
    
  6. pull the source code

    cd ~
    git clone https://github.com/analyticsMD/datadiff.git
    
  7. create virtual environment p2env for qdiff

    cd datadiff
    python -m venv p2env
    
  8. activate the virtual environment 

    source p2env/bin/activate
    
  9. install dependency for qdiff

    pip install -r requirements.txt
    
  10. create database 

    mysql -uroot -p
    CREATE DATABASE qdiff;
    exit;
    
  11. init tables in the database

    python manage.py makemigrations
    python manage.py migrate
    
  12. start Celery workers

    celery -A qdiff worker -l info --detach
    
  13. do a test for sanity check

    python manage.py test
    
  14. run the server for demo!!

    python manage.py runserver
    
  15. check URL http://127.0.0.1:8000/

Consecutive runs

After you restart your device, you only need to do the following to launch Qdiff

  1. start rabbitmq server

    /usr/local/sbin/rabbitmq-server -detached
    
  2. activate virtual environment

    cd ~/datadiff
    source p2env/bin/activate
    
  3. start Celery workers

    celery -A qdiff worker -l info --detach
    
  4. run the django server

    python manage.py runserver
    
install on python3 Venv

One-time setup on Mac

in these steps, the bold and italic sentences are the commands for terminal

  1. install brew

    sudo xcodebuild -license
    /usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
    
  2. install mysql via brew

    brew install mysql
    
  3. start mysql

    brew services start mysql
    
  4. install rabbitmq 

    brew install rabbitmq
    
  5. start rabbitmq

    /usr/local/sbin/rabbitmq-server -detached
    
  6. install python36

    brew install python
    
  7. pull the source code

    cd ~
    git clone https://github.com/analyticsMD/datadiff.git
    
  8. create virtual environment p3env for qdiff

    cd datadiff
    python3 -m venv p3env
    
  9. activate the virtual environment 

    source p3env/bin/activate
    
  10. install dependency for qdiff

    pip install -r requirements.txt
    
  11. create database 

    mysql -uroot -p
    CREATE DATABASE qdiff;
    exit;
    
  12. init tables in the database

    python manage.py makemigrations
    python manage.py migrate
    
  13. start Celery workers

    celery -A qdiff worker -l info --detach
    
  14. do a test for sanity check

    python manage.py test
    
  15. run the server for demo!!

    python manage.py runserver
    
  16. check URL http://127.0.0.1:8000/

Consecutive runs

After you restart your device, you only need to do the following to launch Qdiff

  1. start rabbitmq server

    /usr/local/sbin/rabbitmq-server -detached
    
  2. activate virtual environment

    cd ~/datadiff
    source p3env/bin/activate
    
  3. start Celery workers

    celery -A qdiff worker -l info --detach
    
  4. run the django server

    python manage.py runserver
    

Setup on EC2

detail guides

Install

detail guides
  1. install mysql
    sudo yum install mysql-server
  2. install python 3.6
    sudo yum install python36
  3. clone repository
    ssh-agent bash -c 'ssh-add /path/to/your/private/rsakey; git clone git@github.com:analyticsMD/datadiff.git'
  4. install gcc for compiling mysql-connector
    sudo yum install gcc
  5. install python-devel for compiling mysql-connector
    sudo yum install -y python36-devel
  6. install dependency
    python3 -m pip install -r datadiff/requirements.txt --user

Setup database - mysql

detail guides
  1. start mysql server
    sudo service mysqld start
  2. login into mysql with root user
    mysql -u root -p
  3. create database qdiff and qdiff_test
    mysql> create database qdiff;
    mysql> create database qdiff_test;
  4. change database password in the /qdiff/setting/settings.py and settings_test.py
  5. install the database schema
    python3 manage.py makemigrations
    python3 manage.py migrate

install rabbitmq as broker for celery

detail guides
  1. install Erlang Version 20.1
    cd /opt
    sudo wget https://github.com/rabbitmq/erlang-rpm/releases/download/v20.1.7/erlang-20.1.7-1.el6.x86_64.rpm
    sudo rpm -ivh erlang-20.1.7-1.el6.x86_64.rpm
    
  2. install Socat
    sudo yum install socat
    
  3. RabbitMQ v3.7.0
    sudo wget https://dl.bintray.com/rabbitmq/all/rabbitmq-server/3.7.0/rabbitmq-server-3.7.0-1.el6.noarch.rpm
    sudo rpm -ivh rabbitmq-server-3.7.0-1.el6.noarch.rpm
    
  4. start rabbitmq, run the command
    rabbitmq-server
    
  5. or you can start rabbitmq as service
    sudo service rabbitmq-server start
    

start celery worker, use daemon or inline cli.

detail guides
celery -A qdiff worker -l info --detach

You can also check http://docs.celeryproject.org/en/latest/userguide/daemonizing.html

Sanity test

detail guides
  1. run command
    sudo python3 manage.py test
  2. Cheers if all the test cases are successful

Deploy with Nginx, UWSGI in EC2

Please make sure you already finish this section

detail guides

pull the code into the folder /opt/datadiff/

detail guides
  1. make directory /opt/
    sudo mkdir /opt
  2. clone the code with git
    cd /opt
    git clone git@github.com:analyticsMD/datadiff.git

install uwsgi

detail guides
  1. run command
    sudo python3 -m pip install uwsgi

install nginx

detail guides
  1. run command
    sudo yum install nginx

create qdiff.conf file for nginx

detail guides
  1. run command
    sudo vim /etc/nginx/nginx.conf
  2. edit context for your requirement, this is a simple sample
    user ec2-user ec2-user;
    worker_processes auto;
    error_log /var/log/nginx/error.log;
    pid /var/run/nginx.pid;
    include /usr/share/nginx/modules/*.conf;
    
    events {
        worker_connections 1024;
    }
    http {
        server_names_hash_bucket_size 128;
        log_format  main  '$remote_addr - $remote_user [$time_local] "$request" '
                          '$status $body_bytes_sent "$http_referer" '
                          '"$http_user_agent" "$http_x_forwarded_for"';
    
        access_log  /var/log/nginx/access.log  main;
    
        sendfile            on;
        tcp_nopush          on;
        tcp_nodelay         on;
        keepalive_timeout   65;
        types_hash_max_size 2048;
        include             /etc/nginx/mime.types;
        default_type        application/octet-stream;
        include /etc/nginx/conf.d/*.conf;
        include /etc/nginx/sites-enable/*;
        index   index.html index.htm;
    }
    

create qdiff_nginx.conf for nginx

detail guides
  1. make two directories, /etc/nginx/sites-enable/ and /etc/nginx/sites-available/
    mkdir /etc/nginx/sites-enable/
    mkdir /etc/nginx/sites-available/
  2. create qdiff_nginx.conf in /etc/nginx/sites-available/
    sudo vim /etc/nginx/sites-available/qdiff_nginx.conf
  3. edit content as follow
    # the upstream component nginx needs to connect to
    upstream django {
        server unix:///opt/datadiff/qdiff.sock; # for a file socket
        # server 127.0.0.1:8001; # for a web port socket
    }
    
    # configuration of the server
    server {
        # the port your site will be served on
        listen      8000;
        # the domain name it will serve for
        server_name ec2-54-183-250-158.us-west-1.compute.amazonaws.com; # substitute your machine's IP address or FQDN
        charset     utf-8;
    
        # max upload size
        client_max_body_size 1000M;   # adjust to taste
    
        # Django media
        location /media  {
            alias /opt/datadiff/media;  # your Django project's media files - amend as required
        }
        location /static {
            alias /opt/datadiff/static; # your Django project's static files - amend as required
        }
    
        # Finally, send all non-media requests to the Django server.
        location / {
            uwsgi_pass  django;
            include     /opt/datadiff/uwsgi_params; # the uwsgi_params file you installed
        }
    }
    
    

make sure the permissions of the folder /var/lib/nginx/tmp/client_body/ is readable

detail guides
  1. check the read write permissions
    ls -l /var/lib/nginx/tmp/
    it should return
    drwxr-xr-x 2 ec2-user ec2-user 4096 Jul 31 19:36 client_body
    drwx------ 2 ec2-user ec2-user 4096 Jul 24 02:06 fastcgi
    drwx------ 2 ec2-user ec2-user 4096 Jul 24 02:06 proxy
    drwx------ 2 ec2-user ec2-user 4096 Jul 24 02:06 scgi
    drwx------ 7 ec2-user ec2-user 4096 Jul 31 19:36 uwsgi
    

start the qdiff

detail guides
  1. make sure the rabbit mq already start

  2. start the nginx service

    sudo service nginx start
  3. start uwsgi worker

    cd /opt/datadiff
    uwsgi --socket qdiff.sock --module setting.wsgi --chmod-socket=664 --daemonize uwsgi.log

start and test your machine

detail guides
  1. access URL host:port/tasks to check if it works or not

    for example: http://ec2-xx-xx-xx-xx.us-west-1.compute.amazonaws.com/tasks

  2. if not working, check following files for debug

    /var/log/nginx/error.log    #nginx reverse server log
    /opt/datadiff/uwsgi.log     #uwsgi server log
    /opt/datadiff/dev.log       #qdiff log

System architecture

Architecture

details When input databases as datasources, users need to generate a database access token first. The database access token is an encrypted JSON file which contains the database setting for Django.

Components

details

Data reader

  • Using Django DB cursor to read the data
  • Supporting multiple databases and file sources

file reader

  • Support CSV excel

Comparators

Field Comparators

A wrapper for package tableschema

  • It will query setting.settings.SCHEMA_INFER_LIMIT number records and infer the schema from them.
  • The setting.settings.SCHEMA_INFER_CONFIDENCE permits the minor occurance of abnomral, default is 1.00.
  • The setting.settings.SCHEMA_CSV_MISSING_VALUES and SCHEMA_DATABASE_MISSING_VALUES is used to configurate what should be consider as missing.
Value Comparators
  1. Variable definitions

    data1, data2: the input data, can be queryset, list, dictionary

    item1, item2: the elements from data1 and data2

    dict1, dict2: the list for saving unmatch records

  2. steps

    1. Sort the data1 and data2
    2. Iterate over data1 and data2 at same time, item1 comes from data1 and item2 comes from data2
      1. If the item1 is identical to item2, iterate next item pair
      2. Else if the item1 in dict2, save the all elements in dict2 except item1 as conflicted results, iterate next item1
      3. Else if the item2 in dict1, save the all elements in dict1 except item2 as conflicted results, iterate next item2
      4. Else, the item1 is different from the item2, put item1 in dict1 and item2 in dict2, iterate next item pair
  3. Complexity.

    given m,n = len(data1),len(data2)

    Time complexity:

     O(m+n)
    

    Space complexity:

     O(m+n)
    
  4. pseudo code in python

    qDiff(data1, data2):
        iter1 = iter(sorted(data1))
        iter2 = iter(sorted(data2))
        temp_dict1 ={}
        temp_dict2 ={}
        item1 = None
        item2 = None
        try:
            while True:
                item1 = next(iter1)
                item2 = next(iter2)
                h1 = hash(item1)
                h2 = hash(item2)
                if h1==h2:
                    item1 = None
                    item2 = None
                    continue
                elif h1 in temp_dict2 or h2 in temp_dict1:
                    if h1 in temp_dict2:
                        temp_dict2.pop(h1)
                        saveToConflictedResult(temp_dict2.values())
                    if h2 in temp_dict1:
                        temp_dict1.pop(h2) 
                        saveToConflictedResult(temp_dict1.values())
                else:
                    temp_dict1[h1]=item1
                    temp_dict2[h2]=item2
                item1 = None
                item2 = None
        except StopIteration as e:
            if not item1:
                saveToConflictedResult(list(iter2))
            else:
                saveToConflictedResult([item1] + list(iter1))

Report viewer

  • Providing the comparison result
  • GUI for accepting rules for resolving ((Not implemented yet))

Rule parser (Not implemented yet)

  • Parsing the input rules and save as rule set for reuse

  • Rules:

      Where to write the resolved result
      Left join, right join, inner join, and outer join
      Condition based rule, (E.X. when field1 == 0 and field2 > 3)
    

Conflict resolver (Not implemented yet)

  • Filtering the conflicted results basing on the input rules

ERD

Entities
  1. Task

    • Information about the data source
    • Uploaded file path
    • Database information (This will not contain password)
    • Datetime, Recording the start time and end time for performance evaluation
    • Owner (Not implemented yet)
  2. Conflict record

    • Raw data
    • What source it belongs to
    • The name of raw table
  3. Raw Table

    • Fields here are dynamics, this table schema is depending on the schema of given datasources
  4. Report

    • Which generator will process the data
    • Generated file path
    • Parameters for the generator, in JSON format
  5. Rule set (Not implemented yet)

    • Name
    • Description
    • Rule, formatted rules in json format

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published