Skip to content

wllacer/dana-cube

Repository files navigation

dana-cube

Dana-cube is a tool to automate the design, execution and visualization of cross reference queries, aka pivot tables aka multidimensional aggregate queries.

*Go to the english version of the rationale

Dana-cube es una herramienta para automatizar el diseño, ejecución y visualización de consultas de referencias cruzadas aka tablas pivote aka consultas agregadas multidimensionales

*Ir a la descripción de objetivos en Español

Tabla de contenidos / Table of contents

Objetivos en Español

Cual es el problema:

Una herramienta que me permita resolver, quizas el más común de los problemas con bases de datos SQL: resolver una agregación (suma, promedio, ...) por dos o mas parámetros y presentarlos en formato matricial (hoja de calculo) Y de forma dinámica, es decir que no necesite estar programada previamente.

Un ejemplo canónico sería de consulta de este tipo sería dame la suma total de ventas de un periodo desglosada por linea de producto y ámbito territorial y lo de dinámica que ahora pueda cambiar el cruce a vendedor y rating financiero del cliente sin necesidad de que este cruce este previamente creado.

Primero querría explicar el por qué es útil una herramienta especifica para ello. Veamoslo con un ejemplo. Imaginen una base de datos de resultados electorales y lo que queremos es una consulta del tipo

Dame la suma de votos cruzados por partido y distrito electoral.

En teoría esto es una consulta muy sencilla en SQL

    select partido_id,provincia,sum(votes_presential)
    from votos_locales
    group by partido_id,provincia

Que devuelve los resultados de esta forma

"993"	"12"	"2134"
"993"	"46"	"7101"
"993"	"47"	"1570"
"1033"	"23"	"1188"
"1070"	"24"	"1027"
"1079"	"01"	"10468"
"1079"	"02"	"33676"
"1079"	"03"	"154037"
"1079"	"04"	"44320"
"1079"	"05"	"15892"
"1079"	"06"	"45252"
"1079"	"07"	"71446"
....

Toda la información está aquí, pero un analista resulta mejor servido si los datos pudieran ser presentados de format matricial (partidos como columnas, distritos como filas) y, si es posible, con información mas legible, por ejemplo como


                               C's       EH Bildu        EAJ-PNV             PP           PSOE        PODEMOS           GBAI        CCa-PNC        IU-UPeC   ...
Grand Total               3,500,541        218,467        301,585      7,215,752      5,530,779      3,182,082         30,554         81,750        923,133  
Araba/Álava                  10,468         21,179         28,297         33,609         25,293         48,265                                        6,794                                                                                                         
Albacete                     33,676                                       85,152         65,074         32,155                                        9,277                                                                                                         
Alicante/Alacant            154,037                                      296,709        188,367                                                      33,293
Almería                      44,320                                      117,407         89,022         39,482                                       10,776                                                                                                         
Ávila                        15,892                                       46,963         20,129         11,863                                        3,878                                                                                                         
Badajoz                      45,252                                      137,501        148,347         47,203                                       12,281                                                                                                         
Balears, Illes               71,446                                      140,542         88,542        111,416                                       11,434
Barcelona                   386,143                                      321,268        463,612                                                            
Burgos                       33,373                                       81,780         44,488         36,612                                       10,099                                                                                                         
Cáceres                      28,293                                       87,729         84,532         34,552                                        7,216                                                                                                         
Cádiz                        94,707                                      179,054        180,667        130,215                                       38,798                                                                                                         
Castellón/Castelló           48,220                                       98,341         66,450                                                       9,565
...

Qué alternativas hay

Desgraciadamente este tipo de presentación NO esta disponible directamente en la mayoría de programas de consulta. Algunos productos (como Oracle y MSSQL) ofrecen soluciones privadas para generar este tipo de consultas, pero casi nunca están disponibles en las herramientas generales.

En la mayoría de las aplicaciones existen programas de informes que realizan algun tipo de operación de este tipo, pero carecen de flexibilidad lo que se puede consultar y cruzar esta predefindo por la aplicación original y no puede crearse un nuevo cruce sobre la marcha.

Existen una serie de productos en el mercado (conocidos como data warehouse, business inteligence, ...) que permiten estas consultas dinámicas, y muchas mas cosas sobre ello. Normalmente estan vinculados a una base de datos o un tipo de problema concreto, y tienden a ser muy complejas y costosas en recursos (técnicos y de personal, al menos). Lo que nosotros queremos es una herramienta simple, y limitada al problema de las referencias cruzadas

Las consultas de referencias cruzadas de MS Access y las tablas Pivot en muchas hojas de cálculo ofrecen la funcionalidad. Y he visto algunas soluciones extremadamente interesantes con ellas. Pero el mecanismo exige que el usuario trabaje sobre la estructura interna de la base de datos, que puede ser muy compleja.

Además, la estabilidad, mantenibilidad, seguridad y las posibilidades de distribuir estas soluciones sobre Access y/o Excel son muy problemáticas. Mi experiencia con el rendimiento de los enlaces via ODBC es menor que buena. Y como antiguo administrador, "enganchar" a una base de datos corporativa "real", una herramienta tan incontrolable como es la interfaz ODBC, es una pesadilla desde el punto de vista de la seguridad y, probablemente también, del rendimiento.

Que ofrecemos

ofrecemos un programa danacube.py para que los usuarios puedan ejectuar y visualizar interactivamente las distintas consultas cruzadas sobre una base de datos.

Screenshot

Como se puede ver en el ejemplo, no referenciamos directamente a la base de datos subyacente, sino a una abstracción Cada instancia se ejecuta contra lo que denominamos un Cubo esto es la visión de una tabla de datos (o estructura equivalente), los campos (fields) sobre los que queremos hacer las agregaciones y los criterios de agrupación (que denominamos guias). Estos criterios pueden ser campos escalares o agrupaciones jerarquizadas. Si las guias son fechas, automaticamente ofrecemos la posibilidad de jeraquizarlos (por años, años-mes, ...). Cada consulta cruzada entre dos guias lo que denominamos Vista.

Las definiciones de estos Cubos se hacen a través de ficheros de texto (json_), con lo que los posibles accesos a la base de datos están controlados externamente.

Y el acceso se realiza internamente a través de una API con la intención de simplificar la generación de estas matrices / referencias cruzadas; permitiendo, además integrarlas en otras aplicaciones . A traves de esta interfaz, la generación del ejemplo anterior puede reducirse al siguiente código (sin formateo)

from dana-cube.support.util.jsonmgr import load_cubo
from dana-cube.core import *

mis_cubos = load_cubo()
cubo = Cubo(mis_cubos["datos light"])
vista = Vista(cubo,'provincia','partidos importantes','sum','votes_presential',totalizado=True)
resultado = vista.toList()
for linea in resultado:
    print(linea)

Un usuario final de la herramienta no puede definir (ni utilizar) nada que no se encuentre en el fichero de configuración. ¿Por qué? Una razón es permitir que el usuario trabaje con una visión de los datos que no tiene por que corresponder con la estructura interna de la base de datos, sino a un modelo conceptual mas cercano a su visión de los datos. Por otro lado, es un modo para los DBAs para limitar que esta disponible para consultas dentro de la base de datos, mas alla de la seguridad intrinseca de la base de datos

Danacube.py permite generar dinámicamente una serie de gráficos sobre los datos presentados y exportar los resultados a otras herramientas para un análisis posterior, mas detallado. O a traves de extensiones de código (vea como hacerlo ) crear sus propias herramientas de análisis

Además ofrecemos los programas siguientes

  • cubebrowse.py Diseñado para manipular las definiciones de los cubos. Aunque al ser un fichero json pueden ser editados a mano en caso de necesidad

Screenshot

  • danabrowse.py Para navegar por el contenido de los servidores de base de datos en nuestro entoro y, si se considera conveniente, generar directamente definiciones de cubos desde el catalogo de las bases de datos database

Screenshot

  • danaquery.py Una herramienta muy simple para ejecutar consultas SQL en nuestro entorno

Screenshot

Donde funciona

La herramienta está programada con python3 + PyQt5 La infraestructura es agnósstica respecto del gestor de base de datos. Para ello utilizamos SqlAlchemy como "data backend", de modo que, en teoría cualquier gestor accesible para ella puede ser utilizado con dana-cube; aunque no es descartable que requiera de pequeños ajustes

Rationale in English

Table of Contents

What's the problem:

Probably the most common problem an SQL database users is to solve the need to resolve an aggregate (sum, aver,...) by two or more parameters and show them in array (spreadsheet) form.

An example will explain it. Supose a database with electoral results, and what whe need is a query of the type Give me the sum of votes per party and and per electoral district .In SQL this is a very simple query, like

    select partido_id,provincia,sum(votes_presential)
    from votos_locales
    group by partido_id,provincia

and gives results in the form

"993"	"12"	"2134"
"993"	"46"	"7101"
"993"	"47"	"1570"
"1033"	"23"	"1188"
"1070"	"24"	"1027"
"1079"	"01"	"10468"
"1079"	"02"	"33676"
"1079"	"03"	"154037"
"1079"	"04"	"44320"
"1079"	"05"	"15892"
"1079"	"06"	"45252"
"1079"	"07"	"71446"
....

While all the info is there, it's clear that an analyst is better server if data could be shown in an array format (parties as rows, electoral districts as columns) and with some additional info, more or less like


                               C's       EH Bildu        EAJ-PNV             PP           PSOE        PODEMOS           GBAI        CCa-PNC        IU-UPeC   ...
Grand Total               3,500,541        218,467        301,585      7,215,752      5,530,779      3,182,082         30,554         81,750        923,133  
Araba/Álava                  10,468         21,179         28,297         33,609         25,293         48,265                                        6,794                                                                                                         
Albacete                     33,676                                       85,152         65,074         32,155                                        9,277                                                                                                         
Alicante/Alacant            154,037                                      296,709        188,367                                                      33,293
Almería                      44,320                                      117,407         89,022         39,482                                       10,776                                                                                                         
Ávila                        15,892                                       46,963         20,129         11,863                                        3,878                                                                                                         
Badajoz                      45,252                                      137,501        148,347         47,203                                       12,281                                                                                                         
Balears, Illes               71,446                                      140,542         88,542        111,416                                       11,434
Barcelona                   386,143                                      321,268        463,612                                                            
Burgos                       33,373                                       81,780         44,488         36,612                                       10,099                                                                                                         
Cáceres                      28,293                                       87,729         84,532         34,552                                        7,216                                                                                                         
Cádiz                        94,707                                      179,054        180,667        130,215                                       38,798                                                                                                         
Castellón/Castelló           48,220                                       98,341         66,450                                                       9,565
...

Sadly, this is usually NOT available in most data query programs. Some DB products (fi. Oracle and MSSQL) offer their own private means of generate such queries, but they're not always available in general tools. MS Access cross reference queries or Pivot Tables available in several spreadsheet programs, DO offer this functionality,I've seen quite a few more than interesting.

But, stability, maintainability, and the chance of distributing it, is another story. My own experience with ODBC linking performance is less than good. And as a former DBA, piggybacking an uncontrollable tool (as the ODBC interface is) on a corporate database is a security and performance nightmare

What we provide

We provide a module (and its API) which aims to simplify the generation of such arrays /cross references, allowing to integrate it in any Python module. Thru such an interface the generation of the above sample could be reduced to the following Python core (with formatting extracted):

from dana-cube.util.jsonmgr import load_cubo
from dana-cube.core import *

mis_cubos = load_cubo()
cubo = Cubo(mis_cubos["datos light"])
vista = Vista(cubo,'provincia','partidos importantes','sum','votes_presential',totalizado=True)
resultado = vista.toList()
for linea in resultado:
    print(linea)

As you can see from the sample, we do not refer directly to the underlying database, but to an abstraction. Each instance runs against what we call a Cube. This is the view of a data table (or table-like DB object -a view, a select statement, ...) and the definition of the potential fields (other tools call it measures ) to query and the criteria over which to search, which we call guides ( other tools call dimensions or hierarchies ). This criteria can be scalar fields or hierarchical structures. If the guide is a date field; we automatically provide (for SQLITE, MySQL, PostGreSQL and Oracle, atm) for several subindexes (years, years-month, ...). And every different cross aggregation is what we call a Vista. We provide this abstraction as an Json text file ( vide infra )

As you will notice, this is, in this way is not designed as an end user tool , rather it is designed to be used for knowledgable users (DBAs, developers, data owners) or as a ready made API cum sample tool to be integrated in other's people work (as it still is in heavy development, Caveat emptor ).

But we do provide and end user tool, which can be used standalone: It is called danacube.py Is our main tool where we execute our aggregate accesses to the database (to the cube), and provide means to show graphics or to export the results into several data formats

Screenshot

An end user can not define their own cubes, nor use guides outside what's defined in the configuration file. ¿Why? One reason is that it allows showing the end user an view to the data which doesn't need to ajust exactly to the internal database structure. On the other hadn, it's a way carefull DBAs can limit what is available for search in a production database, beyond the database own security.

The tool is open to user extensions, so special test can be run over the data. See docs here-still only spanish- for the user functon module

  • cubebrowse.py Is a tool designed to manipulate the cube definitions. They are a plain Json file (see below) and can be edited by hand if necessary

Screenshot

  • danabrowse.py We can browse the contents of the database servers in our environment, and if necessary, generate direct cube definitions from the catalog of the database

Screenshot

  • danaquery.py A very simple tool to execute arbitrary sql code against database servers in our environment

Screenshot

Where to run the tool

The tool is programmed in python3 + PyQt5 but it might be possible to be run under Python2 (we try to be as much compatible as possible, but haven't tested it in a while) . And we make avaliable the core functionality without the need of using Qt (but functionaly behind the main version)

The infrastructure is database agnostic. We use SqlAlchemy as data backend, so in theory every database which can be accessed thru it should be available, although a few specific changes might be needed

Documentation

It's a work in progress

Sample Data

We will provide a test database (with results of the Spanish General Election in 2015) for several supported databases, with minimal changes between them. You will find both a sample_data.zip and a sample_data.tar.gz file in the root directory of the project, there you'll find both a cube definition file and a DB dump for the samples

As a matter of fact, the tool grew analizing those data

Dependencies

Besides PyQt, we use:

  • SqlAlchemy as a data backend (only core functionality).Main reason is transparent access to the catalogs
  • DateUtil for some date related functions
  • SqlParse ( Optional ) for some trace outputs
  • XlsxWriter Guess it ...
  • Matplotlib for all the graphic stuff

License

For my part, while I (Werner Llácer Viciano) retain all ownership of the code, this is an open source product.

Best policy would be to comply to the terms of the LGPL license family (I find it the most honest license both for the authors and the users)

a good enough policy would be to acknowledge my autorship of this piece of code and to send upstream all corrections and enhancements to the original functionality.

If in doubt, or your legal overlords demand some clear answer, then, my code is LGPL v2 and/or greater than licensed. For . more grave doubts please contact me, We'll broker an favorable agreement.

A few caveats :

  • Any module installed via the "user functions" functionality is NOT bound to the above mentioned license. That functionality was intendended for the users to extend the capabilities of the tool, so they remain (whatever the mode of linking) property of their developers. In spite of this, those provided by my as samples, are still mine, and licencing aplies (but you have unrestricted right to fork)
  • From my point of view, data used and/ore produced by the tool, are owned by their users, not, in any way, by the tool (i.e. me)
  • IIRC, the *GPL licenses demand that the code must be made available by the distributor. I think that linking where the source of Danacube resides covers it. Although it might be, from the practical POV a bad idea: repositories may move, version changes could be destructive, and any private enhancement could get lost, ...
  • I'm not into legal hairsplitting, so just to avoid confusion, I don't mind if the code is used in/distributed with products under other open source license in the broad sense, as long as they honor my licensing for my code.
  • IIRC, my jurisdiction -Spain- DOES NOT recognize software patents, so don't bother me with that. I wonder, anyhow, what would be "patentable" in my algorithms ... and if, not covered by thousands of previous art.

Qt, PyQt -and the additional libraries-, licensing might impose other restrictions, please keep an eye on it (AFAIK PyQt is/was GPL licensed)

DANACUBE enters ALPHA

What does it means?

  • We deem that we have achieved a functional 'completeness' of the tool, so it should be useful for valiant user; but that it still lacks proper outside testing (so, for sure, many bugs ahead) and documentation (hope to solve it soon)

  • What we know it's missing:

    • Unknown bugs all around (i know i'm not perfect). And a few known ;-)
    • Public API for the core functionality is still open (need input for use cases)
    • The user interface is implemented just for my needs and lacks internationalization (worse still, it's now a mix of english and spanish)
    • It's reasonably well tested with Sqlite, MySQL, PostgreSQL,;
    • As of the last release Oracle 's support, while it works, is still missing thorough testing..
    • I haven't had the chance to adapt/test it against DB2 or MSSQL Server ¿Any volunteer?
    • Nor performance, neither security have been, till now, top priority goals. You've been warned
    • Legalese is missing in code (copyrights, licence specs, and so on)

Active tasks can be read here

Help Needed

We could profit from someone knowledgable in UI development. I'm a DBA type and, well, it's kind of difficult for me

Testers are welcome.

Out there ...

If you feel my package isn't enough for you have a look at the following projects of similar scope and technology. They might be of interest for you

Contact

You can email me regarding this application thru the address danacube.sup at gmail dot com

A.M.D.G. & B.V.M.