Skip to content

leroyjmcclure/DataGristle

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Datagristle is a toolbox of tough and flexible data connectors and analyzers.
It's kind of an interactive mix between ETL and data analysis optimized for rapid analysis and manipulation of a wide variety of data.

It's neither an enterprise ETL tool, nor an enterprise analysis, reporting, or data mining tool. It's intended to be an easily-adopted tool for technical analysts that combines the most useful subset of data transformation and analysis capabilities necessary to do 80% of the work. Its open source python codebase allows it to be easily extended to with custom code to handle that always challenging last 20%.

Current Status: Strong support for easy analysis and simple transformations of csv files.

#Next Steps:

  • attractive PDF output of gristle_determinator.py
  • metadata database population

#Its objectives include:

  • multi-platform (unix, linux, mac os, windows with effort)
  • multi-language (primarily python)
  • free - no cripple-licensing
  • primary audience is programming data analysts - not non-technical analysts
  • primary environment is command-line rather than windows, graphical desktop or eclipse
  • extensible
  • allow a bi-directional iteration between ETL & data analysis
  • can quickly perform initial data analysis prior to longer-duration, deeper analysis with heavier-weight tools.

#Installation

```pip install datagristle```

```easy_install datagristle```
  • Or download tarball from pypi

#Dependencies

  • Python 2.6 or Python 2.7

#Mature Utilities Provided in This Release:

  • gristle_determinator.py
    • Identifies file formats, generates metadata, prints file analysis report
    • This is the most mature - and also used by the other utilities so that you generally do not need to enter file structure info.
  • gristle_freaker.py
    • Produces a frequency distribution of multiple columns from input file.
  • gristle_slicer.py
    • Used to extract a subset of columns and rows out of an input file.
  • gristle_viewer.py
    • Shows one record from a file at a time - formatted based on metadata.

#Immature Utilities Provided in This Release:

  • gristle_differ.py
    • Shows differences between two files
  • gristle_file_converter.py
    • Converts a csv from one dialect to another. Can handle multi-character field delimiters as well as record delimiters.
  • gristle_filter.py
    • Applies simple filter logic to file.
  • gristle_scalar.py
    • Performs scalar operations (min, max, avg, count unique, etc) on a file
  • gristle_validator.py
    • Validates a file - currently just confirms number of fields for each row.

#Future utilities:

  • gristle_metadata.py
    • Manages metadata - allows users to query, add, update, delete file, field, transformation, reporting descriptions.
  • gristle_generator
    • Generates test data based on gristle metadata
  • gristle_validator
    • Confirms validity of database and file structure and contents.
  • gristle_file_joiner.py
    • joins two files on their common keys and produces a new file
  • gristle_grouper.py
    • reads a file, aggregates on a given set of fields, produces a new file
  • gristle_db_loader.py
    • loads a file into a database
  • gristle_db_extractor.py
    • extracts data from a database into a file
  • gristle_field_merge.py
    • prints the matched values from multiple files side by side along with counts

#Licensing

  • Gristle uses the BSD license - see the separate LICENSE file for further information

About

Tough and flexible tools for data analysis, transformation, validation and movement.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 78.3%
  • Perl 14.7%
  • Shell 7.0%