GitHub - rayleyva/DataGristle: Tough and flexible tools for data analysis, transformation, validation and movement.

Datagristle is a toolbox of tough and flexible data connectors and analyzers.
It's kind of an interactive mix between ETL and data analysis optimized for rapid analysis and manipulation of a wide variety of data.

It's neither an enterprise ETL tool, nor an enterprise analysis, reporting, or data mining tool. It's intended to be an easily-adopted tool for technical analysts that combines the most useful subset of data transformation and analysis capabilities necessary to do 80% of the work. Its open source python codebase allows it to be easily extended to with custom code to handle that always challenging last 20%.

Current Status: Strong support for easy analysis and simple transformations of csv files.

More info is on the DataGristle wiki here: https://github.com/kenfar/DataGristle/wiki

#Next Steps:

attractive PDF output of gristle_determinator.py
metadata database population

#Its objectives include:

multi-platform (unix, linux, mac os, windows with effort)
multi-language (primarily python)
free - no cripple-licensing
primary audience is programming data analysts - not non-technical analysts
primary environment is command-line rather than windows, graphical desktop or eclipse
extensible
allow a bi-directional iteration between ETL & data analysis
can quickly perform initial data analysis prior to longer-duration, deeper analysis with heavier-weight tools.

#Installation

Using pip (preferred) or easyinstall:

$ pip install datagristle
$ easy_install datagristle

Or install manually from pypi:

$ mkdir ~\Downloads
$ wget https://pypi.python.org/packages/source/d/datagristle/datagristle-0.46.tar.gz
$ tar -xvf easy_install datagristle
$ cd ~\Downloads\datagristle-*
$ python setup.py install

#Dependencies

Python 2.6 or Python 2.7

#Mature Utilities Provided in This Release:

gristle_slicer
- Used to extract a subset of columns and rows out of an input file.
gristle_freaker
- Produces a frequency distribution of multiple columns from input file.
gristle_viewer
- Shows one record from a file at a time - formatted based on metadata.
gristle_determinator
- Identifies file formats, generates metadata, prints file analysis report
- This is the most mature - and also used by the other utilities so that you generally do not need to enter file structure info.

#gristle_slicer Extracts subsets of input files based on user-specified columns and rows. The input csv file can be piped into the program through stdin or identified via a command line option. The output will default to stdout, or redirected to a filename via a command line option.

The columns and rows are specified using python list slicing syntax -
so individual columns or rows can be listed as can ranges.   Inclusion
or exclusion logic can be used - and even combined.

Examples:
   $ gristle_slicer sample.csv
                Prints all rows and columns
   $ gristle_slicer sample.csv -c":5, 10:15" -C 13
                Prints columns 0-4 and 10,11,12,14 for all records
   $ gristle_slicer sample.csv -C:-1
                Prints all columns except for the last for all records
   $ gristle_slicer sample.csv -c:5 -r-100
                Prints columns 0-4 for the last 100 records
   $ gristle_slicer sample.csv -c:5 -r-100 -d'|' --quoting=quote_all
                Prints columns 0-4 for the last 100 records, csv
                dialect info (delimiter, quoting) provided manually)
   $ cat sample.csv | gristle_slicer -c:5 -r-100 -d'|' --quoting=quote_all
                Prints columns 0-4 for the last 100 records, csv
                dialect info (delimiter, quoting) provided manually)

#gristle_freaker Creates a frequency distribution of values from columns of the input file and prints it out in columns - the first being the unique key and the last being the count of occurances.

Examples:
   $ gristle_freaker sample.csv -d '|'  -c 0
                Creates two columns from the input - the first with
                unique keys from column 0, the second with a count of
                how many times each exists.
   $ gristle_freaker sample.csv -d '|'  -c 0 --sortcol 1 --sortorder forward --writelimit 25
                In addition to what was described in the first example, 
                this example adds sorting of the output by count ascending 
                and just prints the first 25 entries.
   $ gristle_freaker sample.csv -d '|'  -c 0 --sampling_rate 3 --sampling_method interval
                In addition to what was described in the first example,
                this example adds a sampling in which it only references
                every third record.
   $ gristle_freaker sample.csv -d '|'  -c 0,1
                Creates three columns from the input - the first two
                with unique key combinations from columns 0 & 1, the 
                third with the number of times each combination exists.

#gristle_viewer Displays a single record of a file, one field per line, with field names displayed as labels to the left of the field values. Also allows simple navigation between records.

Examples:
   $ gristle_viewer sample.csv -r 3 
                Presents the third record in the file with one field per line
                and field names from the header record as labels in the left
                column.
   $ gristle_viewer sample.csv -r 3  -d '|' -q quote_none
                In addition to what was described in the first example this
                adds explicit csv dialect overrides.

#gristle_determinator Analyzes the structures and contents of csv files in the end producing a report of its findings. It is intended to speed analysis of csv files by automating the most common and frequently-performed analysis tasks. It's useful in both understanding the format and data and quickly spotting issues.

Examples:
   $ gristle_determinator japan_station_radiation.csv
                This command will analyze a file with radiation measurements
                from various Japanese radiation stations.

    File Structure:
    format type:       csv
    field cnt:         4
    record cnt:        100
    has header:        True
    delimiter:                   
    csv quoting:       False   
    skipinitialspace:  False    
    quoting:           QUOTE_NONE  
    doublequote:       False   
    quotechar:         "       
    lineterminator:    '\n'    
    escapechar:        None    

    Field Analysis Progress: 
    Analyzing field: 0
    Analyzing field: 1
    Analyzing field: 2
    Analyzing field: 3

    Fields Analysis Results: 

        ------------------------------------------------------
        Name:             station_id           
        Field Number:     0                    
        Wrong Field Cnt:  0                    
        Type:             timestamp            
        Min:              1010000001           
        Max:              1140000006           
        Unique Values:    99                   
        Known Values:     99                   
        Top Values not shown - all values are unique

        ------------------------------------------------------
        Name:             datetime_utc         
        Field Number:     1                    
        Wrong Field Cnt:  0                    
        Type:             timestamp            
        Min:              2011-02-28 15:00:00  
        Max:              2011-02-28 15:00:00  
        Unique Values:    1                    
        Known Values:     1                    
        Top Values: 
            2011-02-28 15:00:00                      x 99 occurrences

        ------------------------------------------------------
        Name:             sa                   
        Field Number:     2                    
        Wrong Field Cnt:  0                    
        Type:             integer              
        Min:              -999                 
        Max:              52                   
        Unique Values:    35                   
        Known Values:     35                   
        Mean:             2.45454545455        
        Median:           38.0                 
        Variance:         31470.2681359        
        Std Dev:          177.398613681        
        Top Values: 
            41                                       x 7 occurrences
            42                                       x 7 occurrences
            39                                       x 6 occurrences
            37                                       x 5 occurrences
            46                                       x 5 occurrences
            17                                       x 4 occurrences
            38                                       x 4 occurrences
            40                                       x 4 occurrences
            45                                       x 4 occurrences
            44                                       x 4 occurrences

        ------------------------------------------------------
        Name:             ra                   
        Field Number:     3                    
        Wrong Field Cnt:  0                    
        Type:             integer              
        Min:              -888                 
        Max:              0                    
        Unique Values:    2                    
        Known Values:     2                    
        Mean:             -556.121212121       
        Median:           -888.0               
        Variance:         184564.833792        
        Std Dev:          429.610095077        
        Top Values: 
            -888                                     x 62 occurrences
            0                                        x 37 occurrences

#Licensing

Gristle uses the BSD license - see the separate LICENSE file for further information

#Copyright

Name		Name	Last commit message	Last commit date
Latest commit History 191 Commits
data		data
datagristle.egg-info		datagristle.egg-info
docs		docs
gristle		gristle
scripts		scripts
.gitignore		.gitignore
.pylintrc		.pylintrc
LICENSE		LICENSE
README.md		README.md
README.txt		README.txt
setup.py		setup.py
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

datagristle.egg-info

datagristle.egg-info

docs

docs

gristle

gristle

scripts

scripts

.gitignore

.gitignore

.pylintrc

.pylintrc

LICENSE

LICENSE

README.md

README.md

README.txt

README.txt

setup.py

setup.py

tox.ini

tox.ini

Repository files navigation

About

Releases

Packages

License

rayleyva/DataGristle

Folders and files

Latest commit

History

Repository files navigation

About

Resources

License

Stars

Watchers

Forks