Abathur

"Look at flesh, see only potential. Strands, sequences, twisting, separating, joining. See how it could be better, Eat flesh, splinter bone. Inside me, can touch it. Weave it. Spin it. Make it great."

-- Abathur (The evolution master) tells Sarah Kerrigan about his work

Abathur aims to be a easy-to-use automated machine learning and data torturing toolkit to make the data talk.

Feature Extraction

usage: abathur extract [-h] [--query-param] param queries output

Extract (aggregated) features from a sql database.

positional arguments:
  param          the file with query params or the SQL query that finds the
                 params.
  queries        the set of queries and feature names to be executed.
  output         the output file.

optional arguments:
  -h, --help     show this help message and exit
  --query-param  the given param file is a query file. by default we assume
                 it's a file that contains a list of query params.

abathur extract is an adhoc feature extraction tool for relational (SQL) databases, where every value in the feature is extracted with one query. Although this is not efficient in terms of computation processing, but it does the job and can be easily used to extract features for given set of targets.

param file

abathur extract expects an param file. A param file can either be a CSV file containing the query parameters or a SQL file which query parameters can be obtained by an SQL query.

An example of CSV ident file content:

ident,param_age
1,35
3,35
12,35
32,35

An exmample of query ident file content:

select id as ident, 35 as param_age from users where age > 35

You need to add --query-ident option if the param file is an SQL file.

query file

The second argument parsed to abathur extract is a query file. A query file is a json file that specifies the feature name and the SQL query to extract the feature. The SQL should contain {param_key} as a placeholder for putting the relevant parameters in the param file.

An example of query file content:

{
    "n_followers": "select count(*) from follow where follow_user_id={ident}",
    "n_follow": "select count(*) from follow where user_id={ident} and age>{param_age}"
}

For help in commandline options:

abathur extract --help

Clustering

usage: abathur cluster [-h] [--ignore [IGNORE [IGNORE ...]]]
                         feat_filename output

Perform clustering of the given data set.

positional arguments:
  feat_filename         The input feature file
  output                The output file name

optional arguments:
  -h, --help            show this help message and exit
  --ignore [IGNORE [IGNORE ...]]
                        The features (column names) to be ignored. Usually the
                        ID field.

abathur cluster takes a input feature file, and performs clustering. The output is a file with code corresponding to the cluster id for each corresponding row in the input feature file. It uses an information theoretic approch detailed in [1] estimate the best number in of clusters. The algorithm implemented slightly improves [1] by run the jump method multiple times and gets the lowest number.

Abathur Config

Abathur expects a config file in ~/.abathur.conf with the following content:

{
    "db_connection_string": "sqlalchemy_syle_connection_string"
}

For more about sqlalchemy connection string see SQLAlchemy Database URLS

Reference

Catherine A. Sugar and Gareth M. James(2003). "Finding the number of clusters in a data set: An information theoretic approach". Journal of the American Statistical Association 98 (January): 750-763.

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
_abathur		_abathur
tests/_abathur		tests/_abathur
.gitignore		.gitignore
abathur		abathur
readme.md		readme.md
requriements.txt		requriements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

_abathur

_abathur

tests/_abathur

tests/_abathur

.gitignore

.gitignore

abathur

abathur

readme.md

readme.md

requriements.txt

requriements.txt

Repository files navigation

Abathur

Feature Extraction

param file

query file

Clustering

Abathur Config

Reference

About

Releases

Packages

Languages

realstraw/abathur

Folders and files

Latest commit

History

Repository files navigation

Abathur

Feature Extraction

param file

query file

Clustering

Abathur Config

Reference

About

Resources

Stars

Watchers

Forks

Languages