turbopanda: Pandas. But Smarter

======================================================================================== Turbo-charging the Pandas library in an integrative, meta-orientated style.

The aim of this library is extend the functionality of a number of Python packages, including the pandas library, to integrate cohesively together a unified approach to data modelling, including machine learning.

The basic idea

The main purpose is to build a layer on top of pandas which regulates the main data and also associates some meta information to the columns which remembers interactions the user has with it, specifically to do with grouping data columns by the name or some other defining feature.

Motivation

There are a number of areas that the Pandas library is lacklustre from a user standpoint - we'll cover a few of these in more detail and then explain TurboPandas' response to these particular issues.

For details, read the ISSUES markdown file found in the repository.

How to use: The Basics

You will need to import the package as:

import turbopanda as turb

All of the heavy lifting comes in the MetaPanda object which acts as a hood over the top of a pandas.DataFrame:

# where df is a pandas.DataFrame object.
import turbopanda as turb
g = turb.MetaPanda(df)

Alternatively a MetaPanda object can be created using the in-built read function found in turbopanda (which also includes glob-like calls for multiple files):

import turbopanda as turb
g = turb.read("rna.csv")

Here you see the representation of the object presents the dataset in terms of dimensions and memory usage which is incredibly useful to know at a cursory glance.

The raw pandas object can be accessed through the df_ attribute:

g.head(2)

-	Protein_IDs	Majority_protein_IDs	Protein_names	...
0	Q96IC2;Q96IC2-2;H3BM72;H3BV93;H3BSC5	Q96IC2;Q96IC2-2;H3BM72;H3BV93;H3BSC5	Putative RNA exonuclease NEF-sp	...
1	H0YGH4;P01023;H0YGH6;F8W7L3	H0YGH4;P01023	Alpha-2-macroglobulin	...

Whereas metadata can be accessed through the meta_ which is automatically created upon instantiation:

g.meta_.head(2)

-	true_type	is_unique	potential_id
Protein_IDs	object	True	True
Majority_protein_IDs	object	True	True

Accessing column subsets

Unlike traditional pandas which is clunky to access column subsets of a DataFrame with ease, we allow flexible forms of input to override the __getitem__ attribute, including:

regex: string regular expressions
type-casting: using a specific type
Direct-column: the column name/pandas.Index
meta-info: Using selections from the meta_ attributes

Inputs examples could include:

g[object].head()

Returns all of the columns of type object. Or we could return all the columns whose name obeys some regular expression:

g["Intensity_[MG12S]*_1"].head()

Or we could return all of the columns that are unique identifiers, as determined by the meta_ column, is_unique:

g["is_unique"].head()

Sometimes the columns returned may not be as expected for the user, so we provide a view and view_not functions which merely returns the pd.Index or list-like representation of the column names identified:

g.view(object)

Complex access by multi-views

turbopanda helps to facilitate more complex-like selections of columns, by default, by keeping the union of search terms, for example:

g.view(float, "Gene")

Returns all of the columns of type float and where the string name contains the word 'Gene'.

Transformations to columns

Often in pandas, operations are applied across the entire dataframe, which can be annoying if you just want to transform a selection of columns and return the changes inplace, or create a new column. turbopanda solves this with the transform function:

g.transform(lambda x: x**2, float)

This takes every column of type float and applies a square-function to it. lambda in this case accepts a pandas.Series object representing a given column, and expects the return type to be the same size as before.

This is just a small sample of the functionality provided by turbopanda, including some machine learning, parallel and caching functionalities essential for modern statistical research coding practices.

Installation

turbopanda requires a number of dependencies in order to function well, you can find these in the dependencies file. The majority of the requirements can be met by using the Anaconda distribution.

We recommend you use Jupyter to work with turbopanda given the benefits of quick development of code, with fast visualisation.

Using pip

Open your Anaconda Prompt or Terminal and type:

pip install turbopanda

If you are using an conda environment, make sure you're in the appropriate environment before calling this.

From Cloning the GitHub Repository

Alternatively if you are cloning this GitHub repository, use:

git clone https://github.com/gregparkes/turbopanda.git
conda env create -f environment.yml
# or source activate turbopanda...
conda activate turbopanda

Now within the turbopanda environment run your Jupyter notebook:

jupyter notebook

Changelog

Details as to specific and on-going changes can be found either in the Changelog file or in the GitHub repository.

Acknowledgments

We would like to acknowledge the following sources for inspiration for much of this work:

pandas dev team: Forming a solid backbone package to build upon
pingouin python library: For inspiration and code regarding correlation analysis
pyitlib library: For inspiration on mutual information and entropy
matplotlib library
patsy libraries for inspiration on how to formulate design matrices.
PythonCentral tutorials for code validation
Wikipedia for many topics

References

[1]: "Pernet CR, Wilcox R, Rousselet GA. Robust Correlation Analyses: False Positive and Power Validation Using a New Open Source Matlab Toolbox. Frontiers in Psychology. 2012;3:606. doi:10.3389/fpsyg.2012.00606."

[2]: "Wilcox, R.R., 1994. The percentage bend correlation coefficient. Psychometrika 59, 601–616. https://doi.org/10.1007/BF02294395"

Ensure that any use of this material is appropriately referenced and in compliance with the license.

Name		Name	Last commit message	Last commit date
Latest commit History 193 Commits
changelog		changelog
data		data
examples		examples
extras		extras
tests		tests
turbopanda		turbopanda
.coveragerc		.coveragerc
.flake8		.flake8
.gitattributes		.gitattributes
.gitignore		.gitignore
.travis.yml		.travis.yml
CHANGELOG.rst		CHANGELOG.rst
ISSUES.md		ISSUES.md
LICENSE.txt		LICENSE.txt
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
UPDATE_HIGHLIGHTS.md		UPDATE_HIGHLIGHTS.md
environment.yml		environment.yml
pytest.ini		pytest.ini
requirements.txt		requirements.txt
setup.py		setup.py

License

gregparkes/turbopanda

Folders and files

Latest commit

History

Repository files navigation

turbopanda: Pandas. But Smarter

The basic idea

Motivation

How to use: The Basics

Accessing column subsets

Complex access by multi-views

Transformations to columns

Installation

Using pip

From Cloning the GitHub Repository

Changelog

Acknowledgments

References

About

Topics

Resources

License

Stars

Watchers

Forks

Languages