Skip to content

gregparkes/turbopanda

Repository files navigation

turbopanda: Pandas. But Smarter

======================================================================================== Turbo-charging the Pandas library in an integrative, meta-orientated style.

pypi last commit repo size License

The aim of this library is extend the functionality of a number of Python packages, including the pandas library, to integrate cohesively together a unified approach to data modelling, including machine learning.

The basic idea

Image not found

The main purpose is to build a layer on top of pandas which regulates the main data and also associates some meta information to the columns which remembers interactions the user has with it, specifically to do with grouping data columns by the name or some other defining feature.

Motivation

There are a number of areas that the Pandas library is lacklustre from a user standpoint - we'll cover a few of these in more detail and then explain TurboPandas' response to these particular issues.

For details, read the ISSUES markdown file found in the repository.

How to use: The Basics

You will need to import the package as:

import turbopanda as turb

All of the heavy lifting comes in the MetaPanda object which acts as a hood over the top of a pandas.DataFrame:

# where df is a pandas.DataFrame object.
import turbopanda as turb
g = turb.MetaPanda(df)

Alternatively a MetaPanda object can be created using the in-built read function found in turbopanda (which also includes glob-like calls for multiple files):

import turbopanda as turb
g = turb.read("rna.csv")

Here you see the representation of the object presents the dataset in terms of dimensions and memory usage which is incredibly useful to know at a cursory glance.

The raw pandas object can be accessed through the df_ attribute:

g.head(2)
- Protein_IDs Majority_protein_IDs Protein_names ...
0 Q96IC2;Q96IC2-2;H3BM72;H3BV93;H3BSC5 Q96IC2;Q96IC2-2;H3BM72;H3BV93;H3BSC5 Putative RNA exonuclease NEF-sp ...
1 H0YGH4;P01023;H0YGH6;F8W7L3 H0YGH4;P01023 Alpha-2-macroglobulin ...

Whereas metadata can be accessed through the meta_ which is automatically created upon instantiation:

g.meta_.head(2)
- true_type is_unique potential_id
Protein_IDs object True True
Majority_protein_IDs object True True

Accessing column subsets

Unlike traditional pandas which is clunky to access column subsets of a DataFrame with ease, we allow flexible forms of input to override the __getitem__ attribute, including:

  • regex: string regular expressions
  • type-casting: using a specific type
  • Direct-column: the column name/pandas.Index
  • meta-info: Using selections from the meta_ attributes

Inputs examples could include:

g[object].head()

Returns all of the columns of type object. Or we could return all the columns whose name obeys some regular expression:

g["Intensity_[MG12S]*_1"].head()

Or we could return all of the columns that are unique identifiers, as determined by the meta_ column, is_unique:

g["is_unique"].head()

Sometimes the columns returned may not be as expected for the user, so we provide a view and view_not functions which merely returns the pd.Index or list-like representation of the column names identified:

g.view(object)

Complex access by multi-views

turbopanda helps to facilitate more complex-like selections of columns, by default, by keeping the union of search terms, for example:

g.view(float, "Gene")

Returns all of the columns of type float and where the string name contains the word 'Gene'.

Transformations to columns

Often in pandas, operations are applied across the entire dataframe, which can be annoying if you just want to transform a selection of columns and return the changes inplace, or create a new column. turbopanda solves this with the transform function:

g.transform(lambda x: x**2, float)

This takes every column of type float and applies a square-function to it. lambda in this case accepts a pandas.Series object representing a given column, and expects the return type to be the same size as before.

This is just a small sample of the functionality provided by turbopanda, including some machine learning, parallel and caching functionalities essential for modern statistical research coding practices.

Installation

turbopanda requires a number of dependencies in order to function well, you can find these in the dependencies file. The majority of the requirements can be met by using the Anaconda distribution.

We recommend you use Jupyter to work with turbopanda given the benefits of quick development of code, with fast visualisation.

Using pip

Open your Anaconda Prompt or Terminal and type:

pip install turbopanda

If you are using an conda environment, make sure you're in the appropriate environment before calling this.

From Cloning the GitHub Repository

Alternatively if you are cloning this GitHub repository, use:

git clone https://github.com/gregparkes/turbopanda.git
conda env create -f environment.yml
# or source activate turbopanda...
conda activate turbopanda

Now within the turbopanda environment run your Jupyter notebook:

jupyter notebook

Changelog

Details as to specific and on-going changes can be found either in the Changelog file or in the GitHub repository.

Acknowledgments

We would like to acknowledge the following sources for inspiration for much of this work:

  • pandas dev team: Forming a solid backbone package to build upon
  • pingouin python library: For inspiration and code regarding correlation analysis
  • pyitlib library: For inspiration on mutual information and entropy
  • matplotlib library
  • patsy libraries for inspiration on how to formulate design matrices.
  • PythonCentral tutorials for code validation
  • Wikipedia for many topics

References

[1]: "Pernet CR, Wilcox R, Rousselet GA. Robust Correlation Analyses: False Positive and Power Validation Using a New Open Source Matlab Toolbox. Frontiers in Psychology. 2012;3:606. doi:10.3389/fpsyg.2012.00606."

[2]: "Wilcox, R.R., 1994. The percentage bend correlation coefficient. Psychometrika 59, 601–616. https://doi.org/10.1007/BF02294395"


Ensure that any use of this material is appropriately referenced and in compliance with the license.

About

Turbo-charging the Pandas library in an integrative, meta-orientated style

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages