Skip to content

hishoss/hyperlearn

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

90 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

drawing

Faster, Leaner GPU Sklearn, Statsmodels written in PyTorch

GitHub issues Github All Releases Depfu Currently badges don't work --> will update later :)


50%+ Faster, 50%+ less RAM usage, GPU support re-written Sklearn, Statsmodels combo with new novel algorithms.

HyperLearn is written completely in PyTorch, NoGil Numba, Numpy, Pandas, Scipy & LAPACK, and mirrors (mostly) Scikit Learn. HyperLearn also has statistical inference measures embedded, and can be called just like Scikit Learn's syntax (model.confidence_interval_)


drawing

Comparison of Speed / Memory

Algorithm n p Time(s) RAM(mb) Notes
Sklearn Hyperlearn Sklearn Hyperlearn
QDA (Quad Dis A) 1000000 100 54.2 22.25 2,700 1,200 Now parallelized
LinearRegression 1000000 100 5.81 0.381 700 10 Guaranteed stable & fast

Time(s) is Fit + Predict. RAM(mb) = max( RAM(Fit), RAM(Predict) )

I've also added some preliminary results for N = 5000, P = 6000 drawing

Since timings are not good, I have submitted 2 bug reports to Scipy + PyTorch:

  1. EIGH very very slow --> suggesting an easy fix #9212 scipy/scipy#9212
  2. SVD very very slow and GELS gives nans, -inf #11174 pytorch/pytorch#11174

Help is really needed! Email me or message me @ danielhanchen@gmail.com!


Key Methodologies and Aims


1. Embarrassingly Parallel For Loops

  • Including Memory Sharing, Memory Management
  • CUDA Parallelism through PyTorch & Numba

2. 50%+ Faster, 50%+ Leaner

3. Why is Statsmodels sometimes unbearably slow?

  • Confidence, Prediction Intervals, Hypothesis Tests & Goodness of Fit tests for linear models are optimized.
  • Using Einstein Notation & Hadamard Products where possible.
  • Computing only what is necessary to compute (Diagonal of matrix and not entire matrix).
  • Fixing the flaws of Statsmodels on notation, speed, memory issues and storage of variables.

4. Deep Learning Drop In Modules with PyTorch

  • Using PyTorch to create Scikit-Learn like drop in replacements.

5. 20%+ Less Code, Cleaner Clearer Code

  • Using Decorators & Functions where possible.
  • Intuitive Middle Level Function names like (isTensor, isIterable).
  • Handles Parallelism easily through hyperlearn.multiprocessing

6. Accessing Old and Exciting New Algorithms

  • Matrix Completion algorithms - Non Negative Least Squares, NNMF
  • Batch Similarity Latent Dirichelt Allocation (BS-LDA)
  • Correlation Regression
  • Feasible Generalized Least Squares FGLS
  • Outlier Tolerant Regression
  • Multidimensional Spline Regression
  • Generalized MICE (any model drop in replacement)
  • Using Uber's Pyro for Bayesian Deep Learning

Goals & Development Schedule

Will Focus on & why:

1. Singular Value Decomposition & QR Decomposition

* SVD/QR is the backbone for many algorithms including:
    * Linear & Ridge Regression (Regression)
    * Statistical Inference for Regression methods (Inference)
    * Principal Component Analysis (Dimensionality Reduction)
    * Linear & Quadratic Discriminant Analysis (Classification & Dimensionality Reduction)
    * Pseudoinverse, Truncated SVD (Linear Algebra)
    * Latent Semantic Indexing LSI (NLP)
    * (new methods) Correlation Regression, FGLS, Outlier Tolerant Regression, Generalized MICE, Splines (Regression)

  1. Port all important Numpy functions to faster alternatives (ONGOING)
  • Singular Value Decomposition (50% Complete) *

About

50%+ Faster, 50%+ less RAM usage, GPU support re-written Sklearn, Statsmodels combo with new novel algorithms.

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 98.0%
  • Python 2.0%