Skip to content

gitter-badger/simple_ml

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

50 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

将机器学习的基本流程与算法进行手写实现,仅调用numpy以及python基本库

Codacy Badge

TODO list:

  • test cases
  • an efficient bp network
  • more optimal methods
  • train test split func in helper
  • other feature select method to add
  • lasso and Ridge
  • add GBDT feature select
  • update Readme
  • setup.py
  • examples
  • get more datasets

特征工程

特征预处理

PCA降维

当特征数小于样本数时:

from simple_ml.pca import *

pca = PCA(1)
a = np.array([[1,3,2], [3,5,1], [4,7,3], [1,2,0], [0,2,1]])
print(pca.fit_transform(a))
print(pca.explain_ratio)

PCA高维降维

当特征数远小于样本数时,通过矩阵分解进行低维PCA

from simple_ml.pca import *

pca = SuperPCA(1)
a = np.array([[1,3,2], [3,5,1], [4,7,3], [1,2,0], [0,2,1]])
print(pca.fit_transform(a.T))
print(pca.explain_ratio)

特征选择

Filter方法

当前提供了四种Filter选择方法:

  • 方差法
  • 相关系数法
  • 卡方检验法
  • 互信息法

范例如下

    from simple_ml.filter_select import *
    X = np.random.random(20).reshape(-1, 4)
    Y = np.random.randint(0,2,5)
    mf = MyFilter(filter_type=FilterType.chi2, top_k=3)
    mf.fitTransform(X,Y)
    mf.transform(X)

模型评价

相关得分

针对二分类问题:

  • accuracy
  • precision
  • recall
  • f1
  • auc
  • roc作图

针对多分类问题:

  • f1micro
  • f1macro
  • f1weight

针对回归问题:

  • explainedvariance
  • absoluteerror
  • squarederror
  • RMSE(root mean squared error)
  • RMSLE(root mean squared log error, in case of the abnormal value)
  • r2
  • medianabsoluteerror

范例:

    from simple_ml.score import *
    print(classify_accuracy(np.array([1,0,1]), np.array([1, 1, 1])))

分类结果作图

注意:

  • 该画图方法是在内部训练进行画图,如果特征大于2,则降至2维再进行训练,而不是先训练后作图,因为要对图上每一个二维点都进行预测,因此,模型必须支持2维训练集(比如随机森林 m>2 时就不支持2维训练集)
  • 如果想先训练再作图,且特征大于2维,则无法做出区域

范例:

    from simple_ml import classify_plot
    classify_plot.classify_plot(model, X_train, y_train, X_test, Y_test, title='My Support Vector Machine')

交叉验证

目前提供了两种交叉验证方法:

  • 留出法(holdout)
  • k折法(kfolder)

接受参数为:

  1. 模型实例
  2. 特征数据
  3. 标签数据
  4. 交叉验证类型
  5. 训练样本比重:只针对留出法
  6. 交叉验证次数

范例:

    from simple_ml.cross_validation import *
    cross_validation(model, X, y, CrossValidationType.holdout, 0.3, 5)

分类算法

类规范

我在base.py 中给出了所有分类算法所虚继承的抽象类:BaseClassifier

主要作用是:

  • 检查X,Y输入合法性
  • 检查Y的类别,包括连续、二值、多值三种类型
  • 申明样本数、变量数、训练集、测试集等类属性

必须要重写的方法有:

  • fit(X,Y) 给定数据集X和Y进行拟合
  • predict(X) 给定测试集进行预测
  • score(X,Y) 给定X,Y进行预测效果打分

knn相关算法

简单knn

范例:

    from simple_ml.knn import *
    from dataset.classify_data import get_iris
    knn_test = myKNN(K=3,distance_type=DisType.CosSim)
    X, y = get_iris()
    X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3)
    knn_test.fit(X_train, y_train)
    print(knn_test.predict(X_test))
    print(knn_test.score(X_test, y_test))

KD树

Comming Soon

Logistic回归

范例

    from simple_ml.logistic import *
    X = np.array([[2,1], [4,2], [3,3], [4,1], [3,2], [2,3], [1,3]])
    y = np.array([1,2,0,1,0,1,2])
    lr = MyLogisticRegression(step=0.01,tol=1e-10)
    lr.fit(X, y)
    print(lr.predict(X))
    print(lr.score(X, y))
    lr.auc_plot(X, y)

贝叶斯相关算法

朴素贝叶斯

范例

    from simple_ml.naive_bayes import *
    X = np.array([[0, 0, 0, 1],
               [0, 1, 0, 0],
               [1, 1, 0, 1],
               [0, 1, 1, 1],
               [0, 0, 0, 0]])
    y = np.array([0,1,0,1,0])
    nb = MyNaiveBayes()
    nb.fit(X, y)
    X_test = np.array([0, 0, 0, 0]).reshape(1, -1)
    print(nb.predict(X_test))

半朴素贝叶斯

Comming Soon

贝叶斯最小误差

注意:只支持离散标签

import numpy as np
from simple_ml.bayes import MyBayesMinimumError

X = np.array([[2,1],
             [0,3],
             [3,0],
             [1,2],
             [2,0],
             [0,1.5]])
y = np.array([1,0,1,0,1,0])
bme = MyBayesMinimumError()
bme.fit(X, y)
print(bme.predict(X))

贝叶斯最小风险

注意:只支持离散标签

import numpy as np
from simple_ml.bayes import MyBayesMinimumRisk

X = np.array([[2,1],
             [0,3],
             [3,0],
             [1,2],
             [2,0],
             [0,1.5]])
y = np.array([1,0,1,0,1,0])
bme = MyBayesMinimumRisk(np.array([[0,10], [1,0]]))
bme.fit(X, y)
print(bme.predict(X))

基于树的算法

CART

范例

    from simple_ml.tree import *
    np.random.seed(1234)
    rt = RegressionTree(min_leaf_samples=3)
    X = np.random.rand(20, 10)
    Y = np.random.rand(20)
    y_test = np.random.rand(10)
    rt.fit(X, Y)
    print(rt.predict(y_test))

随机森林

范例

    from simple_ml.tree import *
    X, y = get_iris()
    X_train,X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
    mrf = MyRandomForest(2)
    mrf.fit(X_train, y_train)
    print(mrf.predict(X_test))
    print(y_test)
    mrf.classifyPlot(X_test, y_test)

支持向量机

  • 暂时只支持二分类问题
  • 提供核函数如下:
    class KernelType(Enum):
        linear = 0      # 线性核
        polynomial = 1  # 多项式核
        gassian = 2     # 高斯核
        laplace = 3     # 拉普拉斯核
        sigmoid = 4     # sigmoid核

范例

    from simple_ml.svm import *
    from simple_ml.classify_data import  get_iris
    X, y = get_iris()
    X = X[(y==1) | (y==2)]
    y = y[(y==1) | (y==2)]
    y = np.array([i if i ==1 else -1 for i in y])
    mysvm = MySVM(0.6, 0.001, 0.00001, 50, KernelType.linear)
    mysvm.fit(X, y)
    print(mysvm.alphas, mysvm.b)
    print(mysvm.predict(X))
    mysvm.classifyPlot(X, y)

神经网络

BP神经网络

仅仅完成了单样本的情况

聚类

K均值聚类

范例

    from simple_ml.cluster import *
    X = np.array([1, 2,3, 5,6, 10,11,12,20, 35]).reshape(-1, 2)
    X = np.random.rand(*(50, 2))
    km = MyKMeans(3, DisType.Minkowski, d=2)
    km.fit(X)
    print(km.labels)
    # plot
    import matplotlib.pyplot as plt
    plt.scatter(x=X[:,0], y=X[:, 1], c=km.labels)
    plt.show()

层次聚类

范例

    from simple_ml.cluster import *
    X = np.array([1, 2,3, 5,6, 10,11,12,20, 35]).reshape(-1, 2)
    X = np.random.rand(*(50, 2))
    km = MyHierarchical(DisType.Minkowski, d=2)
    km.fit(X)
    print(km.max_dis)
    print(km.cluster(km.max_dis/4))
    # plot
    import matplotlib.pyplot as plt
    plt.scatter(x=X[:,0], y=X[:, 1], c=km.labels)
    plt.show()

Boosting学习

AdaBoost

from simple_ml.ensemble import MyAdaBoost
import numpy as np
X = np.array([[2,1], [4,2], [3,3], [4,1], [3,2], [2,3], [1,3]])
y = np.array([1,2,0,1,0,1,2])
lr = MyAdaBoost(nums=10)
lr.fit(X, y)
lr.predict(X)

GBDT

  • 只支持0-1特征
  • 只支持连续标签
  • 只支持平方损失
from simple_ml.ensemble import *

X = np.array([1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1]).reshape(2, -1).T
y = np.array([3., 3.2, 2., 2.1, 1.5, 2.3, 1.4, 2.1])
gbdt = MyGBDT()
gbdt.fit(X, y)
print(gbdt.predict(np.array([[1, 1], [0, 0], [1, 0], [0, 1]])))

Losers Always Whine About Their Best

献给所有为梦想不懈奋斗的人儿们

About

A simple machine learning algorithm implementation

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 98.2%
  • Emacs Lisp 1.8%