Cluster_ML
无缺失值处理,因此无函数实现。
def Replace_Data(inputfile):
return Data
#将数据里面的Month类型和VisitorType类型还有Weekend类型进行数值转化;
#返回处理完的数据DataFrame类型
- 已实现
def Corr(inputfile):
return True
#输出相关性,相关性热力图,分层相关性热力图
- 已实现
Parameters
-
n_clustersint, default=8
The number of clusters to form as well as the number of centroids to generate.
-
init*{‘k-means++’, ‘random’, ndarray, callable}, default=’k-means++’*
Method for initialization:‘k-means++’ : selects initial cluster centers for k-mean clustering in a smart way to speed up convergence. See section Notes in k_init for more details.‘random’: choose
n_clusters
observations (rows) at random from data for the initial centroids.If an ndarray is passed, it should be of shape (n_clusters, n_features) and gives the initial centers.If a callable is passed, it should take arguments X, n_clusters and a random state and return an initialization. -
n_initint, default=10
Number of time the k-means algorithm will be run with different centroid seeds. The final results will be the best output of n_init consecutive runs in terms of inertia.
-
max_iterint, default=300
Maximum number of iterations of the k-means algorithm for a single run.
-
tolfloat, default=1e-4
Relative tolerance with regards to Frobenius norm of the difference in the cluster centers of two consecutive iterations to declare convergence. It’s not advised to set
tol=0
since convergence might never be declared due to rounding errors. Use a very small number instead. -
precompute_distances*{‘auto’, True, False}, default=’auto’*
Precompute distances (faster but takes more memory).‘auto’ : do not precompute distances if n_samples * n_clusters > 12 million. This corresponds to about 100MB overhead per job using double precision.True : always precompute distances.False : never precompute distances.Deprecated since version 0.23: ‘precompute_distances’ was deprecated in version 0.22 and will be removed in 0.25. It has no effect.
-
verboseint, default=0
Verbosity mode.
-
random_stateint, RandomState instance, default=None
Determines random number generation for centroid initialization. Use an int to make the randomness deterministic. See Glossary.
-
copy_xbool, default=True
When pre-computing distances it is more numerically accurate to center the data first. If copy_x is True (default), then the original data is not modified. If False, the original data is modified, and put back before the function returns, but small numerical differences may be introduced by subtracting and then adding the data mean. Note that if the original data is not C-contiguous, a copy will be made even if copy_x is False. If the original data is sparse, but not in CSR format, a copy will be made even if copy_x is False.
-
n_jobsint, default=None
The number of OpenMP threads to use for the computation. Parallelism is sample-wise on the main cython loop which assigns each sample to its closest center.
None
or-1
means using all processors.Deprecated since version 0.23:n_jobs
was deprecated in version 0.23 and will be removed in 0.25. -
algorithm*{“auto”, “full”, “elkan”}, default=”auto”*
K-means algorithm to use. The classical EM-style algorithm is “full”. The “elkan” variation is more efficient on data with well-defined clusters, by using the triangle inequality. However it’s more memory intensive due to the allocation of an extra array of shape (n_samples, n_clusters).For now “auto” (kept for backward compatibiliy) chooses “elkan” but it might change in the future for a better heuristic.Changed in version 0.18: Added Elkan algorithm
Attributes
-
**cluster_centers_**ndarray of shape (n_clusters, n_features)
Coordinates of cluster centers. If the algorithm stops before fully converging (see
tol
andmax_iter
), these will not be consistent withlabels_
. -
**labels_**ndarray of shape (n_samples,)
Labels of each point
-
**inertia_**float
Sum of squared distances of samples to their closest cluster center.
-
**n_iter_**int
Number of iterations run.
def K_Means_(inputfile,n):
return True
#n为想聚到多少类
#用肘部法确定最佳的K值
#输出从1-n每个聚类的效果和图片
1.K-means算法在多种不同情况下的聚类表现得并不太好,我们可以看到K-means得到的簇更偏向于球形,这 意味着K-means算法不能处理非球形簇的聚类问题,而现实中数据的分布情况是十分复杂的,所以K-means 算法不太适用于现实大多数情况。
2.K-means算法需要预先确定聚类的个数。
3.K-means算法对初始选取的聚类中心点敏感。我们可以看到在Aggression数据聚类中,K-means算法会把不 同簇聚合在一起来满足球形状簇的聚类,而这种情况下得到的中心点在两个簇中间。这说明此时K-means陷 入了一个局部最优解,而陷入局部最优的一个原因是初始化中心点的位置不太好。
- 已实现
Parameters
-
n_clustersint or None, default=2
The number of clusters to find. It must be
None
ifdistance_threshold
is notNone
. -
affinitystr or callable, default=’euclidean’
Metric used to compute the linkage. Can be “euclidean”, “l1”, “l2”, “manhattan”, “cosine”, or “precomputed”. If linkage is “ward”, only “euclidean” is accepted. If “precomputed”, a distance matrix (instead of a similarity matrix) is needed as input for the fit method.
-
memorystr or object with the joblib.Memory interface, default=None
Used to cache the output of the computation of the tree. By default, no caching is done. If a string is given, it is the path to the caching directory.
-
connectivityarray-like or callable, default=None
Connectivity matrix. Defines for each sample the neighboring samples following a given structure of the data. This can be a connectivity matrix itself or a callable that transforms the data into a connectivity matrix, such as derived from kneighbors_graph. Default is None, i.e, the hierarchical clustering algorithm is unstructured.
-
compute_full_tree*‘auto’ or bool, default=’auto’*
Stop early the construction of the tree at n_clusters. This is useful to decrease computation time if the number of clusters is not small compared to the number of samples. This option is useful only when specifying a connectivity matrix. Note also that when varying the number of clusters and using caching, it may be advantageous to compute the full tree. It must be
True
ifdistance_threshold
is notNone
. By defaultcompute_full_tree
is “auto”, which is equivalent toTrue
whendistance_threshold
is notNone
or thatn_clusters
is inferior to the maximum between 100 or0.02 * n_samples
. Otherwise, “auto” is equivalent toFalse
. -
linkage*{“ward”, “complete”, “average”, “single”}, default=”ward”*
Which linkage criterion to use. The linkage criterion determines which distance to use between sets of observation. The algorithm will merge the pairs of cluster that minimize this criterion.ward minimizes the variance of the clusters being merged.average uses the average of the distances of each observation of the two sets.complete or maximum linkage uses the maximum distances between all observations of the two sets.single uses the minimum of the distances between all observations of the two sets.New in version 0.20: Added the ‘single’ option
-
distance_thresholdfloat, default=None
The linkage distance threshold above which, clusters will not be merged. If not
None
,n_clusters
must beNone
andcompute_full_tree
must beTrue
.New in version 0.21.
Attributes
-
**n_clusters_**int
The number of clusters found by the algorithm. If
distance_threshold=None
, it will be equal to the givenn_clusters
. -
**labels_**ndarray of shape (n_samples)
cluster labels for each point
-
**n_leaves_**int
Number of leaves in the hierarchical tree.
-
**n_connected_components_**int
The estimated number of connected components in the graph.New in version 0.21:
n_connected_components_
was added to replacen_components_
. -
**children_**array-like of shape (n_samples-1, 2)
The children of each non-leaf node. Values less than
n_samples
correspond to leaves of the tree which are the original samples. A nodei
greater than or equal ton_samples
is a non-leaf node and has childrenchildren_[i - n_samples]
. Alternatively at the i-th iteration, children[i][0] and children[i][1] are merged to form noden_samples + i
def Agglo(inputfile,n):
return True
#输入文件不带最后一列
#输出聚类以后的图形
Parameters
-
X*{array-like, sparse (CSR) matrix} of shape (n_samples, n_features) or (n_samples, n_samples)*
A feature array, or array of distances between samples if
metric='precomputed'
. -
epsfloat, default=0.5
The maximum distance between two samples for one to be considered as in the neighborhood of the other. This is not a maximum bound on the distances of points within a cluster. This is the most important DBSCAN parameter to choose appropriately for your data set and distance function.
-
min_samplesint, default=5
The number of samples (or total weight) in a neighborhood for a point to be considered as a core point. This includes the point itself.
-
metricstring, or callable
The metric to use when calculating distance between instances in a feature array. If metric is a string or callable, it must be one of the options allowed by
sklearn.metrics.pairwise_distances
for its metric parameter. If metric is “precomputed”, X is assumed to be a distance matrix and must be square during fit. X may be a sparse graph, in which case only “nonzero” elements may be considered neighbors. -
metric_paramsdict, default=None
Additional keyword arguments for the metric function.New in version 0.19.
-
algorithm*{‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’}, default=’auto’*
The algorithm to be used by the NearestNeighbors module to compute pointwise distances and find nearest neighbors. See NearestNeighbors module documentation for details.
-
leaf_sizeint, default=30
Leaf size passed to BallTree or cKDTree. This can affect the speed of the construction and query, as well as the memory required to store the tree. The optimal value depends on the nature of the problem.
-
pfloat, default=2
The power of the Minkowski metric to be used to calculate distance between points.
-
sample_weightarray-like of shape (n_samples,), default=None
Weight of each sample, such that a sample with a weight of at least
min_samples
is by itself a core sample; a sample with negative weight may inhibit its eps-neighbor from being core. Note that weights are absolute, and default to 1. -
n_jobsint, default=None
The number of parallel jobs to run for neighbors search.
None
means 1 unless in ajoblib.parallel_backend
context.-1
means using all processors. See Glossary for more details. If precomputed distance are used, parallel execution is not available and thus n_jobs will have no effect.
Returns
-
core_samplesndarray of shape (n_core_samples,)
Indices of core samples.
-
labelsndarray of shape (n_samples,)
Cluster labels for each point. Noisy samples are given the label -1.
def DBS_(inputfile):
return True
#传入文件需带最后一列
#输出下列内容
#Estimated number of clusters: 6
#Estimated number of noise points: 12225
#Homogeneity: 0.003
#Completeness: 0.023
#V-measure: 0.006
#Adjusted Rand Index: -0.013
#Adjusted Mutual Information: 0.005
#Silhouette Coefficient: -0.325
#聚类图形
- 已实现
def Judge_1(inputfile,n):
return True
#n为聚成几类
#输出KMeans、AgglomerativeClustering、DBSCAN聚类效果
- 已实现
def Judge_2(inputfile,n):
return True
#同上
- 已实现
#%%
import pandas as pd
import numpy as np
data=pd.read_csv('data/online_shoppers_intention.csv')
data.head()
#%%
from Quick_Init import *
data=Replace_Data('data/online_shoppers_intention_1.csv')
data.head()
data.to_csv('data/trans_1.csv')
#%%
data.describe().round(2).T
#数据描述
#%%
from Quick_Init import *
K_Means_('data/trans_1.csv',10)
#%%
from Quick_Init import *
Judge_1('data/trans_1.csv',2)
#%%
from Quick_Init import *
Judge_2('data/trans_2.csv',5)
#%%
from Quick_Init import *
DBS_('data/trans_2.csv')
#%%
from Quick_Init import *
Agglo('data/trans_1.csv',5)