之前的有一点乱,重新整理一下 4/16/2014
上传了PPT不过正在审核,本来想说图片见PPT的,嗯,想看还是搜索书和paper吧...
首先是关于Hierarchical Clustering:
A hierarchical method creates a hierarchical decomposition of the given set of data objects. Hierarchical methods suffer from the fact that once a step (merge or split) is done, it can never be undone.
CURE
关于CURE其实很容易查到相关的内容,Datamining的书(《数据挖掘导论》)里面也有详细的介绍,算是比较好找到的资料。可以参考这个http://wiki.madio.net/index.php?doc-view-996 以及搜这篇paper:Studipto Guha, Rajeev Rastogi, Kyuseok Shim, “CURE: An Efficient Clustering Algorithm for Large Databases”
CURE是一种聚类算法,它使用各种不同的技术创造一种方法,该方法能 够处理大型数据,离群点和具有非球形和非均匀大小的簇的数据。
Shrinking representative points toward the center helps avoid problems with noise and outliers
Output: the set of clusters
1. Draw a random sample from the data set.
The CURE paper is notable for explicitly deriving a formula for what the size of this sample should be in order to gurantee. with high probability, that all clusters are represented by a minimum number of points.
2. Partition the sample into p equal-sized partitions.
3. Cluser the points in each partition into m/pq clusers using CURE's hierarchical clustering algorithm to obtain a total of m/q clusters.
4. Use CURE's hierachical clustering algorithm to cluster the m/q clusters found in the previous step until only k clusters remain.
5. Eliminate outliers. This is the second phase of outlier elimination
6. Assign all remaining data points to the nearesr cluster to obtain a complete clustering.
From the experiment before you can point out CURE is better able to handle clusters of arbitrary shapes and sizes.
But CURE Cannot Handle Differing Densities
换句话说,簇间相似度是基于来自不同簇而有相同邻居的点的数目。
A pair of points is defined to be neighbors if their similarity is greater than some threshold
Use a hierarchical clustering scheme to cluster the data.
Obtain a sample of points from the data set
Compute the link value for each set of points, i.e., transform the original similarities (computed by Jaccard coefficient) into similarities that reflect the number of shared neighbors between points
Perform an agglomerative hierarchical clustering on the data using the “number of shared neighbors” as similarity measure and maximizing “the shared neighbors” objective function
Assign the remaining points to the clusters that have been found
Start with the proximity matrix
Consider each point as a node in a graph
Each edge between two nodes has a weight which is the proximity between the two points
Initially the proximity graph is fully connected
MIN (single-link) and MAX (complete-link) can be viewed as starting with this graph
In the simplest case, clusters are connected components in the graph.
- Main property is the relative closeness and relative inter-connectivity of the cluster
- Two clusters are combined if the resulting cluster shares certain properties with the constituent clusters
- The merging scheme preserves self-similarity
- Build a k-nearest neighbor graph
- Partition the graph using a multilebel graph partitioning algorithm
- Repeat
- Merge the clusters that best preserve the cluster self-similarity with respect to relative interconnectibity and relative closeness
- Until No more cluster can be merged
if two points are similar to many of the same points, then they are simiar to one another, even if a direct measurement of similarity does not indicate this.
Find the K-nearest neighbors of all points.
if two points, x and y are not among the k-nearest neighors of each other then
similarity(x,y) <- 0
else
similarity(s,y) <- number of shared neighbors
end if
In graph terms this can be regarded as breaking all but the k strongest links from a point to other points in the proximity graph
A pair of points is put in the same cluster if
any two points share more than T neighbors and
the two points are in each others k nearest neighbor list
For instance, we might choose a nearest neighbor list of size 20 and put points in the same cluster if they share more than 10 near neighbors
Jarvis-Patrick clustering
First, the k-nearest neighbors of all points are found
In graph terms this can be regarded as breaking all but the k strongest links from a point to other points in the proximity graph
A pair of points is put in the same cluster if
any two points share more than threshold neighbors and
the two points are in each others k nearest neighbor list
For instance, we might choose a nearest neighbor list of size 20 and put points in the same cluster if they share more than 10 near neighbors
(2)Work well for high-dimensional data and is particularly good at finding tight cluster of strongly related objects.
not all objects are cluster
It's hard to choosing the best values for the parameters
Storage requirements of the JP clustering algorithm are only O(km)
The basic time complexity of JP clustering is O(m2)
And for low-dimensional Euclidean data, it can use more efficiently find the k-nearest neighbors. This can reduce the time complexity from O(m2) to O(mlogm)