Clustering - k-means & SOM
Chapters
- General Terms and tools
- PCA
- PCA
- Hebbian Learning
- Kernel-PCA
- Source Separation
- ICA
- Infomax ICA
- Second Order Source Separation
- FastICA
- Stochastic Optimization
- Clustering
- k-means Clustering
- Pairwise Clustering
- Self-Organising Maps
- Locally Linear Embedding
- Estimation Theory
- Density Estimation
- Kernel Density Estimation
- Parametric Density Estimation
- Mixture Models - Estimation Models
- Density Estimation
K-means Clustering
K-means Clustering is good at finding equally sized clusters of data points.
Parameters
- Distance Function (Usually Euclidean)
- Number of clusters
Drawbacks
- ???Cannot cope with clusters of different sizes???
- Cannot automatically select the number of clusters (resolution)
Two algorithms
- Batch K-Means
- Online K-Means
Online K-Means is superior because it is less likely to converge to local minima and can be used for streaming data (that’s why it’s ‘online’)
Batch K-Means
- initialize prototypes randomly around center of mass
- loop arbitrarily
- Assign all data points to their closest prototype
-
choose a
that minimizes the error function. –> In case of Euclidian Distance this means the center of massThe sum of all datapoints in the cluster divided by the number of datapoints in the cluster.
- Set new
- Assign all data points to their closest prototype
- end loop
Problem
Batch K-Means is not a Convex optimization problem. A convex optimization problem would guarantee that a local minimum is a global minimum. This means that we are note certain to converge to the optimum.
Online K-Means
- Initialize all
randomly around first data point - Select learning rate
- for
in- find closest prototype
-
nudge prototype a bit in the direction of the data point
- find closest prototype
Online K-Means is less likely to be stuck in a local minimum.
Robustness
Labels are not ordered
A good choice of number of clusters
Validation
TODO: page 19
The following is likely wrong, better read it up in the original paper1:
Algorithm:
for m in range(m_min,m_max):
for i, x_split in enumerate(split_in_pieces(x,r)):
Y[split] = find_clusters(x_split, m)
# TODO Compute dissimilarities
dissimilarity(Y2, Y1)
Pairwise Clustering
Pairwise Clustering is the clustering we are used to (e.g.
Mean-Field Approximation(not very relevant)
- simulated annealing(slow)
- mean-field approximation(good and fast, robust against locoal optima)
Soft-K-means clustering (Euclidean distances) (Very Relevant)
Algorithm
- choose number of partitions
- Choose parameters:
- initial noise (
) - final noise (
) - annealing factor
- convergence criterion
- initial noise (
- initialize prototypes
- while
- repeat EM until
- compute assignment probabilities (look up formular)
- Compute new prototypes
-
Self-Organising Maps (SOM)
Self-Organizing Maps are useful in dimensionality reduction while preserving neighborhood. At the same time it is a clustering method as opposed to LLE, which projects all data points into the lower dimensional space(see below).
Parameters
partitions/neurons- annealing schedule learning rate
and annealing factor - prototypes
center of mass random noise
Algorithm
- for x_a in random(x)
- get closest prototype
Note:
is in the map space while is in data space- Change all prototypes using:
Example choice of
Annealing
- Decrease
slowly over time
This algorithm would be similar to k-means if one were to update only
Locally Linear Embedding
“locally linear embedding (LLE), an unsupervised learning algorithm that computes low dimensional, neighborhood preserving embeddings of high dimensional data.”2
In Locally Linear Embedding the idea is to split up the data into small patches and then for each patch and data point there are the weights
Algorithm
Parameters
- find K nearest neighbours of
- calculate reconstruction weights
- calculate embedding coordinates
2. Cost function
LLE is very good at preserving neighboorhoods in high dimensional data. Is Convex