Unsupervised Learning Methods Exam Preparation
Following the principle “You only understood something thoroughly if you can explain it” - here come the prepping notes for Machine Intelligence II. If no sources are indicated, it comes from the lecture slides.
Note This was foremostly written for my own understanding, so it might contain incomplete explanations
Chapters
- General Terms and tools
- PCA
- PCA
- Hebbian Learning
- Kernel-PCA
- Source Separation
- ICA
- Infomax ICA
- Second Order Source Separation
- FastICA
- Stochastic Optimization
- Clustering
- k-means Clustering
- Pairwise Clustering
- Self-Organising Maps
- Locally Linear Embedding
- Estimation Theory
- Density Estimation
- Kernel Density Estimation
- Parametric Density Estimation
- Mixture Models - Estimation Models
- Density Estimation
General Terms and tools
A lot of the different methods rely on some general methodology that will be reused. Need a refresher in Matrix multiplication? Oh, and dot product is the same as a scalar product.
Centered Data
Centering data means making the center of mass 0. This means every dimension is averaged and the average is then subtracted from all data points for each dimension.
\[X = X - \frac{1}{p}\sum_{\alpha}^p x^{(\alpha)}\]also called first moment
or with numpy:
Covariance matrix
Assuming \(p\) centered data points \(x^{(\alpha)}\):
Whitened Data
Whitening turns your data matrix into a matrix with a covariance matrix which is an identity matrix. The data is then uncorrelated (but might be dependent). This is useful to find e.g. outliers.
\[\begin{align*} C_{ij} &= \frac{1}{p}\sum_{\alpha=1}^p x_i^{(\alpha)}x_j^{(\alpha)} \\ &or\\ C &= \frac{1}{p}\underline{x}^T\underline{x} \end{align*}\]Kullback-Leibler-Divergence
Kullback-Leibler-Divergence measures the difference / distance between two probability distributions - in this example \(P\) and \(\hat P\)
\[D_{KL}[P(\underline x) , \hat{P}(\underline x; \underline w)] = \int d \underline x P(\underline x)ln \frac{P(\underline x)}{\hat{P}(\underline x; \underline w)} = \underset{(\underline w)}{min}\]Jacobian Matrix
For a function \(f: I\!R ^n \rightarrow I\!R^m\) the Jacobian matrix is filled with the derivatives
\[\begin{bmatrix} \frac{\partial f_1}{\partial x_1}& \cdots & \frac{\partial f_1}{\partial x_n}\\ \vdots & \ddots & \vdots \\ \frac{\partial f_1}{\partial x_n}& \cdots & \frac{\partial f_m}{\partial x_n}\\ \end{bmatrix}\]Mercers theorem
From the slides:
Every positive semidefinite definite kernel k corresponds to a scalar product in some metric feature space
Markov Process
A markov process is only dependent on the most recent state. E.g. Its probabilities into which state it will go next are independent of any older states.
Variance
\[\sigma ^2 = E[(x-\mu)^2]\]Discrete
\[\sigma ^2 = \frac{1}{p}\sum_{\alpha=1}^p(x_\alpha-\mu)^2\]