Estimation theory - Kernel Density Estimation
Chapters
- General Terms and tools
- PCA
- PCA
- Hebbian Learning
- Kernel-PCA
- Source Separation
- ICA
- Infomax ICA
- Second Order Source Separation
- FastICA
- Stochastic Optimization
- Clustering
- k-means Clustering
- Pairwise Clustering
- Self-Organising Maps
- Locally Linear Embedding
- Estimation Theory
- Density Estimation
- Kernel Density Estimation
- Parametric Density Estimation
- Mixture Models - Estimation Models
- Density Estimation
Density Estimation
The goal of density estimation is to be able to give a density estimation for each coordinate in the vector space.
There are two approaches
- parametric (model based)
- Gaussian Densities
- nonparametric (data driven)
- Kernel Density Estimate
Kernel Density Estimation (exemplary with Gliding Histogram)
Parameter
- \(h\) width of rectangle
Histogram Kernel \(H\)
- \(\underline x\) are the coordinates at which we want to measure the density
- \(\underline u\) is the normalized (well, to 1/2 normalized. Why would anyone do that?) distance between two points.
Does the vector given by \(\underline u\) end outside our rectangle with width \(h\)?
\[\begin{align*} H(\underline u) = \begin{cases} 1,&|u_j|<\frac{1}{2},\forall j\in 1,...,n\\ 0,&else \end{cases} \end{align*}\]The estimation of density
- \(h\) width of the rectangle
- \(n\) number of dimensions
Drawbacks of Gliding Histograms
- “Bumpy” whenevery a new data point falls into the rectangle (especially with few data points or high dimensionality)
- Rectangle not really a good choice
- Optimal size of \(h\) non-trivial - needs model selection. lower h leads to overfitting
** Alternatively Gaussian**
Also a Gaussian kernel instead of the rectangle can be used, which reduces most of the side efects.
Parametric Density Estimation
TODO: Figure out what \(\underline \mu ^*\) and \(\underline\sum\) mean (they compose \(\underline w\))
Parametric Density estimation finds a good value for \(h\).
Family of parametric density functions: $$\hat{P}(\underline x;\underline w)
Cost function for model selection
\[E^T = -\frac{1}{p}\sum^p_{\alpha=1}ln\hat{P}(\underline x^{(\alpha)};\underline w) \overset{!}{=}min_{(\underline w)}\]Problem: Minimizing the training costs leads to overfitting
==> We needs \(E^G\), the generalization costs, but they rely on the knowledge of \(P\) ==> Use a proxy function
\[\hat{E}^G=\frac{1}{p}\sum^p_{\beta=1}e^{(\beta)}\]Alternative approach: Select the model that gives the highest probability for the already known data points.
\[\hat{P}(\{\underline x^{(\alpha)}\};\underline w )\overset{!}{=}\underset{(\underline w)} max \\ -\sum^p_{\alpha=1}ln\hat{P}(\underline x^{(\alpha)};\underline w)\overset{!}{=}min\]Probably simple gradient descent
Conditions for multivariate cases
\[\begin{align*} \frac{\partial E^T}{\partial \underline\mu} & =\underline 0 & \implies \underline \mu ^* & = \frac{1}{p}\sum^p_{\alpha=1}\underline x ^{(\alpha)} \\ \frac{\partial E^T}{\partial \underline\sum} & =\underline 0 & \implies \underline \sum ^* & = \frac{1}{p}\sum^p_{\alpha=1}(\underline x ^{(\alpha)}-\underline \mu ^*)(\underline x^{(\alpha)}-\underline \mu^*)^T \\ \end{align*}\]