Estimation theory - Kernel Density Estimation

Chapters

General Terms and tools
PCA
- PCA
- Hebbian Learning
- Kernel-PCA
Source Separation
- ICA
- Infomax ICA
- Second Order Source Separation
- FastICA
Stochastic Optimization
Clustering
- k-means Clustering
- Pairwise Clustering
- Self-Organising Maps
- Locally Linear Embedding
Estimation Theory
- Density Estimation
  - Kernel Density Estimation
  - Parametric Density Estimation
- Mixture Models - Estimation Models

The goal of density estimation is to be able to give a density estimation for each coordinate in the vector space.

There are two approaches

Parameter

Histogram Kernel $H$

$\underset{―}{x}$ are the coordinates at which we want to measure the density
$\underset{―}{u}$ is the normalized (well, to 1/2 normalized. Why would anyone do that?) distance between two points.

Does the vector given by $\underset{―}{u}$ end outside our rectangle with width $h$ ?

\begin{array}{r} H (\underset{―}{u}) = {\begin{cases} 1, & | u_{j} | < \frac{1}{2}, \forall j \in 1, . . ., n \\ 0, & e l s e \end{cases} \end{array}

The estimation of density

\hat{P} (\underset{―}{x}) = \underset{divide by area with width h}{\frac{1}{h^{n}}} \cdot \frac{1}{p} \sum_{α = 1}^{p} H (\frac{\underset{―}{x} - {\underset{―}{x}}^{(α)}}{h})

Drawbacks of Gliding Histograms

“Bumpy” whenevery a new data point falls into the rectangle (especially with few data points or high dimensionality)
Rectangle not really a good choice
Optimal size of $h$ non-trivial - needs model selection. lower h leads to overfitting

** Alternatively Gaussian**

Also a Gaussian kernel instead of the rectangle can be used, which reduces most of the side efects.

TODO: Figure out what ${\underset{―}{μ}}^{*}$ and $\sum_{―}$ mean (they compose $\underset{―}{w}$ )

Parametric Density estimation finds a good value for $h$ .

Family of parametric density functions: $$\hat{P}(\underline x;\underline w)

Cost function for model selection

E^{T} = - \frac{1}{p} \sum_{α = 1}^{p} l n \hat{P} ({\underset{―}{x}}^{(α)}; \underset{―}{w}) \overset{!}{=} m i n_{(\underset{―}{w})}

Problem: Minimizing the training costs leads to overfitting

==> We needs $E^{G}$ , the generalization costs, but they rely on the knowledge of $P$ ==> Use a proxy function

{\hat{E}}^{G} = \frac{1}{p} \sum_{β = 1}^{p} e^{(β)}

Alternative approach: Select the model that gives the highest probability for the already known data points.

\hat{P} ({{\underset{―}{x}}^{(α)}}; \underset{―}{w}) \overset{!}{=} \underset{(\underset{―}{w})}{m} a x - \sum_{α = 1}^{p} l n \hat{P} ({\underset{―}{x}}^{(α)}; \underset{―}{w}) \overset{!}{=} m i n

Probably simple gradient descent

Conditions for multivariate cases

\begin{aligned} \frac{\partial E^{T}}{\partial \underset{―}{μ}} & = \underset{―}{0} & ⟹ {\underset{―}{μ}}^{*} & = \frac{1}{p} \sum_{α = 1}^{p} {\underset{―}{x}}^{(α)} \\ \frac{\partial E^{T}}{\partial \sum_{―}} & = \underset{―}{0} & ⟹ {\sum_{―}}^{*} & = \frac{1}{p} \sum_{α = 1}^{p} ({\underset{―}{x}}^{(α)} - {\underset{―}{μ}}^{*}) ({\underset{―}{x}}^{(α)} - {\underset{―}{μ}}^{*})^{T} \end{aligned}