Performance measures
These are exam preparation notes, subpar in quality and certainly not of divine quality.
See the index with all articles in this series
Choice of error function
Usually squared error is used.
Cross-Entropy
From Wikipedia:
In information theory, the cross entropy between two probability distributions p and q over the same underlying set of events measures the average number of bits needed to identify an event drawn from the set, if a coding scheme is used that is optimized for an “unnatural” probability distribution q, rather than the “true” distribution p.
The delta between Cross-Entropy and True entropy can be measured by The KL-divergence
We have the problem that we do not know the true distribution, and also not
Our expectation is that the following is true:
where
Empirical Risk Minimization (ERM)
Based on what we learned from Cross-Entropy, to reduce
Overfitting
If your model complexity exceeds the complexity of the data, the only thing you start to fit to is the noise in the data.
In general you want to minimize the generalization error
Overfitting <–> Underfitting
Cross Validation
Cross Validation is usually done in an n-fold way. That means that the data is dividided into random similarly sized buckets. The training is then done on all but one buckets (e.g. 9/10) and the testing on the remaining one (1/10). This is done for all combinations.
Averaging the 10 errors, it is possible to estimate the generalization error without leaving training data out. The final model is then often trained on the entire data set.
Aka Resampling.
Bias Variance Trade-Off
The data points are the underlying model + noise.
Stochastic approximation Online learning / Gradient Descent
Where
Often gradient descent is done in mini batches.
Improvements
Adapt step size: Big step size, when error decreases, small step size when error increases.
Regularization
The
To paraphrase this, the
Forms:
- Weight decay
always subtract a part of your
from your w - Different Norms (0.1,.5,1,2,4)
General form:
Parametric vs. Nonparametric Algorithms
Adapted from Parametric Algorithms have a fixed number of parameters (and therefore model complexity). Examples:
Logistic Regression
Linear Discriminant Analysis
Perceptron
Naive Bayes
Simple Neural Networks
Nonparametric algorithms make few assumptions about the underlying data. Examples:
k-Nearest Neighbors
Decision Trees like CART and C4.5
Support Vector Machines