A loss function is a measurable function that induces a risk functional whose minimizer defines the learned decision rule.

A loss function:

  1. Defines what error means: encodes task semantics (distance, ranking error, divergence)
  2. Defines geometry over inputs: induces gradients and curvature in parameter space
  3. Defines the statistical model implicitly: many are negative log-likelihoods of assumed noise models

Given spaces $X,Y$ and a parametric model $f_\theta\colon X\to Y$ a point-wise loss function $\ell$ has the form:

$$ \ell\colon Y\times Y\to\R_{\ge 0} $$

The expected risk $\cal R(\theta)$ of parameter $\theta$ is then defined as:

$$ \cal R(\theta)=\Bbb E_{(x,y)\sim\cal D}\left[ \ell(y,f_\theta(x)) \right] $$

Since the true distribution $\cal D$ is unknown, we minimize empirical risk:

$$ \widehat{\cal R(\theta)}=\frac1n\sum^n_{i=1}\ell(y_i,f_\theta(x_i)) $$

Training is then:

$$ \theta_*=\argmin_\theta \widehat{\cal R(\theta)} $$

Perspectives

Decision-theoretic

A loss function is a penalty:

$$ \ell(y,\hat y) $$

A learner seeks a decision rule minimizing expected loss.

The optimal predictor $f^*$ under loss $\ell$ is: