Learning to be uncertain: Of soft labels and soft losses

Motivation

  • Robust UQ is paramount for trustworthy models!
    • Medical diagnostics
    • Autonomous driving
    • Automated decisions
  • Softmax predictions are often over-confident
  • Can confidence calibration solve this?
    • How can we measure it?
    • How can we encourage it?

Relation between confidence and uncertainty

  • Is uncertainty the absence of confidence? \[U\left(\mathbf{y}|\mathbf{x}\right) = 1 - \text{argmax}_K\left(\mathbf{p}\left(\mathbf{y}|\mathbf{x}\right)\right) \]
  • How does confidence calibration help uncertainty calibration?

Uncertainty Quantification

  • Deep ensembles [1]
  • MC dropout [2]
  • density based methods
    • evidential deep learning [3]
    • Gaussian process [4]

[1] B. Lakshminarayanan, A. Pritzel, and C. Blundell, “Simple and scalable predictive uncertainty estimation using deep ensembles,” Advances in Neural Information Processing Systems, vol. 30, 2017.
[2] Y. Gal and Z. Ghahramani, “Dropout as a Bayesian approximation: Representing model uncertainty in deep learning.” PMLR, 2016, pp. 1050–1059.
[3] B. Charpentier, D. Zügner, and S. Günnemann, “Posterior Network: Uncertainty Estimation without OOD Samples via Density-Based Pseudo-Counts”.
[4] J. Z. Liu, Z. Lin, S. Padhy, D. Tran, T. Bedrax-Weiss, and B. Lakshminarayanan, “Simple and Principled Uncertainty Estimation with Deterministic Deep Learning via Distance Awareness”.

Aleatoric Uncertainty

  • associated with irreducible noise (property of the data)
  • first-order uncertainty
  • for ensembles: average over members’ entropy \[ U_{alea} = \frac{1}{L}\sum_{l=1}^L\sum_{k=1}^K -(p_{l,k}\cdot \log_2(p_{l,k})) \]

Epistemic Uncertainty

  • associated with reducible noise (model uncertainty)
  • including more data can reduce the uncertainty
  • for ensembles: mutual information \[ U_{epis} = \mathbb{H}\left(\mathbb{E}\left( p\left(\mathbf{y}|\mathbf{x}\right)\right)\right) - \mathbb{E}\left(\mathbb{H}\left( p\left(\mathbf{y}|\mathbf{x}\right)\right)\right) = \mathbb{I}\left(\mathbf{y}|\mathbf{x}\right) \]

Evaluating Classification Uncertainty

  • no ground truth
  • common metrics:
    • calibration metrics (Brier score, ECE)
    • classification with rejection
    • out-of-distribution detection

Confidence Calibration

  • predictions with similar confidence should have similar accuracy
  • confidence should be a predictor for accuracy
  • we can calculate a simple metric: \[ECE = \sqrt{\sum_{j=1}^M \frac{S_j}{N} \cdot |A_j - C_j|^2} \]
images/karandikar_fig_6_cropped.png
Fig: Confidence-calibration curves, cropped from Karandikar et al., Soft Calibration Objectives for Neural Networks, NeurIPS 2021S

Calibration losses

  • temperature scaling:
    • goal: improve confidence calibration
    • find a scaling param that optimizes the NLL loss
  • post-hoc calibration can never change the uncertainty ranking
  • training with a calibration loss can!

Expected Calibration Error

  • problematic hard binning operation
    • causes artifacts
    • is not differentiable
  • solution: soft bin membership using Gaussian kernels \(g\) [5] \[\text{SB-ECE} = \sqrt{\frac{1}{N}\sum_{i=1}^{N}\sum_{j=1}^M u^*_{j}(c_i) \cdot |A_j - c_i|^2} \] with the soft membership function \[\textbf{u}^*(c) = \text{softmax}(\textbf{g}(c)),\,\, g_j(c_i)=-(c_i-\xi_j)^2/T\]

[5] A. Karandikar et al., “Soft Calibration Objectives for Neural Networks,” in Advances in Neural Information Processing Systems, Curran Associates, Inc., 2021, pp. 29768–29779.

Introducing SB-MCE

  • we propose a differentiable marginal calibration loss: SB-MCE \[\text{SB-MCE} = \sqrt{\frac{1}{N}\sum_{i=1}^{N}\sum_{j=1}^M\sum_{k=1}^K u^*_{j,k}(c_{i,k}) \cdot |A_{j,k} - c_{i,k}|^2} \]
  • intended baseline: SB-ECE calibration loss
  • intended competitor: repulsive deep ensembles (function-space diversity)
  • we use the same architecture: WideResNet-28-10

From Confidence Calibration to Uncertainty Calibration

  • calibrated confidence:
    • predictions with similar confidence should have similar accuracy
    • confidence should be a predictor for accuracy
  • calibrated uncertainty:
    • predictions with similar uncertainty should have similar accuracy
    • lower uncertainty should be a predictor for higher accuracy

Evaluating Uncertainty Calibration

  • no ground truth, but downstream task: classification with rejection
images/ARC_aleatoric_unct_test_cifar10_r3779.png
Fig: ARC for aleatoric uncertainty on CIFAR10 test set
  • the green line shows the accuracy among the (rejected) samples of the most recent bin (lower is better)

Experimental Setup

  • Deep ensemble of 5 WideResNet-28-10 models
    • cross-entropy loss
    • ReLU activation
    • data augmentation
  • CIFAR10 data
    • 32 x 32 color images (50,000/10,000)
    • 10 classes (deer, dog, airplane, …)

Experimental Results

  • Deep ensemble of 5 WideResNet-28-10 models, cross-entropy loss, SB-MCE calibration loss, ReLU activation
arch L_prim L_sec tgts Aval Atst Aood cLL bSc ECE_TS run
WN CE - hard 91.84 24.17 22.37 -5.08 1.13 44.07 3741
WN CE SBECE hard 92.51 52.10 44.48 -3.85 0.78 31.62 3733
WN CE SBMCE hard 93.88 60.35 50.40 -2.88 0.68 25.34 3752

Experimental Results: Baseline Ensemble

  • Deep ensemble of 5 WideResNet-28-10 models, cross-entropy loss, ReLU activation
  • the model memorizes the training data
    • very high accuracy with very low uncertainty on validation set
    • very low accuracy on test set
arch L_prim L_sec tgts Aval Atst Aood cLL bSc ECE_TS run
WN CE - hard 91.84 24.17 22.37 -5.08 1.13 44.07 3741

Baseline Ensemble: Uncertainty Histograms

  • not just a little overfitting
images/histogram_total_unct_val_cifar10_r3741.png
Fig: Uncertainty histogram on CIFAR10 validation set

Baseline Ensemble: Uncertainty Histograms

  • not just a little overfitting
images/histogram_total_unct_test_cifar10_r3741.png
Fig: Uncertainty histogram on CIFAR10 test set

Experimental Results: No overfitting, no problem!

  • Deep ensemble of 15 ConvNeXt models, cross-entropy loss, no calibration loss, GELU activation, stochastic depth
arch L_prim L_sec tgts Aval Atst Aood cLL bSc ECE_TS run
CN CE - hard 82.81 82.18 68.13 -0.51 0.25 0.83 3750
CN CE SBECE_l hard 83.10 82.41 67.84 -0.52 0.25 1.85 3755
CN CE SBMCE_l hard 82.64 82.59 68.42 -0.59 0.26 5.22 3756

Experimental Results: No overfitting, no problem!

  • Deep ensemble of 15 ConvNeXt models, cross-entropy loss, no calibration loss
images/ARC_total_unct_OOD_cifar10_2_r3750.png
Fig: ARC for total uncertainty on CIFAR10.2 as OOD set

Experimental Results: No overfitting, no problem!

  • Deep ensemble of 15 ConvNeXt models, cross-entropy loss, SB-MCE calibration loss
images/ARC_total_unct_OOD_cifar10_2_r3756.png
Fig: ARC for total uncertainty on CIFAR10.2 as OOD set

Recap

  • calibration loss has little effect if the model is already calibrated
  • the baseline architecture overfits on CIFAR10
    • in fact, it memorizes the training data
    • very high accuracy with very low uncertainty
  • calibration loss has little effect if the model memorizes the data

Two Ways to Make Matters Worse

  • logit calibration reduces overfitting & breaks UQ
  • inverted calibration loss deteriorates UQ

Experimental Results II: Metrics

  • standard metrics give little indication of UQ failure
arch L_pri L_sec tgts Aval Atst Aood cLL bSc ECE_TS run
WN CE - hard 91.84 24.17 22.37 -5.08 1.13 44.07 3741
WN CE SBMCE_s hard 92.01 23.72 22.85 -3.10 1.16 47.64 3773
WN CE SBMCE_i hard 91.60 24.63 23.33 -3.07 1.15 47.99 3775
WN CE SBMCE_l hard 93.88 60.35 50.40 -2.88 0.68 25.34 3752

Experimental Results II: baseline

  • Deep ensemble of 5 WideResNet-28-10 models, cross-entropy loss, no calibration loss
images/ARC_total_unct_test_cifar10_r3741.png
Fig: ARC for total uncertainty on CIFAR10 test set
  • poor accuracy, but the UQ is helping

Experimental Results II: SB-MCE

  • Deep ensemble of 5 WideResNet-28-10 models, cross-entropy loss, SB-MCE calibration loss
images/ARC_aleatoric_unct_test_cifar10_r3773.png
Fig: ARC for aleatoric uncertainty on CIFAR10 test set
  • slightly worse metrics, but the ARC looks slightly better

Experimental Results II: reverse SB-MCE

  • Deep ensemble of 5 WideResNet-28-10 models, cross-entropy loss, reverse-sign SB-MCE calibration loss
images/ARC_aleatoric_unct_test_cifar10_r3775.png
Fig: ARC for aleatoric uncertainty on CIFAR10 test set
  • comparable metrics, but the ARC is slightly lower
  • shows a connection between ARC and (marginal) calibration

Experimental Results II: logit calibration

  • Deep ensemble of 5 WideResNet-28-10 models, cross-entropy loss, SB-MCE logit calibration loss (best by all metrics!)
images/ARC_aleatoric_unct_test_cifar10_r3752.png
Fig: ARC for aleatoric uncertainty on CIFAR10 test set
  • best metrics, yet counter-productive UQ!

Experimental Results II: logit calibration

  • Deep ensemble of 5 WideResNet-28-10 models, cross-entropy loss, SB-MCE logit calibration loss (best by all metrics!)
images/ARC_rw_aleatoric_unct_test_cifar10_r3752.png
Fig: ARC for aleatoric uncertainty on CIFAR10 test set
  • best metrics, yet counter-productive UQ!

Uncertainty Calibration Metric

  • we need a better metric that captures the ARC!
  • idea: predictions with similar uncertainty should have a similar accuracy
    • small number of fixed-width bins
    • the uncertainty should predict the accuracy
    • fit a spline to the uncertainty distribution

The bin with the highest uncertainties should have the lowest accuracy, the bin with the lowest uncertainty should have the highest accuracy and the accuracy should be increasing for every bin in between.

A Final Hypothesis

The third way to obstruct UQ: Training all ensemble members simultaneously on identical samples?

  • the alternative base architecture heavily used stochastic depth
    • same network architecture for each member (by design)
    • samples were identical, networks were different (in practice)
  • preliminary result: stochastic depth improved the test accuracy

Time-Series Forecasting Under Uncertainty

Predicting Electricity Spot Prices

  • Goal: Predict the electricity spot market price for 48 hours
    images/energy-charts_Electricity_production_and_spot_prices_in_Germany_in_week_14_2025.pngFig.: Day-ahead spot market and simplified energy market chart for Germany

Electricity Spot Prices: Sources of Uncertainty

  • Renewable energy is cheaper, but weather-dependent
  • Spot prices are obtained by auction: player interactions
  • Availability of power plants changes
    • installed solar capacity +50% in two years
    • extended droughts, maintenance, …
    • different capabilities to adjust power output
  • Interactions between neighboring markets

Regression with Uncertainty

  • Model outputs predicted (future) value
    • variance of ensemble prediction
    • direct prediction of uncertain interval
  • What is the connection between “confidence” and uncertainty now?
  • Probabilistic evaluation:

Does the uncertain prediction cover the true value at a desired rate?

Regression Uncertainty Evaluation

  • Prediction Interval Coverage Probability (PICP)
  • Mean Predicted Interval Width (MPIW)
  • captured MPIW avoids encouraging narrow intervals for missed predictions

Anomaly Detection

  • Detection of anomalous inputs
    • reconstruction error
    • mapping normal state
  • Independent from the model’s prediction

Can we detect inputs for which the predictive model and its UQ become unreliable?

Summary (Goal)

  • End-to-end model confidence calibration improves uncertainty calibration
  • Marginal calibration supersedes total calibration
  • Introduced a new uncertainty calibration metric
  • Used the uncertainty calibration as a secondary loss
  • Second project: time-series prediction with inherent uncertainty and anomaly detection

Appendix A: The CIFAR10-H dataset

  • CIFAR10 test dataset labeled by 50 human annotators
  • it gives us a sort of ground truth label distribution
  • we can compute the first-order uncertainty based on this distribution

Category Soft Labels

  • the authors used this as a baseline, but only trained on the test set
  • we can use this to train on the train set, too!
  • natural competitor for such an approach: label smoothing

CIFAR10-H Category Soft Labels

  • this table shows the soft labels as a confusion matrix
class 0 1 2 3 4 5 6 7 8 9
0 0.9482 0.008 0.0111 0.0023 0.0013 0.0013 0.0028 0.0013 0.0202 0.0036
1 0.0027 0.9684 0.0007 0.0016 0.0011 0.0008 0.0006 0.0005 0.001 0.0225
2 0.0039 0.001 0.9443 0.012 0.0098 0.0086 0.0132 0.004 0.0024 0.0008
3 0.0016 0.0015 0.0143 0.9119 0.0075 0.0427 0.0133 0.0036 0.0021 0.0015
4 0.0016 0.001 0.0113 0.0096 0.9019 0.0209 0.0074 0.0434 0.0017 0.0012
5 0.0007 0.0005 0.0058 0.0348 0.0037 0.9459 0.0031 0.0038 0.0008 0.0008
6 0.0006 0.0003 0.0116 0.014 0.0058 0.0066 0.9581 0.0014 0.0009 0.0006
7 0.0012 0.0006 0.0026 0.0019 0.0079 0.0074 0.0006 0.9756 0.0009 0.0012
8 0.0116 0.0035 0.0022 0.0019 0.0011 0.0012 0.0023 0.0008 0.9691 0.0064
9 0.0018 0.0242 0.0007 0.0011 0.0008 0.001 0.0007 0.0012 0.0043 0.9641

Human-Based Rejection

  • this figure shows the accuracy-rejection curve for a model evaluated on CIFAR10 test set
  • the second curve is a counterfactual that uses the same predictions, paired with the uncertainty values computed from the human annotations
images/ARC_total_unct_test_cifar10_with_cfh_r3767.png
Fig: ARC for total uncertainty on CIFAR10.2 as OOD set

Appendix B: Failure Cases I

  • Deep ensemble of 15 ConvNeXt models, cross-entropy loss, no calibration loss
images/ARC_total_unct_OOD_cifar10_2_r3750.png
Fig: ARC for total uncertainty on CIFAR10.2 as OOD set

Failure cases: Repulsive Deep Ensemble

  • Same hyper params as before, but trained as a repulsive ensemble
images/ARC_total_unct_OOD_cifar10_2_r3749.png
Fig: ARC for total uncertainty on CIFAR10.2 as OOD set
  • This ARC is completely flat, because all samples from the validation set got almost identical uncertainty (0.968-0.999, same for testing) and all OOD samples had lower uncertainty (0.0-0.93)

Appendix B: Failure Cases II

  • Deep ensemble of 5 WideResNet-28-10 models, cross-entropy loss, SB-MCE calibration loss, soft category targets, ReLU activation (best of its kind!)
  • counter-productive UQ!
images/ARC_total_unct_test_cifar10_r3753.png
Fig: ARC for total uncertainty on CIFAR10 test set

Failure Cases: GELU activation

  • Same hyper params as before, only changed the activation function to GELU
  • almost identical performance metrics… yet:
images/ARC_total_unct_test_cifar10_r3764.png
Fig: ARC for total uncertainty on CIFAR10 test set
  • This may be some kind of bug: After re-calculating the evaluation (-d -w -e -o cifar10_2) the results looked normal. What may have caused these weirdly narrow UQ values? What causes them to be completely different when using eval_run.py instead of relying on the evaluation just after train_model.py?