Uncertainty Quantification

Deep ensembles [1]
MC dropout [2]
density based methods
- evidential deep learning [3]
- Gaussian process [4]

[1] B. Lakshminarayanan, A. Pritzel, and C. Blundell, “Simple and scalable predictive uncertainty estimation using deep ensembles,” Advances in Neural Information Processing Systems, vol. 30, 2017.
[2] Y. Gal and Z. Ghahramani, “Dropout as a Bayesian approximation: Representing model uncertainty in deep learning.” PMLR, 2016, pp. 1050–1059.
[3] B. Charpentier, D. Zügner, and S. Günnemann, “Posterior Network: Uncertainty Estimation without OOD Samples via Density-Based Pseudo-Counts”.
[4] J. Z. Liu, Z. Lin, S. Padhy, D. Tran, T. Bedrax-Weiss, and B. Lakshminarayanan, “Simple and Principled Uncertainty Estimation with Deterministic Deep Learning via Distance Awareness”.

Epistemic Uncertainty

associated with reducible noise (model uncertainty)
including more data can reduce the uncertainty
for ensembles: mutual information \[ U_{epis} = \mathbb{H}\left(\mathbb{E}\left( p\left(\mathbf{y}|\mathbf{x}\right)\right)\right) - \mathbb{E}\left(\mathbb{H}\left( p\left(\mathbf{y}|\mathbf{x}\right)\right)\right) = \mathbb{I}\left(\mathbf{y}|\mathbf{x}\right) \]

Confidence Calibration

predictions with similar confidence should have similar accuracy
confidence should be a predictor for accuracy
we can calculate a simple metric: \[ECE = \sqrt{\sum_{j=1}^M \frac{S_j}{N} \cdot |A_j - C_j|^2} \]

images/karandikar_fig_6_cropped.png — Fig: Confidence-calibration curves, cropped from Karandikar et al., Soft Calibration Objectives for Neural Networks, NeurIPS 2021S

Expected Calibration Error

problematic hard binning operation
- causes artifacts
- is not differentiable
solution: soft bin membership using Gaussian kernels \(g\) [5] \[\text{SB-ECE} = \sqrt{\frac{1}{N}\sum_{i=1}^{N}\sum_{j=1}^M u^*_{j}(c_i) \cdot |A_j - c_i|^2} \] with the soft membership function \[\textbf{u}^*(c) = \text{softmax}(\textbf{g}(c)),\,\, g_j(c_i)=-(c_i-\xi_j)^2/T\]

[5] A. Karandikar et al., “Soft Calibration Objectives for Neural Networks,” in Advances in Neural Information Processing Systems, Curran Associates, Inc., 2021, pp. 29768–29779.

Introducing SB-MCE

we propose a differentiable marginal calibration loss: SB-MCE \[\text{SB-MCE} = \sqrt{\frac{1}{N}\sum_{i=1}^{N}\sum_{j=1}^M\sum_{k=1}^K u^*_{j,k}(c_{i,k}) \cdot |A_{j,k} - c_{i,k}|^2} \]

intended baseline: SB-ECE calibration loss
intended competitor: repulsive deep ensembles (function-space diversity)
we use the same architecture: WideResNet-28-10

From Confidence Calibration to Uncertainty Calibration

calibrated confidence:
- predictions with similar confidence should have similar accuracy
- confidence should be a predictor for accuracy
calibrated uncertainty:
- predictions with similar uncertainty should have similar accuracy
- lower uncertainty should be a predictor for higher accuracy

Evaluating Uncertainty Calibration

no ground truth, but downstream task: classification with rejection

images/ARC_aleatoric_unct_test_cifar10_r3779.png — Fig: ARC for aleatoric uncertainty on CIFAR10 test set

the green line shows the accuracy among the (rejected) samples of the most recent bin (lower is better)

Experimental Results

Deep ensemble of 5 WideResNet-28-10 models, cross-entropy loss, SB-MCE calibration loss, ReLU activation

arch	L_prim	L_sec	tgts	Aval	Atst	Aood	cLL	bSc	ECE_TS	run
WN	CE	-	hard	91.84	24.17	22.37	-5.08	1.13	44.07	3741
WN	CE	SBECE	hard	92.51	52.10	44.48	-3.85	0.78	31.62	3733
WN	CE	SBMCE	hard	93.88	60.35	50.40	-2.88	0.68	25.34	3752

Experimental Results: Baseline Ensemble

Deep ensemble of 5 WideResNet-28-10 models, cross-entropy loss, ReLU activation
the model memorizes the training data
- very high accuracy with very low uncertainty on validation set
- very low accuracy on test set

arch	L_prim	L_sec	tgts	Aval	Atst	Aood	cLL	bSc	ECE_TS	run
WN	CE	-	hard	91.84	24.17	22.37	-5.08	1.13	44.07	3741

Baseline Ensemble: Uncertainty Histograms

not just a little overfitting

images/histogram_total_unct_val_cifar10_r3741.png — Fig: Uncertainty histogram on CIFAR10 validation set

Baseline Ensemble: Uncertainty Histograms

not just a little overfitting

images/histogram_total_unct_test_cifar10_r3741.png — Fig: Uncertainty histogram on CIFAR10 test set

Experimental Results: No overfitting, no problem!

Deep ensemble of 15 ConvNeXt models, cross-entropy loss, no calibration loss, GELU activation, stochastic depth

arch	L_prim	L_sec	tgts	Aval	Atst	Aood	cLL	bSc	ECE_TS	run
CN	CE	-	hard	82.81	82.18	68.13	-0.51	0.25	0.83	3750
CN	CE	SBECE_l	hard	83.10	82.41	67.84	-0.52	0.25	1.85	3755
CN	CE	SBMCE_l	hard	82.64	82.59	68.42	-0.59	0.26	5.22	3756

Experimental Results: No overfitting, no problem!

Deep ensemble of 15 ConvNeXt models, cross-entropy loss, no calibration loss

images/ARC_total_unct_OOD_cifar10_2_r3750.png — Fig: ARC for total uncertainty on CIFAR10.2 as OOD set

Experimental Results: No overfitting, no problem!

Deep ensemble of 15 ConvNeXt models, cross-entropy loss, SB-MCE calibration loss

images/ARC_total_unct_OOD_cifar10_2_r3756.png — Fig: ARC for total uncertainty on CIFAR10.2 as OOD set

Experimental Results II: Metrics

standard metrics give little indication of UQ failure

arch	L_pri	L_sec	tgts	Aval	Atst	Aood	cLL	bSc	ECE_TS	run
WN	CE	-	hard	91.84	24.17	22.37	-5.08	1.13	44.07	3741
WN	CE	SBMCE_s	hard	92.01	23.72	22.85	-3.10	1.16	47.64	3773
WN	CE	SBMCE_i	hard	91.60	24.63	23.33	-3.07	1.15	47.99	3775
WN	CE	SBMCE_l	hard	93.88	60.35	50.40	-2.88	0.68	25.34	3752

Experimental Results II: baseline

Deep ensemble of 5 WideResNet-28-10 models, cross-entropy loss, no calibration loss

images/ARC_total_unct_test_cifar10_r3741.png — Fig: ARC for total uncertainty on CIFAR10 test set

poor accuracy, but the UQ is helping

Experimental Results II: SB-MCE

Deep ensemble of 5 WideResNet-28-10 models, cross-entropy loss, SB-MCE calibration loss

images/ARC_aleatoric_unct_test_cifar10_r3773.png — Fig: ARC for aleatoric uncertainty on CIFAR10 test set

slightly worse metrics, but the ARC looks slightly better

Experimental Results II: reverse SB-MCE

Deep ensemble of 5 WideResNet-28-10 models, cross-entropy loss, reverse-sign SB-MCE calibration loss

images/ARC_aleatoric_unct_test_cifar10_r3775.png — Fig: ARC for aleatoric uncertainty on CIFAR10 test set

comparable metrics, but the ARC is slightly lower
shows a connection between ARC and (marginal) calibration

Experimental Results II: logit calibration

Deep ensemble of 5 WideResNet-28-10 models, cross-entropy loss, SB-MCE logit calibration loss (best by all metrics!)

images/ARC_aleatoric_unct_test_cifar10_r3752.png — Fig: ARC for aleatoric uncertainty on CIFAR10 test set

best metrics, yet counter-productive UQ!

Experimental Results II: logit calibration

Deep ensemble of 5 WideResNet-28-10 models, cross-entropy loss, SB-MCE logit calibration loss (best by all metrics!)

images/ARC_rw_aleatoric_unct_test_cifar10_r3752.png — Fig: ARC for aleatoric uncertainty on CIFAR10 test set

best metrics, yet counter-productive UQ!

Uncertainty Calibration Metric

we need a better metric that captures the ARC!
idea: predictions with similar uncertainty should have a similar accuracy
- small number of fixed-width bins
- the uncertainty should predict the accuracy
- fit a spline to the uncertainty distribution

The bin with the highest uncertainties should have the lowest accuracy, the bin with the lowest uncertainty should have the highest accuracy and the accuracy should be increasing for every bin in between.

A Final Hypothesis

The third way to obstruct UQ: Training all ensemble members simultaneously on identical samples?

the alternative base architecture heavily used stochastic depth
- same network architecture for each member (by design)
- samples were identical, networks were different (in practice)
preliminary result: stochastic depth improved the test accuracy

Electricity Spot Prices: Sources of Uncertainty

Renewable energy is cheaper, but weather-dependent
Spot prices are obtained by auction: player interactions
Availability of power plants changes
- installed solar capacity +50% in two years
- extended droughts, maintenance, …
- different capabilities to adjust power output
Interactions between neighboring markets

Regression with Uncertainty

Model outputs predicted (future) value
- variance of ensemble prediction
- direct prediction of uncertain interval
What is the connection between “confidence” and uncertainty now?
Probabilistic evaluation:

Does the uncertain prediction cover the true value at a desired rate?

Appendix A: The CIFAR10-H dataset

CIFAR10 test dataset labeled by 50 human annotators
it gives us a sort of ground truth label distribution
we can compute the first-order uncertainty based on this distribution

Category Soft Labels

the authors used this as a baseline, but only trained on the test set
we can use this to train on the train set, too!
natural competitor for such an approach: label smoothing

CIFAR10-H Category Soft Labels

this table shows the soft labels as a confusion matrix

class	0	1	2	3	4	5	6	7	8	9
0	0.9482	0.008	0.0111	0.0023	0.0013	0.0013	0.0028	0.0013	0.0202	0.0036
1	0.0027	0.9684	0.0007	0.0016	0.0011	0.0008	0.0006	0.0005	0.001	0.0225
2	0.0039	0.001	0.9443	0.012	0.0098	0.0086	0.0132	0.004	0.0024	0.0008
3	0.0016	0.0015	0.0143	0.9119	0.0075	0.0427	0.0133	0.0036	0.0021	0.0015
4	0.0016	0.001	0.0113	0.0096	0.9019	0.0209	0.0074	0.0434	0.0017	0.0012
5	0.0007	0.0005	0.0058	0.0348	0.0037	0.9459	0.0031	0.0038	0.0008	0.0008
6	0.0006	0.0003	0.0116	0.014	0.0058	0.0066	0.9581	0.0014	0.0009	0.0006
7	0.0012	0.0006	0.0026	0.0019	0.0079	0.0074	0.0006	0.9756	0.0009	0.0012
8	0.0116	0.0035	0.0022	0.0019	0.0011	0.0012	0.0023	0.0008	0.9691	0.0064
9	0.0018	0.0242	0.0007	0.0011	0.0008	0.001	0.0007	0.0012	0.0043	0.9641

Human-Based Rejection

this figure shows the accuracy-rejection curve for a model evaluated on CIFAR10 test set
the second curve is a counterfactual that uses the same predictions, paired with the uncertainty values computed from the human annotations

images/ARC_total_unct_test_cifar10_with_cfh_r3767.png — Fig: ARC for total uncertainty on CIFAR10.2 as OOD set

Appendix B: Failure Cases I

Deep ensemble of 15 ConvNeXt models, cross-entropy loss, no calibration loss

Failure cases: Repulsive Deep Ensemble

Same hyper params as before, but trained as a repulsive ensemble

images/ARC_total_unct_OOD_cifar10_2_r3749.png — Fig: ARC for total uncertainty on CIFAR10.2 as OOD set

This ARC is completely flat, because all samples from the validation set got almost identical uncertainty (0.968-0.999, same for testing) and all OOD samples had lower uncertainty (0.0-0.93)

Appendix B: Failure Cases II

Deep ensemble of 5 WideResNet-28-10 models, cross-entropy loss, SB-MCE calibration loss, soft category targets, ReLU activation (best of its kind!)
counter-productive UQ!

images/ARC_total_unct_test_cifar10_r3753.png — Fig: ARC for total uncertainty on CIFAR10 test set

Failure Cases: GELU activation

Same hyper params as before, only changed the activation function to GELU
almost identical performance metrics… yet:

images/ARC_total_unct_test_cifar10_r3764.png — Fig: ARC for total uncertainty on CIFAR10 test set

This may be some kind of bug: After re-calculating the evaluation (-d -w -e -o cifar10_2) the results looked normal. What may have caused these weirdly narrow UQ values? What causes them to be completely different when using eval_run.py instead of relying on the evaluation just after train_model.py?

Motivation

Relation between confidence and uncertainty

Uncertainty Quantification

Aleatoric Uncertainty

Epistemic Uncertainty

Evaluating Classification Uncertainty

Confidence Calibration

Calibration losses

Expected Calibration Error

Introducing SB-MCE

From Confidence Calibration to Uncertainty Calibration

Evaluating Uncertainty Calibration

Experimental Setup

Experimental Results

Experimental Results: Baseline Ensemble

Baseline Ensemble: Uncertainty Histograms

Baseline Ensemble: Uncertainty Histograms

Experimental Results: No overfitting, no problem!

Experimental Results: No overfitting, no problem!

Experimental Results: No overfitting, no problem!

Recap

Two Ways to Make Matters Worse

Experimental Results II: Metrics

Experimental Results II: baseline

Experimental Results II: SB-MCE

Experimental Results II: reverse SB-MCE

Experimental Results II: logit calibration

Experimental Results II: logit calibration

Uncertainty Calibration Metric

A Final Hypothesis

Time-Series Forecasting Under Uncertainty

Predicting Electricity Spot Prices

Electricity Spot Prices: Sources of Uncertainty

Regression with Uncertainty

Regression Uncertainty Evaluation

Anomaly Detection

Summary (Goal)

Appendix A: The CIFAR10-H dataset

Category Soft Labels

CIFAR10-H Category Soft Labels

Human-Based Rejection

Appendix B: Failure Cases I

Failure cases: Repulsive Deep Ensemble

Appendix B: Failure Cases II

Failure Cases: GELU activation