- predictions with similar confidence should have similar accuracy
- confidence should be a predictor for accuracy
- we can calculate a simple metric: \[ECE = \sqrt{\sum_{j=1}^M \frac{S_j}{N} \cdot |A_j - C_j|^2} \]
[1] B. Lakshminarayanan, A. Pritzel, and C. Blundell, “Simple and scalable predictive uncertainty estimation using deep ensembles,” Advances in Neural Information Processing Systems, vol. 30, 2017.
[2] Y. Gal and Z. Ghahramani, “Dropout as a Bayesian approximation: Representing model uncertainty in deep learning.” PMLR, 2016, pp. 1050–1059.
[3] B. Charpentier, D. Zügner, and S. Günnemann, “Posterior Network: Uncertainty Estimation without OOD Samples via Density-Based Pseudo-Counts”.
[4] J. Z. Liu, Z. Lin, S. Padhy, D. Tran, T. Bedrax-Weiss, and B. Lakshminarayanan, “Simple and Principled Uncertainty Estimation with Deterministic Deep Learning via Distance Awareness”.
[5] A. Karandikar et al., “Soft Calibration Objectives for Neural Networks,” in Advances in Neural Information Processing Systems, Curran Associates, Inc., 2021, pp. 29768–29779.
arch | L_prim | L_sec | tgts | Aval | Atst | Aood | cLL | bSc | ECE_TS | run |
---|---|---|---|---|---|---|---|---|---|---|
WN | CE | - | hard | 91.84 | 24.17 | 22.37 | -5.08 | 1.13 | 44.07 | 3741 |
WN | CE | SBECE | hard | 92.51 | 52.10 | 44.48 | -3.85 | 0.78 | 31.62 | 3733 |
WN | CE | SBMCE | hard | 93.88 | 60.35 | 50.40 | -2.88 | 0.68 | 25.34 | 3752 |
arch | L_prim | L_sec | tgts | Aval | Atst | Aood | cLL | bSc | ECE_TS | run |
---|---|---|---|---|---|---|---|---|---|---|
WN | CE | - | hard | 91.84 | 24.17 | 22.37 | -5.08 | 1.13 | 44.07 | 3741 |
arch | L_prim | L_sec | tgts | Aval | Atst | Aood | cLL | bSc | ECE_TS | run |
---|---|---|---|---|---|---|---|---|---|---|
CN | CE | - | hard | 82.81 | 82.18 | 68.13 | -0.51 | 0.25 | 0.83 | 3750 |
CN | CE | SBECE_l | hard | 83.10 | 82.41 | 67.84 | -0.52 | 0.25 | 1.85 | 3755 |
CN | CE | SBMCE_l | hard | 82.64 | 82.59 | 68.42 | -0.59 | 0.26 | 5.22 | 3756 |
arch | L_pri | L_sec | tgts | Aval | Atst | Aood | cLL | bSc | ECE_TS | run |
---|---|---|---|---|---|---|---|---|---|---|
WN | CE | - | hard | 91.84 | 24.17 | 22.37 | -5.08 | 1.13 | 44.07 | 3741 |
WN | CE | SBMCE_s | hard | 92.01 | 23.72 | 22.85 | -3.10 | 1.16 | 47.64 | 3773 |
WN | CE | SBMCE_i | hard | 91.60 | 24.63 | 23.33 | -3.07 | 1.15 | 47.99 | 3775 |
WN | CE | SBMCE_l | hard | 93.88 | 60.35 | 50.40 | -2.88 | 0.68 | 25.34 | 3752 |
The bin with the highest uncertainties should have the lowest accuracy, the bin with the lowest uncertainty should have the highest accuracy and the accuracy should be increasing for every bin in between.
The third way to obstruct UQ: Training all ensemble members simultaneously on identical samples?
Does the uncertain prediction cover the true value at a desired rate?
Can we detect inputs for which the predictive model and its UQ become unreliable?
class | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0.9482 | 0.008 | 0.0111 | 0.0023 | 0.0013 | 0.0013 | 0.0028 | 0.0013 | 0.0202 | 0.0036 |
1 | 0.0027 | 0.9684 | 0.0007 | 0.0016 | 0.0011 | 0.0008 | 0.0006 | 0.0005 | 0.001 | 0.0225 |
2 | 0.0039 | 0.001 | 0.9443 | 0.012 | 0.0098 | 0.0086 | 0.0132 | 0.004 | 0.0024 | 0.0008 |
3 | 0.0016 | 0.0015 | 0.0143 | 0.9119 | 0.0075 | 0.0427 | 0.0133 | 0.0036 | 0.0021 | 0.0015 |
4 | 0.0016 | 0.001 | 0.0113 | 0.0096 | 0.9019 | 0.0209 | 0.0074 | 0.0434 | 0.0017 | 0.0012 |
5 | 0.0007 | 0.0005 | 0.0058 | 0.0348 | 0.0037 | 0.9459 | 0.0031 | 0.0038 | 0.0008 | 0.0008 |
6 | 0.0006 | 0.0003 | 0.0116 | 0.014 | 0.0058 | 0.0066 | 0.9581 | 0.0014 | 0.0009 | 0.0006 |
7 | 0.0012 | 0.0006 | 0.0026 | 0.0019 | 0.0079 | 0.0074 | 0.0006 | 0.9756 | 0.0009 | 0.0012 |
8 | 0.0116 | 0.0035 | 0.0022 | 0.0019 | 0.0011 | 0.0012 | 0.0023 | 0.0008 | 0.9691 | 0.0064 |
9 | 0.0018 | 0.0242 | 0.0007 | 0.0011 | 0.0008 | 0.001 | 0.0007 | 0.0012 | 0.0043 | 0.9641 |
-d -w -e -o cifar10_2
) the results looked normal. What may have caused these weirdly narrow UQ values? What causes them to be completely different when using eval_run.py
instead of relying on the evaluation just after train_model.py
?