Crossvalidation and the Disappointed Optimizer

2019/03/25

Wherein I introduce and explain the well-studied mechanism behind the pervasive phenomenon in life of things being just a bit worse than expected. I will walk through some of the discussion in the excellent article, The Optimizer’s Curse: Skepticism and Postdecision Surprise in Decision Analysis by Smith and Winkler.

Have you noticed that the checkout lanes you choose at the grocery store tend to be slower than you expected, the products you order off Amazon disappoint, and your second dates are not as exciting as your firsts? If you have, and suspect that the universe is conspiring against you, let me assure you that your suspicions are well-founded. Indeed the fundamental laws of probability have carefully been arrayed to scupper your satisfaction. If you haven’t noticed any such conspiracy, I congratulate you on either your consistent practice of Zen Buddhism or on your finely tuned skepticism about things.

The mechanism that may explain this seeming conspiracy, phenomena as diverse as the the replication crisis in social sciences and underwhelming production of oil wells or wind farms, and also superstitions like the Sports Illustrated cover jinx is what Smith and Winker call the Optimizer’s Curse. Let’s look a bit closer at the mechanism (using notation slightly different from Smith and Winkler).

Imagine that you are confronted with a choice between \(n\) options, \(1, \ldots, n\). You would of course want to select the option with the highest value to you. Let the true values of these \(n\) options to you be the vector \(\mu = (\mu_1, \ldots, \mu_n)\). If these true values were known to you, then you would choose option \(m\) where \(m = \mbox{argmax}_i\, \mu_i\).

Instead assume that you only have estimates (\(V = (v_1, \ldots, v_n)\)) of the true values. Further assume that the estimates are unbiased. That is, on average each \(v_i\) neither overestimates nor underestimates its corresponding true value \(\mu_i\). An obvious strategy now is for you to choose to go with option \(k = \mbox{argmax}_i\, v_i\). That is, you select the option that has the largest estimate of value. After the choice is made, the true value \(\mu_k\) is revealed to you. We want to know what the surprise \(\mu_k - v_k\) is, on average.

We have \[\mu_k - v_k \leq \mu_k - v_m \leq \mu_m - v_m\] The first inequality is because \(v_k \geq v_m\) (that’s why you chose option \(k\)) and the second inequality is because \(\mu_m \geq \mu_k\) because \(m^{th}\) option is the best (unbeknowst to you.)

Now taking expectations with respect to \(V\) we have \[ E[\mu_k - v_k ] \leq E[\mu_m - v_m] = 0\] The equality of the second term to zero is due to our assumption that the estimates are all (and in particular the \(m^{th}\) one is) unbiased. The expected surprise is therefore non-positive. In fact, it is zero only when the probability \(P(k = m) = 1\). That is, if there is a chance that you can make a wrong choice based on the estimates, then your suprises are on average unpleasant.

Disappointment Mitigation

If you knew that the true values are being generated from a known distribution \(P(\mu)\), the bias can be completely eliminated by turning the Bayesian crank.

In this case the generative model for our problem can be written as (read \(\sim\) “is distributed as”) \[\begin{aligned} \mu &\sim P(\mu)\\ v_i &\sim P(.| \mu_i) \end{aligned} \]

The procedure to follow under this model is to first estimate the posterior means \(E[\mu|V] \triangleq (\hat{\mu}_1, \ldots, \hat{\mu}_n)\) and then select the “best” option based on the posterior means, \(k = \mbox{argmax}_i\, \hat{\mu}_i = \mbox{argmax}_i\, E[\mu_i|V]\).

What is the expected surprise under this model? To make things clear I will make it explicit as to what variables the expectations are over.

\[\begin{aligned} E[\mu_k - \hat{\mu}_k] &\triangleq E_{\mu, V}[\mu_k - \hat{\mu}_k] \\ &= E_V[E_\mu[\mu_k - \hat{\mu}_k|V]] \\ &= E_V[E_\mu[\mu_k|V] - \hat{\mu}_k] \\ &= E_V[\hat{\mu}_k - \hat{\mu}_k] = 0 \end{aligned}\]

The first line is just the definition of what we mean by average surprise. The second line uses the law of total expectation. The third line uses the fact that since \(\hat{\mu}_k\) is a function of \(V\), it is a constant when \(V\) is fixed. The last line is from the definition of the posterior estimate \(\hat{\mu}_k\).

Therefore, if you use the posterior estimates of as your expectations of your future value and to select among your options, then your average surprise is zero. The Bayesian posterior expectation shrinks the esimates \(v_i\) towards the prior mean of \(\mu_i\) which is what makes the procedure unbiased.

Depending of the distribution \(P(V|\mu)\) the option that is selected by the Bayesian procedure (i.e., optimizing over \(\hat{\mu}\)) can be different from the one by optimizing over \(V\). For example, if we knew that the estimate with the largest maginitude, say \(v_j\), is very noisy, option \(j\) may no longer be selected after the shrinkage due to the posterior expectation. An option with a smaller but more accurate estimate may win out.

We know that the Bayesian procedure is unbiased, but does it produce better decisions? In other words, how does \(E_{\mu, V}[\mu_k]\) when \(k = \mbox{argmax}_i\, \hat{\mu}_i = \mbox{argmax}_i\, E[\mu_i|V]\) compare to \(E_{\mu, V}[\mu_j]\) when \(j = \mbox{argmax}_i\, v_i\) ? This question is not explicitly addressed by Smith and Winkler, but the following quick argument shows that the Bayesian procedure not only produces unbiased estimates of the value of the decisions you make, but it also produces better decisions. Using these definitions of \(k\) and \(j\) we have \[ E_{\mu, V}[\mu_k] - E_{\mu, V}[\mu_j] = E_V[E_\mu[\mu_k|V]] - E_V[E_\mu[\mu_j|V]] = E_V[\hat{\mu}_k - \hat{\mu}_j] \geq 0\] The first equality uses the same trick of the law of total expectation, the second equality is from the definition of \(\hat{\mu}\) and the inequality is because \(k\) is the index which maximizes the posterior mean. Therefore, unless the two procedures always produce the same decisions, on average the Bayesian procedure yields greater value.

A Subtelty. When I first read the proof of unbiasedness of the Bayesian procedure I was puzzled by it because under a sufficiently flat prior the Bayesian decision procedure arbitrarily closely approximates maximizing over the estimates \(v_i\). So how can the former procedure be unbiased when the latter is not?

I realized that there is a subtle sleight-of-hand in the proof. Note that the fourth line of the proof of unbiasedness of the Bayesian procedure uses the fact that \(E_\mu[\mu_k|V] = \hat{\mu}_k\). This expectation with respect to \(\mu\) is being conducted by the universe after the decision , whereas the expectation in \(\hat{\mu}_i = E[\mu_i|V]\) to obtain the estimates was conducted by you to make the decision. Therefore, as a quick simulation will verify, if your notion of the prior distribution on \(\mu\) differs from the universe’s, the guarantee of unbiasedness does not hold. (Another way to think about this is to notice that as the universe’s choice of the prior becomes flat and wide, the probability of making a wrong choice becomes small.)

Generally since our belief about \(\mu\) will flatter and wider (indicating our ignorance) than the true prior, the optimizer’s bias is not completely eliminated even with the Bayesian procedure. In this case one could apply Smith and Winkler’s heuristic of a hierarchical prior, which results in the posterior estimates that shrink towards a weighted average of a global prior mean of the values, and the mean of the estimates \(\bar{v} = \frac{1}{n}\sum_i v_i\).

Train-Test Split and Crossvalidation

Let us change the setup a bit and postulate that instead of one estimate for each option, we have two independent estimates. That is, for each option \(i\) we have two unbiased estimates \(v_i^A\) and \(v_i^B\) which, conditioned on the true value \(\mu_i\), are independent. What would be the bias of our estimates if we optimize using the first set of estimates \(v_i^A\), but estimate the future values from the second set of estimates \(v_i^B\)? That is, first select the option \(k = \mbox{argmax}_i\, v_i^A\), but expect that the value of option \(k\) is \(v_k^B\). The expected surprise is given by

\[\begin{align} E[\mu_k - v_k^B] &= \sum_{i=1}^n E[\mu_k - v_k^B|k = i] P(k=i) \\ &= \sum_{i=1}^n E[\mu_i - v_i^B|v_i^A > v_{j \neq i}^A] P(v_i^A > v_{j \neq i}^A ) \\ &= \sum_{i=1}^n E[\mu_i - v_i^B] P(v_i^A > v_{j \neq i}^A ) = 0\\ \end{align} \] The first equality is from the law of total expectation, the second uses the definition of \(k = i\), the third uses the independence of the \(A\) and \(B\) estimates and the assumption that all the \(B\) estimates are unbiased.

Since we are using only part of the information we have for the decision making and the remaining part for estimating the future value of the decision there is a natural trade-off: we can use the estimate with the smaller variance (between \(A\) and \(B\)) to make the decision and the other for value estimation, or vice versa. We are choosing between making worse decisions or making worse (but still unbiased) estimates of the future values from those decisions.

We can now relate this whole discussion to how we train machine learning models. Postdecision surprise is hardly a foreign concept to data scientists who regularly select the “best” among competing models to release into the wild, based on some estimate of their accuracy on a training dataset.

The above discussion implies that if we perform a train-test split of our data, train and select between the models based solely on the train-set and use the test-set only to evaluate the performance of the selected model, then we can be confident that on average the real-world performance will not disappoint (as long as the performance evaluation on the test-set is unbiased). Just to be clear, this guarantee of unbiasedness is over the real-world performance of all such models trained and evaluated in this way.

So the procedure to train and evaluate machine learning models ought to be to split the data into two sets perform model selection on one and model evaluation on the other after the selection is done. The model selection on the train-set can use whatever crossvalidation procedure that maximizes the probability of picking the best model, but must use no information from the test-set.

However, the temptation of using information from the test-set to select between models is too much even for the most conscientious data scientist. In fact, the most common procedure is to perform crossvalidation on the entire dataset to both select the model and to judge its accuracy. This whole discussion should have convinced you that this procedure is almost sure to produce disappointing real-world performance.

In Estimating the Maximum Expected Value: An Analysis of (Nested) Cross Validation and the Maximum Sample Average Van Hasselt studies the case of performing selection when the value estimates are biased. In particular he considers the case of nested crossvalidation, i.e., where the data is split into \(K\) folds and within each fold perform another \(L\)-fold crossvalidation (train on \(L-1\) internal folds and evaluate on the remaining fold and average over all the data). He distinguishes between two cases where \(K-1\) folds are used for maximzation and the remaining fold for accuracy estimation (called Low Bias Crossvalidation) and using one fold for optimization and the remaining \(K-1\) folds for accuracy estimation (called Low Variance Crossvalidation). In certain cases nested crossvalidation can yield negative bias (with increased variance) which may be desirable. He concludes that there is no universally good way to guarantee a good estimator of performance. He experimentally shows that even the widely recommended 5 or 10 fold crossvalidation can yield poor estimates.

Miscellany

Reinforcement learning using the Q-learning algorithm iteratively updates an estimate of the state-action value (so-called Q-value) as follows: \[Q_{t+1}(s_t, a_t) = Q_t(s_t, a_t) + \alpha (r_t + \gamma \,\mbox{max}_a Q_t(s_{t+1}, a) − Q_t(s_t, a_t))\] Notice how the updated state-action value depends on the estimate of the value of the maximizing action in the next state. We know from all the discussion above that, because we are using the current estimate \(Q_t\) to both maximize and evaluate the value of the maximizing action, it will be biased high after the update, even if the current estimate \(Q_t\) is an unbiased estimator of the value of every state-action pair.

Van Hasselt proposed Double Q-Learning to mitigate this problem using two estimators \(A\) and \(B\) estimated on different sets of data, and maintaining two \(Q\) functions. In each step, a randomly selected \(Q\) function is updated where the optimization is done using the same \(Q\) function but the value is obtained from the other. He empirically shows that Double Q-learning converges more quickly than the vanilla version.