- Why is the calibration so poor?
- Can it be corrected?
- Should it be corrected?
First, the universe might be non-stationary. However the training data isn't that old, and validation during model construction was done using temporally out-of-sample data. So I don't think that explains it.
Second, calibrating on the historical data might have been misleading. We know from contextual bandit problems that when given historical data where only the value of chosen alternatives is revealed (i.e., nothing is learned about actions not taken), the observed rewards have to be discounted by action probabilities, and generalization to actions with low or no historical probability is suspect. We didn't take any of that into account.
Third, and what I find most interesting, is that the data driving this calibration plot might be misleading. In particular, any estimator will make overestimation errors and underestimation errors; but when these estimates are fed to linear program (which is doing a stylized argmax), presumably overestimates are generally selected and underestimates are generally suppressed.
The shape of the graph lends credence to the third explanation. There are essentially two types of coefficients in the linear program and although the same model estimates both kinds it was separately calibrated on each type; so there is a large catenary style shape on the right and a smaller compressed catenary on the left. In both cases the calibration is good at the low and high end and consistently underestimating in the middle. At the low end underestimation is presumably harder to achieve, so that is consistent with this explanation; why the calibration looks good at the high end is less clear.
I actually expected a small effect due to selection of overestimates by the linear program, but the magnitude of the distortion surprised me, so I think the third explanation is only a partial one.
Can it be corrected? Here's a thought experiment: suppose I fit the above graph to a spline to create a calibration correction and then define a new model which is equal to the old model but with the output passed through the calibration correction. I then use this new model in the linear program to make decisions. What happens? In particular, will the calibration plot going forward look better?
I'm not sure what would happen. It feels like the linear program will be fighting my attempts to calibrate by continuing to prefer overestimates to underestimates, essentially forcing me to systematically underestimate in order to calibrate the extreme values. At that point the linear program could switch to using other (ranges of) estimates and I could end up playing a game of whack-a-mole. If this is true, perhaps a way to proceed is to make a small fraction of decisions obliviously (e.g., by assigning them a random score and mixing them in) and only use those decisions to calibrate the model.
Less pessimistically, I do suspect that some convex combination of the old (uncorrected) and new (corrected) model would appear more calibrated in the sense of the above graph.
Should it be corrected? This is the big question: will the business metrics associated with this decision problem improve if the above graph is made to look more like $y = x$? It's not clear to me, because the above graph basically represents extreme values: we don't see what's happening with the decisions that are not taken. Maybe the answer is yes, because the underestimation pattern looks systematic enough that it might be a characteristic of the linear program we are solving: in that case, calibrating the extreme values and screwing up the mean could still be a good thing to do. Maybe the answer is no, because the underestimation pattern is a characteristic of a calibrated estimator coupled with a linear program, and by attempting to correct it the linear program ends up getting the wrong information to make decisions.
Well we'll be running a live test related to this calibration issue so hopefully I'll have more to say about this issue in the next few weeks.