Here's something cool about the offset tree: the algorithm is well defined even when some actions have no support in the historical policy, i.e., when $p (a | x) = 0$. The regret bound theorem doesn't go through but what does happen is interesting. Given an internal node with input actions $\lambda$ and $\phi$, where $\phi$ is never observed historically, the expected importance weight difference is $\left(r (\lambda) - \frac{1}{2}\right)$. So basically, if the observed action has an average reward which exceeds the median, it sticks with it: otherwise it switches to the never seen before action.

Of course, blindly generalizing to an action never before observed intuitively sounds like a dangerous idea. This intuition is formalized in Strehl et. al., where several elements combine to avoid this problem: an empirical policy estimator which truncates the importance weight of historically rare events; a high-probability regret bound for empirical value maximization where the comparison class of policies are those with sufficient historical support; and in practice a modification of the contextual bandit argmax regression to consider only those choices that have sufficient historical support.

So this last part got my attention, because once again regression makes it easy to introduce a constraint. Fortunately, for a fixed set of historical data and a fixed decision about what the level of sufficient historical support needs to be, the set of actions which are effectively "not allowed" for a given input can be computed. Thus by using the forfeit offset tree, the resulting learned policy can be constrained to have sufficient support.

## No comments:

## Post a Comment