Thursday, September 23, 2010

Aggregate Forfeit Offset Tree Note

In a previous post I talked about the aggregate forfeit offset tree, which handles the problem of learning with partial feedback in the case that the feedback is aggregate, i.e., only the total sum of rewards for a subset of the actions is observed. The regret bound proof required that the historical policy satisfying a kind of detailed balance'' condition, namely, that the two sets \begin{aligned} \Upsilon_{\lambda, \neg \phi} &= \{ \mathcal{A} | \mathcal{A} \cap \omega = \emptyset, \lambda \in \mathcal{A}, \phi \not \in \mathcal{A}, E_{\mathcal{A}^\prime \sim p} [ 1_{\mathcal{A}^\prime = \mathcal{A}} ] > 0 \}, \\ \Upsilon_{\neg \lambda, \phi} &= \{ \mathcal{A} | \mathcal{A} \cap \omega = \emptyset, \lambda \not \in \mathcal{A}, \phi \in \mathcal{A}, E_{\mathcal{A}^\prime \sim p} [ 1_{\mathcal{A}^\prime = \mathcal{A}} ] > 0 \}. \end{aligned} had to be the same when $\lambda$ was replaced by $\phi$ in each set. Basically, if a set containing $\lambda$ and not $\phi$ is possible in the historical policy, the corresponding set with $\lambda$ replaced by $\phi$ had to be possible as well. This way, the expected contribution of the other rewards cancels and the reward of the individual action is revealed. Also, $|\Upsilon_{\lambda, \neg \phi}| = |\Upsilon_{\neg \lambda, \phi}| > 0$, i.e., there needs to be some sets with only one action and not the other.

Well I wanted to remind myself that if the historical policy doesn't obey this detailed balance condition, it can be possible to modify the historical data to create an effective historical policy that does obey the condition. For instance if $|\Upsilon_{\lambda, \neg \phi}| > |\Upsilon_{\neg \lambda, \phi}| > 0$, I can reject some of the extra sets in $\Upsilon_{\lambda, \neg \phi}$ such that $|\Upsilon_{\lambda, \neg \phi}| = |\Upsilon_{\neg \lambda, \phi}|$. Of course, all of the normalization constants and factors that depend upon the historical policy $p$ need to be adjusted because now the historical policy is $p^\prime$ (which is $p$ followed by a rejection step).