## Friday, September 24, 2010

### Positional Effects

In a previous post I talked about constrained cost-sensitive best m with partial feedback (CSBM-PF) and one-way to reduce to constrained cost-sensitive multiclass classification with partial feedback (CSMC-PF). The CSBM variant I considered was about choosing the best set of choices when the rewards of a set is the sum of the rewards of the elements, and the reduction to CSMC works by “pick the best choice, then the next best choice, and so on”. A major inspiring application is optimizing sets of content (or advertising) elements, and indeed I have a content optimization problem in mind for my current gig. In practice for such applications positional effects are important, so I thought I'd try to incorporate them.

My reduction works as follows: first the highest reward choice is chosen, then its reward is adjusted to $-\infty$, and the process is repeated until a set of size $m$ has been achieved. This is essentially how I've always seen it done in the past (e.g., construct a regressor, and fill positions in sorted order), but for this greedy approach to work the positional dependence cannot be arbitrary: it must be that $\sum_{i=1}^m \max_{a_i \in A \setminus \{ a^*_1, \ldots, a^*_{i-1} \} } r (a_i, i) = \max_{a \in \tilde A^m}\; \sum_{i=1}^m r (a_i, i).$ Here $r: A \times [1, m] \to [0, 1]$ are the (conditionally expected) rewards, $A$ are the actions, and $[1, m]$ are the positions, and the right hand side maximum is over vectors of actions $\tilde A^m$ that do not have duplicates (i.e., the same action cannot be chosen at multiple positions). This greedy works'' condition is actually a much weaker condition than I'm used to assuming. For instance, here's an example set of expected rewards for which greedy works: $\begin{array}{|c|c|c|c|} \mbox{Action} & r (a, 1) & r (a, 2) & r (a, 3) \\ \hline 1 & 10 & 5 & 1 \\ 2 & 2 & 4 & 1 \\ 3 & 1 & 2 & 3 \end{array}$ The first action has decreasing reward by position, while the second action prefers a particular position and the third action has increasing reward by position. So I thought I'd had some fun exploring the above greedy works'' condition.

Lemma:Position Maximizers
If there is at least one maximizer of $\sum_{j=1}^m r (a_j, j)$, then there is a maximizer of $\tilde a^*$ of $\sum_{j=1}^m r (a_j, j)$ which uses each individual position maximizer $a^*_i$ for all $i \in [1, m]$, where $a^*_i$ maximizes $r (a, i)$.

Proof: Let $a^*_i$ be a maximizer of $r (a, i)$, and assume $\tilde a^*$ is a maximizer of $\sum_{j=1}^m r (a_j, j)$ that does not use $a^*_i$ in any position. Define $a^\prime$ as $\tilde a^*$ with position $i$ replaced by $a^*_i$. Since $a^*_i$ is a maximizer of $r (a, i)$, the resulting total reward cannot decrease, therefore $a^\prime$ would also be a maximizer of $\sum_{j=1}^m r (a_j, j)$. Repeating this argument for each position $i$ eventually uses an individual position maximizer from every position.
Now, here is a sufficient condition for greedy works.
Sufficient Condition:Swap Supremacy
If $r (a, i) \geq r (a^\prime, i) \implies \forall j > i: r (a, i) + r (a^\prime, j) \geq r (a, j) + r (a^\prime, i),$ then $\sum_{i=1}^m \max_{a_i \in A \setminus \{ a^*_1, \ldots, a^*_{i-1} \} } r (a_i, i) = \max_{a \in \tilde A^m}\; \sum_{i=1}^m r (a_i, i).$ Proof: Proof is by induction. The base case is when $m = 1$, in which case greedy always works.

To show $m - 1 \Rightarrow m$, note from the lemma it follows that there is a maximizer of the right hand side that uses $a^*_1$ in some position. If that position is $j \neq 1$, then construct a new maximizer of the right hand side by swapping positions 1 and $j$: this is guaranteed not to decrease the total reward due to the precondition of the theorem. Since both the left hand and right hand side of the desired result use $a^*_1$ in position 1, it can be subtracted from both sides, yielding $\sum_{i^\prime=1}^{m-1} \max_{a^\prime_{i^\prime} \in A_1 \setminus \{ {a^\prime}^*_1, \ldots, {a^\prime}^*_{i^\prime-1} \} } r (a^\prime_{i^\prime}, i^\prime) = \max_{a \in \tilde A_1^m}\; \sum_{i^\prime=1}^{m-1} r (a_{i^\prime}, i^\prime),$ where $A_1 = A \setminus a^*_1$ and $i^\prime = i - 1$.
The toy example above is sufficient to show that the condition in the above theorem is not necessary: actions 2 and 3 violate the condition for positions 1 and 2, yet greedy still works. This is because they are not used in position 1 due to action 1. It is an interesting assumption, however, because when rearranging a bit, $r (a, i) - r (a^\prime, i) \geq 0 \implies \forall j > i: r (a, i) - r (a^\prime, i) \geq r (a, j) - r (a^\prime, j),$ it suggests that the policy regret at a later position can be upper bounded by the policy regret at an earlier position. This is almost good enough to let an offset tree style regret bound proof go through. It turns out what is needed is two-fold: the expected importance weight difference at a particular node has to be a upper bound on the policy regret, and the sign of the expected importance weight difference at a particular node has to be the same as the sign of the policy regret (so that, the correct importance-weighted decision is the correct policy decision). So I could add one additional condition, $r (a, i) - r (a^\prime, i) \geq 0 \implies \forall j > i: r (a, j) - r (a^\prime, j) \geq 0,$ i.e. the relative order of the actions does not change as position increases. With this assumption I could use historical instances from previous positions to train offset trees for later positions, mitigating the requirements on historical data (without some kind of structural assumption, every action has to have some chance of being tried at every position for the regret bound to go through). What about training offset trees for earlier positions? Well if the relative order of the actions does not change as position increases, I can always find the best action in the last position, and that will be the best action in any position. Thus all the historical data would get projected forward onto the last position and the resulting forfeit offset tree would be rebuilt with the constraint set increasing over time. The historical policy need only have some chance of trying every action pair at an overlapping set of positions in order for the regret bound to go through.

Now to just make this work.