Machined Learnings: CSMC

Showing posts with label CSMC. Show all posts

Tuesday, May 10, 2011

Cost Sensitive Multi Label: An Observation

I'm faced with a cost-sensitive multi-label classification (CSML) problem, i.e., there is a set of labels $K$ and I am to assign any or all of them to instances (in my case, of text documents). One approach is to treat it as a cost-sensitive multi-class classification (CSMC) problem on the power set of labels $\mathcal{P} (K)$. At that point, I could reduce to binary classification using a filter tree. This has some advantages, such as consistency (zero regret on the induced subproblems implies zero regret on the original problem). It also has substantial disadvantages, both theoretical (the constant in the regret bound is scaling as $O (2^{|K|})$) and practical (the most general implementation would have time complexity scaling as $O(2^{|K|})$).

Another popular approach is to learn $|K|$ independent binary classifiers at test time, query them independently at test time, and output the union. This has nice practical properties (time complexity scaling as $O (|K|)$). However decomposing a problem into a set of independent subproblems is generally a formula for creating an inconsistent reduction, so I was suspicious of this approach.

So here's a fun observation: learning a set of independent binary classifiers is equivalent to a filter tree approach on the CSMC problem with the following conditions.

The filter tree is a scoring filter tree, i.e., the classifier at each node is \[
\Phi (x) = 1_{f (x; \lambda) > f (x; \phi)}.
\]
The scoring function further decomposes into a weighted set of indicator functions, \[
f (x; \lambda) = \sum_{k \in K} 1_{k \in \lambda} f_k (x)
\]
The loss function for the CSMC problem is Hamming loss.
The tree is constructed such that at level $k$, the two inputs to each node differ only in the presence or absence of the $k^{\mathrm{th}}$ element of $K$. Thinking of the elements of $\mathcal{P} (K)$ as bit strings, the tournaments at level $k$ are deciding between the $k^{\mathrm{th}}$ bit in the binary expansion.

In this case, here's what happens. At the $k^\mathrm{th}$ level, the tournaments are all between sets that differ only in their $k^\mathrm{th}$ significant bit. The classifier for every node at this level has the form \[
\Phi (x) = 1_{f_k (x) > 0}
\] and all the importance weights for all the subproblems at this level are identical (because of the use of Hamming loss). Thus the entire $k^\mathrm{th}$ level of the tree is equivalent to a binary classifier which is independently learning whether or not to include the $k^\mathrm{th}$ element of $K$ into the final result.

So the good news is: this means learning independent binary classifiers is consistent if Hamming loss is the loss function on the CSML problem. Of course, the constant in the regret bound is huge, and the structural assumptions on the classifiers are significant, so in practice performance might be quite poor.

What about other loss functions? If the structural assumptions on the classifiers are retained, then each level of the filter tree can still be summarized with a single binary classifier. However, the importance weight of the examples fed to this classifier depend upon the output of the classifiers at the previous levels.

As an example, consider 0-1 loss on the entire set of labels. At the lowest level of the tree, the only tournament with a non-zero importance weight (cost difference) is the one which considers whether or not to include the first label conditional on all other labels being correct. At the second level of the tree, the only tournament that could possibly have a non-zero importance weight is the one that considers whether or not to include the second label conditional on all other labels being correct. However, if the first classifier made an error, this condition will not filter up the tree, and all importance weights will be zero. In general, as soon as one of the classifiers makes a mistake, training stops. So the training procedure can be roughly outlined as:

Given a training datum $(x, Y) \in X \times \mathcal{P} (K)$,
For each $k = 1 \ldots |K|$

Let $\hat y_k$ be the prediction of classifier $k$ of whether label $k$ is in the target set.
Add $(x, 1_{k \in Y})$ to training set for classifier $k$. Note all non-zero importance weights are 1 so this is just binary classification.
If $\hat y_k \neq 1_{k \in Y}$, stop iterating over $k$ for this training datum.

If the classifiers make mistakes frequently, this will end up decimating the data to the point of uselessness. Leaving that problem aside, this procedure is intuitively pleasing because it does not waste classifier resources later in the chain on decisions that don't matter according to 0-1 loss on the entire set.

Wednesday, March 30, 2011

Filter Tree Reduction: Perl Implementation

Lately I've been solving multi-classification problems with vowpal using a machine learning reduction. Ideally I would have programmed this in C using a reductions API provided by vowpal. In practice, vowpal has been in flux; therefore to isolate myself I've been treating vowpal as a black box with which I communicate via IPC. There is a penalty for this approach: I estimate my total throughput would be at least 4 times larger if I implemented the reduction within vowpal (based upon the output of top). Hopefully John and crew will provide a stable vowpal reduction API in the near future.

In the meantime, although it is a bit pokey the reduction I'm presenting here is still practical. In addition, sometimes just seeing an implementation of something can really crystallize the concepts, so I thought I'd present the reduction here.

The Strategy

The starting point is the Filter Tree reduction of cost-sensitive multiclass classification to importance weighted binary classification. In this reduction, class labels are arranged into a March-madness style tournament, with winners playing winners until one class label emerges victorious: that is the resulting prediction. When two class labels ``play each other'', what really happens is an importance weighted classifier decides who wins based upon the associated instance features $x$.

In practice I'm using a particular kind of filter tree which I call a scoring filter tree. Here the importance weighted classifier is constrained to be of the form \[
\Psi_{\nu} (x) = 1_{f (x; \lambda) > f (x; \phi)}.
\] Here $\lambda$ and $\phi$ are the two class labels who are ``playing each other'' to see who advances in the tournament. What this equation says is:

There is a function $f$ which says how good each class label is given the instance features $x$.
The better class label always beats the other class label.

This implies that the winner of the tournament is the best team according to $f$. This makes $f$ look like a scoring function (like what would be obtained from argmax regression) and essentially one can ignore the tournament structure at test time. The use of the tournament at train time is critical however to obtaining good performance on noisy problems (i.e., low regret).

The Implementation

I'll assume that we're trying to classify between $|K|$ labels denoted by integers $\{ 1, \ldots, |K|\}$. I'll also assume an input format which is very close to vowpal's native input format, but with a cost vector instead of a label. \[
c_1,\ldots,c_{|K|}\; \textrm{importance}\; \textrm{tag}|\textrm{namespace}\; \textrm{feature} \ldots
\] So for instance a 3 class problem input line might look like \[
0.7,0.2,1.3\; 0.6\; \textrm{idiocracy}|\textrm{items}\; \textrm{hotlatte}\; |\textrm{desires}\; \textrm{i}\; \textrm{like}\; \textrm{money}
\] The best choice (lowest cost) class here is 2.

Test Time

Applying the model is easier to understand than training it, so I'll start there. Within the perl I transform this into a set of vowpal input lines where each line corresponds to a particular class label $n$, \[
\; \textrm{tag}|\textrm{namespace}n\; \textrm{feature} \ldots
\] Essentially the cost vector and importance weight are stripped out (since there is no learning happening right now), the tag is stripped out (I handle that separately), and each namespace has the class label appended to it. Since vowpal uses the first letter to identify namespaces, options that operate on namespaces (e.g., -q, --ignore) will continue to work as expected. So for instance continuing with the above example we would generate three lines \[
|\textrm{items}1\; \textrm{hotlatte}\; |\textrm{desires}1\; \textrm{i}\; \textrm{like}\; \textrm{money}\; |\_1\; k
\] \[
|\textrm{items}2\; \textrm{hotlatte}\; |\textrm{desires}2\; \textrm{i}\; \textrm{like}\; \textrm{money}\; |\_2\; k
\] \[
|\textrm{items}3\; \textrm{hotlatte}\; |\textrm{desires}3\; \textrm{i}\; \textrm{like}\; \textrm{money}\; |\_3\; k
\] Each of these lines is fed to vowpal, and the class label that has the lowest vowpal output is selected as the winner of the tournament. That last feature $k$ in the namespace _ is providing a class label localized version of the constant feature that vowpal silently provides on every example.

Train Time

At train time I essentially run the tournament: but since I know the actual costs, I update the classifier based upon who ``should have won''. The importance weight of the update is determined by the absolute difference in costs between the two teams that just played. So in the case of two class labels $i$ and $j$ there will be a training input fed to vowpal of the form, \[
1\; \omega\; \textrm{tag}|\textrm{namespace$i$:1}\; \textrm{feature} \ldots |\textrm{namespace$j$:-1}\; \textrm{feature} \ldots |\textrm{\_$i$:-1} \; k\; |\textrm{\_$j$:-1}\; k
\] where $\omega = \textrm{importance} * \mbox{abs} (c_i - c_j)$, i.e., the original importance weight scaled by the absolute difference in the actual costs. Here I'm leveraging the ability to provide a weight on a namespace which multiplies the weights on all the features in the namespace. (What about that pesky constant feature that vowpal always provides? It's still there and really it shouldn't be. The right way to deal with this would be to patch vowpal to accept an option not to provide the constant feature. However I want to present something that works with an unpatched vowpal, so instead I feed another training input with everything negated in order to force the constant feature to stay near zero.)

When a team wins a game they should not have won, they still advance in the tournament. Intuitively, the classifier needs to recover gracefully from mistakes made previously in the tournament, so this is the right thing to do.

What's Missing

Here are some things I'd like to improve:

Implement inside vowpal instead of outside via IPC.
In the implementation I manually design the tournament based upon a particular number of classes. It would be better to automatically construct the tournament.
It would be nice to have a concise way to specify sparse cost-vectors. For example when all errors are equally bad all that is needed is the identity of the correct label.
The above strategy doesn't work with hinge loss, and I don't know why (it appears to work with squared and logistic loss). Probably I've made a mistake somewhere. Caveat emptor!

The Code

There are two pieces:

vowpal.pm: this encapsulates the communication with vowpal. You'll need this to get it to work, but mostly this boring unix IPC stuff.

It's not very good at detecting that the underlying vw did not start successfully (e.g., due to attempting to load a model that does not exist). However you will notice this since it just hangs.

filter-tree: perl script where the reduction implementation actually lives. You invoke this to get going. Mostly it takes the same arguments as vw itself and just passes them through, with some exceptions:

You have to read data from standard input. I could intercept --data arguments and emulate them, but I don't.
You can't use the --passes argument because of the previous statement.
I do intercept the -p argument (for outputting predictions) and emulate this at the reduction level.

The output you see from filter-tree looks like the output from vw, but it not. It's actually from the perl script, and is designed to look like vw output suitably modified for the multiclass case.

Here's an example invocation:

% zcat traindata.gz | head -1000 | ./filter-tree --adaptive -l 1 -b 22 --loss_function logistic -f model.users.b22  
average    since       example  example    current  current  current
loss       last        counter   weight      label  predict features
1.000000   1.000000          1      1.0     1.0000   0.0000       16
0.500000   0.000000          2      2.0     1.0000   1.0000       15
0.500000   0.500000          4      4.0     2.0000   1.0000       20
0.375000   0.250000          8      8.0     2.0000   2.0000       19
0.562500   0.750000         16     16.0     5.0000   2.0000       23
0.437500   0.312500         32     32.0     0.0000   1.0000       14
0.281250   0.125000         64     64.0     1.0000   1.0000       16
0.312500   0.343750        128    128.0     0.0000   1.0000       16
0.347656   0.382812        256    256.0     1.0000   1.0000       13
0.322266   0.296875        512    512.0     1.0000   1.0000       20

finished run
number of examples = 1000
weighted examples sum = 1000
average cost-sensitive loss = 0.287
average classification loss = 0.287
best constant for cost-sensitive = 1
best constant cost-sensitive loss = 0.542
best constant for classification = 1
best constant classification loss = 0.542
minimum possible loss = 0.000
confusion matrix
15      1       0       1       0       1       0
77      416     53      23      5       0       1
14      41      281     56      8       3       2
0       0       0       1       0       1       0
0       0       0       0       0       0       0
0       0       0       0       0       0       0
0       0       0       0       0       0       0

The -p argument outputs a tab separated set of columns. The first column is the predicted class label, the next $|K|$ columns are the scoring function values per class label, and the last column is the instance tag.

As is typical, the source code is (unfortunately) the best documentation.

filter-tree

#! /usr/bin/env perl

use warnings;
use strict;

use vowpal;

$SIG{INT} = sub { die "caught SIGINT"; };

# if this looks stupid it is: these used to be actual class names,
# but i didn't want to release code with the actual class labels that i'm using
use constant {
  ZERO => 0,
  ONE => 1,
  TWO => 2,
  THREE => 3,
  FOUR => 4,
  FIVE => 5,
  SIX => 6, 
};

sub argmin (@)
{
  my (@list) = @_;
  my $argmin = 0;

  foreach my $x (1 .. $#list)
    {
      if ($list[$x] < $list[$argmin])
        {
          $argmin = $x;
        }
    }

  return $argmin;
}

sub tweak_line ($$)
{
  my ($suffix, $rest) = @_;

  $rest =~ s/\|(\S*)/\|${1}${suffix}/g;

  return $rest;
}

sub train_node ($$$$$$$$$)
{
  my ($m, $la, $lb, $pa, $pb, $ca, $cb, $i, $rest) = @_;

  my $argmin = ($ca < $cb) ? -1 : 1;
  my $absdiff = abs ($ca - $cb);

  if ($absdiff > 0)
    {
      chomp $rest;
      my $w = $i * $absdiff;

      my $plusone = 1;
      my $minusone = -1;
      my $chirp = (rand () < 0.5) ? 1 : -1;

      $argmin *= $chirp;
      $plusone *= $chirp;
      $minusone *= $chirp;

      $m->send ("$argmin $w",
                tweak_line ("${la}:$plusone", " |$rest |_ k"),
                tweak_line ("${lb}:$minusone", " |$rest |_ k\n"))->recv ()
      or die "vowpal failed to respond";

      $argmin *= -1;
      $plusone *= -1;
      $minusone *= -1;

      $m->send ("$argmin $w",
                tweak_line ("${la}:$plusone", " |$rest |_ k"),
                tweak_line ("${lb}:$minusone", " |$rest |_ k\n"))->recv ()
      or die "vowpal failed to respond";
   }

  return $pa - $pb;
}

sub print_update ($$$$$$$$)
{
  my ($total_loss, $since_last, $delta_weight, $example_counter, 
      $example_weight, $current_label, $current_predict, 
      $current_features) = @_;

  printf STDERR "%-10.6f %-10.6f %8lld %8.1f   %s %8.4f %8lu\n",
         $example_weight > 0 ? $total_loss / $example_weight : -1,
         $delta_weight > 0 ? $since_last / $delta_weight : -1,
         $example_counter,
         $example_weight,
         defined ($current_label) ? sprintf ("%8.4f", $current_label) 
                                  : " unknown",
         $current_predict,
         $current_features;
}

#---------------------------------------------------------------------
#                                main                                 
#---------------------------------------------------------------------

srand 69;

my @my_argv;
my $pred_fh;

while (@ARGV)
  {
    my $arg = shift @ARGV;
    last if $arg eq '--';

    if ($arg eq "-p")
      {
        my $pred_file = shift @ARGV or die "-p argument missing";
        $pred_fh = new IO::File $pred_file, "w" or die "$pred_file: $!";
      }
    else
      {
        push @my_argv, $arg;
      }
  }

my $model = new vowpal join " ", @my_argv;

print STDERR <<EOD;
average    since       example  example    current  current  current
loss       last        counter   weight      label  predict features
EOD

my $total_loss = 0;
my $since_last = 0;
my $example_counter = 0;
my $example_weight = 0;
my $delta_weight = 0;
my $dump_interval = 1;
my @best_constant_loss = map { 0 } (ZERO .. SIX);
my @best_constant_classification_loss = map { 0 } (ZERO .. SIX);
my $minimum_possible_loss = 0;
my $classification_loss = 0;
my $mismatch = 0;
my %confusion;

while (defined ($_ = <>))
  {
    my ($preline, $rest) = split /\|/, $_, 2;

    die "bad preline $preline" 
      unless $preline =~ /^([\d\.]+)?\s+([\d\.]+\s+)?(\S+)?$/;

    my $label = $1;
    my $importance = $2 ? $2 : 1;
    my $tag = $3;

    my ($actual_tag, @costs) = split /,/, $tag;

    die "bad tag $tag" unless @costs == 0 || @costs == 8;

    my @scores = 
      map { my $s = $model->send (tweak_line ($_, " |$rest |_ k"))->recv ();
            chomp $s;
            $s
          } (ZERO .. SIX);
    my $current_prediction = argmin @scores;

    if (@costs == 8)
      {
        # it turned out better for my problem to combine classes 6 and 7.
        # costs are already inverted and subtracted from 1, so, 
        # have to subtract 1 when doing this
        
        my $class_seven = pop @costs;
        $costs[SIX] += $class_seven - 1;

        # zero level

        my $zero_one = train_node ($model,
                                   ZERO,
                                   ONE,
                                   $scores[ZERO],
                                   $scores[ONE],
                                   $costs[ZERO],
                                   $costs[ONE],
                                   $importance,
                                   $rest) <= 0
                       ? ZERO : ONE;

        my $two_three = train_node ($model,
                                    TWO,
                                    THREE,
                                    $scores[TWO],
                                    $scores[THREE],
                                    $costs[TWO],
                                    $costs[THREE],
                                    $importance,
                                    $rest) <= 0
                        ? TWO : THREE;

        my $four_five = train_node ($model,
                                    FOUR,
                                    FIVE,
                                    $scores[FOUR],
                                    $scores[FIVE],
                                    $costs[FOUR],
                                    $costs[FIVE],
                                    $importance,
                                    $rest) <= 0
                        ? FOUR : FIVE;

        # SIX gets a pass

        # first level: (zero_one vs. two_three), (four_five vs. SIX)

        my $fleft = train_node ($model,
                                $zero_one,
                                $two_three,
                                $scores[$zero_one],
                                $scores[$two_three],
                                $costs[$zero_one],
                                $costs[$two_three],
                                $importance,
                                $rest) <= 0
                    ? $zero_one : $two_three;

        my $fright = train_node ($model,
                                 $four_five,
                                 SIX,
                                 $scores[$four_five],
                                 $scores[SIX],
                                 $costs[$four_five],
                                 $costs[SIX],
                                 $importance,
                                 $rest) <= 0
                     ? $four_five : SIX;

        # second level: fleft vs. fright

        my $root = train_node ($model,
                               $fleft,
                               $fright,
                               $scores[$fleft],
                               $scores[$fright],
                               $costs[$fleft],
                               $costs[$fright],
                               $importance,
                               $rest) <= 0
                   ? $fleft : $fright;

        $total_loss += $importance * $costs[$root];
        $since_last += $importance * $costs[$root];
        $example_weight += $importance;
        $delta_weight += $importance;

        my $best_prediction = argmin @costs;

        foreach my $c (ZERO .. SIX)
          {
            $best_constant_loss[$c] += $importance * $costs[$c];
            if ($c != $best_prediction)
              {
                $best_constant_classification_loss[$c] += $importance;
              }
          }

        $minimum_possible_loss += $importance * $costs[$best_prediction];
        $classification_loss += ($current_prediction == $best_prediction) ? 0 : 1;
        ++$confusion{"$current_prediction:$best_prediction"};

        ++$mismatch if $root ne $current_prediction;
      }

    print $pred_fh (join "\t", $current_prediction, @scores, $actual_tag), "\n"
      if defined $pred_fh;

    ++$example_counter;
    if ($example_counter >= $dump_interval)
      {
        my @features = split /\s+/, $rest;         # TODO: not really

        print_update ($total_loss, 
                      $since_last,
                      $delta_weight,
                      $example_counter,
                      $example_weight,
                      (@costs) ? (argmin @costs) : undef,
                      $current_prediction,
                      scalar @features);

        $dump_interval *= 2;
        $since_last = 0;
        $delta_weight = 0;
      }
  }

my $average_loss = sprintf "%.3f", $example_weight > 0 ? $total_loss / $example_weight : -1;

my $best_constant = argmin @best_constant_loss;
my $best_constant_average_loss = sprintf "%.3f", $example_weight > 0 ? $best_constant_loss[$best_constant] / $example_weight : -1;

my $best_constant_classification = argmin @best_constant_classification_loss;
my $best_constant_classification_average_loss = sprintf "%.3f", $example_weight > 0 ? $best_constant_classification_loss[$best_constant_classification] / $example_weight : -1;

my $minimum_possible_average_loss = sprintf "%.3f", $example_weight > 0 ? $minimum_possible_loss / $example_weight : -1;

my $classification_average_loss = sprintf "%.3f", $example_weight > 0 ? $classification_loss / $example_weight : -1;

print <<EOD;

finished run
number of examples = $example_counter
weighted examples sum = $example_weight
average cost-sensitive loss = $average_loss
average classification loss = $classification_average_loss
best constant for cost-sensitive = $best_constant
best constant cost-sensitive loss = $best_constant_average_loss
best constant for classification = $best_constant_classification
best constant classification loss = $best_constant_classification_average_loss
minimum possible loss = $minimum_possible_average_loss
confusion matrix
EOD
#train/test mismatch = $mismatch

foreach my $pred (ZERO .. SIX)
  {
    print join "\t", map { $confusion{"$pred:$_"} || 0 } (ZERO .. SIX);
    print "\n";
  }

vowpal.pm

# vowpal.pm

package vowpal;

use warnings;
use strict;

use POSIX qw (tmpnam mkfifo);
use IO::File;
use IO::Pipe;
use IO::Poll;

sub new ($$)
{
  my $class = shift;
  my $args = shift;

  my $pred_pipename = tmpnam () or die $!;
  my $pred_pipe = mkfifo ($pred_pipename, 0700) or die $!;
  my $pred_fd = POSIX::open ($pred_pipename, 
                             &POSIX::O_RDONLY | 
                             &POSIX::O_NONBLOCK | 
                             &POSIX::O_NOCTTY) or die $!;
  my $pred_fh = new IO::Handle;
  $pred_fh->fdopen ($pred_fd, "r") or die $!;
  POSIX::fcntl ($pred_fh, 
                &POSIX::F_SETFL, 
                POSIX::fcntl ($pred_fh, &POSIX::F_GETFL, 0) & ~&POSIX::O_NONBLOCK);

  my $data_fh = new IO::Pipe or die $!;
  open my $oldout, ">&STDOUT" or die "Can't dup STDOUT: $!";
  eval
    {
      open STDOUT, ">", "/dev/null" or die "Can't redirect STDOUT: $!";
      eval
        {
          open my $olderr, ">&STDERR" or die "Can't dup STDERR: $!";
          eval
            {
              open STDERR, ">", "/dev/null" or die "Can't redirect STDERR: $!";
              $data_fh->writer ("vw $args -p $pred_pipename --quiet") or die $!;
              $data_fh->autoflush (1);
            };
          open STDERR, ">&", $olderr or die "Can't restore STDERR: $!";
          die $@ if $@;
        };
      open STDOUT, ">&", $oldout or die "Can't restore STDOUT: $!";
      die $@ if $@;
    };
  die $@ if $@;

  my $poll = new IO::Poll;
  $poll->mask ($data_fh => POLLOUT);
  $poll->poll ();
  $poll->remove ($data_fh);
  $poll->mask ($pred_fh => POLLIN);

  my $self = { data_fh => $data_fh,
               pred_fh => $pred_fh,
               pred_file => $pred_pipename,
               poll => $poll,
               args => $args };

  bless $self, $class;
  return $self;
}

sub send ($@)
{
  my $self = shift;

  $self->{'data_fh'}->print (@_);

  return $self;
}

sub recv ($)
{
  my $self = shift;

  $self->{'poll'}->poll ();
  return $self->{'pred_fh'}->getline ();
}

sub DESTROY
{
  my $self = shift;

  $self->{'data_fh'}->close ();
  $self->{'pred_fh'}->close ();
  unlink $self->{'pred_file'};
}

1;

Monday, November 22, 2010

Minimax Constrained CSMC: Minor Progress

In a previous post I talked about ad serving, and why regression based approaches still dominate even though other approaches to cost-sensitive multiclass classification (CSMC) have lower regret bounds. In my view, it comes down to practical issues, and an important practical issue in ad serving is that the set of actions (ads) that are allowed for a given decision instance (ad serving request) can be volatile. Furthermore in many cases there is no reason to believe the pattern of constraints is statistically stable between training sets and test sets, e.g., due to advertisers experimenting with budgets. Therefore I feel the constraints are best modeled adversarially, a situation I call minimax constrained CSMC.

I'll repeat the setting for minimax constrained CSMC. There is a distribution $D = D_x \times D_{\tilde c|x}$, where $\tilde c: K \to \mathbb{R}$ takes values in the regular reals $\mathbb{R}$. Then, an adversary comes in and manufactures a cost vector $c$ in the extended reals $\mathbf{R}$ by setting some of the components to $\infty$; these choices are revealed via $\omega$ prior to a decision being elicited. In this case the regret of a particular classifier $h: X \times \mathcal{P} (K) \to K$ is given by \[ \nu (h) = E_{x \sim D_x} \left[ \max_{\omega \in \mathcal{P} (K)} \left\{ E_{\tilde c \sim D_{\tilde c|x}} \left[ c (h (x, \omega)) \right] - \min_{k \in K}\; E_{\tilde c \sim D_{\tilde c|x}} \left[ c (k) \right] \right\} \right]. \] This contrasts with average constrained CSMC, where the distribution of constraints ($\omega$) is stable from training to test data. For average constrained CSMC, tree based reductions work when modified to have disallowed options forfeit their tournaments. This doesn't work for minimax constrained CSMC, however, as the following simple counterexample shows. Suppose $X = \emptyset$, $K = \{1, 2, 3\}$, and $\tilde c$ is deterministic and such that $\tilde c (1) < \tilde c (3) < \tilde c (2)$, and suppose the tree first pairs $\{1, 2\}$ while giving 3 a pass, and then pairs $\{1, 3\}$. Suppose the classifier used at each tree node is $1_{f (a) > f (b)}$ for some function $f: K \to \mathbb{R}$. If the training is done only with data where $\omega = \emptyset$, the regret on the training data can be brought to zero if $f (1) = 1$, $f (3) = 3$, and $f (2) = 2$. However when $\omega = \{1\}$ at test time there will be regret.

What's going on here? The situation is similar to a ranking reduction to classification, where $f$ induces a linear ordering over the elements. In that case the classification error averaged over input pairs provides a bound on the AUC error averaged over input sets. Of course, AUC is too coarse an objective function since it is only sensitive to ordering errors and not magnitudes. However this does suggest that more pairs of elements need to be compared during training other than the $(|K| - 1)$ comparisons done during one pass of the filter tree. If every pair must be compared during training, then perhaps $|K|/2$ passes over the filter tree are required.

Therefore consider a sequence of average constrained CSMC classifiers $\Psi_n$ indexed by $n \in [1, |K|]$. These induce a sequence of $\{ \tilde \omega_n | n \in [0, |K|] \}$ defined via \[
\begin{aligned}
\tilde \omega_0 &= \emptyset, \\
\tilde \omega_n &= \tilde \omega_{n-1} \cup \{ \Psi_n (x, \tilde \omega_{n-1}) \}.
\end{aligned}
\] Essentially this is a sequence of average constrained CSMC classifiers that are determining the best action, the next best action, and so on (in the same fashion as reduction from cost-sensitive best m to cost-sensitive multiclass). Next consider the index \[
\eta (x, \omega) = \min \{ n \in [1, |K|] | \Psi_n (x, \tilde \omega_{n-1}) \not \in \omega \}. \] If $\omega \neq K$, this index always exists. I'll construct a classifier when $\omega \neq K$ via \[ h (x, \omega) = \Psi_{\eta (x, \omega)} (x, \tilde \omega_{\eta (x, \omega) -1}).
\] (When $\omega = K$, the regret is always zero whatever choice the classifier makes, so I'll just ignore that case going forward). The regret for a particular $(x, \omega)$ is given by \[
\begin{aligned}
\nu (x, \omega) &= E_{\tilde c \sim D_{\tilde c|x}} \left[ c (h (x, \omega)) \right] - \min_{k \in K}\; E_{\tilde c \sim D_{\tilde c|x}} \left[ c (k) \right] \\
&= E_{\tilde c \sim D_{\tilde c|x}} \left[ c (h (x, \omega)) \right] - \min_{k \in K \setminus \tilde \omega_{\eta (x, \omega) - 1}} E_{\tilde c \sim D_{\tilde c|x}} \left[ c (k) \right] \\
&= E_{\tilde c \sim D_{\tilde c|x}} \left[ c \left(\Psi_{\eta (x, \omega)} (x, \tilde \omega_{\eta (x, \omega) -1})\right) \right] - \min_{k \in K \setminus \tilde \omega_{\eta (x, \omega) - 1}} E_{\tilde c \sim D_{\tilde c|x}} \left[ c (k) \right], \\
\end{aligned}
\] where the second line follows from $\tilde \omega_{\eta (x, \omega) - 1} \subseteq \omega$, and the third line from the definition of $h$. Now the last line is the per-instance regret of the $\eta (x, \omega)^{\textrm{th}}$ average constrained CSMC classifier trained on the distribution induced by the first $(\eta (x, \omega) - 1)$ classifiers. Thus \[
\max_{\omega \in \mathcal{P} (K)} \nu (x, \omega) = \max_n \left\{ E_{\tilde c \sim D_{\tilde c|x}} \left[ c (\Psi_n (x, \tilde \omega_n)) \right] - \min_{k \in K \setminus \tilde \omega_{n - 1}} E_{\tilde c \sim D_{\tilde c|x}} \left[ c (k) \right] \right\}.
\] This suggests a procedure where $|K|$ forfeit filter tree passes are done per instance; while this seems like a factor of 2 too many, note forfeitures do not generate classification instances which eliminates half of the comparisons. The minimax constrained CSMC regret would be \[
\nu (h) \leq (|K| - 1) E_{x \sim D_x} \left[ \max_n\; q_n (x, \tilde \omega_{n-1}) \right]
\] where $q_n (x, \tilde \omega_{n-1})$ is the average per-node importance-weighted regret of the $n^{\textrm{th}}$ forfeit filter tree trained on the distribution induced by the first $(n-1)$ forfeit filter trees.

At first blush this seems too unwieldy to use in practice, but two modifications might make it practical. The first is to reuse the same tree for every $\Psi_n$ instead of keeping $n$ separate trees; the regret bound still holds, although the proper training procedure is not immediately obvious to me. The second is to consider the case where the number of constraints are bounded, i.e., $|\omega| \leq z \ll |K|$, such that training and testing costs are only increased by $z$. This seems reasonable in practice.

Tuesday, October 26, 2010

Why do Ad Servers use Regression?

The post title is a bit presumptuous, because 1) I don't know that all ad servers use regression, and 2) even if they did it's difficult to speculate why. So this is really, ``Why have I used regression for ad serving in the past?'' But that's less catchy.

Why even ask the question? Because ad serving looks like cost-sensitive multiclass classification, and reducing cost-sensitive multiclass classification to regression leads to regret bounds that are worse than reduction to binary classification.

So here's a laundry list of issues I've encountered in the past, how a regression reduction deals with them, and how reduction to binary classification might deal with them.

The Set of Actions is Changing

First, let me say that I've used regression even in cases where the set of actions wasn't really changing that quickly. For instance, I was involved with a domain monetization product where the actions were a list of bidded keywords phrases (monetization was via driving to a search engine results page). Such a list changes infrequently (e.g., monthly) and modestly (not too many ``Lady Gaga''s are made per unit time). So really, I had no excuse there.

In the case where the set of actions really does change significantly over time (e.g., contextual targeting of sponsored search advertisements, where new ads appear frequently), it is tempting to think that a regressor trained on previous data would generalize reasonably to a novel instance, after all the new instance will share lots of features with existing instances (e.g., words and word phrases) and will be shown in similar contexts (e.g., web pages). This is tempting, but dangerous. I learned the hard way that one has to be very careful about graduating an action from exploration to exploitation traffic. (``Learning the hard way'' is a nice euphemism for ``I built something that didn't work''). Nonetheless, even acknowledging the care required to move from exploration policy to exploitation policy, it is fair to say that regression makes it easy to ``mix a new action in''.

Given that transition from exploration to exploitation is a controlled process, how might it work in a reduction to binary classification? Some of these reductions are structured as tournaments organized as a binary tree. Consider adding a single action. In that case, one can create a new root node whose children are the old root node and the new action. This new root node essentially has to learn, ``Under what situations should I take the new action, versus doing whatever I would have done before when the new action was not available?'' Building out the tree in this fashion would result in a very unbalanced tree. Adding many actions in one sweep would mitigate the problem a bit, since an entire tree can be stitched under the new root node, but unless the number of actions is doubling this will still lead to lack of balance. However, it could be a good choice as an incremental operation, with $|A_{new}| + 1$ novel binary classification subproblems to train comprising $\log_2 (|A_{new}|)$ sequential steps.

Another strategy is to add new actions (or delete old actions) at the leaf level. Converting an existing leaf to an internal node with children being the new action and the action at the former leaf would require $1 + \log_2 (|A_{old}|)$ novel binary classification subproblems to train, since the entire path to the root must be relearned. Conservatively if this done for a set of new actions the total number of retrains is scaled by $|A_{new}|$, but in fact many paths to the root will be shared if the replacements are located near each other in the tree. I suspect the actual cost is something like $|A_{new}| + \log_2 (|A_{old}|/|A_{new}|)$, i.e., a complete tree of $|A_{new}|$ classifiers plus one shared path of length $\log_2 (|A_{old}|/|A_{new}|)$ to the root. I also suspect these retrains can be done in $\log_2 (|A_{old}|)$ sequential steps.

In some cases it is not unreasonable to simply consider retraining the entire tree; each level can be trained in parallel so the number of sequential steps is $\log_2 (|A|)$, with a total number of retrains $|A|$. Given nonstationarity, feature innovation, etc. a complete retrain has to occur periodically anyway.

Intra-Query Constraints

This is similar to the set of actions changing, but while the above section was about how the universe of possible actions can change, this section is about how on an individual instance certain actions might not be allowed.

There are two different situations that I've identified. The first, which I call ``average constrained CSMC'', involves constraints that change very slowly if at all, such that they can be modeled as part of the problem instance with training and testing data drawn IID. These are things like ``this advertisement is not allowed on web pages with pornographic content,'' which almost never changes over the lifetime of an advertisement (perhaps at the very beginning due to a error in specification of a campaign).

The second, which I call ``minimax constrained CSMC'', involves constraints that change rapidly, such that the distribution of the constraints on the training set bears no relationship to the distribution of constraints on the test set. These are things like ``this advertiser has exhausted their budget,'' which given how advertisers experiment with budgets can be quite volatile. Constraints here are modeled as imposed adversarially, and a solution is required to get good regret over all possible settings of constraints.

An interesting result is that argmax regression reduction has the same regret bound for unconstrained, average constrained, and minimax constrained CSMC. This is achieved by simply argmax on the regression score over the set of actions that are allowed on this instance.

In the average constrained case, tree based reductions can be modified such that disallowed actions forfeit their tournaments, and an analogous regret bound to the unconstrained case can be derived. I don't have any results for the minimax constrained case for tree based reductions yet, although I have a small problem example which indicates that forfeiting alone does not achieve good results.

I strongly suspect that minimax constrained CSMC has to be well understood for regression to be dethroned from advertising.

Inter-Query Constraints

This refers to properties that need to be enforced across a set of queries. Budget constraints are the canonical example, where greedy delivery is known to have a worst-case competitive ratio of $\frac{1}{2}$. Again with no excuse (other than lack of knowledge), I've used regression even in the case where there were no inter-query constraints: a system for contextually targeting eBay affiliate ads. Affiliate programs only pay you when they get paid so essentially they have infinite budget.

However often such constraints must be addressed. OR has been dealing with such constraints for decades, and OR pervasively reduces to regression. If budgets are specified in dollars, and regression estimates purport to be of expected revenue, then some ad serving problems with budget constraints can be attacked using network flow algorithms. Such algorithms are fast enough to re-run periodically as actuals flow in to overcome the inevitably large errors in traffic volume estimates. (The size of an ad network that can leverage this approach goes up as CPU and memory get cheaper).

It seems plausible to dethrone regression here, by reducing ad serving to cost-sensitive multiclass classification leveraging approaches like Policy Learning by Dynamic Programming. It might make a nice PhD thesis for somebody (it is somewhat practical, so perhaps lacks panache). In the meantime I will plod along: I've improved my intuition around stochastic shortest path and eventually hope to play around with reducing flow to CSMC.

I also wonder if approximate online methods for optimizing with budget constraints, which involve argmax on adjusted regression estimates, might also be applicable to other CSMC reductions. For example with Mehta et. al.'s $\psi (x) = 1 - e^{x-1}$ remaining budget discounting function, a tree based reduction could be trained using the remaining budget discounted observed reward rather than the actual observed reward. Whether this makes sense requires further thought: my understanding of the analysis of such algorithms is they assume the regression is perfect, and the performance bound is due to the online nature of the query sequence. It would be interesting to augment the analysis with additional terms for regret in the regression, such that a tree based approach could be said to do better.

Selecting a Set

CSMC reductions choose a single action from a set of actions, but often in ad serving multiple ads are selected at once. Not always, however: display advertising is often a single ad display, and mobile screen real estate can be scarce. For sponsored search (or contextual ad serving of sponsored search advertisements) populating multiple positions is the norm.

If the reward associated with a set is the sum of the individual action rewards, then regression handles set selection quite naturally: merely select the top $m$ actions by estimated value, rather than only the first. The regret bound is almost identical to the single action case, with an extra factor of $\sqrt{\min \{m,|A|-m\}}$. The (undesirable) square root dependence on the regressor regret is preserved. Fortunately, this problem can also be reduced to average constrained CSMC. The basic strategy is ``choose the best action, then the next best action, etc.'' The regret has an extra factor of $m$ (worse) but preserves the linear dependence on CSMC regret (better).

For ad serving, however, the assumption of linear rewards is too strong is practice, as there are usually significant positional effects. Fortunately, if the reward dependence upon position obeys swap supremacy and preservation of relative order (as is implied by a monotonic action-independent multiplicative positional modulation), then a similar technique can be used to solve the problem of selecting the best of actions when the reward associated with a set is the sum of individual action-position rewards via reduction to average constrained CSMC.

If the reward of a set of actions is not the sum of individual action rewards, one option is to treat entire sets as actions. In ad serving this is generally infeasible but in content optimization (e.g., adaptive UI) this can be viable. If externalities between actions only flow forward by position (e.g., a serial scan model in a vertical presentation), it feels intuitively like a stochastic shortest path problem but I haven't verified this.

In every ad server I've ever worked on, the reward of a set of actions was assumed linear in the individual action rewards, possibly with a positional correction. Thus, there really is no excuse for using regression merely because the problem involves selecting sets.

Summary

Overall, the two big issues that I feel are preventing the dethroning of regression from ad serving are 1) adversarially imposed intra-query constraints and 2) inter-query constraints. Any ad serving problem that does not exhibit these properties should be a slam dunk for more advanced CSMC reductions. For instance, any ad serving problem which monetizes via search engine landing pages (e.g., actions are bidded phrases) does not exhibit these properties; neither do meta-monetization problems (e.g., dynamically selecting between several ad networks).

I'll be noodling on intra-query and inter-query constraints for CSMC in my spare time.

Sunday, October 3, 2010

A Positional Offset Tree

My previous post summarized my recent thoughts regarding cost-sensitive best m with partial feedback (CSBM-PF) given positional effects. A major inspiring application is optimizing sets of content (or advertising) elements, for which positional effects are typically important. After playing around a bit I decided to wave a theoretical white flag and go with a simplifying assumption of the rewards factoring into an action-specific position-independent factor and a position-specific action-independent factor. It will turn out, however, that even this does not allow me to nicely use data from later positions to inform regret at earlier positions. I'm starting to suspect there is something fundamentally wrong about using data from later positions.

The approach has two parts. The first part is a modification of the offset tree to incorporate positional effects, which is what this post is about. The second part is a slight modification of the reduction from CSBM to CSMC to construct entire sets. I'll be focusing on the first part in this post. The punch line is that by normalizing the positional presentation history of each action, I can use data from previous positions to inform the regret at the current position.

The setup is as follows. There is a distribution $D = D_x \times D_{\omega|x} \times D_{r|\omega,x}$, where $r: A \times [1, m] \to [0, 1] \cup \{ -\infty \}$ takes values in the unit interval augmented with $-\infty$, and the components of $r$ which are $-\infty$-valued for a particular instance are revealed as part of the problem instance via $\omega \in \mathcal{P} (A \times [1, m])$ (i.e., $\omega$ is a subset of $A \times [1, m]$). Allowed outputs in response to a problem instance are the $m$-vectors over $A$ without duplicates, denoted \[ \Xi_m = \{ \xi \in A^m | \xi_i = \xi_j \iff i = j \}.\] The regret of a particular deterministic policy $h: X \times \mathcal{P} (A) \to \Xi_m$ is given by \[ v (h) = E_{(x, \omega) \sim D_x \times D_{\omega|x}} \left[ \max_{s \in \Xi_m}\; E_{r \sim D_{r|\omega,x}} \left[ \sum_{n=1}^m r (\xi_n, n) \right] - E_{r \sim D_{r|\omega,x}} \left[ \sum_{n=1}^m r (h_n (x, \omega), n) \right] \right]. \] I'll assume that the historical policy is using a known conditional distribution over $\Xi_m$ given an instance $p (\xi | x, \omega)$. Note that for certain $\omega$ there might be no elements of $\Xi_m$ which are feasible, i.e., which achieve a finite reward; in which case the regret is always zero. Therefore the interesting parts of the problem space are those $\omega$ for which some elements of $\Xi_m$ are feasible.

The simplifying assumption is that the rewards for an action-position pair factor as \[ r (a, i) = \kappa (i) \tilde r (a) \] where $i > j \implies \kappa (i) < \kappa (j)$, and $\kappa (i) \geq 0$ for all $i$. Note $\kappa$ is a random variable here (like $\tilde r$). I'm not assuming that the positional factors are known or fixed, although I am forced to assume $\kappa$ and $\tilde r$ vary independently. I'll switch from talking about $D_{r | x, \omega}$ to talking about $D_{\tilde r | x, \omega} \times D_{\kappa | x, \omega}$.

With the above assumption it turns out that selecting the actions by position in a greedy fashion is optimal. The point of the positional offset tree is to use data from multiple positions to inform the selection of an action at a particular position. I'll switch to talking about the regret for selecting a single action $h: X \times \mathcal{P} (A) \to A$ at a particular fixed position $z$, \[
\begin{aligned}
v_z (h) &= E_{(x, \omega) \sim D_x \times D_{\omega|x}} \left[ \max_{a \in A}\; E_{r \sim D_{r | x, \omega}} \left[ r (a, z) \right] - E_{r \sim D_{r | x, \omega}} \left[ r (h (x, \omega), z) \right] \right] \\
&= E_{(x, \omega) \sim D_x \times D_{\omega|x}} \left[ E_{\kappa \sim D_{\kappa|x,\omega}} \left[ \kappa (z) \right] \left( \max_{a \in A}\; E_{\tilde r \sim D_{\tilde r|\omega,x}} \left[ \tilde r (a) \right] - E_{\tilde r \sim D_{\tilde r|\omega,x}} \left[ \tilde r (h (x, \omega)) \right] \right) \right].
\end{aligned}
\] The no-duplicate constraint can't be seen at a single position, but it will be satisfied in practice by manipulating $\omega$ when reducing set selection to individual action selection by position.

Algorithm:Positional Forfeit Offset Tree Train

Data: Partially labelled constrained positional CSMC training data set $S$.
Input: Position $z$ for which to create classifier.
Input: Importance-weighted binary classification routine $\mbox{Learn}$.
Input: A binary tree $T$ over the labels with internal nodes $\Lambda (T)$.
Result: Trained classifiers $\{\Psi_\nu | \nu \in \Lambda (T) \}$.
For each $\nu \in \Lambda (T)$ from leaves to roots:

$S_\nu = \emptyset$.
For each example $(x, \omega, \xi, \{ r (a, i) | (a, i) \in \xi \}, p (\cdot | x, \omega)) \in S$:

Let $\lambda$ and $\phi$ be the two classes input to $\nu$ (the predictions of the left and right subtrees on input $(x, \omega)$ respectively).
If $(\lambda, z) \in \omega$, predict $\phi$ for the purposes of constructing training input for parent node (``$\lambda$ forfeits'');
else if $(\phi, z) \in \omega$, predict $\lambda$ for the purposes of constructing training input for parent node (``$\phi$ forfeits'');
else foreach $n$ in $\Upsilon_{\lambda,\phi} = \{ n \in [1, z] | E_{\xi \sim p} \left[ 1_{\xi_n = \lambda} 1_{(\lambda, n) \not \in \omega} \right] E_{\xi \sim p} \left[ 1_{\xi_n = \phi} 1_{(\phi, n) \not \in \omega}\right] > 0 \}$:

Let $\alpha = |\Upsilon_{\lambda,\phi}|^{-1} \frac{E_{\xi \sim p} \left[\sum_{n \in \Upsilon_{\lambda, \phi}} 1_{\xi_n = \lambda} 1_{(\lambda, n) \not \in \omega} + 1_{\xi_n = \phi} 1_{(\phi, n) \not \in \omega} \right]}{E_{\xi^\prime \sim p} \left[ 1_{\xi^\prime_n = \xi_n} 1_{(\xi_n, n) \not \in \omega} \right]}$
Let $y = 1_{\xi_n = \lambda}$.
If $r (\xi_n, n) < \frac{1}{2}$, $S_\nu \leftarrow S_\nu \cup \left\{ \left( x, 1 - y, \alpha \left( \frac{1}{2} - r (\xi_n, n) \right) \right) \right\}$;
else $S_\nu \leftarrow S_\nu \cup \left\{ \left( x, y, \alpha \left( r (\xi_n, n) - \frac{1}{2} \right) \right) \right\}$.

Let $\Psi_\nu = \mbox{Learn} (S_\nu)$.

Return $\{\Psi_\nu | \nu \in \Lambda (T) \}$.

Algorithm:Positional Forfeit Offset Tree Test

Input: A binary tree $T$ over the labels with internal nodes $\Lambda (T)$.
Input: Trained classifiers $\{\Psi_\nu | \nu \in \Lambda (T) \}$.
Input: Instance realization $(x, \omega)$.
Result: Predicted label $k$.

Let $\nu$ be the root node.
Repeat until $\nu$ is a leaf node:

If all the labels of the leaves in the left-subtree of $\nu$ are in $\omega$, traverse to the right child;
else if all the labels of the leaves in the right-subtree of $\nu$ are in $\omega$, traverse to the left child;
else if $\Psi_\nu (x) = 1$, traverse to the left child;
else (when $\Psi_\nu (x) = 0$ and at least one label in each subtree is not in $\omega$), traverse to the right child.

Return leaf label $k$.

Motivating the Update

The basic idea is to importance-weight the historical data so that each pair of ads being compared are getting the same expected positional treatment. This reduces the requirement on the historical policy from ``generalization is not safe unless an action has a chance to be shown at a particular position'' to ``generalization is not safe unless each pair of actions has a chance to be shown at a particular position at or before the one under consideration''. (Ok, maybe that's a bit underwhelming).

For a fixed $(x, \omega, \kappa, \tilde r)$ and an internal node with left input $\lambda$ and right input $\phi$, the expected importance weight for $\lambda$ is \[
\begin{aligned}
w_{\lambda|\tilde r,\kappa} = \frac{E_{\xi \sim p} \left[ \sum_{n \in \Upsilon_{\lambda, \phi}} 1_{\xi_n = \lambda} 1_{(\lambda, n) \not \in \omega} \alpha_{\lambda,n} \left( \kappa (n) \tilde r (\xi_n) - \frac{1}{2} \right)_+ + 1_{\xi_n = \phi} 1_{(\phi, n) \not \in \omega} \alpha_{\phi,n} \left( \frac{1}{2} - \kappa (n) \tilde r (\xi_n) \right)_+ \right]}{E_{\xi \sim p} \left[ \sum_{n \in \Upsilon_{\lambda, \phi}} 1_{\xi_n = \lambda} 1_{(\lambda, n) \not \in \omega} + 1_{\xi_n = \phi} 1_{(\phi, n) \not \in \omega} \right]},
\end{aligned} \] where $\alpha_{\lambda,n}$ and $\alpha_{\phi,n}$ are to be determined scaling factors, and \[ \Upsilon_{\lambda,\phi} = \{ n \in [1, z] | E_{\xi \sim p} \left[ 1_{\xi_n = \lambda} 1_{(\lambda, n) \not \in \omega} \right] E_{\xi \sim p} \left[ 1_{\xi_n = \phi} 1_{(\phi, n) \not \in \omega}\right] > 0 \} \] is the set of feasible positions with shared support at or before the current position. This suggests \[
\alpha_{\lambda,n} \propto
\frac{E_{\xi \sim p} \left[\sum_{n \in \Upsilon_{\lambda, \phi}} 1_{\xi_n = \lambda} 1_{(\lambda, n) \not \in \omega} + 1_{\xi_n = \phi} 1_{(\phi, n) \not \in \omega} \right]}{E_{\xi \sim p} \left[ 1_{\xi_n = \lambda} 1_{(\lambda, n) \not \in \omega} \right]},
\] which yields \[
\begin{aligned}
w_{\lambda|\tilde r,\kappa} &\propto \sum_{n \in \Upsilon_{\lambda,\phi}} \left(\kappa (n) \tilde r (\lambda) - \frac{1}{2} \right)_+ + \left(\frac{1}{2} - \kappa (n) \tilde r (\phi)\right)_+, \\
w_{\phi|\tilde r,\kappa} &\propto \sum_{n \in \Upsilon_{\lambda,\phi}} \left(\kappa (n) \tilde r (\phi) - \frac{1}{2} \right)_+ + \left(\frac{1}{2} - \kappa (n) \tilde r (\lambda)\right)_+, \\
w_{\lambda|\tilde r,\kappa} - w_{\phi|\tilde r,\kappa} &\propto \left( \tilde r (\lambda) - \tilde r (\phi) \right) \sum_{n \in \Upsilon_{\lambda,\phi}} \kappa (n).
\end{aligned}
\] It is not possible to make this exactly equal to the policy regret difference since the positional factors are unknown random variables, but the monotonicity constraint implies $\sum_{n \in \Upsilon_{\lambda,\phi}} \kappa (n) \geq |\Upsilon_{\lambda,\phi}| \kappa (z)$. Thus with the choices \[
\begin{aligned}
\alpha_{\lambda,n} &=
|\Upsilon_{\lambda,\phi}|^{-1} \frac{E_{\xi \sim p} \left[\sum_{n \in \Upsilon_{\lambda, \phi}} 1_{\xi_n = \lambda} 1_{(\lambda, n) \not \in \omega} + 1_{\xi_n = \phi} 1_{(\phi, n) \not \in \omega} \right]}{E_{\xi \sim p} \left[ 1_{\xi_n = \lambda} 1_{(\lambda, n) \not \in \omega} \right]}, \\
\alpha_{\phi,n} &=
|\Upsilon_{\lambda,\phi}|^{-1} \frac{E_{\xi \sim p} \left[\sum_{n \in \Upsilon_{\lambda, \phi}} 1_{\xi_n = \lambda} 1_{(\lambda, n) \not \in \omega} + 1_{\xi_n = \phi} 1_{(\phi, n) \not \in \omega} \right]}{E_{\xi \sim p} \left[ 1_{\xi_n = \phi} 1_{(\phi, n) \not \in \omega} \right]},
\end{aligned}
\] we get an expected importance weight difference which both has the right sign and has a magnitude at least equal to the policy regret for position $z$, \[
\begin{aligned}
E_{D_{\tilde r | x, \omega} \times D_{\kappa | x, \omega}} \left[ w_{\lambda|\tilde r,\kappa} - w_{\phi|\tilde r, \kappa} \right] &= E_{D_{\tilde r | x, \omega}} \left[ \tilde r (\lambda) - \tilde r (\phi) \right] E_{D_{\kappa | x, \omega}} \left[ \frac{1}{|\Upsilon_{\lambda,\phi}|}\sum_{n \in \Upsilon_{\lambda,\phi}} \kappa (n) \right], \\
\mbox{sgn} \left( E_{D_{\tilde r | x, \omega} \times D_{\kappa | x, \omega}} \left[ w_{\lambda|\tilde r,\kappa} - w_{\phi|\tilde r, \kappa} \right] \right) &= \mbox{sgn} \left( E_{D_{\kappa|x,\omega}} [ \kappa (z) ] E_{D_{\tilde r | x, \omega}} \left[ \tilde r (\lambda) - \tilde r (\phi) \right] \right), \\
\left|E_{D_{\tilde r | x, \omega} \times D_{\kappa | x, \omega}} \left[ w_{\lambda|\tilde r,\kappa} - w_{\phi|\tilde r, \kappa} \right] \right| &\geq E_{D_{\kappa|x,\omega}} [ \kappa (z) ] \left| E_{D_{\tilde r | x, \omega}} \left[ \tilde r (\lambda) - \tilde r (\phi) \right] \right|.
\end{aligned}
\] This turns out to be sufficient to make a regret bound proof strategy work. If instead I try to use data from all positions with shared support, I end up with $E_{D_{\kappa|x,\omega}} [ \kappa (m) ]$ as the leading factor in the last inequality, which is too small by a factor of $E_{D_{\kappa|x,\omega}} [ \kappa (z) ] / E_{D_{\kappa|x,\omega}} [ \kappa (m) ]$. I could scale the conditional regret and come up with another proof bound but that bound isn't useful to me, since I have no way of computing the $\kappa$ ratio in practice.

Since I'm not using data from later positions, I suspect I can relax my assumptions a bit and assume only swap supremacy and preservation of relative order and still have things work out.

Regret Analysis

The regret analysis for the positional forfeit offset tree is very similar to the regret analysis for the forfeit offset tree. The main difference is that instead of the expected importance weight difference being equal to the policy regret, it merely bounds the policy regret. This is sufficient for the proof strategy to work, and is good to note in case of other situations where the best that can be done is to bound the policy regret.

Let $\Psi = (T, \{\Psi_\nu | \nu \in \Lambda (T) \})$ denote a particular positional forfeit offset tree (i.e., a choice of a binary tree and a particular set of node classifiers), let $z$ denote the fixed position the tree is trained for, and let $h^\Psi$ denote the policy that results from the tree. The regret analysis leverages an induced importance-weighted binary distribution $D^\prime (\Psi)$ over triples $(x^\prime, y, w)$ defined as follows:

Draw $(x, \omega, \kappa, \tilde r)$ from $D$.
Draw $\nu$ uniform over the internal nodes $\Lambda (T)$ of the binary tree.
Let $x^\prime = (x, \nu)$.
Let $\lambda$ and $\phi$ be the two classes input to $\nu$ (the predictions of the left and right subtrees on input $x$ respectively).
If $(\lambda, z) \in \omega$, create importance-weighted binary example $(x^\prime, 0, 0)$;
else if $(\phi, z) \in \omega$, create importance-weighted binary example $(x^\prime, 1, 0)$;
else (when $(\lambda, z) \not \in \omega$ and $(\phi, z) \not \in \omega$):

Draw $n$ uniform over $\Upsilon_{\lambda, \phi}$.
Draw $\xi$ from $p (\xi | x, \omega)$.
If $\xi_n \neq \lambda$ and $\xi_n \neq \phi$, reject sample;
else if $(\xi_n, n) \in \omega$, reject sample;
else (when ($\xi_n = \lambda$ or $\xi_n = \phi$) and $(\xi_n, n) \not \in \omega$):

Let $\alpha = |\Upsilon_{\lambda,\phi}|^{-1} \frac{E_{\xi \sim p} \left[\sum_{n \in \Upsilon_{\lambda, \phi}} 1_{\xi_n = \lambda} 1_{(\lambda, n) \not \in \omega} + 1_{\xi_n = \phi} 1_{(\phi, n) \not \in \omega} \right]}{E_{\xi^\prime \sim p} \left[ 1_{\xi^\prime_n = \xi_n} 1_{(\xi_n, n) \not \in \omega} \right]}$
Let $y = 1_{\xi_n = \lambda}$
If $r (\xi_n, n) < \frac{1}{2}$, create importance-weighted binary example \[\left( x^\prime, 1 - y, \alpha \left( \frac{1}{2} - r (\xi_n, n) \right) \right);\]
else, create importance-weighted binary example \[ \left( x^\prime, y, \alpha \left( r (\xi_n, n) - \frac{1}{2} \right) \right). \]

The induced distribution $D^\prime (\Psi)$ depends upon the particular tree, but for any fixed tree is well defined. Now I'd like to relate the policy regret of $h^\Psi$ to the importance-weighted binary regret of $\Psi$, \[ \begin{aligned} q (\Psi) &= E_{(x^\prime, y, w) \sim D^\prime (\Psi)} \left[ w 1_{y \neq \Psi (x^\prime)} \right] \\ &= \frac{1}{|\Lambda (T)|} \sum_{\nu \in \Lambda (T)} E_{(x, \omega) \sim D_x \times D_{\omega|x}} \left[ q_\nu (\Psi | x, \omega) \right], \end{aligned} \] where \[ q_\nu (\Psi | x, \omega) = \begin{cases} 0 & \mbox{if } \Gamma (\nu_\lambda) \setminus \omega_z = \emptyset \mbox{ or } \Gamma (\nu_\phi) \setminus \omega_z = \emptyset; \\ 0 & \mbox{if } \Psi_\nu (x) = 1_{w_\lambda > w_\phi}; \\ \left| w_\lambda - w_\phi \right| & \mbox{otherwise}, \end{cases} \] is the importance weighted regret at internal node $\nu$, $\Gamma (\nu)$ refers to set of labels (leaves) in the subtree rooted at $\nu$, $\nu_\lambda$ refers to the left child of $n$, $\nu_\phi$ refers to the right child of $n$, $\omega_z = \{ a | (a, z) \in \omega \}$ is the set of infeasible actions at this position, $w_\lambda$ is the expected importance weight for the left child conditioned on $(x, \omega)$, and $w_\phi$ is the expected importance weight for the right child conditioned on $(x, \omega)$.

Theorem:Regret Bound

For all partially labelled CSMC distributions $D$ such that $r = \kappa \tilde r$ as above; all historical policies $p (\xi | x, \omega)$ such that for all pairs of actions $\lambda, \phi$, $\Upsilon_{\lambda, \phi} \neq \emptyset$ whenever $(\lambda, z) \not \in \omega$ and $(\phi, z) \not \in \omega$; and all positional forfeit offset trees $\Psi$, \[ v_z (h^\Psi) \leq (|A| - 1) q (\Psi) \] where $q (\Psi)$ is the importance-weighted binary regret on the induced subproblem.

Proof: See Appendix.

Thus a particular positional forfeit offset tree, trained for a position $z$ using data from $z$ and previous positions, can be used to select the best action at particular $z$. The next step is to compose individual positional forfeit offset trees into a set selector by using the reduction of CSBM to CSMC with the slight modification of passing the position $z$ to each subproblem.

Since the result is a bit underwhelming, it's best to turn it around and say the following: normalizing the presentation history by position is not sufficient to justify using data from later positions to inform regret at earlier positions, even given a very strong structural assumption about how the rewards vary by position. If I did use the data from all positions, I'd end up with a bound of the form \[ v_z (h^\Psi) \leq (|A| - 1) E_{(x, \omega) \sim D_x \times D_{\omega|x}} \left[ \frac{E_{\kappa \sim D_{\kappa|x,\omega}} \left[ \kappa (z) \right]}{E_{\kappa \sim D_{\kappa|x,\omega}} \left[ \kappa (m) \right]} q (\Psi | x, \omega) \right], \] which although sufficient to establish consistency of the reduction, is not clear to me how to exploit in practice: I don't know how to manage optimization tradeoffs between the different $(x, \omega)$ since I don't know $\frac{E_{\kappa \sim D_{\kappa|x,\omega}} \left[ \kappa (z) \right]}{E_{\kappa \sim D_{\kappa|x,\omega}} \left[ \kappa (m) \right]}$.

Appendix

This is the proof of the regret bound. It is done in terms of $r$, instead of $\kappa \tilde r$, so that I can easily adapt it to the weaker assumptions of swap supremacy and preservation of relative order.

Consider a fixed $(x, \omega)$. It is useful to talk about the conditional policy regret experienced at an internal node $\nu$, \[ v_z (h^\Psi | x, \omega, \nu) = \max_{k \in \Gamma (\nu)} E_{r \sim D_{r | x, \omega}} \left[ r (a, z) \right] - E_{r \sim D_{r | x, \omega}} \left[ r (h_\nu^\Psi (x, \omega), z) \right] . \] where $h_\nu^\Psi$ is the prediction at internal node $\nu$. When $\nu$ is the root of the tree, $v_z (h^\Psi | x, \omega, \nu)$ is the positional forfeit offset tree policy regret conditional on $(x, \omega)$.

The proof strategy is to bound $v_z (h^\Psi | x, \omega, \nu) \leq \sum_{m \in \Lambda (\nu)} q_m (\Psi | x, \omega)$ via induction. The base case is trivially satisfied for trees with only one leaf (no internal nodes) since it evaluates to $0 \leq 0$. To show the recursion at a particular internal node $\nu$, let $\lambda$ and $\phi$ be the predictions of the left subtree ($\nu_\lambda$) and right subtree ($\nu_\phi$).
Case 1: $\Gamma (\nu_\lambda) \setminus \omega_z = \emptyset$. In this case $(\lambda, z) \in \omega$ and forfeits, so $\phi$ is chosen. There must be a maximizer in the right subtree, since all values in the left subtree are $-\infty$. Furthermore $q_m (\Psi | x, \omega) = 0$ for both $m = \nu$ and for $m \in \Lambda (\nu_\lambda)$ by definition. Therefore \[ \begin{aligned} v_z (h^\Psi | x, \omega, \nu) &= \max_{k \in \Gamma (\nu)} E_{r \sim D_{r|\omega,x}} \left[ r (k, z) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (\phi, z) \right] \\ &= \max_{k \in \Gamma (\nu_\phi)} E_{r \sim D_{r|\omega,x}} \left[ r (k, z) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (\phi, z) \right] \\ &= v_z (h^\Psi | x, \omega, \nu_\phi) \\ &\leq \sum_{m \in \Lambda (\nu_\phi)} q_m (\Psi | x, \omega) \\ &= \sum_{m \in \Lambda (\nu)} q_m (\Psi | x, \omega). \end{aligned} \]
Case 2: $\Gamma (\nu_\lambda) \setminus \omega_z \neq \emptyset$ and $\Gamma (\nu_\phi) \setminus \omega_z = \emptyset$. In this case $(\phi, z) \in \omega$ and $(\lambda, z) \not \in \omega$, so $\phi$ forfeits and $\lambda$ is chosen. There must be a maximizer in the left subtree, since all values in the right subtree are $-\infty$. Furthermore $q_m (\Psi | x, \omega) = 0$ for both $m = \nu$ and for $m \in \Lambda (\nu_\phi)$ by definition. Therefore \[ \begin{aligned} v_z (h^\Psi | x, \omega, \nu) &= \max_{k \in \Gamma (\nu)} E_{r \sim D_{r|\omega,x}} \left[ r (k, z) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (\lambda, z) \right] \\ &= \max_{k \in \Gamma (\nu_\lambda)} E_{r \sim D_{r|\omega,x}} \left[ r (k, z) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (\lambda, z) \right] \\ &= v_z (h^\Psi | x, \omega, \nu_\lambda) \\ &\leq \sum_{m \in \Lambda (\nu_\lambda)} q_m (\Psi | x, \omega) \\ &= \sum_{m \in \Lambda (\nu)} q_m (\Psi | x, \omega). \end{aligned} \]
Case 3: $\Gamma (\nu_\lambda) \setminus \omega_z \neq \emptyset$ and $\Gamma (\nu_\phi) \setminus \omega_z \neq \emptyset$. This is the ``normal'' offset tree case, where both $\lambda \not \in \omega$ and $\phi \not \in \omega$ so no forfeiture happens. As shown above, the expected importance weights difference conditioned on $(x, \omega)$ has the same sign as the policy regret between $(\lambda, z)$ and $(\phi, z)$, and has a magnitude which is at least equal to the policy regret between $(\lambda, z)$ and $(\phi, z)$.

Assume without loss of generality that the classifier chooses $\phi$. If the maximizer comes from the right subtree, then \[ \begin{aligned} v_z (h^\Psi | x, \omega, \nu) &= \max_{k \in \Gamma (\nu_\phi)} E_{r \sim D_{r|\omega,x}} \left[ r (k, z) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (\phi, z) \right] \\ &= v_z (h^\Psi | x, \omega, \nu_\phi) \\ &\leq \sum_{m \in \Lambda (\nu_\phi)} q_m (\Psi | x, \omega) \\ &\leq \sum_{m \in \Lambda (\nu)} q_m (\Psi | x, \omega). \end{aligned} \] If the maximizer comes from the left subtree, then \[ \begin{aligned} v_z (h^\Psi | x, \omega, \nu) &= \max_{k \in \Gamma (\nu_\lambda)} E_{r \sim D_{r|\omega,x}} \left[ r (k, z) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (\phi, z) \right] \\ &= E_{r \sim D_{r|\omega,x}} \left[ r (\lambda, z) - r (\phi, z) \right] + v_z (h^\Psi | x, \omega, \nu_\lambda) \\ &\leq q_\nu (\Psi | x, \omega) + v_z (h^\Psi | x, \omega, \nu_\lambda) \\ &\leq q_\nu (\Psi | x, \omega) + \sum_{m \in \Lambda (\nu_\lambda)} q_m (\Psi | x, \omega) \\ &\leq \sum_{m \in \Lambda (\nu)} q_m (\Psi | x, \omega). \end{aligned} \] Terminating the induction at the root yields \[ v_z (h^\Psi | x, \omega) \leq \sum_{\nu \in \Lambda (T)} q_\nu (\Psi | x, \omega) = |\Lambda (T)| q (\Psi | x, \omega). \] Taking the expectation of both sides with respect to $D_x \times D_{\omega|x}$ and noting $|\Lambda (T)| = (|A| - 1)$ completes the proof.

Monday, September 27, 2010

Positional Effects: Part II

In a previous post I talked about cost-sensitive best m with partial feedback (CSBM-PF) in the presence of positional effects, as I try to muddle through how to optimize a combinatorial content page with a single feedback event. Several different possibilities suggested themselves:

Assume nothing about positional dependence, negative externalities, etc. In this case one can treat entire sets as actions and use the offset tree directly. Disadvantages of this approach include: generalization is restricted to combinations which have historical support; and training time computation scales as the number of combinations. However if the number of combinations is small this is perhaps the best way to go.
Assume a greedy approach to constructing result vectors is optimal, but otherwise do not assume anything about positional dependence (or negative externalities?). In this case a sequence of offset trees arranged in greedy fashion reduces CSBM-PF to CSMC-PF. Generalization is restricted to combinations in which the individual action-position pairs have historical support, rather than entire combinations. Training time calculation is $m |A|$ rather than ${|A| \choose m}$.

Ideally, I could find a necessary condition for the greedy approach to be optimal and use that in the learning algorithm, but I haven't found one.

Assume swap supremacy (sufficient for the greedy approach to be optimal) and preservation of relative order by position (orthogonal to greedy, but necessary to make what follows work). Again a sequence of offset trees arranged in greedy fashion reduces CSBM-PF to CSMC-PF, but data is reused at later positions. Generalization is restricted to combinations in which the individual action has historical support at or before the position where it is being used. I had hoped to be able to use data at later positions with these assumptions, but further reflection suggests no. For instance, consider the following expected reward distribution as $\epsilon \to 0$, \[
\begin{array}{|c|c|c|}
\mbox{Action} & r (a, 1) & r (a, 2) \\ \hline
1 & 1 + 1 / \epsilon & 1 + \epsilon \\
2 & 1 & 1
\end{array}
\] which obeys swap supremacy and preserves relative order by position; yet a small regret in the second position leads to a large regret in the first position. Basically the assumptions I have so far are not strong enough to let me go backwards. This is sad because exploration at the earlier positions is the most expensive, in the sense that they are the higher value spots.

Well I tried to be fancy, and it was informative, but ultimately I'm going to go with a common assumption: that the positional dependence and the action dependence factorizes, i.e., $r (a, i) = \kappa (i) \tilde r (a)$, where $\kappa$ and $\tilde r$ are independent random variables. If $\kappa (i)$ is non-negative and monotonically decreasing then this implies swap supremacy and preservation of relative order by position. The difference is that it gives a projection factor for changing positions from $i$ to $j$, $E_{\kappa | x, \omega} [ \kappa (j) ] / E_{\kappa | x,\omega}[ \kappa (i) ]$, which will (hopefully) allow me to generalize across position and mitigate the historical support requirement.

Now to just make that work.

Friday, September 24, 2010

Positional Effects

In a previous post I talked about constrained cost-sensitive best m with partial feedback (CSBM-PF) and one-way to reduce to constrained cost-sensitive multiclass classification with partial feedback (CSMC-PF). The CSBM variant I considered was about choosing the best set of choices when the rewards of a set is the sum of the rewards of the elements, and the reduction to CSMC works by “pick the best choice, then the next best choice, and so on”. A major inspiring application is optimizing sets of content (or advertising) elements, and indeed I have a content optimization problem in mind for my current gig. In practice for such applications positional effects are important, so I thought I'd try to incorporate them.

My reduction works as follows: first the highest reward choice is chosen, then its reward is adjusted to $-\infty$, and the process is repeated until a set of size $m$ has been achieved. This is essentially how I've always seen it done in the past (e.g., construct a regressor, and fill positions in sorted order), but for this greedy approach to work the positional dependence cannot be arbitrary: it must be that \[ \sum_{i=1}^m \max_{a_i \in A \setminus \{ a^*_1, \ldots, a^*_{i-1} \} } r (a_i, i) = \max_{a \in \tilde A^m}\; \sum_{i=1}^m r (a_i, i). \] Here $r: A \times [1, m] \to [0, 1]$ are the (conditionally expected) rewards, $A$ are the actions, and $[1, m]$ are the positions, and the right hand side maximum is over vectors of actions $\tilde A^m$ that do not have duplicates (i.e., the same action cannot be chosen at multiple positions). This ``greedy works'' condition is actually a much weaker condition than I'm used to assuming. For instance, here's an example set of expected rewards for which greedy works: \[
\begin{array}{|c|c|c|c|}
\mbox{Action} & r (a, 1) & r (a, 2) & r (a, 3) \\ \hline
1 & 10 & 5 & 1 \\
2 & 2 & 4 & 1 \\
3 & 1 & 2 & 3
\end{array}
\] The first action has decreasing reward by position, while the second action prefers a particular position and the third action has increasing reward by position. So I thought I'd had some fun exploring the above ``greedy works'' condition.

Lemma:Position Maximizers

If there is at least one maximizer of $\sum_{j=1}^m r (a_j, j)$, then there is a maximizer of $\tilde a^*$ of $\sum_{j=1}^m r (a_j, j)$ which uses each individual position maximizer $a^*_i$ for all $i \in [1, m]$, where $a^*_i$ maximizes $r (a, i)$.

Proof: Let $a^*_i$ be a maximizer of $r (a, i)$, and assume $\tilde a^*$ is a maximizer of $\sum_{j=1}^m r (a_j, j)$ that does not use $a^*_i$ in any position. Define $a^\prime$ as $\tilde a^*$ with position $i$ replaced by $a^*_i$. Since $a^*_i$ is a maximizer of $r (a, i)$, the resulting total reward cannot decrease, therefore $a^\prime$ would also be a maximizer of $\sum_{j=1}^m r (a_j, j)$. Repeating this argument for each position $i$ eventually uses an individual position maximizer from every position.

Now, here is a sufficient condition for greedy works.

Sufficient Condition:Swap Supremacy

If \[
r (a, i) \geq r (a^\prime, i) \implies \forall j > i: r (a, i) + r (a^\prime, j) \geq r (a, j) + r (a^\prime, i),
\] then \[
\sum_{i=1}^m \max_{a_i \in A \setminus \{ a^*_1, \ldots, a^*_{i-1} \} } r (a_i, i) = \max_{a \in \tilde A^m}\; \sum_{i=1}^m r (a_i, i).
\] Proof: Proof is by induction. The base case is when $m = 1$, in which case greedy always works.

To show $m - 1 \Rightarrow m$, note from the lemma it follows that there is a maximizer of the right hand side that uses $a^*_1$ in some position. If that position is $j \neq 1$, then construct a new maximizer of the right hand side by swapping positions 1 and $j$: this is guaranteed not to decrease the total reward due to the precondition of the theorem. Since both the left hand and right hand side of the desired result use $a^*_1$ in position 1, it can be subtracted from both sides, yielding \[
\sum_{i^\prime=1}^{m-1} \max_{a^\prime_{i^\prime} \in A_1 \setminus \{ {a^\prime}^*_1, \ldots, {a^\prime}^*_{i^\prime-1} \} } r (a^\prime_{i^\prime}, i^\prime) = \max_{a \in \tilde A_1^m}\; \sum_{i^\prime=1}^{m-1} r (a_{i^\prime}, i^\prime),
\] where $A_1 = A \setminus a^*_1$ and $i^\prime = i - 1$.

The toy example above is sufficient to show that the condition in the above theorem is not necessary: actions 2 and 3 violate the condition for positions 1 and 2, yet greedy still works. This is because they are not used in position 1 due to action 1. It is an interesting assumption, however, because when rearranging a bit, \[ r (a, i) - r (a^\prime, i) \geq 0 \implies \forall j > i: r (a, i) - r (a^\prime, i) \geq r (a, j) - r (a^\prime, j), \] it suggests that the policy regret at a later position can be upper bounded by the policy regret at an earlier position. This is almost good enough to let an offset tree style regret bound proof go through. It turns out what is needed is two-fold: the expected importance weight difference at a particular node has to be a upper bound on the policy regret, and the sign of the expected importance weight difference at a particular node has to be the same as the sign of the policy regret (so that, the correct importance-weighted decision is the correct policy decision). So I could add one additional condition, \[ r (a, i) - r (a^\prime, i) \geq 0 \implies \forall j > i: r (a, j) - r (a^\prime, j) \geq 0, \] i.e. the relative order of the actions does not change as position increases. With this assumption I could use historical instances from previous positions to train offset trees for later positions, mitigating the requirements on historical data (without some kind of structural assumption, every action has to have some chance of being tried at every position for the regret bound to go through). What about training offset trees for earlier positions? Well if the relative order of the actions does not change as position increases, I can always find the best action in the last position, and that will be the best action in any position. Thus all the historical data would get projected forward onto the last position and the resulting forfeit offset tree would be rebuilt with the constraint set increasing over time. The historical policy need only have some chance of trying every action pair at an overlapping set of positions in order for the regret bound to go through.

Now to just make this work.

Thursday, September 23, 2010

Aggregate Forfeit Offset Tree Note

In a previous post I talked about the aggregate forfeit offset tree, which handles the problem of learning with partial feedback in the case that the feedback is aggregate, i.e., only the total sum of rewards for a subset of the actions is observed. The regret bound proof required that the historical policy satisfying a kind of ``detailed balance'' condition, namely, that the two sets \[
\begin{aligned}
\Upsilon_{\lambda, \neg \phi} &= \{ \mathcal{A} | \mathcal{A} \cap \omega = \emptyset, \lambda \in \mathcal{A}, \phi \not \in \mathcal{A}, E_{\mathcal{A}^\prime \sim p} [ 1_{\mathcal{A}^\prime = \mathcal{A}} ] > 0 \}, \\
\Upsilon_{\neg \lambda, \phi} &= \{ \mathcal{A} | \mathcal{A} \cap \omega = \emptyset, \lambda \not \in \mathcal{A}, \phi \in \mathcal{A}, E_{\mathcal{A}^\prime \sim p} [ 1_{\mathcal{A}^\prime = \mathcal{A}} ] > 0 \}.
\end{aligned}
\] had to be the same when $\lambda$ was replaced by $\phi$ in each set. Basically, if a set containing $\lambda$ and not $\phi$ is possible in the historical policy, the corresponding set with $\lambda$ replaced by $\phi$ had to be possible as well. This way, the expected contribution of the other rewards cancels and the reward of the individual action is revealed. Also, $|\Upsilon_{\lambda, \neg \phi}| = |\Upsilon_{\neg \lambda, \phi}| > 0$, i.e., there needs to be some sets with only one action and not the other.

Well I wanted to remind myself that if the historical policy doesn't obey this detailed balance condition, it can be possible to modify the historical data to create an effective historical policy that does obey the condition. For instance if $|\Upsilon_{\lambda, \neg \phi}| > |\Upsilon_{\neg \lambda, \phi}| > 0$, I can reject some of the extra sets in $\Upsilon_{\lambda, \neg \phi}$ such that $|\Upsilon_{\lambda, \neg \phi}| = |\Upsilon_{\neg \lambda, \phi}|$. Of course, all of the normalization constants and factors that depend upon the historical policy $p$ need to be adjusted because now the historical policy is $p^\prime$ (which is $p$ followed by a rejection step).

Sunday, September 19, 2010

An Aggregate Forfeit Offset Tree

In a previous post I talked about cost-sensitive best m (CSBM) with partial feedback. The idea is to pick a set of actions where the reward for the set is the sum of the individual element rewards. The nature of the partial feedback I considered previously was that only the rewards for the set of actions chosen was revealed, but each individual reward in the set was revealed. Another plausible scenario is that only the total reward of the set of actions chosen is revealed. This is harder because it is both partial feedback and aggregate feedback. However there is a problem at my current gig where a set of page elements are to be optimized, but there is only one way for the user to positively interact with the page, i.e., individual page components are the unit of decision making but not the unit of user feedback. If the cardinality of the sets in question are small, treating whole sets as actions and directly utilizing the offset tree is an option. For the initial problem I'm dealing with here, there are ${9 \choose 2} = 36$ combinations so this is totally viable. Taking over more portions of the page would scale this up maybe 2 or 3 orders of magnitude but I have a lot of data so maybe this would still work.

Still, there is that itch to scratch $\ldots$ my hope is to use an offset tree approach but for individual actions not sets, composing them into a set selector with my constrained CSBM to constrained CSMC reduction. The first step is to solve constrained CSMC with aggregate feedback, i.e., pick the best action in a set given historical data consisting of sets of actions and associated total summed reward. The constrained CSMC setup is as follows. There is a distribution $D = D_x \times D_{\omega|x} \times D_{r|\omega,x}$ where $r: A \to [0, 1] \cup \{ -\infty \}$ takes values on the unit interval augmented with $-\infty$, and the components of $r$ that are $-\infty$ valued for a particular instance are revealed as part of the problem instance via $\omega \in \mathcal{P} (A)$ (i.e., $\omega$ is a subset of $A$). The regret of a particular deterministic policy $h: X \times \mathcal{P} (A) \to A$ is \[ v (h) = E_{(x, \omega) \sim D_x \times D_{\omega|x}} \left[ \max_{k \in A}\; E_{r \sim D_{r|\omega,x}} \left[ r (k) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (h (x, \omega)) \right] \right]. \] I'll assume that the historical policy is using a known conditional distribution over the power set of actions given an instance $p (\mathcal{A} | x, \omega)$. I'll use the shorthand $\mathcal{A}$ to refer to realizations from $\mathcal{P} (A)$. Instead of historical data containing the rewards for each element of $\mathcal{A}$, instead there is only $\sum_{a \in \mathcal{A}} r (a)$.

Algorithm:Aggregate Forfeit Offset Tree Train

Data: Constrained CSMC with aggregate feedback training data set $S$.
Input: Importance-weighted binary classification routine $\mbox{Learn}$.
Input: A binary tree $T$ over the labels with internal nodes $\Lambda (T)$.
Result: Trained classifiers $\{\Psi_n | n \in \Lambda (T) \}$.

For each $n \in \Lambda (T)$ from leaves to roots:

$S_n = \emptyset$.
For each example $\left(x, \omega, \mathcal{A}, \sum_{a \in \mathcal{A}} r (a), p (\cdot | x, \omega)\right) \in S$ with $\mathcal{A} \cap \omega = \emptyset$:

Let $\lambda$ and $\phi$ be the two classes input to $n$ (the predictions of the left and right subtrees on input $(x, \omega)$ respectively).
If $\lambda \in \omega$, predict $\phi$ for the purposes of constructing training input for parent node (``$\lambda$ forfeits'');
else if $\phi \in \omega$, predict $\lambda$ for the purposes of constructing training input for parent node (``$\phi$ forfeits'');
else if ($\lambda \in \mathcal{A}$ and $\phi \not \in \mathcal{A}$) or ($\lambda \not \in \mathcal{A}$ and $\phi \in \mathcal{A}$):

Let \[ \alpha = { |A \setminus \omega| - 2 \choose |\mathcal{A}| - 1 }^{-1} \frac{E_{\mathcal{A} \sim p} \left[ 1_{\mathcal{A} \cap \omega = \emptyset} (1_{\lambda \in \mathcal{A}} 1_{\phi \not \in \mathcal{A}} + 1_{\lambda \not \in \mathcal{A}} 1_{\phi \in \mathcal{A}})\right]}{E_{\mathcal{A}^\prime \sim p} [ 1_{\mathcal{A} \cap \omega = \emptyset} 1_{\mathcal{A}^\prime = \mathcal{A}} ]}. \]
If $\sum_{a \in \mathcal{A}} r (a) < \frac{|\mathcal{A}|}{2}$, $S_n \leftarrow S_n \cup \left\{ \left( x, 1_{\phi \in \mathcal{A}}, \alpha \left( \frac{|\mathcal{A}|}{2} - \sum_{a \in \mathcal{A}} r (a) \right) \right) \right\}$;
else $S_n \leftarrow S_n \cup \left\{ \left( x, 1_{\lambda \in \mathcal{A}}, \alpha \left( \sum_{a \in \mathcal{A}} r (a) - \frac{|\mathcal{A}|}{2} \right) \right) \right\}$.

Let $\Psi_n = \mbox{Learn} (S_n)$.

Return $\{\Psi_n | n \in \Lambda (T) \}$.

Comment: This assumes a historical policy where $|\mathcal{A}|$ is a constant almost surely, and all feasible sets have positive probability.

Algorithm:Aggregate Forfeit Offset Tree Test

Input: A binary tree $T$ over the labels with internal nodes $\Lambda (T)$.
Input: Trained classifiers $\{\Psi_n | n \in \Lambda (T) \}$.
Input: Instance realization $(x, \omega)$.
Result: Predicted label $k$.

Let $n$ be the root node.
Repeat until $n$ is a leaf node:

If all the labels of the leaves in the left-subtree of $n$ are in $\omega$, traverse to the right child;
else if all the labels of the leaves in the right-subtree of $n$ are in $\omega$, traverse to the left child;
else if $\Psi_n (x) = 1$, traverse to the left child;
else (when $\Psi_n (x) = 0$ and at least one label in each subtree is not in $\omega$), traverse to the right child.

Return leaf label $k$.

Motivating the Update

The basic idea is to use the total reward as the signal in an offset tree, but only attributing when one but not both of the inputs to a node is in the set of actions. The key to leveraging the filter tree style regret bound proof policy is to ensure that the expected importance weight difference at an internal node is equal to the policy regret with respect to the two inputs to that node. Since the total reward is a linear combination of individual rewards, it is possible to compare the action values by evaluating their difference when co-occuring with the same actions. The update is chosen such that when the expectation is taken, sets that differ only in the actions input to a particular node combine to contribute to the expected importance weight difference.

Jumping ahead a bit, for a fixed $(x, \omega, r)$ and an internal node with left input $\lambda \not \in \omega$ and right input $\phi \not \in \omega$, the expected importance weight for $\lambda$ is \[
\begin{aligned}
w_{\lambda|r} &= \frac{E_{\mathcal{A} \sim p} \left[ 1_{\mathcal{A} \cap \omega = \emptyset} 1_{\lambda \in \mathcal{A}} 1_{\phi \not \in \mathcal{A}} \alpha_{\lambda, \neg \phi} \left( \sum_{a \in \mathcal{A}} r (a) - \frac{|\mathcal{A}|}{2} \right)_+ \right ]}{E_{\mathcal{A} \sim p} \left[ 1_{\mathcal{A} \cap \omega = \emptyset} 1_{\lambda \in \mathcal{A}} 1_{\phi \not \in \mathcal{A}} + 1_{\mathcal{A} \cap \omega = \emptyset} 1_{\lambda \not \in \mathcal{A}} 1_{\phi \in \mathcal{A}} \right]} \\
&\quad + \frac{E_{\mathcal{A} \sim p} \left[ 1_{\mathcal{A} \cap \omega = \emptyset} 1_{\lambda \not \in \mathcal{A}} 1_{\phi \in \mathcal{A}} \alpha_{\neg \lambda, \phi} \left( \frac{|\mathcal{A}|}{2} - \sum_{a \in \mathcal{A}} r (a) \right)_+ \right ]}{E_{\mathcal{A} \sim p} \left[ 1_{\mathcal{A} \cap \omega = \emptyset} 1_{\lambda \in \mathcal{A}} 1_{\phi \not \in \mathcal{A}} + 1_{\mathcal{A} \cap \omega = \emptyset} 1_{\lambda \not \in \mathcal{A}} 1_{\phi \in \mathcal{A}} \right]},
\end{aligned}
\] where $(x)_+ = \max (x, 0)$, and $\alpha_{\lambda,\neg \phi}$ and $\alpha_{\neg \lambda, \phi}$ are to be determined scaling factors. This suggests \[
\alpha_{\neg \lambda, \phi} = \alpha_{\lambda, \neg \phi} \propto \begin{cases} \frac{E_{\mathcal{A} \sim p} \left[ 1_{\mathcal{A} \cap \omega = \emptyset} 1_{\lambda \in \mathcal{A}} 1_{\phi \not \in \mathcal{A}} + 1_{\mathcal{A} \cap \omega = \emptyset} 1_{\lambda \not \in \mathcal{A}} 1_{\phi \in \mathcal{A}} \right]}{E_{\mathcal{A}^\prime \sim p} [ 1_{\mathcal{A} \cap \omega = \emptyset} 1_{\mathcal{A}^\prime = \mathcal{A}} ]} & \mbox{if } E_{\mathcal{A}^\prime \sim p} [ 1_{\mathcal{A} \cap \omega = \emptyset} 1_{\mathcal{A}^\prime = \mathcal{A}} ] > 0; \\ 0 & \mbox{otherwise}, \end{cases}
\] which yields \[
\begin{aligned}
w_{\lambda|r} &\propto \sum_{\mathcal{A} \in \Upsilon_{\lambda, \neg \phi}} \left(\sum_{a \in \mathcal{A}} r (a) - \frac{|\mathcal{A}|}{2} \right)_+ + \sum_{\mathcal{A} \in \Upsilon_{\neg \lambda, \phi}} \left( \frac{|\mathcal{A}|}{2} - \sum_{a \in \mathcal{A}} r (a) \right)_+, \\
w_{\phi|r} &\propto \sum_{\mathcal{A} \in \Upsilon_{\neg \lambda,\phi}} \left( \sum_{a \in \mathcal{A}} r (a) - \frac{|\mathcal{A}|}{2} \right)_+ + \sum_{\mathcal{A} \in \Upsilon_{\lambda, \neg \phi}} \left( \frac{|\mathcal{A}|}{2} - \sum_{a \in \mathcal{A}} r (a) \right)_+, \\
\end{aligned}
\] where \[
\begin{aligned}
\Upsilon_{\lambda, \neg \phi} &= \{ \mathcal{A} | \mathcal{A} \cap \omega = \emptyset, \lambda \in \mathcal{A}, \phi \not \in \mathcal{A}, E_{\mathcal{A}^\prime \sim p} [ 1_{\mathcal{A}^\prime = \mathcal{A}} ] > 0 \}, \\
\Upsilon_{\neg \lambda, \phi} &= \{ \mathcal{A} | \mathcal{A} \cap \omega = \emptyset, \lambda \not \in \mathcal{A}, \phi \in \mathcal{A}, E_{\mathcal{A}^\prime \sim p} [ 1_{\mathcal{A}^\prime = \mathcal{A}} ] > 0 \}.
\end{aligned}
\] Now if a set containing $\lambda$ and not $\phi$ is possible under the historical policy if and only if the corresponding set with $\lambda$ replaced by $\phi$ is possible under the historical policy, a condition I shall denote $\Upsilon_{\lambda, \neg \phi} \sim \Upsilon_{\neg \lambda, \phi}$, then the expected importance weight difference is \[
w_{\lambda|r} - w_{\phi|r} \propto | \Upsilon | \left( r (\lambda) - r (\phi) \right),
\] and therefore the proper choice when $|\Upsilon_{\lambda,\neg \phi}| = |\Upsilon_{\neg \lambda, \phi}| \doteq |\Upsilon| > 0$ is \[
\alpha_{\phi, \neg \lambda} = \alpha_{\lambda, \neg \phi} = \begin{cases} |\Upsilon|^{-1} \frac{E_{\mathcal{A} \sim p} \left[ 1_{\mathcal{A} \cap \omega = \emptyset} 1_{\lambda \in \mathcal{A}} 1_{\phi \not \in \mathcal{A}} + 1_{\mathcal{A} \cap \omega = \emptyset} 1_{\lambda \not \in \mathcal{A}} 1_{\phi \in \mathcal{A}} \right]}{E_{\mathcal{A}^\prime \sim p} [ 1_{\mathcal{A} \cap \omega = \emptyset} 1_{\mathcal{A}^\prime = \mathcal{A}} ]} & \mbox{if } E_{\mathcal{A}^\prime \sim p} [ 1_{\mathcal{A} \cap \omega = \emptyset} 1_{\mathcal{A}^\prime = \mathcal{A}} ] > 0; \\ 0 & \mbox{otherwise}. \end{cases}
\] In the simplest case where all entirely feasible sets have positive probability under the historical policy, and all sets constructed by the historical policy have the same $|\mathcal{A}|$, then $|\Upsilon| = { |A \setminus \omega| - 2 \choose |\mathcal{A}| - 1 }$.

In some cases a historical policy that does not obey $\Upsilon_{\lambda, \neg \phi} \sim \Upsilon_{\neg \lambda, \phi}$ can be modified via rejecting a portion of the historical data into an effective historical policy that does obey $\Upsilon_{\lambda, \neg \phi} \sim \Upsilon_{\neg \lambda, \phi}$.

Regret Analysis

The regret analysis for the aggregate forfeit offset tree is almost identical to the regret analysis for the forfeit offset tree.

Let $\Psi = (T, \{\Psi_n | n \in \Lambda (T) \})$ denote a particular aggregate forfeit offset tree (i.e., a choice of a binary tree and a particular set of node classifiers), and let $h^\Psi$ denote the policy that results from the aggregate forfeit offset tree. The regret analysis leverages an induced importance-weighted binary distribution $D^\prime (\Psi)$ over triples $(x^\prime, y, w)$ defined as follows:

Draw $(x, \omega, r)$ from $D$.
Draw $n$ uniform over the internal nodes $\Lambda (T)$ of the binary tree.
Let $x^\prime = (x, n)$.
Let $\lambda$ and $\phi$ be the two classes input to $n$ (the predictions of the left and right subtrees on input $x$ respectively).
If $\lambda \in \omega$, create importance-weighted binary example $(x^\prime, 0, 0)$;
else if $\phi \in \omega$, create importance-weighted binary example $(x^\prime, 1, 0)$;
else (when $\lambda \not \in \omega$ and $\phi \not \in \omega$):

Draw $\mathcal{A}$ from $p (\mathcal{A} | x, \omega)$.
If $\mathcal{A} \cap \omega \neq \emptyset$, reject sample;
else if ($\lambda \in \mathcal{A}$ and $\phi \not \in \mathcal{A}$) or ($\lambda \not \in \mathcal{A}$ and $\phi \in \mathcal{A}$):

Let \[ \alpha = |\Upsilon|^{-1} \frac{E_{\mathcal{A} \sim p} \left[ 1_{\mathcal{A} \cap \omega = \emptyset} (1_{\lambda \in \mathcal{A}} 1_{\phi \not \in \mathcal{A}} + 1_{\lambda \not \in \mathcal{A}} 1_{\phi \in \mathcal{A}})\right]}{E_{\mathcal{A}^\prime \sim p} [ 1_{\mathcal{A} \cap \omega = \emptyset} 1_{\mathcal{A}^\prime = \mathcal{A}} ]}, \] with $|\Upsilon|$ as defined above.
If $\sum_{a \in \mathcal{A}} r (a) < \frac{|\mathcal{A}|}{2}$, create importance-weighted binary example \[\left( x^\prime, 1_{\phi \in \mathcal{A}}, \alpha \left( \frac{|\mathcal{A}|}{2} - \sum_{a \in \mathcal{A}} r (a) \right) \right) ;\]
else (when $\sum_{a \in \mathcal{A}} r (a) \geq \frac{|\mathcal{A}|}{2}$), create importance-weighted binary example \[ \left( x^\prime, 1_{\lambda \in \mathcal{A}}, \alpha \left( \sum_{a \in \mathcal{A}} r (a) - \frac{|\mathcal{A}|}{2} \right) \right) ;\]

else reject sample.

The induced distribution $D^\prime (\Psi)$ depends upon the particular aggregate forfeit offset tree, but for any fixed aggregate forfeit offset tree is well defined. Now I'd like to relate the policy regret of $h^\Psi$ to the importance-weighted binary regret of $\Psi$, \[ \begin{aligned} q (\Psi) &= E_{(x^\prime, y, w) \sim D^\prime (\Psi)} \left[ w 1_{y \neq \Psi (x^\prime)} \right] \\ &= \frac{1}{|\Lambda (T)|} \sum_{n \in \Lambda (T)} E_{(x, \omega) \sim D_x \times D_{\omega|x}} \left[ q_n (\Psi | x, \omega) \right], \end{aligned} \] where \[ q_n (\Psi | x, \omega) = \begin{cases} 0 & \mbox{if } \Gamma (n_\lambda) \setminus \omega = \emptyset \mbox{ or } \Gamma (n_\phi) \setminus \omega = \emptyset; \\ 0 & \mbox{if } \Psi_n (x) = 1_{w_\lambda > w_\phi}; \\ \left| w_\lambda - w_\phi \right| & \mbox{otherwise}, \end{cases} \] is the importance weighted regret at internal node $n$, $\Gamma (n)$ refers to set of labels (leaves) in the subtree rooted at $n$, $n_\lambda$ refers to the left child of $n$, $n_\phi$ refers to the right child of $n$, $w_\lambda$ is the expected importance weight for the left child conditioned on $(x, \omega)$, and $w_\phi$ is the expected importance weight for the right child conditioned on $(x, \omega)$.

Theorem:Regret Bound

For all CSMC distributions $D$; all historical policies $p$ such that for all pairs of actions $\lambda$ and $\phi$, $\Upsilon_{\lambda, \neg \phi} \sim \Upsilon_{\neg \lambda, \phi} \neq \emptyset$ whenever $\lambda \not \in \omega$ and $\phi \not \in \omega$, and such that $E_{\mathcal{A} \sim p} [ 1_{a \in \mathcal{A}} | x, \omega ] > 0$ whenever $a \not \in \omega$; and all aggregate forfeit offset trees $\Psi$, \[ v (h^\Psi) \leq (|A| - 1) q (\Psi), \] where $q (\Psi)$ is the importance-weighted binary regret on the induced subproblem.

Proof: See Appendix.

While this is pleasantly tidy, there is still a blemish: identifying constraints with penalties on particular actions seemed natural in previous contexts, but here a more plausible scenario is penalties on particular combinations of actions. That starts to look like stochastic shortest path (SSP) without recourse with partial (aggregate?) feedback and a non-fully connected graph. In OR they reduce many problems to SSP, so maybe it's time to revisit SSP now that I have a better command of the offset tree.

Appendix

This is the proof of the regret bound.

Consider a fixed $(x, \omega)$. It is useful to talk about the conditional policy regret experienced at an internal node $n$, \[ v (h^\Psi | x, \omega, n) = \max_{k \in \Gamma (n)} E_{r \sim D_{r|\omega,x}} \left[ r (k) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (h^\Psi_n (x, \omega)) \right]. \] where $h_n^\Psi$ is the prediction at internal node $n$. When $n$ is the root of the tree, $v (h^\Psi | x, \omega, n)$ is the forfeit offset tree policy regret conditional on $(x, \omega)$.

The proof strategy is to bound $v (h^\Psi | x, \omega, n) \leq \sum_{m \in \Lambda (n)} q_m (\Psi | x, \omega)$ via induction. The base case is trivially satisfied for trees with only one leaf (no internal nodes) since it evaluates to $0 \leq 0$. To show the recursion at a particular internal node $n$, let $\lambda$ and $\phi$ be the predictions of the left subtree ($n_\lambda$) and right subtree ($n_\phi$).

Case 1: $\Gamma (n_\lambda) \setminus \omega = \emptyset$. In this case $\lambda \in \omega$ and forfeits, so $\phi$ is chosen. There must be a maximizer in the right subtree, since all values in the left subtree are $-\infty$. Furthermore $q_m (\Psi | x, \omega) = 0$ for both $m = n$ and for $m \in \Lambda (n_\lambda)$ by definition. Therefore \[ \begin{aligned} v (h^\Psi | x, \omega, n) &=
\max_{k \in \Gamma (n)} E_{r \sim D_{r|\omega,x}} \left[ r (k) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (\phi) \right] \\ &= \max_{k \in \Gamma (n_\phi)} E_{r \sim D_{r|\omega,x}} \left[ r (k) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (\phi) \right] \\ &= v (h^\Psi | x, \omega, n_\phi) \\ &\leq \sum_{m \in \Lambda (n_\phi)} q_m (\Psi | x, \omega) \\ &= \sum_{m \in \Lambda (n)} q_m (\Psi | x, \omega). \end{aligned} \]
Case 2: $\Gamma (n_\lambda) \setminus \omega \neq \emptyset$ and $\Gamma (n_\phi) \setminus \omega = \emptyset$. In this case $\phi \in \omega$ and $\lambda \not \in \omega$, so $\phi$ forfeits and $\lambda$ is chosen. There must be a maximizer in the left subtree, since all values in the right subtree are $-\infty$. Furthermore $q_m (\Psi | x, \omega) = 0$ for both $m = n$ and for $m \in \Lambda (n_\phi)$ by definition. Therefore \[ \begin{aligned} v (h^\Psi | x, \omega, n) &=
\max_{k \in \Gamma (n)} E_{r \sim D_{r|\omega,x}} \left[ r (k) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (\lambda) \right] \\ &= \max_{k \in \Gamma (n_\lambda)} E_{r \sim D_{r|\omega,x}} \left[ r (k) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (\lambda) \right] \\ &= v (h^\Psi | x, \omega, n_\lambda) \\ &\leq \sum_{m \in \Lambda (n_\lambda)} q_m (\Psi | x, \omega) \\ &= \sum_{m \in \Lambda (n)} q_m (\Psi | x, \omega). \end{aligned} \]
Case 3: $\Gamma (n_\lambda) \setminus \omega \neq \emptyset$ and $\Gamma (n_\phi) \setminus \omega \neq \emptyset$. This is the ``normal'' offset tree case, where both $\lambda \not \in \omega$ and $\phi \not \in \omega$ so no forfeiture happens. As shown above, the expected importance weights conditioned on $(x, \omega, r)$ and $\lambda \not \in \omega$ and $\phi \not \in \omega$ satisfy \[ | w_\lambda - w_\phi | = \left| E_{r \sim D_{r|\omega,x}} \left[ w_{\lambda|r} - w_{\phi|r} \right] \right| = \left| E_{r \sim D_{r|\omega,x}} [r (\lambda) - r (\phi)] \right|, \] i.e., the importance-weighted regret at an internal node is equal to the policy regret with respect to the two actions input to that node.

Assume without loss of generality that the classifier chooses $\phi$. If the maximizer comes from the right subtree, then \[ \begin{aligned} v (h^\Psi | x, \omega, n) &= \max_{k \in \Gamma (n_\phi)} E_{r \sim D_{r|\omega,x}} \left[ r (k) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (\phi) \right] \\ &= v (h^\Psi | x, \omega, n_\phi) \\ &\leq \sum_{m \in \Lambda (n_\phi)} q_m (\Psi | x, \omega) \\ &\leq \sum_{m \in \Lambda (n)} q_m (\Psi | x, \omega). \end{aligned} \] If the maximizer comes from the left subtree, then \[ \begin{aligned} v (h^\Psi | x, \omega, n) &= \max_{k \in \Gamma (n_\lambda)} E_{r \sim D_{r|\omega,x}} \left[ r (k) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (\phi) \right] \\ &= E_{r \sim D_{r|\omega,x}} \left[ r (\lambda) - r (\phi) \right] + v (h^\Psi | x, \omega, n_\lambda) \\ &= q_n (\Psi | x, \omega) + v (h^\Psi | x, \omega, n_\lambda) \\ &\leq q_n (\Psi | x, \omega) + \sum_{m \in \Lambda (n_\lambda)} q_m (\Psi | x, \omega) \\ &\leq \sum_{m \in \Lambda (n)} q_m (\Psi | x, \omega). \end{aligned} \] Terminating the induction at the root yields \[ v (h^\Psi | x, \omega) \leq \sum_{n \in \Lambda (T)} q_n (\Psi | x, \omega) = |\Lambda (T)| q (\Psi | x, \omega). \] Taking the expectation of both sides with respect to $D_x \times D_{\omega|x}$ and noting $|\Lambda (T)| = (|A| - 1)$ completes the proof.