## Tuesday, May 10, 2011

### Cost Sensitive Multi Label: An Observation

I'm faced with a cost-sensitive multi-label classification (CSML) problem, i.e., there is a set of labels $K$ and I am to assign any or all of them to instances (in my case, of text documents). One approach is to treat it as a cost-sensitive multi-class classification (CSMC) problem on the power set of labels $\mathcal{P} (K)$. At that point, I could reduce to binary classification using a filter tree. This has some advantages, such as consistency (zero regret on the induced subproblems implies zero regret on the original problem). It also has substantial disadvantages, both theoretical (the constant in the regret bound is scaling as $O (2^{|K|})$) and practical (the most general implementation would have time complexity scaling as $O(2^{|K|})$).

Another popular approach is to learn $|K|$ independent binary classifiers at test time, query them independently at test time, and output the union. This has nice practical properties (time complexity scaling as $O (|K|)$). However decomposing a problem into a set of independent subproblems is generally a formula for creating an inconsistent reduction, so I was suspicious of this approach.

So here's a fun observation: learning a set of independent binary classifiers is equivalent to a filter tree approach on the CSMC problem with the following conditions.
1. The filter tree is a scoring filter tree, i.e., the classifier at each node is $\Phi (x) = 1_{f (x; \lambda) > f (x; \phi)}.$
2. The scoring function further decomposes into a weighted set of indicator functions, $f (x; \lambda) = \sum_{k \in K} 1_{k \in \lambda} f_k (x)$
3. The loss function for the CSMC problem is Hamming loss.
4. The tree is constructed such that at level $k$, the two inputs to each node differ only in the presence or absence of the $k^{\mathrm{th}}$ element of $K$. Thinking of the elements of $\mathcal{P} (K)$ as bit strings, the tournaments at level $k$ are deciding between the $k^{\mathrm{th}}$ bit in the binary expansion.

In this case, here's what happens. At the $k^\mathrm{th}$ level, the tournaments are all between sets that differ only in their $k^\mathrm{th}$ significant bit. The classifier for every node at this level has the form $\Phi (x) = 1_{f_k (x) > 0}$ and all the importance weights for all the subproblems at this level are identical (because of the use of Hamming loss). Thus the entire $k^\mathrm{th}$ level of the tree is equivalent to a binary classifier which is independently learning whether or not to include the $k^\mathrm{th}$ element of $K$ into the final result.

So the good news is: this means learning independent binary classifiers is consistent if Hamming loss is the loss function on the CSML problem. Of course, the constant in the regret bound is huge, and the structural assumptions on the classifiers are significant, so in practice performance might be quite poor.

What about other loss functions? If the structural assumptions on the classifiers are retained, then each level of the filter tree can still be summarized with a single binary classifier. However, the importance weight of the examples fed to this classifier depend upon the output of the classifiers at the previous levels.

As an example, consider 0-1 loss on the entire set of labels. At the lowest level of the tree, the only tournament with a non-zero importance weight (cost difference) is the one which considers whether or not to include the first label conditional on all other labels being correct. At the second level of the tree, the only tournament that could possibly have a non-zero importance weight is the one that considers whether or not to include the second label conditional on all other labels being correct. However, if the first classifier made an error, this condition will not filter up the tree, and all importance weights will be zero. In general, as soon as one of the classifiers makes a mistake, training stops. So the training procedure can be roughly outlined as:
1. Given a training datum $(x, Y) \in X \times \mathcal{P} (K)$,
2. For each $k = 1 \ldots |K|$
1. Let $\hat y_k$ be the prediction of classifier $k$ of whether label $k$ is in the target set.
2. Add $(x, 1_{k \in Y})$ to training set for classifier $k$. Note all non-zero importance weights are 1 so this is just binary classification.
3. If $\hat y_k \neq 1_{k \in Y}$, stop iterating over $k$ for this training datum.
If the classifiers make mistakes frequently, this will end up decimating the data to the point of uselessness. Leaving that problem aside, this procedure is intuitively pleasing because it does not waste classifier resources later in the chain on decisions that don't matter according to 0-1 loss on the entire set.

1. Blogger's comment system was down, so John Langford emailed me the following to post:

(1) The regret bound for a filter tree scale as the depth. The maximum depth required is log(possible outputs), implying that if there are N of K partial labels, then the depth is:
log (K-choose-N) ~= N log (N/K). The time complexity can be the same. The real problem is space complexity, which could be N-choose-K.

(2) The K independent predictors approach can be made consistent via using (logistic or squared loss) regression rather than classification and then thresholding appropriately.

(3) If I understand correctly, this construction is depth K? In that case, the regret bound could be relatively large.

(4) My favored approach is a bit different. Instead of having a single classifier at each node, you would have have two classifiers at each node, each of which predicts whether you should descend into a subtree. This approach has time complexity N log K and space complexity K. Training at the nodes is slightly tricky, because it's extra important to descend when there are two labels in a particular subtree. Hence, having an importance weight equal to the sum of correct labels in the subtree seems good (or something more complex if you have a more complex loss function than hamming). I believe a regret bound substantially better than the general cost sensitive case could be proved for the algorithm. If we are willing to accept time complexity K, then we can use a representational trick along the lines of what you are thinking where we sum up the predicted function values for all members of the subtree.

2. >> (1) The regret bound for a filter tree scale as the depth. The maximum
>> depth required is log(possible outputs), implying that if there are N of K
>> partial labels, then the depth is:
>> log (K-choose-N) ~= N log (N/K). The time complexity can be the same. The
>> real problem is space complexity, which could be N-choose-K.

I agree in practice most of the elements of the power set need not be
reached (i.e., most things have few labels).

>>
>> (2) The K independent predictors approach can be made consistent via using
>> (logistic or squared loss) regression rather than classification and then
>> thresholding appropriately.

So you are saying ... using a proper loss ... each independent
classifier estimates P (element k present | x). Is this consistent
only for Hamming loss, or is it consistent for any loss (e.g., 0-1
loss on the entire set)?

For 0-1 loss it seems like this cannot be. Consider a 3 element
problem, null example space, and representing each element of the
power set by a 3-bit string,

000: probability 49%
110: probability 25.5%
101: probability 25.5%

Independent classifiers will learn to output 100, right? Whereas
filtering the training data for the later classifiers in the manner I
describe on the blog will result in learning to output 110 which is
better for 0-1 loss.

>>
>> (3) If I understand correctly, this construction is depth K? In that case,
>> the regret bound could be relatively large.

Yes it is depth K and the regret bound is large. I was just trying to
verify the approach is consistent.

>>
>> (4) My favored approach is a bit different. Instead of having a single
>> classifier at each node, you would have have two classifiers at each node,
>> each of which predicts whether you should descend into a subtree. This
>> approach has time complexity N log K and space complexity K. Training at
>> the nodes is slightly tricky, because it's extra important to descend when
>> there are two labels in a particular subtree. Hence, having an importance
>> weight equal to the sum of correct labels in the subtree seems good (or
>> something more complex if you have a more complex loss function than
>> hamming). I believe a regret bound substantially better than the general
>> cost sensitive case could be proved for the algorithm. If we are willing to
>> accept time complexity K, then we can use a representational trick along the
>> lines of what you are thinking where we sum up the predicted function values
>> for all members of the subtree.

Yes that does seem promising.

3. > For 0-1 loss it seems like this cannot be. Consider a 3 element
> problem, null example space, and representing each element of the
> power set by a 3-bit string,
>
> 000: probability 49%
> 110: probability 25.5%
> 101: probability 25.5%
>
> Independent classifiers will learn to output 100, right? Whereas
> filtering the training data for the later classifiers in the manner I
> describe on the blog will result in learning to output 110 which is
> better for 0-1 loss.
>

Whoops, if i train the classifiers in reverse order, I get 000, which
is pretty good, but not optimal, even though each subproblem was
solved optimally.

More thought needed :)

4. Ok I get it now. When training the classifiers in reverse, the regret at the first level of the tree is not zero (at the 000 vs 001 node, it is zero; but at the 100 vs. 101 node, it is nonzero). Basically by sharing the same classifier across the entire level of the tree, I make it very difficult to achieve zero regret (summed across the tree!).

Nonetheless the data filtering while advancing through the classifiers does seem like it would improve things in practice.