tag:blogger.com,1999:blog-4446292666398344382.post7180253281985795768..comments2024-05-14T20:48:32.117-07:00Comments on Machined Learnings: Cost Sensitive Multi Label: An ObservationPaul Mineirohttp://www.blogger.com/profile/05439062526157173163noreply@blogger.comBlogger4125tag:blogger.com,1999:blog-4446292666398344382.post-60947840874633106702011-05-13T10:56:37.680-07:002011-05-13T10:56:37.680-07:00Ok I get it now. When training the classifiers in...Ok I get it now. When training the classifiers in reverse, the regret at the first level of the tree is not zero (at the 000 vs 001 node, it is zero; but at the 100 vs. 101 node, it is nonzero). Basically by sharing the same classifier across the entire level of the tree, I make it very difficult to achieve zero regret (summed across the tree!).<br /><br />Nonetheless the data filtering while advancing through the classifiers does seem like it would improve things in practice.Paul Mineirohttps://www.blogger.com/profile/05439062526157173163noreply@blogger.comtag:blogger.com,1999:blog-4446292666398344382.post-54478909104126722292011-05-13T09:37:31.988-07:002011-05-13T09:37:31.988-07:00> For 0-1 loss it seems like this cannot be. C...> For 0-1 loss it seems like this cannot be. Consider a 3 element<br />> problem, null example space, and representing each element of the<br />> power set by a 3-bit string,<br />><br />> 000: probability 49%<br />> 110: probability 25.5%<br />> 101: probability 25.5%<br />><br />> Independent classifiers will learn to output 100, right? Whereas<br />> filtering the training data for the later classifiers in the manner I<br />> describe on the blog will result in learning to output 110 which is<br />> better for 0-1 loss.<br />><br /><br />Whoops, if i train the classifiers in reverse order, I get 000, which<br />is pretty good, but not optimal, even though each subproblem was<br />solved optimally.<br /><br />More thought needed :)Paul Mineirohttps://www.blogger.com/profile/05439062526157173163noreply@blogger.comtag:blogger.com,1999:blog-4446292666398344382.post-64341080910545003742011-05-13T09:36:38.885-07:002011-05-13T09:36:38.885-07:00>> (1) The regret bound for a filter tree sc...>> (1) The regret bound for a filter tree scale as the depth. The maximum<br />>> depth required is log(possible outputs), implying that if there are N of K<br />>> partial labels, then the depth is:<br />>> log (K-choose-N) ~= N log (N/K). The time complexity can be the same. The<br />>> real problem is space complexity, which could be N-choose-K.<br /><br />I agree in practice most of the elements of the power set need not be<br />reached (i.e., most things have few labels).<br /><br />>><br />>> (2) The K independent predictors approach can be made consistent via using<br />>> (logistic or squared loss) regression rather than classification and then<br />>> thresholding appropriately.<br /><br />So you are saying ... using a proper loss ... each independent<br />classifier estimates P (element k present | x). Is this consistent<br />only for Hamming loss, or is it consistent for any loss (e.g., 0-1<br />loss on the entire set)?<br /><br />For 0-1 loss it seems like this cannot be. Consider a 3 element<br />problem, null example space, and representing each element of the<br />power set by a 3-bit string,<br /><br />000: probability 49%<br />110: probability 25.5%<br />101: probability 25.5%<br /><br />Independent classifiers will learn to output 100, right? Whereas<br />filtering the training data for the later classifiers in the manner I<br />describe on the blog will result in learning to output 110 which is<br />better for 0-1 loss.<br /><br />>><br />>> (3) If I understand correctly, this construction is depth K? In that case,<br />>> the regret bound could be relatively large.<br /><br />Yes it is depth K and the regret bound is large. I was just trying to<br />verify the approach is consistent.<br /><br />>><br />>> (4) My favored approach is a bit different. Instead of having a single<br />>> classifier at each node, you would have have two classifiers at each node,<br />>> each of which predicts whether you should descend into a subtree. This<br />>> approach has time complexity N log K and space complexity K. Training at<br />>> the nodes is slightly tricky, because it's extra important to descend when<br />>> there are two labels in a particular subtree. Hence, having an importance<br />>> weight equal to the sum of correct labels in the subtree seems good (or<br />>> something more complex if you have a more complex loss function than<br />>> hamming). I believe a regret bound substantially better than the general<br />>> cost sensitive case could be proved for the algorithm. If we are willing to<br />>> accept time complexity K, then we can use a representational trick along the<br />>> lines of what you are thinking where we sum up the predicted function values<br />>> for all members of the subtree.<br /><br />Yes that does seem promising.Paul Mineirohttps://www.blogger.com/profile/05439062526157173163noreply@blogger.comtag:blogger.com,1999:blog-4446292666398344382.post-40878452306259382492011-05-13T09:35:19.785-07:002011-05-13T09:35:19.785-07:00Blogger's comment system was down, so John Lan...Blogger's comment system was down, so John Langford emailed me the following to post:<br /><br />(1) The regret bound for a filter tree scale as the depth. The maximum depth required is log(possible outputs), implying that if there are N of K partial labels, then the depth is:<br />log (K-choose-N) ~= N log (N/K). The time complexity can be the same. The real problem is space complexity, which could be N-choose-K.<br /><br />(2) The K independent predictors approach can be made consistent via using (logistic or squared loss) regression rather than classification and then thresholding appropriately.<br /><br />(3) If I understand correctly, this construction is depth K? In that case, the regret bound could be relatively large.<br /><br />(4) My favored approach is a bit different. Instead of having a single classifier at each node, you would have have two classifiers at each node, each of which predicts whether you should descend into a subtree. This approach has time complexity N log K and space complexity K. Training at the nodes is slightly tricky, because it's extra important to descend when there are two labels in a particular subtree. Hence, having an importance weight equal to the sum of correct labels in the subtree seems good (or something more complex if you have a more complex loss function than hamming). I believe a regret bound substantially better than the general cost sensitive case could be proved for the algorithm. If we are willing to accept time complexity K, then we can use a representational trick along the lines of what you are thinking where we sum up the predicted function values for all members of the subtree.Paul Mineirohttps://www.blogger.com/profile/05439062526157173163noreply@blogger.com