Fortunately Bennett's inequality also applies; on the zero-mean variable (˜q(q,w)−Yi), it gives a bound of Pr(∑iYi<nw1+w)=Pr(n˜q(q,w)−∑iYi>n(˜q(q,w)−w1+w))≤exp(−nσ2a2h(atσ2)),σ2=E[(˜q(q,w)−Yi)2]=˜q(q,w)(1−˜q(q,w)),a=max{˜q(q,w),1−˜q(q,w)},t=˜q(q,w)−w1+w,h(u)=(1+u)log(1+u)−u. This bound does a much better job at capturing the structure of the regret as a function of w.
Here red and blue dots are the actual regret above and below the integral cutoff boundaries, green is the Azuma-Hoeffding bound, and orange is the Bennett bound. The vertical dashed lines are the minimum of each bound. While Azuma is way too aggressive, Bennett is overly conservative, but it does a better job. Both bounds predict an optimal w given p which is independent of n; here's how they compare to an exact computation for various p. (The exact computation depends upon n but converges as n→∞; n=105 appears to be close to this limit.)
Generalization
What does Bennett's inequality say about the more general problem? Suppose a distribution D on X×Y, and a single hypothesis h:X→Y which we are assessing according to 0-1 loss. The actual risk is l(h)=E(x,y)∼D[1h(x)≠y]. Now suppose we draw a training data from ˜Dw, defined by- Draw (x,y) from D.
- If y=1, reject (x,y) with probability (1−w).
- Otherwise, accept (x,y).
The instantaneous loss on the ith sample is given by Li(w,h)=λ(Yi;w,q)1h(Xi)≠Yi,={0withprobability˜q(q,w)r1+(1−˜q(q,w))r0λ(0;w,q)withprobability(1−˜q(q,w))(1−r0)λ(1;w,q)withprobability˜q(q,w)(1−r1),˜q(q,w)=qwqw+1−q, where rz=E(x,y)∼D[1h(x)=z|y=z]. (Note rz is not affected by subsampling.) The sequence ∑i(Li(w,h)−l(h)) is a martingale, so the Azuma-Hoeffding inequality holds. However Azuma-Hoeffding is driven by the largest possible deviation between adjacent sequence members, which is smallest when w=1, so it does not indicate subsampling helps. However Bennett's inequality also applies and is driven by the variance, which is sometimes better. E[Li(w,h)2]=(1−˜q(q,w))(1−r0)λ(0;w,q)2+˜q(q,w)(1−r1)λ(1;w,q)2,ddwE[Li(w,h)2]=−(1−q)q(1−r1−(1−r0)w2)w2,ddwE[Li(w,h)2]|w=1=(1−q)q(r1−r0),d2dw2E[Li(w,h)2]=2(1−q)q(1−r1)w3>0. What this means is: if r1>r0 there is a range of values w<1 for which the variance of the estimator is lower than at w=1. This suggests some amount of subsampling of positives is beneficial whenever the hypothesis being assessed is more likely to be correct on positive instances than negative instances, e.g., the trivial hypothesis h(⋅)=1. Interestingly this is true even if q<1/2.
It is not immediately clear how to use this result, because typically one wants to bound the deviation of the empirical risk over a set of hypothesis, some of which will not satisfy r1>r0. Here's one idea: since we know q=E(x,y)∼D[1y=1]>1/2, suppose we have some way (with high probability?) to only consider hypotheses such that E(x,y)∼D[1h(x)=1]≥ρ≥q. In that case the solution to minr1−r0s.t.0≤r1≤10≤r0≤112<q≤ρ≤1E(x,y)∼D[1h(x)=1]=qr1+(1−q)(1−r0)≥ρ, is given by (r1−r0)=(ρ−q)/(1−q)≥0, i.e., such a hypothesis set would benefit from subsampling. This constraint doesn't correspond to anything I've ever seen in machine learning software; however, suppose E(x,y)∼D[1h(x)=1]≤χ≤q. Then the solution to minl(h)=(1−q)(1−r0)+q(1−r1)s.t.0≤r1≤10≤r0≤112<q≤1E(x,y)∼D[1h(x)=1]=qr1+(1−q)(1−r0)≤χ0≤χ≤q, is given by l(h)=q−χ. Since the constant hypothesis h(⋅)=1 has loss 1−q, it is better than any hypothesis where χ<2q−1. Therefore when the data set is very imbalanced (q→1), it might be the case that the hypotheses that have their estimator degraded by subsampling are far from the minimum and therefore unlikely to confuse the empirical risk minimizer. This argument obviously needs some polish, but it does correspond intuitively to what happens with online learning on a very imbalanced data set: first, the model quickly learns to say everything is the predominant class, then it starts to carve out various exceptions.
No comments:
Post a Comment