Machined Learnings: Real Data

Showing posts with label Real Data. Show all posts

Friday, April 25, 2014

A Discriminative Representation Learning Technique

Nikos and I have developed a technique for learning discriminative features using numerical linear algebra techniques which gives good results for some problems. The basic idea is as follows. Suppose you have a multiclass problem, i.e., training data of the form $S = \{ (x, y) | x \in \mathbb{R}^d, y \in \{ 1, \ldots, k \} \}$. Here $x$ is the original representation (features) and you want to learn new features that help your classifier. In deep learning this problem is tackled by defining a multi-level parametric nonlinearity of $x$ and optimizing the parameters. Deep learning is awesome but the resulting optimization problems are challenging, especially in the distributed setting, so we were looking for something more computationally felicitous.

First consider the two class case. Imagine looking for features of the form $\phi (w^\top x)$, where $w \in \mathbb{R}^d$ is a “weight vector” and $\phi$ is some nonlinearity. What is a simple criterion for defining a good feature? One idea is for the feature to have small average value on one class and large average value on another. Assuming $\phi$ is non-negative, that suggests maximizing the ratio \[
w^* = \arg \max_w \frac{\mathbb{E}[\phi (w^\top x) | y = 1]}{\mathbb{E}[\phi (w^\top x) | y = 0]}.
\] For the specific choice of $\phi (z) = z^2$ this is tractable, as it results in a Rayleigh quotient between two class-conditional second moments, \[
w^* = \arg \max_w \frac{w^\top \mathbb{E}[x x^\top | y = 1] w}{w^\top \mathbb{E}[x x^\top | y = 0] w},
\] which can be solved via generalized eigenvalue decomposition. Generalized eigenvalue problems have been extensively studied in machine learning and elsewhere, and the above idea looks very similar to many other proposals (e.g., Fisher LDA), but it is different and more empirically effective. I'll refer you to the paper for a more thorough discussion, but I will mention that after the paper was accepted someone pointed out the similarity to CSP, which is a technique from time-series analysis (c.f., Ecclesiastes 1:4-11).

The features that result from this procedure pass the smell test. For example, starting from a raw pixel representation on mnist, the weight vectors can be visualized as images; the first weight vector for discriminating 3 vs. 2 looks like

which looks like a pen stroke, c.f., figure 1D of Ranzato et. al.

We make several additional observations in the paper. The first is that multiple isolated minima of the Rayleigh quotient are useful if the associated generalized eigenvalues are large, i.e., one can extract multiple features from a Rayleigh quotient. The second is that, for moderate $k$, we can extract features for each class pair independently and use all the resulting features to get good results. The third is that the resulting directions have additional structure which is not completely captured by a squaring non-linearity, which motivates a (univariate) basis function expansion. The fourth is that, once the original representation has been augmented with additional features, the procedure can be repeated, which sometimes yields additional improvements. Finally, we can compose this with randomized feature maps to approximate the corresponding operations in a RKHS, which sometimes yields additional improvements. We also made a throw-away comment in the paper that computing class-conditional second moment matrices is easily done in a map-reduce style distributed framework, but this was actually a major motivation for us to explore in this direction, it just didn't fit well into the exposition of the paper so we de-emphasized it.

Combining the above ideas, along with Nikos' preconditioned gradient learning for multiclass described in a previous post, leads to the following Matlab script, which gets 91 test errors on (permutation invariant) mnist. Note: you'll need to download mnist_all.mat from Sam Roweis' site to run this.

function calgevsquared

more off;
clear all;
close all;

start=tic;
load('mnist_all.mat');
xxt=[train0; train1; train2; train3; train4; train5; ...
     train6; train7; train8; train9];
xxs=[test0; test1; test2; test3; test4; test5; test6; test7; test8; test9];
kt=single(xxt)/255;
ks=single(xxs)/255;
st=[size(train0,1); size(train1,1); size(train2,1); size(train3,1); ...
    size(train4,1); size(train5,1); size(train6,1); size(train7,1); ...
    size(train8,1); size(train9,1)];
ss=[size(test0,1); size(test1,1); size(test2,1); size(test3,1); ... 
    size(test4,1); size(test5,1); size(test6,1); size(test7,1); ...
    size(test8,1); size(test9,1)];
paren = @(x, varargin) x(varargin{:});
yt=zeros(60000,10);
ys=zeros(10000,10);
I10=eye(10);
lst=1;
for i=1:10; yt(lst:lst+st(i)-1,:)=repmat(I10(i,:),st(i),1); lst=lst+st(i); end
lst=1;
for i=1:10; ys(lst:lst+ss(i)-1,:)=repmat(I10(i,:),ss(i),1); lst=lst+ss(i); end

clear i st ss lst
clear xxt xxs
clear train0 train1 train2 train3 train4 train5 train6 train7 train8 train9
clear test0 test1 test2 test3 test4 test5 test6 test7 test8 test9

[n,k]=size(yt);
[m,d]=size(ks);

gamma=0.1;
top=20;
for i=1:k
    ind=find(yt(:,i)==1);
    kind=kt(ind,:);
    ni=length(ind);
    covs(:,:,i)=double(kind'*kind)/ni;
    clear ind kind;
end
filters=zeros(d,top*k*(k-1),'single');
last=0;
threshold=0;
for j=1:k
    covj=squeeze(covs(:,:,j)); l=chol(covj+gamma*eye(d))';
    for i=1:k
        if j~=i
            covi=squeeze(covs(:,:,i));
            C=l\covi/l'; CS=0.5*(C+C'); [v,L]=eigs(CS,top); V=l'\v;
            take=find(diag(L)>=threshold);
            batch=length(take);
            fprintf('%u,%u,%u ', i, j, batch);
            filters(:,last+1:last+batch)=V(:,take);
            last=last+batch;
        end
    end
    fprintf('\n');
end

clear covi covj covs C CS V v L

% NB: augmenting kt/ks with .^2 terms is very slow and doesn't help

filters=filters(:,1:last);
ft=kt*filters;
clear kt;
kt=[ones(n,1,'single') sqrt(1+max(ft,0))-1 sqrt(1+max(-ft,0))-1];
clear ft;
fs=ks*filters;
clear ks filters;
ks=[ones(m,1,'single') sqrt(1+max(fs,0))-1 sqrt(1+max(-fs,0))-1];
clear fs;

[n,k]=size(yt);
[m,d]=size(ks);

for i=1:k
    ind=find(yt(:,i)==1);
    kind=kt(ind,:);
    ni=length(ind);
    covs(:,:,i)=double(kind'*kind)/ni;
    clear ind kind;
end

filters=zeros(d,top*k*(k-1),'single');
last=0;
threshold=7.5;
for j=1:k
    covj=squeeze(covs(:,:,j)); l=chol(covj+gamma*eye(d))';
    for i=1:k
        if j~=i
            covi=squeeze(covs(:,:,i));
            C=l\covi/l'; CS=0.5*(C+C'); [v,L]=eigs(CS,top); V=l'\v;
            take=find(diag(L)>=threshold);
            batch=length(take);
            fprintf('%u,%u,%u ', i, j, batch);
            filters(:,last+1:last+batch)=V(:,take);
            last=last+batch;
        end
    end
    fprintf('\n');
end
fprintf('gamma=%g,top=%u,threshold=%g\n',gamma,top,threshold);
fprintf('last=%u filtered=%u\n', last, size(filters,2) - last);

clear covi covj covs C CS V v L

filters=filters(:,1:last);
ft=kt*filters;
clear kt;
kt=[sqrt(1+max(ft,0))-1 sqrt(1+max(-ft,0))-1];
clear ft;
fs=ks*filters;
clear ks filters;
ks=[sqrt(1+max(fs,0))-1 sqrt(1+max(-fs,0))-1];
clear fs;

trainx=[ones(n,1,'single') kt kt.^2];
clear kt;
testx=[ones(m,1,'single') ks ks.^2];
clear ks;

C=chol(0.5*(trainx'*trainx)+sqrt(n)*eye(size(trainx,2)),'lower');
w=C'\(C\(trainx'*yt));
pt=trainx*w;
ps=testx*w;

[~,trainy]=max(yt,[],2);
[~,testy]=max(ys,[],2);

for i=1:5
        xn=[pt pt.^2/2 pt.^3/6 pt.^4/24];
        xm=[ps ps.^2/2 ps.^3/6 ps.^4/24];
        c=chol(xn'*xn+sqrt(n)*eye(size(xn,2)),'lower');
        ww=c'\(c\(xn'*yt));
        ppt=SimplexProj(xn*ww);
        pps=SimplexProj(xm*ww);
        w=C'\(C\(trainx'*(yt-ppt)));
        pt=ppt+trainx*w;
        ps=pps+testx*w;

        [~,yhatt]=max(pt,[],2);
        [~,yhats]=max(ps,[],2);
        errort=sum(yhatt~=trainy)/n;
        errors=sum(yhats~=testy)/m;
        fprintf('%u,%g,%g\n',i,errort,errors)
end
fprintf('%4s\t', 'pred');
for true=1:k
        fprintf('%5u', true-1);
end
fprintf('%5s\n%4s\n', '!=', 'true');
for true=1:k
        fprintf('%4u\t', true-1);
        trueidx=find(testy==true);
        for predicted=1:k
                predidx=find(yhats(trueidx)==predicted);
                fprintf('%5u', sum(predidx>0));
        end
        predidx=find(yhats(trueidx)~=true);
        fprintf('%5u\n', sum(predidx>0));
end

toc(start)

end

% http://arxiv.org/pdf/1309.1541v1.pdf
function X = SimplexProj(Y)
  [N,D] = size(Y);
  X = sort(Y,2,'descend');
  Xtmp = bsxfun(@times,cumsum(X,2)-1,(1./(1:D)));
  X = max(bsxfun(@minus,Y,Xtmp(sub2ind([N,D],(1:N)',sum(X>Xtmp,2)))),0);
end

When I run this on my desktop machine it yields

>> calgevsquared
2,1,20 3,1,20 4,1,20 5,1,20 6,1,20 7,1,20 8,1,20 9,1,20 10,1,20 
1,2,20 3,2,20 4,2,20 5,2,20 6,2,20 7,2,20 8,2,20 9,2,20 10,2,20 
1,3,20 2,3,20 4,3,20 5,3,20 6,3,20 7,3,20 8,3,20 9,3,20 10,3,20 
1,4,20 2,4,20 3,4,20 5,4,20 6,4,20 7,4,20 8,4,20 9,4,20 10,4,20 
1,5,20 2,5,20 3,5,20 4,5,20 6,5,20 7,5,20 8,5,20 9,5,20 10,5,20 
1,6,20 2,6,20 3,6,20 4,6,20 5,6,20 7,6,20 8,6,20 9,6,20 10,6,20 
1,7,20 2,7,20 3,7,20 4,7,20 5,7,20 6,7,20 8,7,20 9,7,20 10,7,20 
1,8,20 2,8,20 3,8,20 4,8,20 5,8,20 6,8,20 7,8,20 9,8,20 10,8,20 
1,9,20 2,9,20 3,9,20 4,9,20 5,9,20 6,9,20 7,9,20 8,9,20 10,9,20 
1,10,20 2,10,20 3,10,20 4,10,20 5,10,20 6,10,20 7,10,20 8,10,20 9,10,20 
2,1,15 3,1,20 4,1,20 5,1,20 6,1,20 7,1,20 8,1,20 9,1,20 10,1,20 
1,2,20 3,2,20 4,2,20 5,2,20 6,2,20 7,2,20 8,2,20 9,2,20 10,2,20 
1,3,20 2,3,11 4,3,17 5,3,20 6,3,20 7,3,19 8,3,18 9,3,18 10,3,19 
1,4,20 2,4,12 3,4,20 5,4,20 6,4,12 7,4,20 8,4,19 9,4,15 10,4,20 
1,5,20 2,5,12 3,5,20 4,5,20 6,5,20 7,5,20 8,5,16 9,5,20 10,5,9 
1,6,18 2,6,13 3,6,20 4,6,12 5,6,20 7,6,18 8,6,20 9,6,13 10,6,18 
1,7,20 2,7,14 3,7,20 4,7,20 5,7,20 6,7,20 8,7,20 9,7,20 10,7,20 
1,8,20 2,8,14 3,8,20 4,8,20 5,8,20 6,8,20 7,8,20 9,8,20 10,8,12 
1,9,20 2,9,9 3,9,20 4,9,15 5,9,18 6,9,11 7,9,20 8,9,17 10,9,16 
1,10,20 2,10,14 3,10,20 4,10,20 5,10,14 6,10,20 7,10,20 8,10,12 9,10,20 
gamma=0.1,top=20,threshold=7.5
last=1630 filtered=170
1,0.0035,0.0097
2,0.00263333,0.0096
3,0.00191667,0.0092
4,0.00156667,0.0093
5,0.00141667,0.0091
pred        0    1    2    3    4    5    6    7    8    9   !=
true
   0      977    0    1    0    0    1    0    1    0    0    3
   1        0 1129    2    1    0    0    1    1    1    0    6
   2        1    1 1020    0    1    0    0    6    3    0   12
   3        0    0    1 1004    0    1    0    2    1    1    6
   4        0    0    0    0  972    0    4    0    2    4   10
   5        1    0    0    5    0  883    2    1    0    0    9
   6        4    2    0    0    2    2  947    0    1    0   11
   7        0    2    5    0    0    0    0 1018    1    2   10
   8        1    0    1    1    1    1    0    1  966    2    8
   9        1    1    0    2    5    2    0    4    1  993   16
Elapsed time is 186.147659 seconds.

That's a pretty good confusion matrix, comparable to state-of-the-art deep learning results on (permutation invariant) mnist. In the paper we report a slightly worse number (96 test errors) because for a paper we have to choose hyperparameters via cross-validation on the training set rather than cherry-pick them as for a blog post.

The technique as stated here is really only useful for tall-thin design matrices (i.e., lots of examples but not too many features): if the original feature dimensionality is too large (e.g., $> 10^4$) than naive use of standard generalized eigensolvers becomes slow or infeasible, and other tricks are required. Furthermore, if the number of classes is too large than solving $O (k^2)$ generalized eigenvalue problems is also not reasonable. We're working on remedying these issues, and we're also excited about extending this strategy to structured prediction. Hopefully we'll have more to say about it at the next few conferences.

Saturday, November 9, 2013

We can hash that

The lab is located in the Pacific Northwest, so it's natural to ask what machine learning primitives are as ubiquitously useful as pickling. There are two leading candidates at the moment: randomized feature maps and the hashing trick. The latter, it turns out, can be beneficially employed for randomized PCA.

Randomized PCA algorithms, as I've discussed recently, are awesome. Empirically, two (or more) pass algorithms seem necessary to get really good results. Ideally, one could just do one pass over the data with a (structured) randomness down to some computationally suitable dimension, and then use exact techniques to finish it off. In practice this doesn't work very well, although the computational benefits (single pass over the data and low memory usage) sometimes justifies it. Two pass algorithms use the first pass to construct an orthogonal basis, and then use that basis for the second pass. In addition to that extra data pass, two pass algorithms require storage for the basis, and an orthogonalization step. If the original feature dimensionality is $p$ and the number of desired components is $k$ than the storage requirements are $O (p k)$ and the orthogonalization step has time complexity $O (p k)$. If $O (p k)$ fits in main memory, this is not a problem, but otherwise, it can be a bother as essentially a distributed QR decomposition is required.

The hashing trick (more generally, structured randomness) can provide a bridge between the two extremes. The idea is to use structured randomness to reduce the feature dimensionality from $p$ to $d$, such that $O (d k)$ fits in main memory, and then use a two pass randomized algorithm. This can be seen as an interpolation between a one pass algorithm leveraging structured randomness and a traditional two-pass algorithm. Practically speaking, we're just trying to use the available space resources to get a good answer. We've found hashing to be a good structured randomness for sparse domains such as text or graph data, while other structured randomness (e.g., subsampled Hartley transforms) are better for dense data. When using hashing, other conveniences of the hashing trick, such as not needing to know the feature cardinality of the input data apriori, are inherited by the approach.

These randomized methods should not intimidate: once you understand them, they are very simple. Here is some Matlab to do randomized PCA with hashing:

function H=makehash(d,p)
  i = linspace(1,d,d);
  j = zeros(1,d);
  s = 2*randi(2,1,d)-3;

  perm = randperm(d);
  j=1+mod(perm(1:d),p);
  H = sparse(i,j,s);
end

function [V,L]=hashpca(X,k,H)
  [~,p] = size(H);
  Omega = randn(p,k+5);
  [n,~] = size(X);
  Z = (X*H)'*((X*H)*Omega)/n;
  Q = orth(Z);
  Z = (X*H)'*((X*H)*Q)/n;
  [V,Lm,~] = svd(Z,'econ');
  V = V(:,1:k);
  L = diag(Lm(1:k,1:k));
end

which you can invoke with something like

>> H=makehash(1000000,100000); [V,L]=hashpca(sprandn(4000000,1000000,1e-5),5,H); L'

ans =

   1.0e-03 *

    0.1083    0.1082    0.1081    0.1080    0.1079

So as usual one benefit is the shock-and-awe of allowing you to achieve some computation on your commodity laptop that brings other implementations to their knees. Here's a picture that results from PCA-ing a publicly available Twitter social graph on my laptop using about 800 megabytes of memory. The space savings from hashing is only about a factor of 20, so if you had a machine with 16 gigabytes of memory you could have done this with redsvd without difficulty, but of course with larger data sets eventually memory gets expensive.

This image can be hard to read, but if you click on it it gets bigger, and then if you open the bigger version in a new tab and zoom in you can get more detail.

If you like this sort of thing, you can check out the arxiv paper, or you can visit the NIPS Randomized Methods for Machine Learning workshop where Nikos will be talking about it. Arun Kumar, who interned at CISL this summer, also has a poster at Biglearn regarding a distributed variant implemented on REEF.

Tuesday, December 13, 2011

Visualizing the Crowd

Roughly a year ago I read a paper by Welinder et. al. titled The Multidimensional Wisdom of Crowds. At the time I was just starting to heavily leverage crowdsourcing for machine learning tasks and the paper jump started my thoughts regarding crowdsourced data sets. So I'm happy to announce that I've added visualization support to playerpiano inspired by this paper.

I say ``inspired by'' because the model is quite a bit simpler. In particular since in my data sets there are typically very few ratings per item (e.g., 3), I continue my tradition of a simple item model (namely, a single scalar difficulty parameter $\beta$). Therefore instead of embedding items, I embed the hidden labels. Each worker is modeled as a probabilistic classifier driven by the distance from the hidden label prototype, \[
p (l_{ij} = r | \alpha, \beta, \tau, z) \propto \exp (-\beta_j \lVert \tau_{z_j} + \alpha_{z_jr} - \tau_r - \alpha_{ir} \rVert^2).
\] Here $l_{ij}$ is the label reported by worker $i$ on item $j$, $\alpha_{ir}$ is the $d$-dimensional bias vector for worker $i$ and label $r$, $\beta_j$ is the difficulty parameter for item $j$, $\tau_r$ is the $d$-dimensional prototype vector for label $r$, $z_j$ is the true hidden label for item $j$, and $d$ is the dimensionality of the embedding. Although the $\tau$ need to be randomly initialized to break symmetry, this parameterization ensures that $\alpha_{ir} = 0$ is a reasonable starting condition. The $\alpha$ are $L^2$ regularized (Gaussian prior) but the $\tau$ are not (uninformative prior). A note about invariances: $d$ symmetries are eliminated by translating and rotating the $\tau$ into canonical position ($\tau_0$ is constrained to be at the origin, $\tau_1$ is constrained to be in the subspace spanned by the first unit vector, etc.).

Although my motivation was visualization (corresponding to $d = 2$ or $d = 3$), there are two other possible uses. $d = 1$ is akin to a non-monotonic ordinal constraint and might be appropriate for some problems. Larger $d$ are potentially useful since there is a reduction of per-worker parameters from $O (|L|^2)$ to $O (d |L|)$, which might be relevant for multi-label problems handled by reduction.

Inference proceeds as before (I used multinomial logistic regression for the classifier), except of course the worker model has changed. In practice this worker model is roughly 3x slower than the multinomial worker model, but since this worker model results in a reduction of per-worker parameters perhaps the fair comparison is against a low-rank approximation, which is also slower. Here is the software working through my canonical demonstration task, predicting the ethnicity of a Twitter user from their profile.

strategy = nominalembed
initial_t = 10000
eta = 1.0
rho = 0.9
n_items = 16547
n_labels = 9
n_worker_bits = 16
n_feature_bits = 18
n_dims = 2
seed = 45
test_only = false
prediction file = (no output)
data file = (stdin)
cumul    since    cumul    since      example current current current  current
avg q    last     avg ce   last       counter   label predict ratings features
-1.64616 -1.64616 -1.90946 -1.90946         2      -1       2       4       30
-1.60512 -1.56865 -1.93926 -1.95912         5      -1       2       3       32
-1.38015 -1.15517 -2.13355 -2.32784        10      -1       1       4       28
-1.11627 -0.82685 -2.08542 -2.03194        19      -1       2       3       21
-0.89318 -0.63424 -1.89668 -1.68574        36      -1       1       3       35
-0.90385 -0.91498 -1.62015 -1.31849        69      -1       8       4       27
-0.99486 -1.0903  -1.5287  -1.43162       134      -1       1       4       54
-0.93116 -0.86077 -1.42049 -1.30809       263      -1       1       4       45
-0.90436 -0.87592 -1.47783 -1.5365        520      -1       1       3       13
-0.92706 -0.95001 -1.42042 -1.36223      1033      -1       2       1       11
-0.96477 -1.00259 -1.33948 -1.25791      2058      -1       8       3       21
-0.95079 -0.93672 -1.2513  -1.16272      4107      -1       1       3       44
-0.91765 -0.88423 -1.13014 -1.0087       8204      -1       0       3       26
-0.90145 -0.88529 -0.98977 -0.84921     16397      -1       8       3       23
-0.86520 -0.82882 -0.80860 -0.62731     32782      -1       8       3       20
-0.83186 -0.79852 -0.63999 -0.47132     65551      -1       1       3       56
-0.79732 -0.76279 -0.50123 -0.36243    131088      -1       2       3       35
-0.77279 -0.74826 -0.40255 -0.30386    262161      -1       8       3       41
-0.75345 -0.73413 -0.33804 -0.27352    524306      -1       2       3       43
-0.74128 -0.72911 -0.29748 -0.25692   1048595      -1       1       4       45
-0.73829 -0.72691 -0.28774 -0.25064   1323760      -1       1       3       27
applying deferred prior updates ... finished

tau:
     \  latent dimension
      |   0       1   
label |
    0 | 0.0000  0.0000
    1 | 2.6737  0.0000
    2 | 3.5386  -1.3961
    3 | 1.3373  -1.2188
    4 | -1.5965 -1.4927
    5 | 0.0136  -2.9098
    6 | -2.4236 1.4345
    7 | -0.0450 2.2672
    8 | 2.1513  -1.5638
  447.48s user 1.28s system 97% cpu 7:38.84 total

The above process produces estimates (posterior distributions) over the hidden labels for each item as well as a classifier that will attempt to generalize to novel instances and a worker model that will attempt to generalize to novel workers. In addition several visualizable things fall out of this:

The hidden label prototype vectors $\tau_r$. Being closer together suggests two labels are more likely to be confused.
The per-worker noise vector $\alpha_{ir}$. These adjust the hidden label prototypes per user, leading to differences in bias and accuracy.
The items can be placed into the latent space by forming a convex combination of hidden label prototype vectors via the posterior distribution over labels.

Here's how the major labels fall in a 2-dimensional embedding. The text of the label is centered upon the value of $\tau$ for that label (for a novel worker, $\alpha_{ir} = 0$, so the $\tau$ define the default confusion matrix). The typical $\beta$ is 1 so a distance of 3 on this plot indicates the likelihood of confusion is very low. (Click on the image to zoom in).

Results are dependent upon the random seed. The most popular labels (Asian, Hispanic, Black, White and N/A) maintain their relative positions but the less popular labels move around. Here's the above plot for a different random seed: note the x-axis has shrunk, but this will be more convenient for subsequent plots. (Click on the image to zoom in).

I'll stick with this random seed for the remainder of the plots. Now I'll place a dot for each worker's prototype vector ($\tau_z + \alpha_{iz}$) on the plot. (Click on the image to zoom in).

The pattern of dots provides some intuition about the distribution of error patterns across the worker population. For instance, the dots around the Hispanic label have more horizontal than vertical spread. That suggests there is more variation in distinguishing between Whites and Hispanics versus distinguishing between Blacks and Hispanics. The distinction between Whites and Hispanics is more cultural than racial; the US Census Bureau lists White as a race, but ``Hispanic or Latino'' as an ethnicity; thus in some sense this is poor experimental design, but since advertisers care strongly about this distinction, I have to make it work.

Finally here are some profile photos embedded into the latent space according to the posterior distribution over the hidden label for the profile. Click on the image below to get a vector version that you can zoom into and see the detail.

In some cases the photos don't appear to make sense given their embedded location. Some of this is because the workers are noisy labelers. However the workers have access to and are basing their labeling decisions on the entire profile. Therefore these photos are best thought of as ``examples of the kind of profile photo that particular ethnicities choose to use'', rather than examples of pictures of people of that ethnicity per se.

The latest version of playerpiano is available from the Google code repository.

Wednesday, November 23, 2011

Ordered Logistic Regression is a Hot Mess

I've added ordinal support to playerpiano. If you want to predict whether somebody is Hot or Not, this is now the tool for you.[1] (Best line from the Wikipedia article: ``Moreover, according to these researchers, one of the basic functions of the brain is to classify images into a hot or not type categorization.'' It's clear that brain researchers have all the fun.)

Although I already had a worker model I needed a classifier to go with it. Ordered logistic regression seemed like the natural choice but I ended up not using it for computational reasons. The ordered logistic regression probability model is \[
\begin{aligned}
P (Y = j | X = x; w, \kappa) &= \frac{1}{1 + \exp (w \cdot x - \kappa_{j+1})} - \frac{1}{1 + \exp (w \cdot x - \kappa_j)},
\end{aligned}
\] where $\kappa_0 = -\infty$ and $\kappa_{n+1} = \infty$. So the first problem is that unless the constraint $i < j \implies \kappa_i < \kappa_j$ is enforced, the predicted probabilities go negative. Since I represent probabilities with their logarithms that's a problem for me. Even worse, however, is that the formula for the gradient of a class probability with respect to the weights is not very convenient computationally.

Contrast this with the Polytomous Rasch model, \[
\begin{aligned}
p (Y = 0 | X = x; w, \kappa) &\propto 1 \\
p (Y = j | X = x; w, \kappa) &\propto \exp \left(\sum_{k=1}^j (w \cdot x - \kappa_j) \right)
\end{aligned}
\] There's no particular numerical difficulty with violating $i < j \implies \kappa_i < \kappa_j$. Of course, if that does happen, it strongly suggests something is very wrong (such as, the response variable is not actually ordered the way I presume), but the point is I can do unconstrained optimization and then check for sanity at the end. In addition computing the gradient of a class probability with respect to the weights is comparatively pleasant. Therefore I went with the Polytomous Rasch functional form.

Here's an example run on a dataset trying to predict the (discretized) age of a Twitter user from their profile.

strategy = ordinal
initial_t = 10000
eta = 0.1
rho = 0.9
n_items = 11009
n_labels = 8
n_worker_bits = 16
n_feature_bits = 18
test_only = false
prediction file = (no output)
data file = (stdin)
cumul    since    cumul    since      example current current current  current
avg q    last     avg ce   last       counter   label predict ratings features
-1.15852 -1.15852 -2.20045 -2.20045         2      -1       2       3       33
-1.21748 -1.25678 -1.8308  -1.58437         5      -1       2       4       15
-1.20291 -1.1873  -1.89077 -1.95075        10      -1       2       3       34
-1.15344 -1.09367 -1.94964 -2.01505        19      -1       2       1       18
-1.21009 -1.2637  -1.99869 -2.05351        36      -1       4       1       29
-1.13031 -1.04421 -1.80028 -1.58384        69      -1       3       2       46
-1.1418  -1.15346 -1.58537 -1.35723       134      -1       3       2       35
-1.14601 -1.15028 -1.38894 -1.18489       263      -1       2       4       31
-1.1347  -1.12285 -1.14685 -0.89911       520      -1       3       2       42
-1.12211 -1.10868 -1.03302 -0.91764      1033      -1       3       3       26
-1.11483 -1.10755 -0.91798 -0.80203      2058      -1       3       3       43
-1.10963 -1.10447 -0.82174 -0.72509      4107      -1       3       4       16
-1.07422 -1.03901 -0.82659 -0.83145      8204      -1       2       4       29
-1.02829 -0.98195 -0.84504 -0.86352     16397      -1       3       2       55
-0.98414 -0.93991 -0.85516 -0.86528     32782      -1       2       1       16
-0.94415 -0.90447 -0.84898 -0.84281     65551      -1       2       4       27
-0.90247 -0.86075 -0.86127 -0.87355    131088      -1       2       4       15
-0.88474 -0.83311 -0.86997 -0.89529    176144      -1       4       3       27
applying deferred prior updates ... finished
gamma = 0.4991 1.4993 2.5001 3.5006 4.5004 5.5001 6.5001
  13.65s user 0.19s system 89% cpu 15.455 total

playerpiano is available from the Google code repository.

Footnote 1

In actuality Hot or Not is a bad example, because there is probably no universal ground truth hotness; rather it is a personalized concept, and therefore perhaps better handled by personalization algorithms such as this one applied to spam filtering. playerpiano is more appropriate for problems with an objective ground truth, such as predicting the age of a Twitter user based upon their Twitter profile. Doesn't sound as sexy, does it? Exactly. That's why it's in a footnote.

Wednesday, November 16, 2011

Logistic Regression for Crowdsourced Data

Lately I have been processing crowdsourced data with generative models to create a distribution over the ground truth label. I then convert that distribution into a cost-vector for cost-sensitive classification by taking the expectation of my classification loss function with respect to the distribution over ground truth. Because the generative models assume the typical worker is typically correct, they are consensus-driven: they will assume that a worker which is consistently disagreeing with peers when assigning a label is less accurate, and should therefore contribute less to the distribution over ground truth.

Raykar et. al. note that a classifier trained on the crowdsourced data will ultimately agree or disagree with particular crowdsourced labels. It would be nice to use this to inform the model of each worker's likely errors, but in the sequential procedure I've been using so far, there is no possibility of this: first ground truth is estimated, than the classifier is estimated. Consequently they propose to jointly estimate ground truth and the classifier to allow each to inform the other.

At this point let me offer same plate diagrams to help elucidate.

This is a plate diagram corresponding to the generative models I've been using so far. An unobserved ground truth label $z$ combines with a per-worker model parametrized by vector $\alpha$ and scalar item difficulty $\beta$ to create an observed worker label $l$ for an item. $\mu$, $\rho$, and $p$ are hyperprior parameters for the prior distributions of $\alpha$, $\beta$, and $z$ respectively. Depending upon the problem (multiclass, ordinal multiclass, or multilabel) the details of how $z$, $\alpha$, and $\beta$ produce a distribution over $l$ change but the general structure is given by the above diagram.

Raykar et. al. extend the generative model to allow for observed item features.

The diagram supposes that item features $\psi$ and worker labels $l$ are emitted conditionally independently given the true label $z$. This sounds bogus, since presumably the item features drive the worker directly or at least indirectly via the scalar difficulty, unless perhaps the item features are completely inaccessible to the crowdsource worker. It might be a reasonable next step to try and enrich the above diagram to address the concern, but the truth is all generative models are convenient fictions, so I'm using the above for now. Raykar et. al. provide a batch EM algorithm for the joint classification, but the above fits nicely into the online algorithm I've been using.

Here's the online procedure, for each input pair $(\psi, \{ (w_i, l_i) \})$.

Using the item features $\psi$, interrogate a classifier trained using a proper scoring rule, and interpret the output as $P (z | \psi)$.
Use $P (z | \psi)$ as the prior distribution for $z$ in the online algorithms previously discussed for processing the set of crowdsourced labels $\{ (w_i, l_i) \}$. This produces result $P (z | \psi, \{ (w_i, l_i ) \})$.
Update the classifier using SGD on the expected prior scoring rule loss against distribution $P (z | \psi, \{ (w_i, l_i ) \})$. For instance, with log loss (multiclass logistic regression) the objective function is the cross-entropy, \[
\sum_j P (z = j | \psi, \{ (w_i, l_i) \}) \log P (z = j | \psi).
\]

I have a diagram to assist with visualizing the online procedure.

Note if you observe ground truth $\tilde z$ for a particular instance, then the worker model is updated as if $P (z = j | \psi) = 1_{z = \tilde z}$ as the prior distribution, and the classifier is updated as if $P (z = j | \psi, \{ (w_i, l_i) \}) = 1_{z = \tilde z}$. In this case the classifier update is the same as ``vanilla'' logistic regression, so this can be considered a generalization of logistic regression to crowdsourced data.

I always add the constant item feature to each input. Thus in the case where there are no item features, the algorithm is the same as before except that it is learning the prior distribution over $z$. Great, that's one less thing to specify. In the case where there are item features, however, things get more interesting. If there is a feature which is strongly indicative of the ground truth (e.g., lang=es on a Twitter profile being strongly indicative of a Hispanic ethnicity), the model can potentially identify accurate workers who happened to disagree with their peers on every item they labeled, if the worker agrees with other workers on items which share some dispositive features. This might occur if a worker happens to get unlucky and colocate on several tasks with multiple inaccurate workers. This really starts to pay off when those multiple inaccurate workers have their influence reduced on other items which are more ambiguous.

Here is a real life example. The task is prediction of the gender of a Twitter profile. Mechanical Turk workers are asked to visit a particular profile and then choose a gender: male, female, or neither. ``neither'' is mostly intended for the Twitter accounts of organizations like the Los Angeles Dodgers, not necessarily RuPaul. The item features are whatever can be obtained via GET users/lookup (note all of these features are readily apparent to the Mechanical Turk worker). Training examples end up looking like

A26E8CJMP5S4WN:2,A8H56XB9K7DB5:2,AU9LVYE38Q6S2:2,AHGJTOTIPCL8X:2 WONBOTTLES,180279525|firstname taste |restname this ? ?? |lang en |description weed girls life cool #team yoooooooo #teamblasian #teamgemini #teamcoolin #teamcowboys |utc_offset utc_offset_-18000 |profile sidebar_252429 background_1a1b1f |location spacejam'n in my jet fool

If that looks like Vowpal Wabbit, it's because I ripped off their input format again, but the label specification is enriched. In particular zero or more worker:label pairs can be specified, as well as an optional true label (just a label, no worker). Here's what multiple passes over a training set look like.

initial_t = 10000
eta = 1.0
rho = 0.9
n_items = 10130
n_labels = 3
n_worker_bits = 16
n_feature_bits = 16
test_only = false
prediction file = (no output)
data file = (stdin)
cumul    since    cumul    since      example current current current  current
avg q    last     avg ce   last       counter   label predict ratings features
-0.52730 -0.52730 -0.35304 -0.35304         2      -1       0       4        7
-0.65246 -0.73211 -0.29330 -0.25527         5      -1       0       4       23
-0.62805 -0.60364 -0.33058 -0.36786        10      -1       1       4       13
-0.73103 -0.86344 -0.29300 -0.24469        19      -1       0       4       12
-0.76983 -0.81417 -0.25648 -0.21474        36      -1       0       4       20
-0.75015 -0.72887 -0.26422 -0.27259        69      -1       2       4       12
-0.76571 -0.78134 -0.25690 -0.24956       134      -1       2       4       37
-0.76196 -0.75812 -0.24240 -0.22752       263      -1       0       4       21
-0.74378 -0.72467 -0.25171 -0.26148       520      -1       2       4       12
-0.75463 -0.76554 -0.24286 -0.23396      1033      -1       2       2       38
-0.72789 -0.70122 -0.24080 -0.23874      2058      -1       0       4       30
-0.68904 -0.65012 -0.25367 -0.26656      4107      -1       2       4       25
-0.61835 -0.54738 -0.25731 -0.26097      8204      -1       0       4       11
-0.55034 -0.48273 -0.24362 -0.23001     16397      -1       2       3       12
-0.49055 -0.43083 -0.20390 -0.16423     32782      -1       2       3       29
-0.44859 -0.40666 -0.15410 -0.10434     65551      -1       2       4       12
-0.42490 -0.40117 -0.11946 -0.08477    131088      -1       0       4        9
-0.41290 -0.40090 -0.10018 -0.08089    262161      -1       2       4        9
-0.40566 -0.39841 -0.08973 -0.07927    524306      -1       0       4       33
-0.40206 -0.39846 -0.08416 -0.07858   1048595      -1       2       4       22
-0.40087 -0.39869 -0.08206 -0.07822   1620800      -1       0       4       18
applying deferred prior updates ... finished

gamma:
     \  ground truth
      |   0       1       2
label |
    0 | -1.0000 0.0023  0.0038
    1 | 0.0038  -1.0000 0.0034
    2 | 0.0038  0.0018  -1.0000

That output takes about 3 minutes to produce on my laptop. If that looks like Vowpal Wabbit, it's because I ripped off their output format again. The first two columns are the EM auxiliary function, which is akin to a log-likelihood, so increasing numbers indicate the worker model is better able to predict the worker labels. The next two columns are the cross-entropy for the classifier, so increasing numbers indicate the classifier is better able to predict the posterior (with respect to crowdsource worker labels) over ground truth from the item features.

The above software is available from the Google code repository. It's called playerpiano, since I find the process of using crowdsource workers to provide training data for classifiers reminiscent of Vonnegut's dystopia, in which the last generation of human master craftsmen had their movements recorded onto tape before being permanently evicted from industrial production. Right now playerpiano only supports nominal problems but I've written things so hopefully it will be easy to add ordinal and multilabel into the same executable.

Friday, October 28, 2011

Online Multi Label Extraction from Crowdsourced Data

I've applied the online EM approach previously discussed to my low-rank model for nominal labels, and by reduction to my model for multi-labels. At this point it's just turning the crank with a different label emission likelihood.

Unfortunately due to the combinatorial nature of the multi-label reduction it can be very slow in practice. Here's an example application where I asked Mechanical Turkers to multi-label phrases into high level buckets like ``Politics'' and ``Entertainment''.

pmineiro@ubuntu-152% for r in 4; do rm model.${r}; time ~/src/multionlineextract/src/multionlineextract --model model.${r} --data <(./multicat 10 =(sort -R octoplevel.max3.moe.in)) --n_items $(cat octoplevel.max3.moe.in | wc -l) --n_raw_labels $(./statsfrompm n_raw_labels) --max_raw_labels 3 --rank ${r} --priorz $(./statsfrompm priorz) --predict flass.${r} --eta 0.5; done
seed = 45
initial_t = 1000
eta = 0.500000 
rho = 0.500000 
n_items = 3803
n_raw_labels = 10
max_raw_labels = 3
n_labels (induced) = 176
n_workers = 65536
rank = 4
test_only = false
prediction file = flass.4
priorz = 0.049156,0.087412,0.317253,0.012600,0.135758,0.079440,0.109094,0.016949
,0.157750,0.034519
cumul     since       example   current   current   current
avg q     last        counter     label   predict   ratings
-3.515874 -3.515874         2        -1         0         4
-3.759951 -3.922669         5        -1         0         4
-3.263854 -2.767756        10        -1         0         4
-2.999247 -2.696840        19        -1         0         3
-2.531113 -2.014788        36        -1         9         4
-2.503801 -2.474213        69        -1         3         4
-2.452015 -2.396817       134        -1         3         4
-2.214508 -1.968222       263        -1         6         3
-2.030175 -1.842252       520        -1         3         4
-1.907382 -1.783031      1033        -1         1         4
-1.728004 -1.547266      2058        -1         2         4
-1.582127 -1.435591      4107        -1         2         4
-1.460967 -1.339532      8204        -1         9         4
-1.364336 -1.267581     16397        -1         5         4
-1.281301 -1.198209     32782        -1         3         4
-1.267093 -1.178344     38030        -1         3        -1
applying deferred prior updates ... finished
gamma:  0.0010  0.0008  0.0007  0.0006
~/src/multionlineextract/src/multionlineextract --model model.${r} --data      2
717.98s user 3.46s system 99% cpu 45:26.28 total

Sadly, yes, that's 45 minutes on one core of my laptop. The good news is that while working on speeding this up, I improved the speed of ordinalonlineextract and nominallabelextract by a factor of 4. However inference is still $O (|L|^2)$ so the problem with 176 effective labels above is about 7700 times slower than a binary problem. A more restrictive assumption, such as ``all errors are equally likely'' (in the nominal case) or ``error likelihood depends only upon the edit distance from the true label'' (in the multi-label case) would admit cheaper exact inference.

multionlineextract is available from the nincompoop repository on Google code.

Friday, September 23, 2011

Online Label Extraction from Crowdsourced Data

So far I'm been using batch EM to optimize the various generative models I've developed to process crowdsourced data (nominal, ordinal, and multi-label). This is despite my fondness for online techniques, but I had to crawl before I walked and my datasets were fairly modest in size. The business is happy with the results from Mechanical Turk, however, and wants to scale up from tasks involving multiples of $10^4$ items to tasks involving multiples of $10^6$ items. Although that will still fit in memory on my laptop, it seemed a good excuse to develop an online variant of the algorithm.

My previous batch EM approaches can be considered as maximizing the auxiliary function \[
F (\alpha, \beta, \gamma, q) = E_{Z \sim q}[\log L (D | \alpha, \beta, \gamma, Z)] + \log P (\alpha, \beta, \gamma) + E_{Z \sim q}[\log P (Z)] + H (q),
\] where $\alpha$ are worker-indexed parameters, $\beta$ are item-indexed parameters, $\gamma$ are global parameters, $q$ is the joint distribution over all unobserved labels, $Z$ is the set of all unobserved labels, $D$ is the data set of item-worker-label triples, $\log L (D | \alpha, \beta, \gamma, Z)$ is the log-likelihood of the data set, $P (\alpha, \beta, \gamma)$ is the prior distribution over the generative model parameters, $P (Z)$ is the prior unobserved label distribution, and $H (q)$ is the entropy of the unobserved label distribution.

The unobserved label distribution is assumed to factor over items, $q (Z) = \prod_i q_i (Z_i)$, as is the prior distribution $P (Z) = \prod_i P_i (Z_i)$. Alternatively only a constrained maximum of the auxiliary function is found, subject to this constraint. The data likelihood is assumed independent conditioned upon $(\alpha, \beta, Z)$, leading to \[
\begin{split}
F (\alpha, \beta, \gamma, q) &= \\
&\sum_i E_{Z_i \sim q_i} [ \log L (D_i | \alpha, \beta_i, \gamma, Z_i)] + \log P (\alpha, \beta, \gamma) \\
&+ \sum_i E_{Z_i \sim q_i} [ \log P_i (Z_i)] + \sum_{q_i} H (q_i),
\end{split}
\] where $i$ indexes items and $D_i$ is the set of data associated with item $i$. Further assuming the prior distribution is of the form $P (\alpha, \beta, \gamma) = P (\alpha, \gamma) \prod_i P (\beta_i)$ and rearranging yields \[
\begin{aligned}
F (\alpha, \beta, \gamma, q) &= \sum_i F_i (\alpha, \beta_i, \gamma, q_i), \\
F_i (\alpha, \beta_i, \gamma, q_i) &= \\
E_{Z_i \sim q_i} [ \log L &(D_i | \alpha, \beta_i, \gamma, Z_i)] + \frac{1}{|I|} \log P (\alpha, \gamma) + \log P (\beta_i) + E_{Z_i \sim q_i} [ \log P_i (Z_i)] + H (q_i),
\end{aligned}
\] where $|I|$ is the total number of items. Now the objective function looks like a sum of terms where $\beta_i$ and $q_i$ only appear once. This indicates that, if the data were streamed in blocks corresponding to the same item and the optimal $\alpha$ and $\gamma$ were already known, the $\beta_i$ and $q_i$ could be individually maximized and discarded. Of course, the optimal $\alpha$ and $\gamma$ are not known, but hopefully over time as more data is encountered the estimates get increasingly good. That suggests the following procedure:

Receive a block $D_i$ of item-worker-label triples corresponding to a single item.
Maximize $F_i (\alpha, \beta_i, \gamma, q_i)$ with respect to $\beta_i$ and $q_i$.

Basically I run EM on this block of data with fixed $\alpha$ and $\gamma$.

Set $\alpha \leftarrow \alpha + \eta_t \nabla_{\alpha} F_i\bigr|_{\alpha, \beta^*_i, \gamma, q^*_i}$ and $\gamma \leftarrow \gamma + \eta_t \nabla_{\gamma} F_i\bigr|_{\alpha, \beta^*_i, \gamma, q^*_i}$.

$\eta_t$ is a learning which decays over time, e.g., $\eta_t = \eta_0 (\tau_0 + t)^{-\rho}$.
$\eta_0 \geq 0$, $\tau_0 \geq 0$ and $\rho \geq 0$ are tuning parameters for the learning algorithm.
Effectively $|I|$ is also a tuning parameter which sets the relative importance of the prior.

If desired (e.g., ``inference mode''), output $\beta^*_i$ and $q^*_i$.
Discard $\beta^*_i$ and $q^*_i$.
Return to (1).

This has very good scalability with respect to number of items, since no per-item state is maintained across input blocks. It does require that all the labels for a particular item are aggregated: however, even in a true online crowdsourcing scenario this does not present a scalability issue. In practice, items are individually programatically submitted for crowdsourced analysis and the number of redundant assessments is typically small (e.g., 5) so a receiving system which buffered crowdsourced data until the entire block of item labels were available would have very modest space requirements. In my case I'm actually applying this online algorithm to an offline previously collected data set, so I can easily arrange for all the labels corresponding to a particular item to be together.

Scalability with respect to the number of workers is a potential issue. This is because $\alpha$ is maintained as state, and it is indexed by worker (e.g., in nominallabelextract, $\alpha_w$ is the confusion matrix for worker $w$). To overcome this I use the hashing trick: I have a fixed finite number of $\alpha$ parameters and I hash the worker id to get the $\alpha$ for that worker. When I get a hash collision this means I treat two (or more) workers as equivalent, but it allows me to bound the space usage of the algorithm up front. In practice doing hashing tricks like this always seems to work out fabulously. In this particular context, in the limit of a very large number of workers I will model every worker with the population confusion matrix. This is a graceful way to degrade as the sample complexity overwhelms the (fixed) model complexity. (I don't actually anticipate having a large number of workers; the way crowdsourcing seems to go is, one does some small tasks to identify high quality workers and then a larger version of the task restricted to those workers).

Here's an example run involving 40 passes over a small test dataset.

% time ~/src/nincompoop/nominalonlineextract/src/nominalonlineextract --initial_t 10000 --n_items 9859 --n_labels 5 --priorz 1,1,1,1,1 --model flass --data <(./multicat 40 =(sort -R ethnicity4.noe.in)) --eta 1 --rho 0.5
initial_t = 10000
eta = 1.000000 
rho = 0.500000 
n_items = 9859
n_labels = 5
n_workers = 65536
symmetric = false
test_only = false
prediction file = (no output)
priorz = 0.199987,0.199987,0.199987,0.199987,0.199987
cumul     since       example   current   current   current
avg q     last        counter     label   predict   ratings
-1.183628 -1.183628         2        -1         0         5
-1.125888 -1.092893         5        -1         0         5
-1.145204 -1.162910        10        -1         0         5
-1.081261 -1.009520        19         0         0         5
-1.124367 -1.173712        36        -1         3         3
-1.083097 -1.039129        69        -1         0         4
-1.037481 -0.988452       134        -1         1         2
-0.929367 -0.820539       263        -1         1         5
-0.820125 -0.709057       520        -1         4         5
-0.738361 -0.653392      1033        -1         1         4
-0.658806 -0.579719      2058        -1         1         5
-0.610473 -0.562028      4107        -1         4         5
-0.566530 -0.522431      8204        -1         0         3
-0.522385 -0.478110     16397        -1         2         4
-0.487094 -0.451771     32782        -1         0         3
-0.460216 -0.433323     65551        -1         4         5
-0.441042 -0.421860    131088        -1         2         5
-0.427205 -0.413365    262161        -1         0         5
-0.420944 -0.408528    394360        -1         1        -1
~/src/nincompoop/nominalonlineextract/src/nominalonlineextract --initial_t     85.77s user 0.22s system 99% cpu 1:26.41 total

If that output format looks familiar, it's because I've jacked vowpal wabbit's output style (again). The first column is the progressive validated auxiliary function, i.e., the (averaged over items) $F_i$ function evaluated prior to updating the model parameters ($\alpha$ and $\gamma$). It is akin to a log-likelihood and if everything is working well it should get bigger as more data is consumed.

nominallabelextract, the implementation of the batch EM analog to the above, converges in about 90 seconds on this dataset and so the run times are a dead heat. For larger datasets, there is less need to do so many passes over the dataset so I would expect the online version to become increasingly advantageous. Furthermore I've been improving the performance of nominallabelextract for several months whereas I just wrote nominalonlineextract so there might be additional speed improvements in the latter. Nonetheless it appears for datasets that fit into memory batch EM is competitive.

nominalonlineextract is available from the nincompoop code repository on Google code. I'll be putting together online versions of the other algorithms in the near-term (the basic approach holds for all of them, but there are different tricks for each specific likelihood).

Wednesday, July 20, 2011

Recommendation Diversity

In recommendation systems, a predictor utilizing only the overall average rating, the average rating per user, and the average rating per item forms a powerful baseline which is surprisingly difficult to beat (in terms of raw prediction accuracy). When I asked Charles Elkan about this, he mentioned there are published papers whose reported results failed to exceed this baseline. Although it was good fun, he pointed out (paraphrasing) ``it's not just about accuracy; nobody likes the linear baseline because it makes the same recommendations to everyone.''

I decided to explore this a bit using the movielens 10m dataset. First, for each user I randomly withheld two of their ratings, then added the rest to the training set. I optimized for rating MAE using the analytical importance-aware update for quantile loss. I then computed rating MAE on the test set. I also computed the AUC per user, defined as 1 if the model correctly ordered the two withheld ratings, and 0 otherwise; averaging this quantity across users yields something I've seen referred to as the ``user AUC''. Of course, if I'm optimizing for AUC, I should reduce to pairwise classification and use hinge loss, rather than use quantile loss; however I was also interested in whether the modest gains I was seeing in MAE might result in larger gains in AUC. Here are the results: \[
\begin{array}{|c|c|c|}
\mbox{Model } &\mbox{ MAE (test set) } &\mbox{ User AUC (test set) } \\ \hline
\mbox{Best Constant } &\mbox{ 0.420 } &\mbox{ 0.5 } \\
\mbox{Linear } &\mbox{ 0.356 } &\mbox{ 0.680 } \\
\mbox{Dyadic $k=1$ } &\mbox{ 0.349 } &\mbox{ 0.692 } \\
\mbox{Dyadic $k=5$ } &\mbox{ 0.338 } &\mbox{ 0.706 } \\
\mbox{Dyadic $k=10$ } &\mbox{ 0.335 } &\mbox{ 0.709 }
\end{array}
\] As noted previously, the lion's share of predictive lift (for both metrics) comes from the linear baseline, i.e., a model of the form $\alpha_{user} + \beta_{movie} + c$. Adding an additional dyadic term of the form $a_{user}^\top b_{movie}$ with latent dimensionality $k$ slightly improves accuracy (although I didn't do cross-validation, treating the User AUC as a binomial implies a standard error of $0.004$ which suggests the dyadic lift in User AUC is barely significant).

Next I looked at recommendation diversity. I took a random subsample of the users (which turned out to be of size 11597), exhaustively estimated their rankings for every movie in the data set, and then I computed the top three movies for each user. At each size $n= \{ 1, 2, 3 \}$ I looked at how often the most popular set of movies of that size was suggested ($\max p_n$), and I also counted the number of unique sets of movies recommended (sets, not lists: if one user has recommendations $a, b, c$ and another user has recommendations $c, b, a$ they are considered the same). \[
\begin{array}{|c|c|c|c|c|c|c|}
\mbox{Model } & \max p_1 &\mbox{ unique 1-sets } & \max p_2 &\mbox{ unique 2-sets } & \max p_3 &\mbox{ unique 3-sets } \\ \hline
\mbox{Linear } &\mbox{ 1 } &\mbox{ 1 } &\mbox{ 1 } &\mbox{ 1 } &\mbox{ 1 } &\mbox{ 1 } \\
\mbox{Dyadic $k=1$ } &\mbox{ 0.478 } &\mbox{ 6 } &\mbox{ 0.389 } &\mbox{ 10 } &\mbox{ 0.218 } &\mbox{ 18 } \\
\mbox{Dyadic $k=5$ } &\mbox{ 0.170 } &\mbox{ 112 } &\mbox{ 0.120 } &\mbox{ 614 } &\mbox{ 0.069 } &\mbox{ 1458 } \\
\mbox{Dyadic $k=10$ } &\mbox{ 0.193 } &\mbox{ 220 } &\mbox{ 0.102 } &\mbox{ 1409 } &\mbox{ 0.035 } &\mbox{ 3390}
\end{array}
\] As anticipated, there is no diversity of recommendation in the linear model. Interestingly, however, the diversity explodes with increasing latent dimensionality, even though the accuracy metrics do not improve dramatically. For $k=10$ the diversity in top-3 suggestion sets is substantial: the largest group of users that share the same top 3 suggestions is only 3.5% of the user base. (The 0.193 result for $k=10$ and $\max p_1$ is not a typo; even though there are more unique movies that are recommended as the top movie for a user, the most popular #1 movie for $k=10$ is more frequently chosen than for $k=5$. Not sure what's going on there.)

If you were to walk over to a product manager and say, "I have two recommendation models which have the same accuracy, but one makes the same recommendations to everybody and the other one makes different recommendations to different people, which do you want?" you can bet that product manager is going to say the second one. In fact, it would probably be acceptable to sacrifice some accuracy in order to achieve recommendation diversity. Given that dyadic models can both improve accuracy and improve diversity, they are a win-win.

Monday, June 13, 2011

Even Better Hashtag Similarity Visualization

Ok, I spent a good chunk of the weekend trying to improve my hashtag similarity visualization, for no good reason except that when you have an itch, you have to scratch it. Here's what ended up looking best.

Compute term frequency vector for each hash tag.
1. For each tweet X:
  1. For each hashtag $h \in X$:
    1. For each non-hashtag token $t \in X$:
      1. Increment count for $(h, t)$.
Compute pairwise cosine for each hashtag pair in the top 1000 most frequent hashtags, $\cos \theta (h, h^\prime)$.
Define similarity as $s (h, h^\prime) = \arccos (\cos \theta (h, h^\prime))$.
1. I get less negative eigenvalues doing this versus $1.0 - \cos \theta (h, h^\prime)$. This is the difference between moving around the hypersphere and cutting across a hyperchord.
Do 60-dimensional MDS on $s$.
1. Two of the 60 eigenvalues were negative so I treated them as 0. So really 58-dimensional MDS.
2. I was a bit surprised to get any negative eigenvalues here, since all my term vectors occupy the positive hyperquadrant of a hypersphere. Clearly my hyperintuition needs hypertuning ... or maybe I have a bug.
Input resulting 60-dimensional representation into t-SNE.
1. I used perplexity 10.

I went with t-SNE because the high dimensional structure appeared to be relatively isolated clusters. I inferred that because I was defining neighborhoods as $\epsilon \theta$ balls and running Floyd-Warshall on the neighborhood graph and I noticed I had to use a really big neighborhood ($\cos \theta \geq 0.8$) before I got a large enough connected component to be interesting.

Finally when I plotted this I tried to randomize the colors to give a chance of being able to see something when all the tags are on top of each other. Really the png does not do justice, you should get the pdf version and zoom in.

Thursday, June 9, 2011

A Hashtag Similarity Visualization

Never underestimate the power of pretty pictures. For one thing, exploratory data analysis really does help build better models. In addition, whether you want more grant money or more salary, pretty pictures are a great way to convince people that you are doing something interesting.

So I've started looking into Twitter hash tag use; the ultimate goal would be to improve our client software by automatically suggesting or correcting hash tags as tweets are being composed, explaining hash tags when reading tweets, etc. Novice users of Twitter often complain that it makes no sense, so this would be a nice value add.

Hashtags suffer from the vocabulary problem. Basically, you say #potato, and I say #spud. In general people use variations which are nearly equivalent semantically but sometimes lexically completely different. Some examples are

#shoutouts and #shoutout : this case appears amenable to lexical analysis.
#imjustsaying and #imjustsayin : these are fairly close, perhaps we could hope a ``stemming'' based strategy would work.
#nowplaying and #np : there are lots of examples of long hash tags being replaced by short abbreviation to economize, perhaps a custom lexical strategy could be developed around that.
#libya and #feb17 : not at all lexical, but the currently nearly equivalent nature of these tags is presumably unstable with time.

Instead of a lexically driven approach, it would be nice to develop a data-driven approach for suggesting, correcting, or explaining hash tags: after all, there are no shortage of tweets.

Alright, enough gabbing, here's the pretty picture. I took the 1000 most popular hash tags in our sample of tweets, and computed a term frequency vector for each hash tag by summing across all tweets with that hash tag. Next, I computed cosine similarity between the hash tags, thresholded at 0.985, and fed the result to Mathematica's GraphPlot. Here's the result (click to enlarge).

Depending upon how much time I want to spend on this, a reasonable next step would be to employ some dimensionality reduction techniques to project this down to 2 dimensions and look at the resulting (hopefully even more informative) picture. Even the above simple procedure, however, yields some interesting information:

Generally this plot helps build intuition around some lexical transformations that are employed.
Hash tags are used for following: indicating desire to be followed and intent to reciprocate.
Horoscopes are apparently very important for both English and Portuguese readers; but the patterns of actual words used to describe each sign are very similar. :)
There are several ways to qualify potentially inflammatory statements to indicate a truth-seeking motivation (the ``Real talk'' cluster).
My favorite, the #sex and #porn cluster. On the internet, they really are the same.

Thursday, June 2, 2011

Dimensional Analysis and Gradient Descent

Consider learning a linear predictor according to a loss function $l$, \[
\begin{aligned}
p^{(t)} &= w^{(t) \top} x^{(t)}, \\
w^{(t+1)} &= w^{(t)} - \eta \frac{\partial l}{\partial p} x^{(t)}.
\end{aligned}
\] Suppose the optimal weight vector is $w^*$. If all the inputs are scaled by a constant amount $r$, then the optimal weight vector is $w^* / r$, and it would be nice if the learning algorithm produced output that respected this identity. A more general way to think along these lines is dimensional analysis.

Assume predictions have dimensions $\rho$ and inputs have dimensions of $\chi$; then weights must have dimensions of $(\rho / \chi)$ for the prediction equation to work out. Assuming loss has dimensions $\lambda$ that means the learning rate $\eta$ must have units of $\rho^2 / (\lambda \chi^2)$ for the update equation to work out. What that means in practice is, if we had a sequence of weight vectors which converged somewhere we liked, and we then changed all the inputs such that they were doubled, the learning rate must be quartered, such that the entire sequence of generated weight vectors is halved, such that the entire sequence of predictions is identical.

So these ideas are already incorporated into Vowpal Wabbit (and actually that is how I became aware of them: I asked the Vowpal team to help me understand what I saw in the source code). In particular the $\eta$ specified on the command line is normalized by $x^{(t) \top} x^{(t)}$, corresponding to the $(1 / \chi^2)$ portion of the $\eta$ dimensionality; although, this does not address the $\rho^2 / \lambda$ portion. (Why does that matter? Suppose I created a new loss function which was twice the output of another loss function; the learning rate specified on the command line would have to be reduced by half.)

Working on the dyadic models I had to figure out how to generalize this, so I started to think about it. Normalizing by $x^{(t) \top} x^{(t)}$ definitely adjusts for a global scaling of the input, but what if just one dimension of the input were scaled? This starts to get into the realm of pre-conditioning which leads to the adaptive approach of Duchi, Hazan, and Singer aka DHS (also simultaneously developed in McMahan and Streeter but I will focus on DHS) . There they define a matrix $G^{(t)} = \mathrm{diag} \left( 1 / \sqrt{\sum_{s \leq t} (g^{(s)}_i)^{2}} \right)$ and use an update \[
w^{(t+1)} = w^{(t)} - \eta \frac{\partial l}{\partial p} G^{(t)} x^{(t)},
\] where $g^{(s)} = (\partial l / \partial p^{(s)}) x^{(s)}$. With this update $G^{(t)}$ has dimensions of $\rho / (\lambda \chi) \ldots$ getting closer! Unfortunately $\eta$ still is not dimensionless, having dimensions of $\rho / \chi$. Note the optimal choice of $\eta$ in Corollary 1 of DHS is proportional to $\max_t ||w_t - w^*||_{\infty}$ which has units of $(\rho / \chi)$. In other words, if all the inputs were doubled, we would still have to reduce the previously optimal learning rate by half to get optimal behaviour.

This leaves open the question of what is lying around with units $(\rho / \chi)$ that could be used to normalize an $\eta$ such that the parameter specified on the command line is dimensionless. Anything that varies with $t$ is outside the scope of the analysis in DHS but I'll ignore that for now. Two things suggest themselves: $1/||x^{(t) \top} x^{(t)}||_p$ and $1 / (x^{(t) \top} G^{(t)} x^{(t)})$. (These have units of $1/\chi$ but that's getting closer). They have different properties.

Intuitively what gives the adaptive learning algorithms their edge is that they are more conservative on frequently occurring features and more aggressive on rarely occurring features. With $||x^{(t) \top} x^{(t)}||_p$ normalization, if an example is encountered with features that have all been seen and trained extensively before, the effective learning rate will be small and hence the change in the prediction will be small relative to if this example were seen earlier in the training sequence. Conversely, with $x^{(t) \top} G^{(t)} x^{(t)}$ normalization, if an example is encountered with features that have all been seen and trained extensively before, the effective learning rate will be normalized to compensate, such that the change in the prediction will not be small relative to if this example were seen earlier in the training sequence. On the other hand for an example with a mixture of novel and frequent features that occurs later in the training sequence, the update will have a greater impact on novel feature weights than frequent feature weights with either normalization scheme, relative to if the example had occurred earlier in the training sequence.

There are other things lying around with dimensionality $(\rho / \chi)$. One of the key insights of the adaptive learning approaches is to use information from the entire history of inputs, and not just than the current input, to drive the learning. One thing that is lying around that summarizes all the inputs so far is the weight vector. The optimal weight vector in Corollary 1 of DHS is proportional to $\max_t ||w_t - w^*||_{\infty}$, and since $w_0 = 0$, this is approximately $||w^*||_\infty$. One (potentially horrible!) approximation of $w^*$ is the current weight vector. It is an especially bad approximation at the beginning when it is zero, so I considered scaling the supplied $\eta$ by $\max (1, ||w^{(t)}||_\infty)$, where I only consider the components of $w^{(t)}$ that have corresponding non-zero values of $x^{(t)}$.

An experiment

Lots of ideas, so I decided to run an experiment. I have a set of dozens of millions of tweets that have been labelled according to the gender of the person who wrote the tweet. The input vectors have interesting norms due to varying numbers of tokens in the tweets. I trained using a constant learning rate (which vowpal scales by $x^\top x$) and the DHS adaptive learning algorithm scaled by different values. I only do one pass over the data and I'm reporting progressive validation hinge loss on the training set. I used $\eta = 1$ on the command line for all these tests. \[
\begin{array}{c|c|c}
\mbox{Method } &\mbox{ Loss } &\mbox{ Comments } \\ \hline
\mbox{Best constant } &\mbox{ 0.722 } &\mbox{ More women than men on Twitter (why?) } \\
\mbox{Non-adaptive } &\mbox{ 0.651 } &\mbox{ VW normalizes by $||x^{(t)}||^2_2$ in this case } \\
\mbox{Adaptive $||x^{(t)}||_1$ normalization } &\mbox{ 0.588 } & \\
\mbox{Adaptive $||x^{(t)}||_2$ normalization } &\mbox{ 0.554 } & \\
\mbox{Adaptive $||x^{(t)}||_\infty$ normalization } &\mbox{ 0.579 } & \\
\mbox{Adaptive $x^{(t) \top} G^{(t)} x^{(t)}$ normalization } &\mbox{ 0.621 } &\mbox{ Much worse than alternatives! } \\
\mbox{Adaptive $\max (1, ||w^{(t)}||_\infty)$ scaling } &\mbox{ 0.579 } & \\
\mbox{Adaptive $\max (0.1, ||w^{(t)}||_\infty)$ scaling } &\mbox{ 0.562 } & \\
\mbox{Adaptive $\max (1, ||w^{(t)}||_\infty)$ scaling } &\mbox{ 0.579 } & \mbox{Ignoring const feature in $||w^{(t)}||_\infty$} \\
\mbox{Adaptive $\max (0.1, ||w^{(t)}||_\infty)$ scaling } &\mbox{ 0.560 } & \mbox{Ignoring const feature in $||w^{(t)}||_\infty$} \\
\end{array}
\]
In practice trying $\max (z, ||w^{(t)}||_\infty)$ for $z \leq 0.1$ yielded the same results. I thought this might be because of the weight of the constant feature (which is always present with value 1; so in a sense, it has a fixed scale which does not change with the input) so I tried not using the constant feature when computing $||w^{(t)}||_\infty$ but the results were about the same.

So this experiment suggests normalizing by the adaptive norm $x^{(t) \top} G^{(t)} x^{(t)}$ really does undermine the effectiveness of the adaptive strategy. Otherwise it's hard to really prefer one strategy over another on the basis of this one experiment. Having said that my personal favorite is the $\max (0.1, ||w^{(t)}||_\infty)$ scaling since it is directly motivated from the regret analysis and has the correct units.

Now I need to return to my original goal: figuring out how to normalize the learning rate when training a dyadic model.

Friday, May 27, 2011

A Dyadic Importance Aware Update: Part III

The next time somebody asks, "Whose Afraid of Non-Convex Loss Functions?", I'm going to raise my hand. This dyadic model is only mildly non-convex but is proving challenging to get right.

I downloaded the bookcrossing dataset. The rankings are ordinal so the standard evaluation metric is mean absolute error. Fortunately $\tau$-quantile loss also has an analytic solution to the dynamics, so I thought I was set. $\tau$-quantile loss is defined by \[
l (p, y) = \begin{cases} \tau (y - p) & y > p \\ (1 - \tau) (p - y) & y \leq p \end{cases}.
\] When $\tau = 1/2$ this is the same as mean absolute error.

I fired this up on the bookcrossing dataset, but it didn't work, in the following sense. As I increased the learning rate, it was possible to exhibit pathological behaviour, e.g., progressive validation loss diverging. This is despite the fact that the importance aware update cannot overrun the label. What I believe is happening (based upon printf) is that there are many ways to cause the loss on the current example to go to zero, and some of those ways involve a large dyadic parameter multiplied by a small dyadic parameter; unfortunately this combination is unlikely to generalize well.

$L^2$ regularization will favor dyadic parameter configurations of uniform magnitude, and fortunately if only the dyadic parameters are regularized there is an analytical solution for $\tau$-quantile loss (and hinge loss). The dynamics with regularization are given by \[
\begin{aligned}
a' (h) &= -\eta b (h) \frac{\partial l}{\partial p}\biggr|_{p = a (h)^\top b (h) + (w - s (h) x)^\top x} - \eta \lambda a (h), \\
b' (h) &= -\eta a (h) \frac{\partial l}{\partial p}\biggr|_{p = a (h)^\top b (h) + (w - s (h) x)^\top x} - \eta \lambda b (h), \\
s' (h) &= \eta \frac{\partial l}{\partial p}\biggr|_{p = a (h)^\top b (h) + (w - s (h) x)^\top x},
\end{aligned}
\] which for $\tau$-quantile loss has the solution \[
\begin{aligned}
a (h) &= e^{-\eta \lambda h} \left( a_0 \cosh (-s (h)) + b_0 \sinh (-s (h)) \right), \\
b (h) &= e^{-\eta \lambda h} \left( b_0 \cosh (-s (h)) + a_0 \sinh (-s (h)) \right), \\
s (h) &= -\eta h \begin{cases} \tau & y > p \\ (\tau - 1) & y \leq p \end{cases}.
\end{aligned}
\] Without regularization, once $p = y$ the dynamics halt. With regularization this is not necessarily true; if $\lambda$ is large enough the dynamics can go past the label making progress on the regularizer. Even if $p = y$ is an attractor, it's possible to trade off linear parameters (which are not regularized) for dyadic ones while maintaining a constant prediction. I ignored those weird cases and just stopped updating when $p = y$.

So I tried that, and it also didn't work, but in a different way. It does not exhibit pathological behaviour with high learning rates as long as $\lambda$ is non-zero (this is good: this means reductions and active learning will not go haywire). In addition, it is possible to drive the progressive validation loss on the training set much lower with dyadic dimensions that without. However this improvement does not translate to the test set. In other words, it fit the training data better, but it didn't generalize well.

So I actually implemented the approach from Menon and Elkan in order to compare. I found that it also didn't generalize well on the bookcrossing dataset. Then I noticed that the number of ratings, users, etc. reported in their paper didn't match what I had. I'm not sure what's up with the bookcrossing dataset, so I decided to switch to the 1M movielens dataset. The Menon and Elkan approach did show some uplift here, not as large as reported in the paper but on the other hand I didn't implement all the bells and whistles from their paper like variable regularization and reduced parameterization. Now I had a dataset for which I expected to make progress. Funny thing about machine learning: it's alot easier to get something to work when you know it's supposed to :)

Here's how my dyadic fit to the 1M movielens ratings performs.\[
\begin{array}{c|c|c}
\mbox{Model } &\mbox{ $0.5$-quantile loss on Test Set } &\mbox{ Comments } \\ \hline
\mbox{Best Constant } &\mbox{ 0.428 } &\mbox{ Median rating is 4 } \\ \hline
\mbox{Linear } &\mbox{ 0.364 } & \mbox{Prediction is $\alpha_{rid} + \beta_{tid} + c$} \\ \hline
\mbox{Dyadic $k=1$ } &\mbox{ 0.353 } & \mbox{Prediction is $a_{rid}^\top b_{tid} + \alpha_{rid} + \beta_{tid} + c$} \\ \hline
\mbox{Dyadic $k=2$ } &\mbox{ 0.351 } &\mbox{ Like previous with $a_{rid}, b_{tid}$ as 2-vectors } \\ \hline
\mbox{Dyadic $k=5$ } &\mbox{ 0.349 } &\mbox{ Like previous with 5-vectors instead of 2-vectors}
\end{array}
\] Ok once again I say, where's the beef? Most of the predictive power over the best constant is coming from the linear components. Looking at table 3 from the Menon and Elkan paper, I can see that they also don't show dramatic lift with increasing latent dimensionality.

Friday, May 20, 2011

A Dyadic Importance Aware Update: Part II

Implementation Odds and Ends

It turns out the dyadic model I discussed in my previous post was a little simpler than is typically used. A more typical model structure is \[
p = a^\top b + a^\top 1 + 1^\top b + w^\top x
\] however with a change of variables \[
\begin{aligned}
\tilde a (h) &= a (h) + 1 \\
\tilde b (h) &= b (h) + 1 \\
\tilde p &= p + 1^\top 1
\end{aligned}
\] one recovers the original equations. This also has the effect of moving the saddle point of the gradient dynamics away from zero; however initializing all model parameters to zero still has problems because there is nothing to break the symmetry between the model parameters. (I chose to break the symmetry by applying a small perturbation to the dyadic parameters on their first update.) In what follows I'll be referring to the original equations to ease the exposition, but in practice the change of variables is employed.

Numerically computing the prediction for a particular step-size, \[
p (s) = a_0^\top b_0 \cosh (2 s) - (||a_0||^2 + ||b_0||^2) \cosh (s) \sinh (s) + x^\top (w - s x)
\] and the implicitly defined inverse $p^{-1} (y)$ turn out to be important. This form is very convenient because only the norm of $a_0$ and $b_0$ along with their inner product is required to simulate a particular step. In terms of software: the interface for a loss function needs to be augmented with two extra scalar inputs ($||a_0||^2 + ||b_0||^2$, and $a_0^\top b_0$; and with the change of variables, $1^\top 1$).

In the forward direction, $p (s)$ is a recipe for catastrophic cancellation so some care is required when computing it. Once this is addressed the reverse direction $p^{-1} (y)$ is mostly well-behaved and can be solved using Newton's method with backtracking line search. However, when $a^\top b \approx \frac{1}{2} (||a_0||^2 + ||b_0||^2)$ and $x \approx 0$, precision issues dominate. Essentially when there are no linear features ($x \approx 0$), the dyadic terms lie near a hyperdiagonal (e.g., $a_0 \approx \pm b_0$), and the prediction is the wrong sign, then it takes a very large $s$ to change the sign of the prediction. Fortunately putting a constant linear feature in the model ensures $x^\top x \geq 1$. In this case encountering a hyperdiagonal essentially implies no learning for the dyadic model parameters, but hopefully dyads are paired ``sufficiently randomly'' in the training sequence that this is not a problem in practice.

The function $p^{-1} (y)$ is used directly to enforce the boundary for hinge loss. It is also related to the smallest importance weight that would cause the prediction to change sign, a quantity that comes up during active learning. (Well it comes up in active learning with linear models; whether the same theory applies to dyadic models is not clear, but I'm hoping sampling ``near the boundary'' and using importance weights will do something reasonable.) The smallest importance weight that causes the model output to be $x$ when the label is $y$ is given by $h_{min} (x, y) = s^{-1} (p^{-1} (x); y)$. Leveraging the separation of variables trick from Karampatziakis and Langford, the update function $s (h; y)$ can be inverted without solving the initial value problem via \[
\begin{aligned}
s' (h; y) &= \eta \frac{\partial l (p, y)}{\partial p}\biggr|_{p = p (s)} = \eta F (s; y), \\
dh &= \frac{1}{\eta F (s; y)} ds \\
s^{-1} (x; y) &= \frac{1}{\eta} \int_{0}^x \frac{d s}{F (s; y)}.
\end{aligned}
\] For hinge loss this boils down to $h_{min} (x, y) = -p^{-1} (x) / (\eta y)$; for other losses the integral has to be done numerically, but for active learning purposes presumably only a rough numerical integration need be done.

Does It Work?

Well, there are many ways to answer that question :)

I had some dyadic data lying around corresponding to the results from a Mechanical Turk HIT in which I asked users to guess the ethnicity of Twitter users based upon their public profile: there are 10,000 tasks and 5 raters each, for a total of 50,000 data points. I split the data up, trained on most of it (15/16ths), and tested on the rest (1/16th). I'm only using the identifiers as features, e.g., an input line might look like

2 2|rater r_2737 |task t_89 |id r_2737 t_89

for a model with one latent dimension and

2 2|rater r_2737 r2_2737 |task t_89 t2_89 |id r_2737 t_89

If that input seems funny, that's because right now I don't emit linear features for feature spaces that are dot-producted together (in this case, "rater" and "task" are dot-producted). This violates DRY so I'll probably change this. (Also, I shouldn't have to regenerate the input to change the number of latent dimensions, so that needs to change.)

My nominallabelextract model I put together to process this data is akin to a dyadic model with one latent dimension. With the above framework I can easily add additional latent dimensions, which is reminiscent of the Welinder et. al. multidimensional embedding. The goal is a bit different here, however: in those cases, the idea is to extract ``ground truth'', i.e., a probability distribution over the ``true'' ethnicity for the task. Here the goal will be merely to predict the label that a rater will assign to a task. This is related to the true underlying label of the task, but also related to properties of the rater.

As an aside, to the extent that it is possible to reliably predict the test set labels from the training set labels, that would mean that I paid too much money for my labels since the ones in the test set were redundant. Using active learning techniques on this problem might mitigate the redundancy. Perversely, however, this might reward a rater answering randomly with more work.

I randomly did the training/test split, trained each model on the same training set and tested each model on the same test set. The idea is to have some (alot!) of twinning of individual raters and individual tasks between the two data sets (but not of pairs; each pair occurs at most once). I'm using a scoring filter tree reduction to reduce multiclass to binary, and then I'm using hinge loss for the binary classifier. Here are some results. \[
\begin{array}{c|c|c}
\mbox{Model } & \mbox{0/1 Loss on Test Set }(\pm 0.007) &\mbox{ Comments } \\ \hline
\mbox{Best Constant } &\mbox{ 0.701 } & \mbox{Most frequent ethnicity occurs }\approx 30\% \mbox{ of the time} \\ \hline
\mbox{Linear } &\mbox{ 0.214 } & \mbox{Prediction is }\alpha_{rid} + \beta_{tid} + c \\ \hline
\mbox{Dyadic }k=1 &\mbox{ 0.198 } & \mbox{Prediction is }a_{rid}^\top b_{tid} + \alpha_{rid} + \beta_{tid} + c \\ \hline
\mbox{Dyadic }k=2 &\mbox{ 0.198 } &\mbox{ Like previous with }a_{rid}, b_{tid} \mbox{ as 2-vectors } \\ \hline
\mbox{Dyadic }k=3 &\mbox{ 0.199 } &\mbox{ Like previous with 3-vectors instead of 2-vectors}
\end{array}
\] Ok so this is not blowing my socks off $\ldots$ the differences are not statistically significant (using a bogus binomial variance estimate). There seems to be a tiny bit of juice when adding a single latent dimension, and after that it poops out. In retrospect this is a poor choice of dataset for testing since absent adversarial behaviour by the raters the important variables are the rater bias and the true task label, both of which are nicely captured by the linear model. By the way I chose all the learning parameters to give the best result for the linear model, then reused them for the other models.

On the other hand, operationally things look promising. First the speed penalty is fairly minor between dyadic $k=3$ and linear (80 seconds vs 50 seconds to do 160 passes). Since an input record is bigger in the dyadic $k=3$ case, I tried training a linear model on the dyadic $k=3$ input data, in which case the difference disappears. I don't think the difference is parsing; rather I suspect more weight values are being accessed per line which is causing more cache misses. In any event, for an equal number of effective features, the overhead of computing $p^{-1} (y)$ is not noticeable. Of course, hinge loss has an analytic solution to the dynamics; for other loss functions I have to solve an initial value problem, which might slow things down considerably.

Second the results are robust with respect to the setting of the learning rate. There is definitely a best learning rate in terms of minimizing test set loss, but when I varied the learning rate over 4 orders of magnitude there was never any pathological behaviour. Perhaps in a more complicated model I would encounter trouble, or perhaps when using a different loss function (a large learning rate might complicate numerically solving the initial value problem). Also, I haven't implemented regularization, which might be required to get a more complicated model working.

So next steps are to apply this to some of the testing datasets mentioned in Menon and Elkan, e.g., the bookcrossing dataset; and to implement for other loss functions and see if I get bad results (due to approximating the multidimensional ODE with a one-dimensional ODE), instability with respect to the learning rate (arising when solving the initial value problem), or unacceptably slow throughput (due to all the additional numerics on each update).

Sunday, April 24, 2011

Semi-Supervised LDA: Part III

In my previous post I discussed the idea of semi-supervised LDA: constructing a topic model on a corpus where some of the documents have an associated label, but most do not. The goal is to extract topics that are informed by all the text in the corpus (like unsupervised LDA) but are also predictive of the observed label (like supervised LDA).

Supervised Results

In my previous post I didn't indicate how to compute $p (y | w)$, i.e., the distribution of document labels given the document text. Following the DiscLDA paper I used the harmonic mean estimator (the worst Monto Carlo method ever), but I only use a sample size of 1 (the worst sample size ever). What this means in practice is:

For each label $y$

Draw $z \sim p (\cdot | w, y)$ using the Gibbs sampler conditioned upon the label.
Compute $p (w | z, y) = \prod_i \frac{n^{(w_i)}_{-d, u^k_{z_i}} + \beta}{n^{(\cdot)}_{-d, u^k_{z_i}} + W \beta}.$
Estimate $p (w | y) \approx p (w | z, y)$.

Compute unnormalized posterior $p (y | w) \propto p (w | y) p (y)$ and normalize.

On fully labeled data sets (i.e., supervised learning) this gives good results. Here's the result of a supervised run,

There's no other classifier here: 0/1 loss is being determined based upon the posterior distribution of labels given the document text as generated by the LDA procedure.

You might ask: how come the classifier doesn't drive the loss on the training set to zero? Well loss here is assessed by (temporarily) removing the document from the training set and then running the inference algorithm to get the label posterior. This is still a biased estimator of generalization performance, but it is somewhat conservative. Actual 0/1 loss on a test set was around 0.23 (i.e., 77% accurate).

Next you might ask: only 77%? As I indicated in a previous post, predicting gender from a user's set of followers is challenging, mainly because I've only labeled a small fraction of the data and the popular twitter accounts (i.e., most likely to be followed) tend not to be gender dispositive. I am able to predict the gender of a twitter user quite accurately, but I have to combine the multiple sources of information in the twitter account, of which the social graph is merely one component. In my experience 77% accuracy using only follower information is really good relative to other things I've tried. In particular doing an unsupervised LDA on all the social graph edge sets followed by supervised classification on the labeled subset (via Vowpal Wabbit) achieves 72% accuracy. Note that 72% accuracy is no straw man: this is the best result from an array of experiments where I varied the number of topics and other parameters in the LDA model.

My intuition was that supervised LDA on the labeled subset would not do as well as unsupervised LDA on the entire set followed by supervised classification on the labeled subset, because the latter technique uses all the data. Now I have to update my intuitions!

Semi-Supervised Results

Unfortunately the situation goes south when I add unsupervised data. Here's a plot of accuracy on a test set as a function of the fraction of training data which was unlabeled.

This is unfortunate, because I'd like to take the x-axis in this plot to 0.99 (i.e., only 1% of my overall data is labeled) and the trend is not my friend.

There is still a lot of tuning and bug finding so I'm not necessarily sunk. However there is no nice theoretical result driving things either: I'm not aware of any guarantees of classification improvement due to adding unsupervised data to a classifier based upon a generative model, especially when the generative model is known to be incorrect (i.e., documents are not actually generated this way). So at some point, I'll have to raise the white flag.