## Wednesday, March 30, 2011

### Filter Tree Reduction: Perl Implementation

Lately I've been solving multi-classification problems with vowpal using a machine learning reduction. Ideally I would have programmed this in C using a reductions API provided by vowpal. In practice, vowpal has been in flux; therefore to isolate myself I've been treating vowpal as a black box with which I communicate via IPC. There is a penalty for this approach: I estimate my total throughput would be at least 4 times larger if I implemented the reduction within vowpal (based upon the output of top). Hopefully John and crew will provide a stable vowpal reduction API in the near future.

In the meantime, although it is a bit pokey the reduction I'm presenting here is still practical. In addition, sometimes just seeing an implementation of something can really crystallize the concepts, so I thought I'd present the reduction here.

#### The Strategy

The starting point is the Filter Tree reduction of cost-sensitive multiclass classification to importance weighted binary classification. In this reduction, class labels are arranged into a March-madness style tournament, with winners playing winners until one class label emerges victorious: that is the resulting prediction. When two class labels play each other'', what really happens is an importance weighted classifier decides who wins based upon the associated instance features $x$.

In practice I'm using a particular kind of filter tree which I call a scoring filter tree. Here the importance weighted classifier is constrained to be of the form $\Psi_{\nu} (x) = 1_{f (x; \lambda) > f (x; \phi)}.$ Here $\lambda$ and $\phi$ are the two class labels who are playing each other'' to see who advances in the tournament. What this equation says is:
1. There is a function $f$ which says how good each class label is given the instance features $x$.
2. The better class label always beats the other class label.
This implies that the winner of the tournament is the best team according to $f$. This makes $f$ look like a scoring function (like what would be obtained from argmax regression) and essentially one can ignore the tournament structure at test time. The use of the tournament at train time is critical however to obtaining good performance on noisy problems (i.e., low regret).

#### The Implementation

I'll assume that we're trying to classify between $|K|$ labels denoted by integers $\{ 1, \ldots, |K|\}$. I'll also assume an input format which is very close to vowpal's native input format, but with a cost vector instead of a label. $c_1,\ldots,c_{|K|}\; \textrm{importance}\; \textrm{tag}|\textrm{namespace}\; \textrm{feature} \ldots$ So for instance a 3 class problem input line might look like $0.7,0.2,1.3\; 0.6\; \textrm{idiocracy}|\textrm{items}\; \textrm{hotlatte}\; |\textrm{desires}\; \textrm{i}\; \textrm{like}\; \textrm{money}$ The best choice (lowest cost) class here is 2.

##### Test Time
Applying the model is easier to understand than training it, so I'll start there. Within the perl I transform this into a set of vowpal input lines where each line corresponds to a particular class label $n$, $\; \textrm{tag}|\textrm{namespace}n\; \textrm{feature} \ldots$ Essentially the cost vector and importance weight are stripped out (since there is no learning happening right now), the tag is stripped out (I handle that separately), and each namespace has the class label appended to it. Since vowpal uses the first letter to identify namespaces, options that operate on namespaces (e.g., -q, --ignore) will continue to work as expected. So for instance continuing with the above example we would generate three lines $|\textrm{items}1\; \textrm{hotlatte}\; |\textrm{desires}1\; \textrm{i}\; \textrm{like}\; \textrm{money}\; |\_1\; k$ $|\textrm{items}2\; \textrm{hotlatte}\; |\textrm{desires}2\; \textrm{i}\; \textrm{like}\; \textrm{money}\; |\_2\; k$ $|\textrm{items}3\; \textrm{hotlatte}\; |\textrm{desires}3\; \textrm{i}\; \textrm{like}\; \textrm{money}\; |\_3\; k$ Each of these lines is fed to vowpal, and the class label that has the lowest vowpal output is selected as the winner of the tournament. That last feature $k$ in the namespace _ is providing a class label localized version of the constant feature that vowpal silently provides on every example.

##### Train Time
At train time I essentially run the tournament: but since I know the actual costs, I update the classifier based upon who should have won''. The importance weight of the update is determined by the absolute difference in costs between the two teams that just played. So in the case of two class labels $i$ and $j$ there will be a training input fed to vowpal of the form, $1\; \omega\; \textrm{tag}|\textrm{namespacei:1}\; \textrm{feature} \ldots |\textrm{namespacej:-1}\; \textrm{feature} \ldots |\textrm{\_i:-1} \; k\; |\textrm{\_j:-1}\; k$ where $\omega = \textrm{importance} * \mbox{abs} (c_i - c_j)$, i.e., the original importance weight scaled by the absolute difference in the actual costs. Here I'm leveraging the ability to provide a weight on a namespace which multiplies the weights on all the features in the namespace. (What about that pesky constant feature that vowpal always provides? It's still there and really it shouldn't be. The right way to deal with this would be to patch vowpal to accept an option not to provide the constant feature. However I want to present something that works with an unpatched vowpal, so instead I feed another training input with everything negated in order to force the constant feature to stay near zero.)

When a team wins a game they should not have won, they still advance in the tournament. Intuitively, the classifier needs to recover gracefully from mistakes made previously in the tournament, so this is the right thing to do.

##### What's Missing
Here are some things I'd like to improve:
1. Implement inside vowpal instead of outside via IPC.
2. In the implementation I manually design the tournament based upon a particular number of classes. It would be better to automatically construct the tournament.
3. It would be nice to have a concise way to specify sparse cost-vectors. For example when all errors are equally bad all that is needed is the identity of the correct label.
4. The above strategy doesn't work with hinge loss, and I don't know why (it appears to work with squared and logistic loss). Probably I've made a mistake somewhere. Caveat emptor!

#### The Code

There are two pieces:
• vowpal.pm: this encapsulates the communication with vowpal. You'll need this to get it to work, but mostly this boring unix IPC stuff.
• It's not very good at detecting that the underlying vw did not start successfully (e.g., due to attempting to load a model that does not exist). However you will notice this since it just hangs.
• filter-tree: perl script where the reduction implementation actually lives. You invoke this to get going. Mostly it takes the same arguments as vw itself and just passes them through, with some exceptions:
1. You have to read data from standard input. I could intercept --data arguments and emulate them, but I don't.
2. You can't use the --passes argument because of the previous statement.
3. I do intercept the -p argument (for outputting predictions) and emulate this at the reduction level.

The output you see from filter-tree looks like the output from vw, but it not. It's actually from the perl script, and is designed to look like vw output suitably modified for the multiclass case.

Here's an example invocation:
% zcat traindata.gz | head -1000 | ./filter-tree --adaptive -l 1 -b 22 --loss_function logistic -f model.users.b22
average    since       example  example    current  current  current
loss       last        counter   weight      label  predict features
1.000000   1.000000          1      1.0     1.0000   0.0000       16
0.500000   0.000000          2      2.0     1.0000   1.0000       15
0.500000   0.500000          4      4.0     2.0000   1.0000       20
0.375000   0.250000          8      8.0     2.0000   2.0000       19
0.562500   0.750000         16     16.0     5.0000   2.0000       23
0.437500   0.312500         32     32.0     0.0000   1.0000       14
0.281250   0.125000         64     64.0     1.0000   1.0000       16
0.312500   0.343750        128    128.0     0.0000   1.0000       16
0.347656   0.382812        256    256.0     1.0000   1.0000       13
0.322266   0.296875        512    512.0     1.0000   1.0000       20

finished run
number of examples = 1000
weighted examples sum = 1000
average cost-sensitive loss = 0.287
average classification loss = 0.287
best constant for cost-sensitive = 1
best constant cost-sensitive loss = 0.542
best constant for classification = 1
best constant classification loss = 0.542
minimum possible loss = 0.000
confusion matrix
15      1       0       1       0       1       0
77      416     53      23      5       0       1
14      41      281     56      8       3       2
0       0       0       1       0       1       0
0       0       0       0       0       0       0
0       0       0       0       0       0       0
0       0       0       0       0       0       0

The -p argument outputs a tab separated set of columns. The first column is the predicted class label, the next $|K|$ columns are the scoring function values per class label, and the last column is the instance tag.

As is typical, the source code is (unfortunately) the best documentation.

#### filter-tree

#! /usr/bin/env perl

use warnings;
use strict;

use vowpal;

$SIG{INT} = sub { die "caught SIGINT"; }; # if this looks stupid it is: these used to be actual class names, # but i didn't want to release code with the actual class labels that i'm using use constant { ZERO => 0, ONE => 1, TWO => 2, THREE => 3, FOUR => 4, FIVE => 5, SIX => 6, }; sub argmin (@) { my (@list) = @_; my$argmin = 0;

foreach my $x (1 ..$#list)
{
if ($list[$x] < $list[$argmin])
{
$argmin =$x;
}
}

return $argmin; } sub tweak_line ($$) { my (suffix, rest) = @_; rest =~ s/\|(\S*)/\|{1}{suffix}/g; return rest; } sub train_node ($$$$) { my (m, la, lb, pa, pb, ca, cb, i, rest) = @_; my argmin = (ca < cb) ? -1 : 1; my absdiff = abs (ca - cb); if (absdiff > 0) { chomp rest; my w = i * absdiff; my plusone = 1; my minusone = -1; my chirp = (rand () < 0.5) ? 1 : -1; argmin *= chirp; plusone *= chirp; minusone *= chirp; m->send ("argmin w", tweak_line ("{la}:plusone", " |rest |_ k"), tweak_line ("{lb}:minusone", " |rest |_ k\n"))->recv () or die "vowpal failed to respond"; argmin *= -1; plusone *= -1; minusone *= -1; m->send ("argmin w", tweak_line ("{la}:plusone", " |rest |_ k"), tweak_line ("{lb}:minusone", " |rest |_ k\n"))->recv () or die "vowpal failed to respond"; } return pa - pb; } sub print_update ($$$$) { my (total_loss, since_last, delta_weight, example_counter, example_weight, current_label, current_predict, current_features) = @_; printf STDERR "%-10.6f %-10.6f %8lld %8.1f %s %8.4f %8lu\n", example_weight > 0 ? total_loss / example_weight : -1, delta_weight > 0 ? since_last / delta_weight : -1, example_counter, example_weight, defined (current_label) ? sprintf ("%8.4f", current_label) : " unknown", current_predict, current_features; } #--------------------------------------------------------------------- # main #--------------------------------------------------------------------- srand 69; my @my_argv; my pred_fh; while (@ARGV) { my arg = shift @ARGV; last if arg eq '--'; if (arg eq "-p") { my pred_file = shift @ARGV or die "-p argument missing"; pred_fh = new IO::File pred_file, "w" or die "pred_file: !"; } else { push @my_argv, arg; } } my model = new vowpal join " ", @my_argv; print STDERR <<EOD; average since example example current current current loss last counter weight label predict features EOD my total_loss = 0; my since_last = 0; my example_counter = 0; my example_weight = 0; my delta_weight = 0; my dump_interval = 1; my @best_constant_loss = map { 0 } (ZERO .. SIX); my @best_constant_classification_loss = map { 0 } (ZERO .. SIX); my minimum_possible_loss = 0; my classification_loss = 0; my mismatch = 0; my %confusion; while (defined (_ = <>)) { my (preline, rest) = split /\|/, _, 2; die "bad preline preline" unless preline =~ /^([\d\.]+)?\s+([\d\.]+\s+)?(\S+)?/; my label = 1; my importance = 2 ? 2 : 1; my tag = 3; my (actual_tag, @costs) = split /,/, tag; die "bad tag tag" unless @costs == 0 || @costs == 8; my @scores = map { my s = model->send (tweak_line (_, " |rest |_ k"))->recv (); chomp s; s } (ZERO .. SIX); my current_prediction = argmin @scores; if (@costs == 8) { # it turned out better for my problem to combine classes 6 and 7. # costs are already inverted and subtracted from 1, so, # have to subtract 1 when doing this my class_seven = pop @costs; costs[SIX] += class_seven - 1; # zero level my zero_one = train_node (model, ZERO, ONE, scores[ZERO], scores[ONE], costs[ZERO], costs[ONE], importance, rest) <= 0 ? ZERO : ONE; my two_three = train_node (model, TWO, THREE, scores[TWO], scores[THREE], costs[TWO], costs[THREE], importance, rest) <= 0 ? TWO : THREE; my four_five = train_node (model, FOUR, FIVE, scores[FOUR], scores[FIVE], costs[FOUR], costs[FIVE], importance, rest) <= 0 ? FOUR : FIVE; # SIX gets a pass # first level: (zero_one vs. two_three), (four_five vs. SIX) my fleft = train_node (model, zero_one, two_three, scores[zero_one], scores[two_three], costs[zero_one], costs[two_three], importance, rest) <= 0 ? zero_one : two_three; my fright = train_node (model, four_five, SIX, scores[four_five], scores[SIX], costs[four_five], costs[SIX], importance, rest) <= 0 ? four_five : SIX; # second level: fleft vs. fright my root = train_node (model, fleft, fright, scores[fleft], scores[fright], costs[fleft], costs[fright], importance, rest) <= 0 ? fleft : fright; total_loss += importance * costs[root]; since_last += importance * costs[root]; example_weight += importance; delta_weight += importance; my best_prediction = argmin @costs; foreach my c (ZERO .. SIX) { best_constant_loss[c] += importance * costs[c]; if (c != best_prediction) { best_constant_classification_loss[c] += importance; } } minimum_possible_loss += importance * costs[best_prediction]; classification_loss += (current_prediction == best_prediction) ? 0 : 1; ++confusion{"current_prediction:best_prediction"}; ++mismatch if root ne current_prediction; } print pred_fh (join "\t", current_prediction, @scores, actual_tag), "\n" if defined pred_fh; ++example_counter; if (example_counter >= dump_interval) { my @features = split /\s+/, rest; # TODO: not really print_update (total_loss, since_last, delta_weight, example_counter, example_weight, (@costs) ? (argmin @costs) : undef, current_prediction, scalar @features); dump_interval *= 2; since_last = 0; delta_weight = 0; } } my average_loss = sprintf "%.3f", example_weight > 0 ? total_loss / example_weight : -1; my best_constant = argmin @best_constant_loss; my best_constant_average_loss = sprintf "%.3f", example_weight > 0 ? best_constant_loss[best_constant] / example_weight : -1; my best_constant_classification = argmin @best_constant_classification_loss; my best_constant_classification_average_loss = sprintf "%.3f", example_weight > 0 ? best_constant_classification_loss[best_constant_classification] / example_weight : -1; my minimum_possible_average_loss = sprintf "%.3f", example_weight > 0 ? minimum_possible_loss / example_weight : -1; my classification_average_loss = sprintf "%.3f", example_weight > 0 ? classification_loss / example_weight : -1; print <<EOD; finished run number of examples = example_counter weighted examples sum = example_weight average cost-sensitive loss = average_loss average classification loss = classification_average_loss best constant for cost-sensitive = best_constant best constant cost-sensitive loss = best_constant_average_loss best constant for classification = best_constant_classification best constant classification loss = best_constant_classification_average_loss minimum possible loss = minimum_possible_average_loss confusion matrix EOD #train/test mismatch = mismatch foreach my pred (ZERO .. SIX) { print join "\t", map { confusion{"pred:_"} || 0 } (ZERO .. SIX); print "\n"; }  #### vowpal.pm # vowpal.pm package vowpal; use warnings; use strict; use POSIX qw (tmpnam mkfifo); use IO::File; use IO::Pipe; use IO::Poll; sub new ($$) { my$class = shift;
my $args = shift; my$pred_pipename = tmpnam () or die $!; my$pred_pipe = mkfifo ($pred_pipename, 0700) or die$!;
my $pred_fd = POSIX::open ($pred_pipename,
&POSIX::O_RDONLY |
&POSIX::O_NONBLOCK |
&POSIX::O_NOCTTY) or die $!; my$pred_fh = new IO::Handle;
$pred_fh->fdopen ($pred_fd, "r") or die $!; POSIX::fcntl ($pred_fh,
&POSIX::F_SETFL,
POSIX::fcntl ($pred_fh, &POSIX::F_GETFL, 0) & ~&POSIX::O_NONBLOCK); my$data_fh = new IO::Pipe or die $!; open my$oldout, ">&STDOUT" or die "Can't dup STDOUT: $!"; eval { open STDOUT, ">", "/dev/null" or die "Can't redirect STDOUT:$!";
eval
{
open my $olderr, ">&STDERR" or die "Can't dup STDERR:$!";
eval
{
open STDERR, ">", "/dev/null" or die "Can't redirect STDERR: $!";$data_fh->writer ("vw $args -p$pred_pipename --quiet") or die $!;$data_fh->autoflush (1);
};
open STDERR, ">&", $olderr or die "Can't restore STDERR:$!";
die $@ if$@;
};
open STDOUT, ">&", $oldout or die "Can't restore STDOUT:$!";
die $@ if$@;
};
die $@ if$@;

my $poll = new IO::Poll;$poll->mask ($data_fh => POLLOUT);$poll->poll ();
$poll->remove ($data_fh);
$poll->mask ($pred_fh => POLLIN);

my $self = { data_fh =>$data_fh,
pred_fh => $pred_fh, pred_file =>$pred_pipename,
poll => $poll, args =>$args };

bless $self,$class;
return $self; } sub send ($@)
{
my $self = shift;$self->{'data_fh'}->print (@_);

return $self; } sub recv ($)
{
my $self = shift;$self->{'poll'}->poll ();
return $self->{'pred_fh'}->getline (); } sub DESTROY { my$self = shift;

$self->{'data_fh'}->close ();$self->{'pred_fh'}->close ();
}

1;


## Sunday, March 27, 2011

### Regarding leveraging the Twitter social graph

Lately I have been occupied building various classifiers which take twitter profiles and predict properties of interest to advertisers: things like gender and ethnicity. Initially I focused on recent tweets as a source of features, and achieved some success. I then started incorporating other portions of the twitter profile in order to improve my classifiers.

## Quick and dirty

Two additional sources of information looked immediately promising: the bio and the social graph. Since I'm using sparse logistic regression as provided by vowpal wabbit, I started by doing the most naive encoding possible: I took the top N tokens by mutual information with my label and nominally encoded them. To clarify, a token in a bio is more or less what you would expect; whereas a token in a social graph is the numeric twitter identity of the account on the other end of the connection (I only considered following behaviour, and ignored being followed). This naive approach applied to a gender classifier led to some improvement for the bio tokens, but essentially no improvement for the social tokens.

## Semi-supervised

As has been a recent theme on this blog, an important aspect of my situation is that unlabeled profiles outnumber labeled profiles by roughly 4 orders of magnitude. I therefore took a large number of bios, used LDA to build a topic model from them, and then used the resulting features as inputs to my supervised classifier. I also took a large number of social graph edge sets, used LDA to build a topic model from them, and then used the resulting features as inputs to my supervised classifier.

Surprisingly, the bios LDA features did not do much better than the nominal encoding of bio tokens. However, the social graph LDA features did do better than the nominal encoding of social tokens.

## What's going on?

Because the social graph information was useful in the form of LDA features, this suggests the issue with the nominal encoding is some combination of sample complexity and classification technique (or more likely: I made a mistake somewhere). While I don't have a complete understanding of what's going on, this episode prompted me to look at the statistics of the social graph relative to the bio.

So exhibit 1 is the rank-ordered frequency of a single token occurring in either a bio or a social edge set. What this means:
• For bio: if you pick a random word from all the bios concatenated together (or equivalently, sample from all bios proportional to the number of words, then pick a random word in that bio), this is the probability that the Nth most frequently used token will be the one you choose.
• For social: if you pick a random edge from the (directed) graph of all "A follows B" relations (or equivalently, sample from all twitter accounts proportional to the number of accounts followed, then pick a random account which that twitter profile is following), this is the probability that the Nth most frequently followed twitter account will be the one followed for the edge selected.
Without a doubt when sampling in this fashion, you are far more likely to choose the most frequent token occurring in a bio than the most frequently followed twitter account. Viewed this way bio tokens have a larger head and shorter tail than social tokens. So I thought, aha this is the problem, nominal encoding ran into sample complexity issues due to the heavy tail of the social edge set''.

However I now believe this is incorrect. Consider exhibit 2 which is similar, but the rank-order and the frequency are in terms of accounts and not tokens. In other words:
• For bio: if you pick a random twitter account, this is the probability that at least one of the tokens in that account's bio will be the Nth most frequently occurring token.
• For social: if you pick a random twitter account, this is the probability that the account follows the Nth most frequently followed twitter account.
Arguably the above sampling procedure is more important for the nominal encoding strategy: since I have a small fraction of the total profiles labeled (e.g., 1 profile in 1000), my sample can only support nominal inference on tokens that occur above a certain rate per profile (e.g., 1 profile in 100, so that I can expect to have 10 labeled samples associated with the presence of the token). So then, exhibit 2:
Viewed in this fashion the head is more substantial for the social graph: said differently, there were about 100 words that could have been potentially useful for nominal encoding of the bio, whereas there were about 1000 twitter accounts that could have been potentially useful for the nominal encoding of the social edge set. Fundamentally the difference between exhibit 1 and exhibit 2 is that most profiles have more followers than they do words in their bio, so sampling an entire profile yields more social tokens (alternatively: the graph in exhibit 1 normalizes to 1, the graph in exhibit 2 does not).

So now what I think is happening is that within the set of sufficiently frequent bio tokens are many that are dispositive with respect to gender: obvious words like "father", "mother", "husband", and "wife"; but also less obvious things like women being more likely to say "bitch" and Indonesian men being more likely to say "saya". Meanwhile among the set of most popular accounts on twitter, the gender preference of followers of that account is comparatively not very strong. For instance, Coldplay has a lot of followers, but as far as I can tell men and women are nearly equally likely to follow Coldplay. Although Oprah sounds intuitively gender polarizing, women are only twice as likely as men to follow Oprah. Compare this to: men are 42 times more likely to use the word "sucks" in their bio, and women are 20 times more likely to use the word "girl". There are a few popular twitter accounts that are comparatively extremely gender polarizing (e.g., TheSingleWoman and Chris_Broussard) but overall the most common things people say when describing themselves appear more dispositive regarding gender than the most common twitter accounts that people choose to follow (once recent tweets have been taken into account).

So my takeaway is: to really leverage the social graph is going to require stepping up my technique. That means:
• Getting more labeled data. Active learning techniques will be critical here to prevent my mechanical turk bill from exploding.
• Leveraging the unlabeled social graph data more extensively. This includes trying more unsupervised techniques on the social graph; but also doing more explicitly semi-supervised techniques as well (instead of: unsupervised followed by supervised).

## Wednesday, March 23, 2011

### LDA on a Social Graph

In my previous post I indicated that I was faced with a variety of semi-supervised problems, and that I was hoping to utilize LDA on the social graph in order to build a feature representation that would improve my performance on various classification tasks. Now that I've actually done it, I thought I'd share some results.

## LDA on Graphs

The strategy is to treat the edge sets at each vertex of the social graph as a document and then apply LDA to the resulting document corpus, similar to Zhang et. al. Since I'm considering Twitter's social graph, the latent factors might represent interests or communities, but I don't actually care as long as the resulting features improve my supervised classifiers.

When LDA was first applied in Computer Vision, it was first applied essentially without modification with some success. Then the generative model was adapted to the problem domain to improve performance (e.g., in the case of Computer Vision, by incorporating spatial structure). Things are done in this order for a very practical reason: when you apply the standard generative model, you get to leverage someone else's optimized and correct implementation! For the same reasons I'm sticking with the original LDA here, but there are some aspects I've noticed are not a perfect fit.
• On directed social graphs (such as Twitter) there are two kinds of edges which is analogous to two different kinds of tokens being present in the document. LDA only has one token type. Possibly this can be worked around by prefixing every edge with a '+' or '-' indicating direction. In practice I sidestep this problem by only modeling the outgoing edges (i.e., the set of people that someone follows).
• An edge can only exist once in an edge set, whereas with vanilla LDA a token can occur multiple times in a text document. Taking into account this negative correlation between edge emission probabilities might improve results.

Even though I don't actually care about understanding the latent factors, it makes for entertaining blog fodder. So now for the fun. I ran a 10 topic LDA model over the edge sets from a random sample of twitter users, in order to get a broad overview of the graph structure. Here are the top 10 mostly likely twitter accounts for each topic:
So roughly speaking I see Hispanic (topic 1), Asian (topic 2), Hip Hop (topic 3), Asian with western influence (topic 4), Hip Hop with astrological influence (topic 5), News and Comedy (topic 6), North American celebrities (topic 7), Hip Hop (topic 8), Asian (topic 9), and Hip Hop (topic 10).

And yes, this data was collected prior to Charlie Sheen's meteoric rise.

## shitmydadsays is a News Site

Actually topic 6 is truly fascinating. Perhaps it is best called "Stuff News Junkies like". There is no doubt that news interest and comedy interest intersect, but the causality is unclear: does one need to watch the news to understand the jokes, or does one need the jokes to avoid severe depression after watching the news?

## The Cultural Polysemy of Justin Beiber

When using LDA to analyze text, tokens that have high emission probability for multiple topics often have multiple meanings. Here we see justinbieber has high emmission probability for topics 4 and 7, which are otherwise mostly of Asian and North American focus respectively. One interpretation is that the appeal of justinbieber cuts across both cultures.

## Thursday, March 3, 2011

### Another Reason to Enjoy Lightning Fast LDA

In my current situation I am faced with problems involving lots of unlabeled data and a smaller amount of labeled data. This puts me in the semi-supervised learning zone.

One popular semi-supervised technique is to use an unsupervised technique on the unlabeled data to learn a data representation and then use the resulting data representation with a supervised technique on the smaller labeled data. I'm looking at Twitter data and tweets are text (arguably, unnatural language'') so LDA is a natural choice here, and the recently developed super-fast implementation in vowpal wabbit is germane.

Twitter is also a social network but incorporating the social graph information in the most straightforward way (nominally encoding the identities of direct connections) is analogous to the most straightforward way of encoding text tokens. It works great when you have millions of labeled examples, but otherwise is too sparse to be very useful.

When you have a hammer, everything looks like a nail, so I thought maybe I could treat the edge set associated with a vertex as a document and then perform LDA on all edge sets. Turns out this has been done already, and the results look reasonable. From my perspective I don't care if the latent factors are communities or interests (with Twitter, probably a bit of both), only that the resulting features end up improving my supervised learner.

## Wednesday, March 2, 2011

### Sherlock Holmes

I watched the pilot episode A Study in Pink of the BBC television series Sherlock. It made me wonder if Sherlock Holmes is even theoretically possible. This is not just a fanciful question: if what Sherlock Holmes is doing can be done, but it too difficult for humans to do in practice, then eventually we can build machines that will give the police the powers of Sherlock Holmes.

The Sherlock Holmes formula, from a Bayesian perspective, consists of copious amounts of observation coupled with strong assumptions on the likelihood and occasionally strong prior assumptions to resolve an ambiguity. My question is whether the observations actually contain as much information as Sherlock says, i.e., is the likelihood (or prior) terribly misspecified?

For instance, in A Study in Pink Sherlock concludes that the owner of a cell phone is a habitual drunk on the basis of extensive scratching found near the power plug on the phone: only a habitual drunk has consistently shaky hands when plugging in their phone at night.'' But is that true? If we were to survey millions of cell phones, select the ones with extreme scratching around the power plug, and then look at the proportion of habitual drunks in the resulting owner population, what would we find relative to the proportion of habitual drunks amongst all cell phone owners?

In any event there are enough known problems with human reasoning, e.g. confirmation bias, that a future computerized police assistant will probably greatly improve detective work, even if correct extrapolations from small observations are not achievable.

Also, the show is really well done.