Quick and dirtyTwo additional sources of information looked immediately promising: the bio and the social graph. Since I'm using sparse logistic regression as provided by vowpal wabbit, I started by doing the most naive encoding possible: I took the top N tokens by mutual information with my label and nominally encoded them. To clarify, a token in a bio is more or less what you would expect; whereas a token in a social graph is the numeric twitter identity of the account on the other end of the connection (I only considered following behaviour, and ignored being followed). This naive approach applied to a gender classifier led to some improvement for the bio tokens, but essentially no improvement for the social tokens.
Semi-supervisedAs has been a recent theme on this blog, an important aspect of my situation is that unlabeled profiles outnumber labeled profiles by roughly 4 orders of magnitude. I therefore took a large number of bios, used LDA to build a topic model from them, and then used the resulting features as inputs to my supervised classifier. I also took a large number of social graph edge sets, used LDA to build a topic model from them, and then used the resulting features as inputs to my supervised classifier.
Surprisingly, the bios LDA features did not do much better than the nominal encoding of bio tokens. However, the social graph LDA features did do better than the nominal encoding of social tokens.
What's going on?Because the social graph information was useful in the form of LDA features, this suggests the issue with the nominal encoding is some combination of sample complexity and classification technique (or more likely: I made a mistake somewhere). While I don't have a complete understanding of what's going on, this episode prompted me to look at the statistics of the social graph relative to the bio.
So exhibit 1 is the rank-ordered frequency of a single token occurring in either a bio or a social edge set. What this means:
- For bio: if you pick a random word from all the bios concatenated together (or equivalently, sample from all bios proportional to the number of words, then pick a random word in that bio), this is the probability that the Nth most frequently used token will be the one you choose.
- For social: if you pick a random edge from the (directed) graph of all "A follows B" relations (or equivalently, sample from all twitter accounts proportional to the number of accounts followed, then pick a random account which that twitter profile is following), this is the probability that the Nth most frequently followed twitter account will be the one followed for the edge selected.
However I now believe this is incorrect. Consider exhibit 2 which is similar, but the rank-order and the frequency are in terms of accounts and not tokens. In other words:
- For bio: if you pick a random twitter account, this is the probability that at least one of the tokens in that account's bio will be the Nth most frequently occurring token.
- For social: if you pick a random twitter account, this is the probability that the account follows the Nth most frequently followed twitter account.
So now what I think is happening is that within the set of sufficiently frequent bio tokens are many that are dispositive with respect to gender: obvious words like "father", "mother", "husband", and "wife"; but also less obvious things like women being more likely to say "bitch" and Indonesian men being more likely to say "saya". Meanwhile among the set of most popular accounts on twitter, the gender preference of followers of that account is comparatively not very strong. For instance, Coldplay has a lot of followers, but as far as I can tell men and women are nearly equally likely to follow Coldplay. Although Oprah sounds intuitively gender polarizing, women are only twice as likely as men to follow Oprah. Compare this to: men are 42 times more likely to use the word "sucks" in their bio, and women are 20 times more likely to use the word "girl". There are a few popular twitter accounts that are comparatively extremely gender polarizing (e.g., TheSingleWoman and Chris_Broussard) but overall the most common things people say when describing themselves appear more dispositive regarding gender than the most common twitter accounts that people choose to follow (once recent tweets have been taken into account).
So my takeaway is: to really leverage the social graph is going to require stepping up my technique. That means:
- Getting more labeled data. Active learning techniques will be critical here to prevent my mechanical turk bill from exploding.
- Leveraging the unlabeled social graph data more extensively. This includes trying more unsupervised techniques on the social graph; but also doing more explicitly semi-supervised techniques as well (instead of: unsupervised followed by supervised).