Friday, January 13, 2017

Generating Text via Adversarial Training

There was a really cute paper at the GAN workshop this year, Generating Text via Adversarial Training by Zhang, Gan, and Carin. In particular, they make a couple of unusual choices that appear important. (Warning: if you are not familiar with GANs, this post will not make a lot of sense.)
  1. They use a convolutional neural network (CNN) as a discriminator, rather than an RNN. In retrospect this seems like a good choice, e.g. Tong Zhang has been crushing it in text classification with CNNs. CNNs are a bit easier to train than RNNs, so the net result is a powerful discriminator with a relatively easy optimization problem associated with it.
  2. They use a smooth approximation to the LSTM output in their generator, but actually this kind of trick appears everywhere so isn't so remarkable in isolation.
  3. They use a pure moment matching criterion for the saddle point optimization (estimated over a mini-batch). GANs started with a pointwise discrimination loss and more recent work has augmented this loss with moment matching style penalties, but here the saddle point optimization is pure moment matching. (So technically the discriminator isn't a discriminator. They actually refer to it as discriminator or encoder interchangeably in the text, this explains why.)
  4. They are very smart about initialization. In particular the discriminator is pre-trained to distinguish between a true sentence and the same sentence with two words swapped in position. (During initialization, the discriminator is trained using a pointwise classification loss). This is interesting because swapping two words preserves many of the $n$-gram statistics of the input, i.e., many of the convolutional filters will compute the exact same value. (I've had good luck recently using permuted sentences as negatives for other models, now I'm going to try swapping two words.)
  5. They update the generator more frequently than the discriminator, which is counter to the standard folklore which says you want the discriminator to move faster than the generator. Perhaps this is because the CNN optimization problem is much easier than the LSTM one; the use of a purely moment matching loss might also be relevant.


The old complaint about neural network papers was that you couldn't replicate them. Nowadays it is often easier to replicate neural network papers than other papers, because you can just fork their code on github and run the experiment. However, I still find it difficult to ascertain the relative importance of the various choices that were made. For the choices enumerated above: what is the sensitivity of the final result to these choices? Hard to say, but I've started to assume the sensitivity is high, because when I have tried to tweak a result after replicating it, it usually goes to shit. (I haven't tried to replicate this particular result yet.)

Anyway this paper has some cool ideas and hopefully it can be extended to generating realistic-looking dialog.