Sentence embeddings took off in 2017. When Google released their Universal Sentence Encoder last year researchers took notice. Google trained their sentence embedding on a massive corpus of text, everything from wikipedia and news articles to FAQs and forums. And then they refined the accuracy by training it on the Stanford Natural Language Inference corpus. Like word2vec, this enabled NLP enthusiasts to leverage Google’s text-scraping and cleaning infrastructure to build their own models using transfer learning. Transfer Learning is just a fancy way by using one model within another. Usually you’re just doing “activation” or “inference” with the pretrained model and then using its output as a feature (input) for some other model.
So is this new embedding any better than the other options? John Christian Fjellestad compiled a nice summary of many of the options, to which I’ve added the “naive” versions at the top of the lists here. So these lists should progress from earlier to more recent techniques and from less advanced to more advanced.
- naive word vector embedding sum
- normalized mean of word embeddings: subtract principal eigenvector
- thought vectors (LSTM-based word vector autoencoder)
- skip thought vectors (next sentence prediction)
And you also have to choose a good word embedding:
To do that you need some measure of good for your machine learning problem. Fortunately John Fjellestad has coded some machine learning problems that are great for generic goodness of a sentence embedding for various problems: