Docvectors using spaCy for Springboard

One of my Springboard mentees asked how she should compute document vectors using the word2vec vectors available within a parsed document object from the spaCy parser.

The straightforward way she came up with was to sum up all the word vectors for a document. As long as you sum along the correct axis, so that each word vector dimension is independently summed, it should work:

>>> !python -m spacy download en_core_web_md
>>> import spacy
>>> nlp = spacy.load("en_core_web_md")
>>> docs = ['Hello world!', 'Another doc, another $.', 'Goobye world...']
>>> pd.DataFrame((pd.DataFrame([w.vector for w in nlp(doc)]).sum(axis=0) for doc in docs))
    0     1     2     3     4     5     6    ...   293   294   295   296   297   298   299
0 -0.02  0.66 -0.18 -0.26  0.78 -0.35  0.77  ... -0.23  0.68  0.06  0.40  0.06 -0.58  0.60
1 -0.94  0.83 -0.57  0.33  1.50  0.61 -0.78  ... -0.11 -1.91  0.32 -0.19 -0.38 -1.28 -0.33
2 -0.00  0.47  0.19 -0.43  0.32 -0.31  0.46  ... -0.82  0.25  0.31  0.05  0.07  0.05  0.24

Or you can just use the doc vector computed internally by Spacy the exact same way:

>>> pd.DataFrame((nlp(doc).vector for doc in docs)).round(2)
    0     1     2     3     4     5     6    ...   293   294   295   296   297   298   299
0 -0.01  0.22 -0.06 -0.09  0.26 -0.12  0.26  ... -0.08  0.23  0.02  0.13  0.02 -0.19  0.20
1 -0.16  0.14 -0.10  0.05  0.25  0.10 -0.13  ... -0.02 -0.32  0.05 -0.03 -0.06 -0.21 -0.06
2 -0.00  0.16  0.06 -0.14  0.11 -0.10  0.15  ... -0.27  0.08  0.10  0.02  0.02  0.02  0.08
[3 rows x 300 columns]

But wait, those are different document vectors from the ones we computed! The first document vector computed by spaCy is a third of the magnitude in each dimension. Our sum is too big. Why does spaCy divide by 3?

It’s because “Hello world!” tokenizes into 3 tokens ‘Hello’, ‘world’, and ‘!’. So if you replace sum() with mean(), you should get the exact same values that spaCy returns for the document vectors:

>>> pd.DataFrame((pd.DataFrame([w.vector for w in nlp(doc)]).sum(axis=0) for doc in docs))
    0     1     2     3     4     5     6    ...   293   294   295   296   297   298   299
0 -0.01  0.22 -0.06 -0.09  0.26 -0.12  0.26  ... -0.08  0.23  0.02  0.13  0.02 -0.19  0.20
1 -0.16  0.14 -0.10  0.05  0.25  0.10 -0.13  ... -0.02 -0.32  0.05 -0.03 -0.06 -0.21 -0.06
2 -0.00  0.16  0.06 -0.14  0.11 -0.10  0.15  ... -0.27  0.08  0.10  0.02  0.02  0.02  0.08

Also, notice that misspelled words that don’t exist in the spaCy dictionary, like “Goobye” don’t crash the loop. spaCy just returns an all-zero vector for those words.

See the spaCy documentation for more details.

Written on July 16, 2019