Document Co-occurrence of Words

Matrix of word usage frequency between documents..

Sort by:

Each colored cell represents two documents that used the same word. Darker cells indicate documents that more frequently used the same words. The color of the cell represents the category or grouping assigned to the documents. Cooccurrences between documents of different groups are grayscale.

A typical NLP (natural language processing) data structure is an "Occurence Matrix". Each cell in the occurrence matrix represents the number of times (or frequency or probability) that a word occurs in a given set of words (usually a document or webpage).

At a fundamental level, this occurence matrix is really a graph (network of connections) where each element of the matrix at row i and column j represents the value of a connection (edge) from vertex i to node (vertex) j.

So if you start with a word in the Occurrence Graph and traverse through the it's document nodes and reach all the word nodes connected to each of those documents individually you can identify the words that "co-occur" in the same document. To form a new, smaller graph, with nodes of the same type you can delete all the document nodes and edges and replace them with the word-to-word nodes that are equivalent.

This results in a Co-Occurrence graph, which itself can be viewed as a Matrix (just like the Occurence Matrix). Each element ij represents the value of an edge from vertex i to vertex j.

Thanks to d3.js by Mike Bostock.