Word Co-occurrence of Documents

Each cell in this heat map matrix is shaded proportional the product of the number of occurrences of each pair of words in the same document summed over all the documents (the product of the occurrence matrix with its transpose)..

Sort by:

Each colored cell represents two words that appeared in the same document (usually a webpage). Darker cells indicate words that co-occurred more frequently.

You can sort the matrix rows and columns using the pull-down menu above.

This dynamic visualization uses d3.js by Mike Bostock.

A typical NLP (natural language processing) data structure is an "Occurence Matrix". Each cell in the occurrence matrix represents the number of times (or frequency or probability) that a word occurs in a given set of words (usually a document or webpage).

At a fundamental level, this occurence matrix is really a graph (network of connections) where each element of the matrix at row i and column j represents the value of a connection (edge) from vertex i to node (vertex) j.

So if you start with a word in the Occurrence Graph and traverse through the it's document nodes and reach all the word nodes connected to each of those documents individually you can identify the words that "co-occur" in the same document. To form a new, smaller graph, with nodes of the same type you can delete all the document nodes and edges and replace them with the word-to-word nodes that are equivalent.

This results in a Co-Occurrence graph, which itself can be viewed as a Matrix (just like the Occurence Matrix). Each element ij represents the value of an edge from vertex i to vertex j.