TF-IDF For Document Similarity

For an upcoming post about network visualization, I wanted to use a unique dataset to visualize. After I started the process, I realized that people might want to know how I created the dataset. So here we are 🙂

The dataset I wanted was one that showed the similarity between each US state. Each state would be a node and the similarity between state A and state B would be an edge. To determine this, I chose to compute the similarity between each state’s constitution. And to make things easier on myself, I chose to limit the similarity to just the preamble of each constitution. The preamble is meant to communicate the guiding principles for the constitution. Though not perfect and certainly not up to date, I think it’s a pretty good start.

The Method

What is TF-IDF?

First, I needed to pick a similarity metric. A common method for determining the similarity between two pieces of text is first by using a method called TF-IDF. TF-IDF is essentially a number that tells you how unique a word (a “term”) is across multiple pieces of text. Those numbers are then combined (more on that later) to determine how unique each bit of text is from each other.

TF-IDF has two components: term frequency (TF) and inverse document frequency (IDF).

Definitions

Term frequency measures how often a word appears in a bit of text (a “document”). This is computed as the ratio between the number of times the word is in the document and the number of words in the document.

Inverse document frequency is a measure of how often a word appears in a collection of documents (a “corpus”). The “inverse” is because this value decreases the more the word is used. We want this because we want to know what makes the documents similar with respect to the entire corpus. If every document has a certain word, that doesn’t say much about one document’s similarity with another document.

The formula for inverse document frequency is a bit more complicated and many software implementations use their own tweaks. That being said, IDF ratio is just the ratio between the number of documents in your corpus and the number of documents with the word you’re evaluating.

The TF and IDF parts are multiplied together to get the actual TF-IDF value. This gives us a metric for how much each word makes a document in the corpus unique.

Between-Document Similarity

Typically, TF-IDF is calculated for each word within each document to produce a “document term matrix”. This is a matrix where the rows represent each document and the columns represent each unique word in the corpus. The benefit of this structure is that taking the product of the matrix with its transpose will result in a matrix that we can use to compare similarities between documents. This matrix has rows and columns equal to the number of documents and each value is the similarity between those two documents.

Below is an example document similarity matrix. Note the diagonal of 1’s indicating each document is very similar to itself.

	Document 1	Document 2	Document 3
Document 1	1	0.23	0.15
Document 2	0.23	1	0.86
Document 3	0.15	0.86	1

Code Example

I’m going to run through a very simple code example that can be easily expanded to handle larger datasets.

This example uses a corpus of two documents consisting of a snippet from two US state constitutions:

corpus = [
  'We the people of Alaska, grateful to God and to those who founded our nation',
  'We, the people of Hawaii, grateful for Divine Guidance, and mindful of our Hawaiian heritage and uniqueness as an island State',
]

First we import the necessary sklearn class

from sklearn.feature_extraction.text import TfidfVectorizer

Then we simply ask sklearn to generate the document term matrix for our corpus:

vectorizer = TfidfVectorizer()
document_term_matrix = vectorizer.fit_transform(corpus)

We can now show both the term list that was used as well as the TF-IDF matrix itself:

print(vectorizer.get_feature_names())
print(document_term_matrix.toarray())

['alaska', 'an', 'and', 'as', 'divine', 'for', 'founded', 'god', 'grateful', 'guidance', 'hawaii', 'hawaiian', 'heritage', 'island', 'mindful', 'nation', 'of', 'our', 'people', 'state', 'the', 'those', 'to', 'uniqueness', 'we', 'who']
[[0.27172601 0.         0.19333529 0.         0.         0.
  0.27172601 0.27172601 0.19333529 0.         0.         0.
  0.         0.         0.         0.27172601 0.19333529 0.19333529
  0.19333529 0.         0.19333529 0.27172601 0.54345202 0.
  0.19333529 0.27172601] [0.         0.2319869  0.33012117 0.2319869  0.2319869  0.2319869
  0.         0.         0.16506059 0.2319869  0.2319869  0.2319869
  0.2319869  0.2319869  0.2319869  0.         0.33012117 0.16506059
  0.16506059 0.2319869  0.16506059 0.         0.         0.2319869
  0.16506059 0.        ]]

Then we can generate the document similarity matrix:

pairwise_similarity = document_term_matrix * document_term_matrix.transpose()

And display it:

print(pairwise_similarity.toarray())

[[1.         0.28720833]
 [0.28720833 1.        ]]

You can easily generate a document similarity matrix for a much larger corpus simply by adding documents to the corpus list.

In the case of the US state constitution dataset I wanted to generate, I first took the preambles (found here) and produced a JSON with all of the preambles by state name and abbreviation. The final document similarity matrix I generated for the preambles can be found here.

Stay tuned for my upcoming post where I use this dataset to visualize these similarities.

Full Code

Below is the full code example. You can also find a running version here that you can play with.

corpus = [
  'We the people of Alaska, grateful to God and to those who founded our nation',
  'We, the people of Hawaii, grateful for Divine Guidance, and mindful of our Hawaiian heritage and uniqueness as an island State',
]

from sklearn.feature_extraction.text import TfidfVectorizer

# Use sklearn to generate document term matrix
vectorizer = TfidfVectorizer()
document_term_matrix = vectorizer.fit_transform(corpus)

# Show the labels for the document term matrix columns
print(vectorizer.get_feature_names())

# Show the document term matrix
print(document_term_matrix.toarray())

# Generate document similarity matrix
pairwise_similarity = document_term_matrix * document_term_matrix.transpose()

# Show the document similarity matrix
print(pairwise_similarity.toarray())