NLP – TF-IDF (Term Frequency- Inverse Document Frequency) model

In NLP, TF-IDF (Term Frequency- Inverse Document Frequency) algorithm is useful in Search, Recommendation, Classification, etc. use cases. This model uses Label/ Integer word encoding.

To overcome one of the limitations in the Bag of Word (BOW) methodology (i.e. results generally get dominated by the common words like “is”, “the”, “a” ), a technique called TF-IDF (Term Frequency- Inverse Document Frequency) has been developed.

It gives the importance of any term in a document if it is occurring multiple times in that document, and also penalize the importance of a term in a document in case it is occurring very frequently in various documents in the corpus.

TF-IDF Equation

    \[ tf\_idf(t, d, D) = tf(t, d) . idf(t, D) \]

    \[ Further, \]

    \[tf(t,d) = count \;of \;t \;(term\; or\; word) \;in \;d \]

    \[[Increases \;proportionally \;to \;the \;number \;of \;times \;a \;word \;appears \;in \;the \;document]\]

    \[idf(t) = log(N/df)\]

    \[[Decreases \;as \;occurence\; of \;word \;increases \;in \;multiple \;documents]\]

Where,
tf: Term frequency of the word in a document (Term Frequency)
idf: Inverse of the number of documents in the corpus that contain the word (Inverse Document Frequency)
t: Denotes to term
d: Denotes the document
D: Denotes all the documents in the corpus
N: Number of documents in the corpus
df(t): Occurrence of word t in number of documents

Example 1 of TF-IDF:

Let us consider that we have the following two docs in a corpus:
Doc1: “I love playing football.”
Doc2: “Indians love playing Cricket.”

Our corpus have 6 unique words (vocabulary/ dictionary size), It can be represented as a index of 0 to 5 as following:

vocabulary/ dictionary (Label/ Integer Encoding of Words): {‘i’: 0, ‘love’: 1, ‘playing’: 2, ‘football’: 3, ‘indians’: 4, ‘cricket’: 5}
Number of Documents (N): 2
Term frequency (tf) of each word in each document are as follows (same as BOW):
Doc 1: [1. 1. 1. 1. 0. 0.]
Doc 2:  [0. 1. 1. 0. 1. 1.]

Each term has document frequency as following (term occurring in the number of documents):
df: {‘i’: 1, ‘love’: 2, ‘playing’: 2, ‘football’: 1, ‘indians’: 1, ‘cricket’: 1}

Inverse Document Frequency (idf) of each term in the corpus is as following
idf = log(N/df)): {‘i’: 0.6931471805599453, ‘love’: 0.0, ‘playing’: 0.0, ‘football’: 0.6931471805599453, ‘indians’: 0.6931471805599453, ‘cricket’: 0.6931471805599453}

After calculation, we get the TF-IDF weights of each term in each document are as follows:

tf_idf (tf * idf):

iloveplayingfootballindianscricket
Doc 10.690.0.0.690.0.
Doc 20.0.0.0.0.690.69

As we can see weightage of all words are similar in this example as BOW, except the keywords “love” and “playing” as it appeared in both the documents, so penalized by IDF and its weight became zero in both the documents.

Example 2 of TF-IDF:

Let us consider that we have the following three docs in a corpus:
Doc1:  “My name is John, What is your name?”
Doc2: “Bill is a very good person. He likes playing soccer.”
Doc3:  “What is your favorite game? I love Football. Football is a great game.”

This corpus has 21 unique words (vocabulary/ dictionary size) vocabulary/ dictionary (Label/ Integer Encoding): {‘my’: 0, ‘name’: 1, ‘is’: 2, ‘john’: 3, ‘what’: 4, ‘your’: 5, ‘bill’: 6, ‘a’: 7, ‘very’: 8, ‘good’: 9, ‘person’: 10, ‘he’: 11, ‘likes’: 12, ‘playing’: 13, ‘soccer’: 14, ‘favorite’: 15, ‘game’: 16, ‘i’: 17, ‘love’: 18, ‘football’: 19, ‘great’: 20}

Number of Documents (N): 3

Term frequency (tf) in each document is as following (same as BOW):
tf:
Doc 1:: {‘my’: 1, ‘name’: 2, ‘is’: 2, ‘john’: 1, ‘what’: 1, ‘your’: 1, ‘bill’: 0, ‘a’: 0, ‘very’: 0, ‘good’: 0, ‘person’: 0, ‘he’: 0, ‘likes’: 0, ‘playing’: 0, ‘soccer’: 0, ‘favorite’: 0, ‘game’: 0, ‘i’: 0, ‘love’: 0, ‘football’: 0, ‘great’: 0},
Doc 2: {‘my’: 0, ‘name’: 0, ‘is’: 1, ‘john’: 0, ‘what’: 0, ‘your’: 0, ‘bill’: 1, ‘a’: 1, ‘very’: 1, ‘good’: 1, ‘person’: 1, ‘he’: 1, ‘likes’: 1, ‘playing’: 1, ‘soccer’: 1, ‘favorite’: 0, ‘game’: 0, ‘i’: 0, ‘love’: 0, ‘football’: 0, ‘great’: 0},
Doc 3: {‘my’: 0, ‘name’: 0, ‘is’: 2, ‘john’: 0, ‘what’: 1, ‘your’: 1, ‘bill’: 0, ‘a’: 1, ‘very’: 0, ‘good’: 0, ‘person’: 0, ‘he’: 0, ‘likes’: 0, ‘playing’: 0, ‘soccer’: 0, ‘favorite’: 1, ‘game’: 2, ‘i’: 1, ‘love’: 1, ‘football’: 2, ‘great’: 1}

Each term has document frequency as following (term occurring in the number of documents):
df: {‘my’: 1, ‘name’: 1, ‘is’: 3, ‘john’: 1, ‘what’: 2, ‘your’: 2, ‘bill’: 1, ‘a’: 2, ‘very’: 1, ‘good’: 1, ‘person’: 1, ‘he’: 1, ‘likes’: 1, ‘playing’: 1, ‘soccer’: 1, ‘favorite’: 1, ‘game’: 1, ‘i’: 1, ‘love’: 1, ‘football’: 1, ‘great’: 1}

Inverse Document Frequency (idf) of each term in corpus is as following:
idf = log(N/df)):
{‘my’: 1.099, ‘name’: 1.099, ‘is’: 0.0, ‘john’: 1.099, ‘what’: 0.405, ‘your’: 0.405, ‘bill’: 1.099, ‘a’: 0.405, ‘very’: 1.099, ‘good’: 1.099, ‘person’: 1.099, ‘he’: 1.099, ‘likes’: 1.099, ‘playing’: 1.099, ‘soccer’: 1.099, ‘favorite’: 1.099, ‘game’: 1.099, ‘i’: 1.099, ‘love’: 1.099, ‘football’: 1.099, ‘great’: 1.099}

After calculation, we get the TF-IDF weights of each term in each documents are as following:
tf_idf (tf * idf):
Doc 1: {‘my‘: 1.099, ‘name’: 2.198, ‘is‘: 0.0, ‘john’: 1.099, ‘what‘: 0.405, ‘your‘: 0.405, ‘bill’: 0, ‘a’: 0, ‘very’: 0, ‘good’: 0, ‘person‘: 0, ‘he’: 0, ‘likes’: 0, ‘playing’: 0, ‘soccer’: 0, ‘favorite’: 0, ‘game‘: 0, ‘i’: 0, ‘love’: 0, ‘football‘: 0, ‘great’: 0},
Doc 2: {‘my‘: 0, ‘name’: 0, ‘is‘: 0.0, ‘john’: 0, ‘what‘: 0, ‘your‘: 0, ‘bill’: 1.099, ‘a’: 0.405, ‘very’: 1.099, ‘good’: 1.099, ‘person‘: 1.099, ‘he’: 1.099, ‘likes’: 1.099, ‘playing’: 1.099, ‘soccer’: 1.099, ‘favorite’: 0, ‘game‘: 0, ‘i’: 0, ‘love’: 0, ‘football‘: 0, ‘great’: 0},
Doc 3: {‘my‘: 0, ‘name’: 0, ‘is‘: 0.0, ‘john’: 0, ‘what‘: 0.405, ‘your‘: 0.405, ‘bill’: 0, ‘a’: 0.405, ‘very’: 0, ‘good’: 0, ‘person‘: 0, ‘he’: 0, ‘likes’: 0, ‘playing’: 0, ‘soccer’: 0, ‘favorite’: 1.099, ‘game‘: 2.198, ‘i’: 1.099, ‘love’: 1.099, ‘football‘: 2.198, ‘great’: 1.099}

As we can see, the weightage of all words is similar in this example as BOW, Please note the following observations:

  • If some word doesn’t occur in any document, then it’s weightage is zero in that document (same as in case of BOW).
  • Weights of common words like “what” (occured in Doc 1 and Doc 3), “your” (occured in Doc 1 and Doc 3) penalized, where these words occured, as it occurred in two docs. Whereas weightage of the words like “my” (occured in doc 1), “person” (occured in Doc 2) is higher in the docs where these words occured, as these words occured only in one document.
  • Weight of “is” penalized to zero in all documents as it occurred in all three documents.
  • Keywords “Football” and “game” got very high weightage in Document 3 (2.197), where it occured, as it occurred twice in this document, and not occurred in any other document.

Further improvements to the TF-IDF algorithm:

  • Eliminate common/ stop words [“a”, “is” etc]
  • Stemming [treat “play” and “playing” same]
  • Defining synonyms [Define/ maintain the synonym dictionary e.g. “Soccer” is a synonym of “Football”]
  • etc…

Limitations

Limitations of TF-IDF are as following:

  • Doesn’t have any way to leverage similar keywords, like there is now way to find out that words “Football” and “Soccer” are similar and leverage the similarity of these two words
  • Algorithms tried to solve it by creating Synonyms lists manually, but as we can understand that it requires a lot of manual effort and maintenance cost is very high. Also, generally, synonyms change based on context, which can not be covered by a synonym list. 

So there is a need to define the word embeddings through an algorithm so that semantically similar words have similar embeddings, which can be leveraged in our use cases.

Python Code for the above example (Github code Location) at follows:

'''
Author: Gyan Mittal
Corresponding Document: https://gyan-mittal.com/nlp-ai-ml/nlp-tf-idf-term-frequency-inverse-document-frequency-model/
Brief about TF-IDF (Term Frequency- Inverse Document Frequency):
To overcome the limitation of common words in the Bag of Word (BOW) methodology, a technique called TF-IDF (Term Frequency- Inverse Document Frequency) has been developed.
It gives the importance of any term in a document if it is occurring multiple times in that document,
and also penalize the importance of a term in a document in case it is occurring very frequently in various documents in the corpus.
About Code: This code demonstrates the TF-IDF (Term Frequency- Inverse Document Frequency) with two simple example corpus
'''
from collections import Counter
import itertools
import numpy as np
import math
from util import naive_clean_text

def naive_tf_idf(corpus):
    split_docs_words= [naive_clean_text(doc).split() for doc in corpus]
    word_counts = Counter(itertools.chain(*split_docs_words))
    #print("word_counts", "\n", word_counts)

    vocab_word_index = {x: i for i, x in enumerate(word_counts)}
    reverse_vocab_word_index = {i: x for i, x in enumerate(word_counts)}
    print("vocab_word_index\n", vocab_word_index)
    print("reverse_vocab_word_index\n", reverse_vocab_word_index)
    
    vocab_size = len(vocab_word_index)
    N = len(corpus)
    tf = np.zeros((N, vocab_size))
    tf_dict = {}
    
    for i, doc in enumerate(split_docs_words):
        doc_dict = { word : 0 for word in vocab_word_index }
        for word in doc:
            tf[i][vocab_word_index[word]] += 1
            doc_dict[word] = doc_dict[word] + 1
        tf_dict[i] = doc_dict

    print("N:\t", N)
    print("tf:\t", tf)
    print("tf_dict:\t", tf_dict)

    df = {}
    for word in vocab_word_index:
        for i, doc in enumerate(split_docs_words):
            if word in doc:
                df[word] = df[word] + 1 if word in df else 1

    print("df\t", df)
    
    #idf = np.zeros(vocab_size)
    idf = {}
    for i, word in enumerate(vocab_word_index):
        idf[word] = round(math.log(N/df[word]), 3)

    print("idf:\t", idf)

    tf_idf = np.zeros((N, vocab_size))
    tf_idf_dict = {}
    
    for i, doc in enumerate(split_docs_words):
        doc_dict = { word : 0 for word in vocab_word_index }
        for word in doc:
            tf_idf[i][vocab_word_index[word]] = tf[i][vocab_word_index[word]] * idf[word] # tf * idf
            doc_dict[word] =  tf[i][vocab_word_index[word]] * idf[word] # tf * idf
        tf_idf_dict[i] = doc_dict
    print("tf_idf:\t", np.around(tf_idf, decimals = 3))
    print("tf_idf_dict:\t", tf_idf_dict)

#corpus =  ["My name is John, What is your name?", "Bill is a very good person. He likes playing soccer.", "What is your favorite game? I love Football. Football is a great game."]
corpus = ["I love playing football.", "Indians love playing Cricket."]
naive_tf_idf(corpus)

About Gyan Mittal

Gyan is Engineering Head at Times Internet Limited. He is a technology leader and AI-ML expert with 22+ of experience in hands-on architecture, design, and development of a broad range of software technologies. Gyan Likes writing blogs in Technology, especially in Artificial Intelligence, Machine Learning, NLP, and Data Science. In his blogs, Gyan tries to explain complex concepts in simple language, mostly supported by very simple example code (with the corresponding GitHub link) that is very easy to understand. Gyan has done B. Tech. from IIT Kanpur, M.Tech. From IIT Madras and Artificial Intelligence Professional Program from Stanford University (2019-2021).
Bookmark the permalink.

Leave a Reply

Your email address will not be published.