In NLP, TF-IDF (Term Frequency- Inverse Document Frequency) algorithm is useful in Search, Recommendation, Classification, etc. use cases. This model uses Label/ Integer word encoding.
To overcome one of the limitations in the Bag of Word (BOW) methodology (i.e. results generally get dominated by the common words like “is”, “the”, “a” ), a technique called TF-IDF (Term Frequency- Inverse Document Frequency) has been developed.
It gives the importance of any term in a document if it is occurring multiple times in that document, and also penalize the importance of a term in a document in case it is occurring very frequently in various documents in the corpus.
TF-IDF Equation
Where,
: Term frequency of the word in a document (Term Frequency)
: Inverse of the number of documents in the corpus that contain the word (Inverse Document Frequency)
: Denotes to term
: Denotes the document
: Denotes all the documents in the corpus
: Number of documents in the corpus
: Occurrence of word t in number of documents
Example 1 of TF-IDF:
Let us consider that we have the following two docs in a corpus:
Doc1: “I love playing football.”
Doc2: “Indians love playing Cricket.”
Our corpus have 6 unique words (vocabulary/ dictionary size), It can be represented as a index of 0 to 5 as following:
vocabulary/ dictionary (Label/ Integer Encoding of Words): {‘i’: 0, ‘love’: 1, ‘playing’: 2, ‘football’: 3, ‘indians’: 4, ‘cricket’: 5}
Number of Documents (N): 2
Term frequency (tf) of each word in each document are as follows (same as BOW):
Doc 1: [1. 1. 1. 1. 0. 0.]
Doc 2: [0. 1. 1. 0. 1. 1.]
Each term has document frequency as following (term occurring in the number of documents):
df: {‘i’: 1, ‘love’: 2, ‘playing’: 2, ‘football’: 1, ‘indians’: 1, ‘cricket’: 1}
Inverse Document Frequency (idf) of each term in the corpus is as following
idf = log(N/df)): {‘i’: 0.6931471805599453, ‘love’: 0.0, ‘playing’: 0.0, ‘football’: 0.6931471805599453, ‘indians’: 0.6931471805599453, ‘cricket’: 0.6931471805599453}
After calculation, we get the TF-IDF weights of each term in each document are as follows:
tf_idf (tf * idf):
i | love | playing | football | indians | cricket | |
Doc 1 | 0.69 | 0. | 0. | 0.69 | 0. | 0. |
Doc 2 | 0. | 0. | 0. | 0. | 0.69 | 0.69 |
As we can see weightage of all words are similar in this example as BOW, except the keywords “love” and “playing” as it appeared in both the documents, so penalized by IDF and its weight became zero in both the documents.
Example 2 of TF-IDF:
Let us consider that we have the following three docs in a corpus:
Doc1: “My name is John, What is your name?”
Doc2: “Bill is a very good person. He likes playing soccer.”
Doc3: “What is your favorite game? I love Football. Football is a great game.”
This corpus has 21 unique words (vocabulary/ dictionary size) vocabulary/ dictionary (Label/ Integer Encoding): {‘my’: 0, ‘name’: 1, ‘is’: 2, ‘john’: 3, ‘what’: 4, ‘your’: 5, ‘bill’: 6, ‘a’: 7, ‘very’: 8, ‘good’: 9, ‘person’: 10, ‘he’: 11, ‘likes’: 12, ‘playing’: 13, ‘soccer’: 14, ‘favorite’: 15, ‘game’: 16, ‘i’: 17, ‘love’: 18, ‘football’: 19, ‘great’: 20}
Number of Documents (N): 3
Term frequency (tf) in each document is as following (same as BOW):
tf:
Doc 1:: {‘my’: 1, ‘name’: 2, ‘is’: 2, ‘john’: 1, ‘what’: 1, ‘your’: 1, ‘bill’: 0, ‘a’: 0, ‘very’: 0, ‘good’: 0, ‘person’: 0, ‘he’: 0, ‘likes’: 0, ‘playing’: 0, ‘soccer’: 0, ‘favorite’: 0, ‘game’: 0, ‘i’: 0, ‘love’: 0, ‘football’: 0, ‘great’: 0},
Doc 2: {‘my’: 0, ‘name’: 0, ‘is’: 1, ‘john’: 0, ‘what’: 0, ‘your’: 0, ‘bill’: 1, ‘a’: 1, ‘very’: 1, ‘good’: 1, ‘person’: 1, ‘he’: 1, ‘likes’: 1, ‘playing’: 1, ‘soccer’: 1, ‘favorite’: 0, ‘game’: 0, ‘i’: 0, ‘love’: 0, ‘football’: 0, ‘great’: 0},
Doc 3: {‘my’: 0, ‘name’: 0, ‘is’: 2, ‘john’: 0, ‘what’: 1, ‘your’: 1, ‘bill’: 0, ‘a’: 1, ‘very’: 0, ‘good’: 0, ‘person’: 0, ‘he’: 0, ‘likes’: 0, ‘playing’: 0, ‘soccer’: 0, ‘favorite’: 1, ‘game’: 2, ‘i’: 1, ‘love’: 1, ‘football’: 2, ‘great’: 1}
Each term has document frequency as following (term occurring in the number of documents):
df: {‘my’: 1, ‘name’: 1, ‘is’: 3, ‘john’: 1, ‘what’: 2, ‘your’: 2, ‘bill’: 1, ‘a’: 2, ‘very’: 1, ‘good’: 1, ‘person’: 1, ‘he’: 1, ‘likes’: 1, ‘playing’: 1, ‘soccer’: 1, ‘favorite’: 1, ‘game’: 1, ‘i’: 1, ‘love’: 1, ‘football’: 1, ‘great’: 1}
Inverse Document Frequency (idf) of each term in corpus is as following:
idf = log(N/df)):
{‘my’: 1.099, ‘name’: 1.099, ‘is’: 0.0, ‘john’: 1.099, ‘what’: 0.405, ‘your’: 0.405, ‘bill’: 1.099, ‘a’: 0.405, ‘very’: 1.099, ‘good’: 1.099, ‘person’: 1.099, ‘he’: 1.099, ‘likes’: 1.099, ‘playing’: 1.099, ‘soccer’: 1.099, ‘favorite’: 1.099, ‘game’: 1.099, ‘i’: 1.099, ‘love’: 1.099, ‘football’: 1.099, ‘great’: 1.099}
After calculation, we get the TF-IDF weights of each term in each documents are as following:
tf_idf (tf * idf):
Doc 1: {‘my‘: 1.099, ‘name’: 2.198, ‘is‘: 0.0, ‘john’: 1.099, ‘what‘: 0.405, ‘your‘: 0.405, ‘bill’: 0, ‘a’: 0, ‘very’: 0, ‘good’: 0, ‘person‘: 0, ‘he’: 0, ‘likes’: 0, ‘playing’: 0, ‘soccer’: 0, ‘favorite’: 0, ‘game‘: 0, ‘i’: 0, ‘love’: 0, ‘football‘: 0, ‘great’: 0},
Doc 2: {‘my‘: 0, ‘name’: 0, ‘is‘: 0.0, ‘john’: 0, ‘what‘: 0, ‘your‘: 0, ‘bill’: 1.099, ‘a’: 0.405, ‘very’: 1.099, ‘good’: 1.099, ‘person‘: 1.099, ‘he’: 1.099, ‘likes’: 1.099, ‘playing’: 1.099, ‘soccer’: 1.099, ‘favorite’: 0, ‘game‘: 0, ‘i’: 0, ‘love’: 0, ‘football‘: 0, ‘great’: 0},
Doc 3: {‘my‘: 0, ‘name’: 0, ‘is‘: 0.0, ‘john’: 0, ‘what‘: 0.405, ‘your‘: 0.405, ‘bill’: 0, ‘a’: 0.405, ‘very’: 0, ‘good’: 0, ‘person‘: 0, ‘he’: 0, ‘likes’: 0, ‘playing’: 0, ‘soccer’: 0, ‘favorite’: 1.099, ‘game‘: 2.198, ‘i’: 1.099, ‘love’: 1.099, ‘football‘: 2.198, ‘great’: 1.099}
As we can see, the weightage of all words is similar in this example as BOW, Please note the following observations:
- If some word doesn’t occur in any document, then it’s weightage is zero in that document (same as in case of BOW).
- Weights of common words like “what” (occured in Doc 1 and Doc 3), “your” (occured in Doc 1 and Doc 3) penalized, where these words occured, as it occurred in two docs. Whereas weightage of the words like “my” (occured in doc 1), “person” (occured in Doc 2) is higher in the docs where these words occured, as these words occured only in one document.
- Weight of “is” penalized to zero in all documents as it occurred in all three documents.
- Keywords “Football” and “game” got very high weightage in Document 3 (2.197), where it occured, as it occurred twice in this document, and not occurred in any other document.
Further improvements to the TF-IDF algorithm:
- Eliminate common/ stop words [“a”, “is” etc]
- Stemming [treat “play” and “playing” same]
- Defining synonyms [Define/ maintain the synonym dictionary e.g. “Soccer” is a synonym of “Football”]
- etc…
Limitations
Limitations of TF-IDF are as following:
- Doesn’t have any way to leverage similar keywords, like there is now way to find out that words “Football” and “Soccer” are similar and leverage the similarity of these two words
- Algorithms tried to solve it by creating Synonyms lists manually, but as we can understand that it requires a lot of manual effort and maintenance cost is very high. Also, generally, synonyms change based on context, which can not be covered by a synonym list.
So there is a need to define the word embeddings through an algorithm so that semantically similar words have similar embeddings, which can be leveraged in our use cases.
Python Code for the above example (Github code Location) at follows:
''' Author: Gyan Mittal Corresponding Document: https://gyan-mittal.com/nlp-ai-ml/nlp-tf-idf-term-frequency-inverse-document-frequency-model/ Brief about TF-IDF (Term Frequency- Inverse Document Frequency): To overcome the limitation of common words in the Bag of Word (BOW) methodology, a technique called TF-IDF (Term Frequency- Inverse Document Frequency) has been developed. It gives the importance of any term in a document if it is occurring multiple times in that document, and also penalize the importance of a term in a document in case it is occurring very frequently in various documents in the corpus. About Code: This code demonstrates the TF-IDF (Term Frequency- Inverse Document Frequency) with two simple example corpus ''' from collections import Counter import itertools import numpy as np import math from util import naive_clean_text def naive_tf_idf(corpus): split_docs_words= [naive_clean_text(doc).split() for doc in corpus] word_counts = Counter(itertools.chain(*split_docs_words)) #print("word_counts", "\n", word_counts) vocab_word_index = {x: i for i, x in enumerate(word_counts)} reverse_vocab_word_index = {i: x for i, x in enumerate(word_counts)} print("vocab_word_index\n", vocab_word_index) print("reverse_vocab_word_index\n", reverse_vocab_word_index) vocab_size = len(vocab_word_index) N = len(corpus) tf = np.zeros((N, vocab_size)) tf_dict = {} for i, doc in enumerate(split_docs_words): doc_dict = { word : 0 for word in vocab_word_index } for word in doc: tf[i][vocab_word_index[word]] += 1 doc_dict[word] = doc_dict[word] + 1 tf_dict[i] = doc_dict print("N:\t", N) print("tf:\t", tf) print("tf_dict:\t", tf_dict) df = {} for word in vocab_word_index: for i, doc in enumerate(split_docs_words): if word in doc: df[word] = df[word] + 1 if word in df else 1 print("df\t", df) #idf = np.zeros(vocab_size) idf = {} for i, word in enumerate(vocab_word_index): idf[word] = round(math.log(N/df[word]), 3) print("idf:\t", idf) tf_idf = np.zeros((N, vocab_size)) tf_idf_dict = {} for i, doc in enumerate(split_docs_words): doc_dict = { word : 0 for word in vocab_word_index } for word in doc: tf_idf[i][vocab_word_index[word]] = tf[i][vocab_word_index[word]] * idf[word] # tf * idf doc_dict[word] = tf[i][vocab_word_index[word]] * idf[word] # tf * idf tf_idf_dict[i] = doc_dict print("tf_idf:\t", np.around(tf_idf, decimals = 3)) print("tf_idf_dict:\t", tf_idf_dict) #corpus = ["My name is John, What is your name?", "Bill is a very good person. He likes playing soccer.", "What is your favorite game? I love Football. Football is a great game."] corpus = ["I love playing football.", "Indians love playing Cricket."] naive_tf_idf(corpus)