NLP – Bag-of-Words (BOW) model

In NLP, The Bag-of-Words model is useful in Search, Recommendation, Classification, etc. use cases. This model uses Label/ Integer word encoding.

The Bag-of-Words model counts the occurrence of each word within a document. The count can be considered as the weightage of a word in a document. Based on the weightage of different words in a document, various use cases like Search, Recommendation, Classification, etc. can be implemented.

Example 1 of BOW

Let us consider that we have the following two docs in a corpus:
Doc1: “I love playing football.”
Doc2: “Indians love playing Cricket.”

Our corpus has 6 unique words (vocabulary/ dictionary size), It can be represented as an index of 0 to 5 as follows:

vocabulary/ dictionary (Label/ Integer Encoding of Words): {‘i’: 0, ‘love’: 1, ‘playing’: 2, ‘football’: 3, ‘indians’: 4, ‘cricket’: 5}

BOW for each doc is given below as a vector of 6 elements (size of the vocabulary/ dictionary), where each index represents a unique word. [zeroth index represents “i”, First index represents “love” and so on]. The value represents the frequency of the word corresponding to the index in the doc.

Word Index012345
Wordsiloveplayingfootballindianscricket
BOW of Doc1111100
BOW of Doc2011011
BOW of Doc1 and Doc2

Example 2 of BOW

We have the following three documents in our corpus:
Doc1:  “My name is John, What is your name?”
Doc2: “Bill is a very good person. He likes playing soccer.”
Doc3:  “What is your favorite game? I love Football. Football is a great game.”

This corpus has 21 unique words (vocabulary/ dictionary size)
vocabulary/ dictionary (Label/ Integer Encoding of Words): {‘my’: 0, ‘name’: 1, ‘is’: 2, ‘john’: 3, ‘what’: 4, ‘your’: 5, ‘bill’: 6, ‘a’: 7, ‘very’: 8, ‘good’: 9, ‘person’: 10, ‘he’: 11, ‘likes’: 12, ‘playing’: 13, ‘soccer’: 14, ‘favorite’: 15, ‘game’: 16, ‘i’: 17, ‘love’: 18, ‘football’: 19, ‘great’: 20}

Similar to the first example, In this, every doc can be represented as a vector of 21 elements (size of the vocabulary/ dictionary), where each index represents a unique word. [0th index represents “my”, 1st index represents “name” and so on]. 

The value represents the frequency of the word corresponding to the index in the doc.

BOW of Doc1: [1. 2. 2. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
BOW of Doc2: [0. 0. 1. 0. 0. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 0. 0. 0. 0. 0.]
BOW of Doc3:  [0. 0. 2. 0. 1. 1. 0. 1. 0. 0. 0. 0. 0. 0. 0. 1. 2. 1. 1. 2. 1.]

Limitations

Limitations of this approach are as following:

  • Real-world applications have a corpus of millions of documents, which have millions of unique words, results generally get dominated by the common words like “is”, “the”, “a” etc.
  • The sequence of words not important
  • Doesn’t have any way to leverage similar keywords, like there is now way to find out that words “Football” and “Soccer” are similar and leverage the similarity of these two words

Python Code for the above examples (GitHub code location) is as follows:

'''
Author: Gyan Mittal
Corresponding Document: https://gyan-mittal.com/nlp-ai-ml/nlp-bag-of-words-bow-model/
Brief about BOW:
In NLP, the technique Bag-of-Words (BOW) model counts the occurrence of each word within a document.
The word count can be considered as the weightage of a word in a document.
This algorithm uses Label/ Integer word encoding.
Bag of words (BOW) algorithm is useful in Search, Recommendation, Classification, etc. use cases.
About Code: This code demonstrates the Bag-of-Words (BOW) with two simple example corpus
'''

from collections import Counter
import itertools
import numpy as np
from util import naive_clean_text

# naive algorithm impementation of Bag-of-Words (BOW)
def naive_bow(corpus):
    split_docs_words= [naive_clean_text(doc).split() for doc in corpus]
    print("split_docs_words", "\n", split_docs_words)
    word_counts = Counter(itertools.chain(*split_docs_words))
    print("word_counts", "\n", word_counts)

    vocab_word_index = {x: i for i, x in enumerate(word_counts)}
    print("vocabulary\n", vocab_word_index)

    vocab_size = len(vocab_word_index)
    no_docs = len(corpus)
    print(vocab_size, no_docs)
    bow = np.zeros((no_docs, vocab_size))
    for i, doc in enumerate(split_docs_words):
        for word in doc:
            bow[i][vocab_word_index[word]] += 1
    print(bow)

#Sample Corpus
#corpus =  ["My name is John, What is your name?", "Bill is a very good person. He likes playing soccer.", "What is your favorite game? I love Football. Football is a great game."]
corpus = ["I love playing football.", "Indians love playing Cricket."]
naive_bow(corpus)


About Gyan Mittal

Gyan is Engineering Head at Times Internet Limited. He is a technology leader and AI-ML expert with 22+ of experience in hands-on architecture, design, and development of a broad range of software technologies. Gyan Likes writing blogs in Technology, especially in Artificial Intelligence, Machine Learning, NLP, and Data Science. In his blogs, Gyan tries to explain complex concepts in simple language, mostly supported by very simple example code (with the corresponding GitHub link) that is very easy to understand. Gyan has done B. Tech. from IIT Kanpur, M.Tech. From IIT Madras and Artificial Intelligence Professional Program from Stanford University (2019-2021).
Bookmark the permalink.

Leave a Reply

Your email address will not be published.