In NLP, The Bag-of-Words model is useful in Search, Recommendation, Classification, etc. use cases. This model uses Label/ Integer word encoding.
The Bag-of-Words model counts the occurrence of each word within a document. The count can be considered as the weightage of a word in a document. Based on the weightage of different words in a document, various use cases like Search, Recommendation, Classification, etc. can be implemented.
Example 1 of BOW
Let us consider that we have the following two docs in a corpus:
Doc1: “I love playing football.”
Doc2: “Indians love playing Cricket.”
Our corpus has 6 unique words (vocabulary/ dictionary size), It can be represented as an index of 0 to 5 as follows:
vocabulary/ dictionary (Label/ Integer Encoding of Words): {‘i’: 0, ‘love’: 1, ‘playing’: 2, ‘football’: 3, ‘indians’: 4, ‘cricket’: 5}
BOW for each doc is given below as a vector of 6 elements (size of the vocabulary/ dictionary), where each index represents a unique word. [zeroth index represents “i”, First index represents “love” and so on]. The value represents the frequency of the word corresponding to the index in the doc.
Word Index | 0 | 1 | 2 | 3 | 4 | 5 |
Words | i | love | playing | football | indians | cricket |
BOW of Doc1 | 1 | 1 | 1 | 1 | 0 | 0 |
BOW of Doc2 | 0 | 1 | 1 | 0 | 1 | 1 |
Example 2 of BOW
We have the following three documents in our corpus:
Doc1: “My name is John, What is your name?”
Doc2: “Bill is a very good person. He likes playing soccer.”
Doc3: “What is your favorite game? I love Football. Football is a great game.”
This corpus has 21 unique words (vocabulary/ dictionary size)
vocabulary/ dictionary (Label/ Integer Encoding of Words): {‘my’: 0, ‘name’: 1, ‘is’: 2, ‘john’: 3, ‘what’: 4, ‘your’: 5, ‘bill’: 6, ‘a’: 7, ‘very’: 8, ‘good’: 9, ‘person’: 10, ‘he’: 11, ‘likes’: 12, ‘playing’: 13, ‘soccer’: 14, ‘favorite’: 15, ‘game’: 16, ‘i’: 17, ‘love’: 18, ‘football’: 19, ‘great’: 20}
Similar to the first example, In this, every doc can be represented as a vector of 21 elements (size of the vocabulary/ dictionary), where each index represents a unique word. [0th index represents “my”, 1st index represents “name” and so on].
The value represents the frequency of the word corresponding to the index in the doc.
BOW of Doc1: [1. 2. 2. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
BOW of Doc2: [0. 0. 1. 0. 0. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 0. 0. 0. 0. 0.]
BOW of Doc3: [0. 0. 2. 0. 1. 1. 0. 1. 0. 0. 0. 0. 0. 0. 0. 1. 2. 1. 1. 2. 1.]
Limitations
Limitations of this approach are as following:
- Real-world applications have a corpus of millions of documents, which have millions of unique words, results generally get dominated by the common words like “is”, “the”, “a” etc.
- The sequence of words not important
- Doesn’t have any way to leverage similar keywords, like there is now way to find out that words “Football” and “Soccer” are similar and leverage the similarity of these two words
Python Code for the above examples (GitHub code location) is as follows:
''' Author: Gyan Mittal Corresponding Document: https://gyan-mittal.com/nlp-ai-ml/nlp-bag-of-words-bow-model/ Brief about BOW: In NLP, the technique Bag-of-Words (BOW) model counts the occurrence of each word within a document. The word count can be considered as the weightage of a word in a document. This algorithm uses Label/ Integer word encoding. Bag of words (BOW) algorithm is useful in Search, Recommendation, Classification, etc. use cases. About Code: This code demonstrates the Bag-of-Words (BOW) with two simple example corpus ''' from collections import Counter import itertools import numpy as np from util import naive_clean_text # naive algorithm impementation of Bag-of-Words (BOW) def naive_bow(corpus): split_docs_words= [naive_clean_text(doc).split() for doc in corpus] print("split_docs_words", "\n", split_docs_words) word_counts = Counter(itertools.chain(*split_docs_words)) print("word_counts", "\n", word_counts) vocab_word_index = {x: i for i, x in enumerate(word_counts)} print("vocabulary\n", vocab_word_index) vocab_size = len(vocab_word_index) no_docs = len(corpus) print(vocab_size, no_docs) bow = np.zeros((no_docs, vocab_size)) for i, doc in enumerate(split_docs_words): for word in doc: bow[i][vocab_word_index[word]] += 1 print(bow) #Sample Corpus #corpus = ["My name is John, What is your name?", "Bill is a very good person. He likes playing soccer.", "What is your favorite game? I love Football. Football is a great game."] corpus = ["I love playing football.", "Indians love playing Cricket."] naive_bow(corpus)