NLP – Word Encoding by One-Hot Vector

In the field of NLP, generally, AI/ ML Algorithms don’t work on text data. We have to find some way to represent text data into numerical data. 

Any data consists of words. So in case, we find some way to convert (Encode) words into numerical data, then our whole data could be converted into numerical data, which can be consumed by AI/ ML algorithms.

One of the simplest forms of word encoding to represent the word in NLP is Label/ Integer encoding or OneHotVectorEncoding. Here we will concentrate on OneHotVectorEncoding, which requires very little computing power to convert text data into one-hot encoding data, and it’s easy to implement.

In this methodology convert a single word into a vector of N dimensions (N is the size of the vocabulary). This vector is filled with zeros and with a single (hot) position representing the corresponding word with one.

The size of the vector is the unique words in the corpus (many models increase the size by one to provision unknown words). Each position represents a word, the position with one represents the word mapped by the word vector. 

OneHotEncoding has the advantage over Label/ Integer encoding, that the result is binary rather than ordinal, it does not suffer from undesirable bias.

However, its immense and sparse vector representation requires large memory for computation.

Let us consider the following two examples to understand OneHotEncoding better:

Example 1 of One Hot Vector Encoding:

Let us consider that we have the following two docs in the corpus:
Doc 1: “I love playing football.”
Doc 2: “Indians love playing Cricket.”

Our corpus have 6 unique words, It can be represented as a index of 0 to 5 as following:
vocabulary/ dictionary (Label/ Integer Encoding of Words): {‘i’: 0, ‘love’: 1, ‘playing’: 2, ‘football’: 3, ‘indians’: 4, ‘cricket’: 5}

Doc1 and Doc2 can be represented as follows in terms of word index:
Doc 1: [0, 1, 2, 3]
Doc 2: [4, 1, 2, 5]

Doc1 and Doc2 can be represented as follows in terms of one-hot vector of the dimension 6 (dictionary size):
Doc 1: [[1, 0, 0, 0, 0, 0], [0, 1, 0, 0, 0, 0], [0, 0, 1, 0, 0, 0], [0, 0, 0, 1, 0, 0]]
Doc 2: [[0, 0, 0, 0, 1, 0], [0, 1, 0, 0, 0, 0], [0, 0, 1, 0, 0, 0], [0, 0, 0, 0, 0, 1]]

Example 2 of One Hot Vector Encoding:

Let us consider that we have the following three docs in the corpus:
Doc1:  “My name is John, What is your name?”
Doc2: “Bill is a very good person. He likes playing soccer.”
Doc3:  “What is your favorite game? I love Football. Football is a great game.”

This corpus has 21 unique words (vocabulary/ dictionary size)
vocabulary/ dictionary (Label/ Integer Encoding of Words): {‘my’: 0, ‘name’: 1, ‘is’: 2, ‘john’: 3, ‘what’: 4, ‘your’: 5, ‘bill’: 6, ‘a’: 7, ‘very’: 8, ‘good’: 9, ‘person’: 10, ‘he’: 11, ‘likes’: 12, ‘playing’: 13, ‘soccer’: 14, ‘favorite’: 15, ‘game’: 16, ‘i’: 17, ‘love’: 18, ‘football’: 19, ‘great’: 20}

For calculation purposes each word has to be represented as a one hot vector of the dimension 21 (dictionary size):

vocabulary_hot_vector

 {‘my’: [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
‘name’: [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
‘is’: [0, 0, 1, 0, 0, 0,  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
‘john’: [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
‘what’: [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
‘your’: [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
‘bill’: [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
‘a’: [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
‘very’: [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
‘good’: [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
‘person’: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
‘he’: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
‘likes’: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0], 
‘playing’: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
‘soccer’: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
‘favorite’: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
‘game’: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
‘i’: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
‘love’: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0],
‘football’: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
‘great’: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]}

Doc1, Doc2 and Doc3 can be represented as following in terms of word index:
Doc 1: [0, 1, 2, 3, 4, 2, 5, 1], 
Doc 2: [6, 2, 7, 8, 9, 10, 11, 12, 13, 14], 
Doc 3: [4, 2, 5, 15, 16, 17, 18, 19, 19, 2, 7, 20, 16]

Doc1, Doc2, and Doc3  can be represented as follows in terms of a one-hot vector of the dimension 21 (dictionary size:

Doc 1: [[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 
Doc 2: [[0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]], 
Doc 3: [[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0], [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]]]

Limitations:

  • In case vocabulary size is high then memory requirement increases.
  • Also, feature space also increases as vocabulary size increases, which in turn can result in the curse of dimensionality.
  • Still, it does not carry any extra information about the data. Semantically similar words also have orthogonal vectors, with no information of similarity.

Python Code for the above examples (Github code Location) is as follows:

'''
Author: Gyan Mittal
Corresponding Document: https://gyan-mittal.com/nlp-ai-ml/nlp-word-encoding-by-one-hot-encoding/
Brief about One–Hot–Encoding:
One of the simplest forms of word encoding to represent the word in NLP is One–Hot–Vector–Encoding.
It requires very little computing power to convert text data into one-hot encoding data, and it’s easy to implement.
One–Hot–Encoding has the advantage over Label/ Integer encoding, that the result is binary rather than ordinal, it does not suffer from undesirable bias.
About Code: This code demonstrates the concept of One–Hot–Encoding with two simple example corpus
'''
from collections import Counter
import itertools
from util import naive_clean_text

def naive_one_hot_vector(corpus):
    split_docs_words= [naive_clean_text(doc).split() for doc in corpus]
    print("split_docs_words", "\n", split_docs_words)
    word_counts = Counter(itertools.chain(*split_docs_words))
    print("word_counts", "\n", word_counts)

    vocab_word_index = {x: i for i, x in enumerate(word_counts)}
    print("vocabulary\n", vocab_word_index)

    # =One hot vector of each word in the vocabulary
    vocabulary_hot_vector = {word: [1 if i == vocab_word_index[word] else 0 for i in range(len(vocab_word_index))] for i, word in enumerate(vocab_word_index)}
    print("vocabulary_hot_vector\n", vocabulary_hot_vector)

    # Each doc in corpus can be represented as secquence of word id's instead of words
    doc_sequence_id =[[vocab_word_index[word] for word in sen] for sen in split_docs_words]
    print("doc_sequence_id:\n", doc_sequence_id)

    # Each doc in corpus can be represented as secquence of one hot vectors instead of words and word id's
    one_hot_vector =[[[1 if i == vocab_word_index[word] else 0 for i in range(len(vocab_word_index))] for word in sen] for sen in split_docs_words]
    print("one_hot_vector:\n", one_hot_vector)

#Example 1
corpus = ["I love playing football.", "Indians love playing Cricket."]
#Example 2
#corpus =  ["My name is John, What is your name?", "Bill is a very good person. He likes playing soccer.", "What is your favorite game? I love Football. Football is a great game."]
naive_one_hot_vector(corpus)

About Gyan Mittal

Gyan is Engineering Head at Times Internet Limited. He is a technology leader and AI-ML expert with 22+ of experience in hands-on architecture, design, and development of a broad range of software technologies. Gyan Likes writing blogs in Technology, especially in Artificial Intelligence, Machine Learning, NLP, and Data Science. In his blogs, Gyan tries to explain complex concepts in simple language, mostly supported by very simple example code (with the corresponding GitHub link) that is very easy to understand. Gyan has done B. Tech. from IIT Kanpur, M.Tech. From IIT Madras and Artificial Intelligence Professional Program from Stanford University (2019-2021).
Bookmark the permalink.

Leave a Reply

Your email address will not be published.