In the field of NLP, generally, AI/ ML Algorithms don’t work on text data. We have to find some way to represent text data into numerical data.
Any data consists of words. So in case, we find some way to convert (Encode) words into numerical data, then our whole data could be converted into numerical data, which can be consumed by AI/ ML algorithms.
One of the simplest forms of word encoding to represent the word in NLP is Label/ Integer encoding or One–Hot–Vector–Encoding. Here we will concentrate on One–Hot–Vector–Encoding, which requires very little computing power to convert text data into one-hot encoding data, and it’s easy to implement.
In this methodology convert a single word into a vector of N dimensions (N is the size of the vocabulary). This vector is filled with zeros and with a single (hot) position representing the corresponding word with one.
The size of the vector is the unique words in the corpus (many models increase the size by one to provision unknown words). Each position represents a word, the position with one represents the word mapped by the word vector.
One–Hot–Encoding has the advantage over Label/ Integer encoding, that the result is binary rather than ordinal, it does not suffer from undesirable bias.
However, its immense and sparse vector representation requires large memory for computation.
Let us consider the following two examples to understand One–Hot–Encoding better:
Example 1 of One Hot Vector Encoding:
Let us consider that we have the following two docs in the corpus:
Doc 1: “I love playing football.”
Doc 2: “Indians love playing Cricket.”
Our corpus have 6 unique words, It can be represented as a index of 0 to 5 as following:
vocabulary/ dictionary (Label/ Integer Encoding of Words): {‘i’: 0, ‘love’: 1, ‘playing’: 2, ‘football’: 3, ‘indians’: 4, ‘cricket’: 5}
Doc1 and Doc2 can be represented as follows in terms of word index:
Doc 1: [0, 1, 2, 3]
Doc 2: [4, 1, 2, 5]
Doc1 and Doc2 can be represented as follows in terms of one-hot vector of the dimension 6 (dictionary size):
Doc 1: [[1, 0, 0, 0, 0, 0], [0, 1, 0, 0, 0, 0], [0, 0, 1, 0, 0, 0], [0, 0, 0, 1, 0, 0]]
Doc 2: [[0, 0, 0, 0, 1, 0], [0, 1, 0, 0, 0, 0], [0, 0, 1, 0, 0, 0], [0, 0, 0, 0, 0, 1]]
Example 2 of One Hot Vector Encoding:
Let us consider that we have the following three docs in the corpus:
Doc1: “My name is John, What is your name?”
Doc2: “Bill is a very good person. He likes playing soccer.”
Doc3: “What is your favorite game? I love Football. Football is a great game.”
This corpus has 21 unique words (vocabulary/ dictionary size)
vocabulary/ dictionary (Label/ Integer Encoding of Words): {‘my’: 0, ‘name’: 1, ‘is’: 2, ‘john’: 3, ‘what’: 4, ‘your’: 5, ‘bill’: 6, ‘a’: 7, ‘very’: 8, ‘good’: 9, ‘person’: 10, ‘he’: 11, ‘likes’: 12, ‘playing’: 13, ‘soccer’: 14, ‘favorite’: 15, ‘game’: 16, ‘i’: 17, ‘love’: 18, ‘football’: 19, ‘great’: 20}
For calculation purposes each word has to be represented as a one hot vector of the dimension 21 (dictionary size):
vocabulary_hot_vector
{‘my’: [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
‘name’: [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
‘is’: [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
‘john’: [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
‘what’: [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
‘your’: [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
‘bill’: [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
‘a’: [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
‘very’: [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
‘good’: [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
‘person’: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
‘he’: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
‘likes’: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
‘playing’: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
‘soccer’: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
‘favorite’: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
‘game’: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
‘i’: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
‘love’: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0],
‘football’: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
‘great’: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]}
Doc1, Doc2 and Doc3 can be represented as following in terms of word index:
Doc 1: [0, 1, 2, 3, 4, 2, 5, 1],
Doc 2: [6, 2, 7, 8, 9, 10, 11, 12, 13, 14],
Doc 3: [4, 2, 5, 15, 16, 17, 18, 19, 19, 2, 7, 20, 16]
Doc1, Doc2, and Doc3 can be represented as follows in terms of a one-hot vector of the dimension 21 (dictionary size:
Doc 1: [[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],
Doc 2: [[0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]],
Doc 3: [[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0], [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]]]
Limitations:
- In case vocabulary size is high then memory requirement increases.
- Also, feature space also increases as vocabulary size increases, which in turn can result in the curse of dimensionality.
- Still, it does not carry any extra information about the data. Semantically similar words also have orthogonal vectors, with no information of similarity.
Python Code for the above examples (Github code Location) is as follows:
''' Author: Gyan Mittal Corresponding Document: https://gyan-mittal.com/nlp-ai-ml/nlp-word-encoding-by-one-hot-encoding/ Brief about One–Hot–Encoding: One of the simplest forms of word encoding to represent the word in NLP is One–Hot–Vector–Encoding. It requires very little computing power to convert text data into one-hot encoding data, and it’s easy to implement. One–Hot–Encoding has the advantage over Label/ Integer encoding, that the result is binary rather than ordinal, it does not suffer from undesirable bias. About Code: This code demonstrates the concept of One–Hot–Encoding with two simple example corpus ''' from collections import Counter import itertools from util import naive_clean_text def naive_one_hot_vector(corpus): split_docs_words= [naive_clean_text(doc).split() for doc in corpus] print("split_docs_words", "\n", split_docs_words) word_counts = Counter(itertools.chain(*split_docs_words)) print("word_counts", "\n", word_counts) vocab_word_index = {x: i for i, x in enumerate(word_counts)} print("vocabulary\n", vocab_word_index) # =One hot vector of each word in the vocabulary vocabulary_hot_vector = {word: [1 if i == vocab_word_index[word] else 0 for i in range(len(vocab_word_index))] for i, word in enumerate(vocab_word_index)} print("vocabulary_hot_vector\n", vocabulary_hot_vector) # Each doc in corpus can be represented as secquence of word id's instead of words doc_sequence_id =[[vocab_word_index[word] for word in sen] for sen in split_docs_words] print("doc_sequence_id:\n", doc_sequence_id) # Each doc in corpus can be represented as secquence of one hot vectors instead of words and word id's one_hot_vector =[[[1 if i == vocab_word_index[word] else 0 for i in range(len(vocab_word_index))] for word in sen] for sen in split_docs_words] print("one_hot_vector:\n", one_hot_vector) #Example 1 corpus = ["I love playing football.", "Indians love playing Cricket."] #Example 2 #corpus = ["My name is John, What is your name?", "Bill is a very good person. He likes playing soccer.", "What is your favorite game? I love Football. Football is a great game."] naive_one_hot_vector(corpus)