In the field of NLP, generally, AI/ ML Algorithms don’t work on text data. We have to find some way to represent text data into numerical data.
Any data consists of words. So in case, we find some way to convert (Encode) words into numerical data, then our whole data could be converted into numerical data, which can be consumed by AI/ ML algorithms.
Label/ Integer Encoding of Words is one of the initial methodologies used to encode the words into numerical data. In this methodology, we assign a numerical value to every word in the corpus starting with zero.
So let us consider the following two examples for the label encoding:
Example 1 of Label/ Integer Encoding:
Let us consider that we have the following two docs in a corpus:
Doc1: “I love playing football.”
Doc2: “Indians love playing Cricket.”
Our corpus has 6 unique words (vocabulary/ dictionary size), Integer/ Label Encoding of a corpus containing the above two docs can be as following:
label_or_integer_encoding: {‘i’: 0, ‘love’: 1, ‘playing’: 2, ‘football’: 3, ‘indians’: 4, ‘cricket’: 5}
Example 2 of Label/ Integer Encoding:
Let us consider that we have the following three docs in a corpus:
Doc1: “My name is John, What is your name?”
Doc2: “Bill is a very good person. He likes playing soccer.”
Doc3: “What is your favorite game? I love Football. Football is a great game.”
This corpus has 21 unique words (vocabulary/ dictionary size), Integer/ Label Encoding of a corpus containing above three docs can be as following
label_or_integer_encoding: {‘my’: 0, ‘name’: 1, ‘is’: 2, ‘john’: 3, ‘what’: 4, ‘your’: 5, ‘bill’: 6, ‘a’: 7, ‘very’: 8, ‘good’: 9, ‘person’: 10, ‘he’: 11, ‘likes’: 12, ‘playing’: 13, ‘soccer’: 14, ‘favorite’: 15, ‘game’: 16, ‘i’: 17, ‘love’: 18, ‘football’: 19, ‘great’: 20}
Algorithms like Bag-of-Words (BOW), or TF-IDF, etc. use label encoding. Usage of Integer/ Label encoding in Bag-of-Words (BOW) and TF-IDF is described separately.
Limitations
- One of the drawbacks of label encoding is that a lot of lookup/ if-else statements are generally required in algorithms. Mostly, in deep learning-based models (which use multiple layers of networks), this approach doesn’t work.
- Also, the disadvantage of label encoding to ordinal representation is that it might suffer from undesirable bias.
- Doesn’t contain the information of relationship between two words. e.g. there is no way to find out that word “motel” and “hotel” are similar
There we need a more sophisticated way to represent words (One-Hot encoding or Word Embedding), which addresses some or all of the above limitations and can be used in various matrix calculations used in multi-layer networks.
Python code of the above Example for Integer/ Label Encoding (GitHub code location) is as follows:
''' Author: Gyan Mittal Corresponding Document: https://gyan-mittal.com/nlp-ai-ml/nlp-label-integer-encoding-of-words/ Brief about Label or Integer Encoding: In the field of NLP, generally, AI/ ML Algorithms don’t work on text data. We have to find some way to represent text data into numerical data. Any data consists of words. So in case, we find some way to convert (Encode) words into numerical data, then our whole data could be converted into numerical data, which can be consumed by AI/ ML algorithms. Label/ Integer Encoding of Words is one of the initial methodologies used to encode the words into numerical data. In this methodology, we assign a numerical value to every word in the corpus starting with zero. About Code: This code demonstrates the Label or Integer Encoding with two simple example corpus ''' from collections import Counter import itertools from util import naive_clean_text def naive_label_or_integer_encoding(corpus): split_docs_words= [naive_clean_text(doc).split() for doc in corpus] #print("split_docs_words", "\n", split_docs_words) word_counts = Counter(itertools.chain(*split_docs_words)) #print("word_counts", "\n", word_counts) label_or_integer_encoding = {x: i for i, x in enumerate(word_counts)} print("label_or_integer_encoding\n", label_or_integer_encoding) #corpus = ["My name is John, What is your name?", "Bill is a very good person. He likes playing soccer.", "What is your favorite game? I love Football. Football is a great game."] corpus = ["I love playing football.", "Indians love playing Cricket."] naive_label_or_integer_encoding(corpus)