Various word encoding methods like Integer/ Label Encoding or One-Hot Encoding have many limitations, among them, the main limitation is that these encoding doesn’t have the semantic relationship between the words. So, using these encodings, we have no way to find out that the words “Soccer” and “Football” are nearby words.
Also, In One-Hot Encoding, as vocabulary size increases, memory requirement increases, as well feature spaces increases. Feature space increase results in a curse of dimensionality issues. Whereas Integer/ Label Encoding suffers from undesirable bias.
We could use the WordNet list of Synonyms to get similarity, but it will fail due to the following issues:
- Many times, two words have the same meaning in some particular context only. Synonym list doesn’t contain that information
- We can define the synonyms of some limited combinations of words only, maintaining the list for all pairs of words is not possible. [suppose we have 1 million words in dictionary, then we would have to maintain the pair of 1 million * 1 million combinations. We can very well imagine what would happen in case we have 1 trillion words in our corpus.]
- As the vocabulary evolve, we need regularly update the synonym list
- It would be much more efficient and useful in case we can calculate the similarity between any two words by looking into their encoding.
So we should have some way where words can be encoded in low dimensionality space (predefined dimensions which are independent of vocab size) in such a way that similarity between two words can be calculated using this.
To achieve the above, the concept of embeddings has been introduced. Idea is to represent a word in a fixed vector space, which is not dependent on the size of the vocabulary. Using the vector representation of words, the similarity between two words can be calculated by calculating the distance of these two word’s vector representation, which is called embeddings.
In word embeddings, each word can be represented as an N-dimensional vector (mostly 300), and similarity between two words can be calculated by calculating the distance between these words in N-dimensional space. So, it solves the high dimensionality-related issues as well, which can be seen in One–Hot–Encoding where the dimension of word encoding increases as vocabulary size increases.
For calculating the word embeddings, intuition has been used that word’s meaning should be represented by the words that frequently appear nearby that word.
[“You shall know a word by the company it keeps” (J. R. Firth 1957: 11)]
Generating Word Embedding also evolve through various methodologies as below:
- SVD based methods
- Neural Network/ Iteration based methods
In SVD based embedding methods, first, we create the matrix of co-occurrence in the following two ways.
Then we reduce the dimensionality of the matrix using SVD. After applying SVD, each word in the vocabulary has an embedding in reduced space. Distance between two words can be now calculated by calculating the distance of two corresponding embedding vectors.
[To get more details about SVD, please refer to the Reference links given at the end of the document.]
Let us see both the methods mentioned above to calculate the co-occurrence matrix.
Document Term Matrix
In this, we count the occurrence of each word in each document. Doing this creates a large matrix of the size of vocabulary into the number of documents. After applying SVD, we can reduce the dimensionality of the matrix into reduced dimensions, which becomes the embedding of every word.
Example of Document Term Matrix based SVD:
Let us consider that we have the following docs in a corpus:
Doc1: “I love playing both sports, Cricket and Football”
Doc2: “Indians play both sports, Cricket and Football”
Doc3: “Football is more popular sport than Cricket”
Term Document matrix of these words in different docs are as following:
Word Index | Word | Doc1 Occurrence | Doc2 Occurrence | Doc3 Occurrence |
0 | i | 1 | 0 | 0 |
1 | love | 1 | 0 | 0 |
2 | playing | 1 | 0 | 0 |
3 | both | 1 | 1 | 0 |
4 | sports | 1 | 1 | 0 |
5 | cricket | 1 | 1 | 1 |
6 | and | 1 | 1 | 0 |
7 | football | 1 | 1 | 1 |
8 | indians | 0 | 1 | 0 |
9 | play | 0 | 1 | 0 |
10 | is | 0 | 0 | 1 |
11 | more | 0 | 0 | 1 |
12 | popular | 0 | 0 | 1 |
13 | sport | 0 | 0 | 1 |
14 | than | 0 | 0 | 1 |
Please observe the following points:
- Our corpus has 15 unique words. In the above example, we have a 15*3 matrix since there are 15 words and 3 documents in the corpus.
- Value in any cell represents the number of times word corresponding to the row, occurs to the doc corresponding to the column.
- Word “sports” at index 4 occurs one time in Doc 1 and Doc2 and zero times in Doc3. Similarly “cricket” at index 5 is occurring once in all three documents.
In our example, by applying the SVD we will transform the 15*3 matrix into a 15*2 matrix, where we would consider the value of each row as embedding for the corresponding word (2 dimensional).
Word Index | Word | Embeddings -> | |
0 | i | 0.68012243 | -0.31358878 |
1 | love | 0.68012243 | -0.31358878 |
2 | playing | 0.68012243 | -0.31358878 |
3 | both | 1.30218352 | -0.54500864 |
4 | sports | 1.30218352 | -0.54500864 |
5 | cricket | 1.69009263 | 0.37591857 |
6 | and | 1.30218352 | -0.54500864 |
7 | football | 1.69009263 | 0.37591857 |
8 | indians | 0.62206109 | -0.23141986 |
9 | play | 0.62206109 | -0.23141986 |
10 | is | 0.38790911 | 0.92092721 |
11 | more | 0.38790911 | 0.92092721 |
12 | popular | 0.38790911 | 0.92092721 |
13 | sport | 0.38790911 | 0.92092721 |
14 | than | 0.38790911 | 0.92092721 |
In a similar way, the matrix can be prepared in a very large corpus. In a corpus matrix of N * D would be created where N is the number of words in the corpus and D is the number of documents in the corpus. In a typical corpus, N might be in millions whereas documents can be in hundreds of thousands.
In a real-world scenario typically a matrix of N*D is converted to N*d using SVD. So every word typically has embedding of d dimension. Where d is typically in the range of 100 to 500.
Limitations of Document Term Matrix based approach
A reasonable big document, which might be covering something in-depth tends to have various unrelated terms together. Which can bias our algorithm. To overcome this problem let us evaluate a better approach called “Window-based co-occurrence matrix” in the next section.
Python code of the above Example for Document Term Matrix (GitHub code location) is as follows:
''' Author: Gyan Mittal Corresponding Document: https://gyan-mittal.com/nlp-ai-ml/nlp-word-embedding-svd-based-methods/#DocTermMatrix Brief about Document Term Matrix based SVD: In this, we count the occurrence of each word in each document. Doing this creates a large matrix of the size of vocabulary into the number of documents. After applying SVD, we can reduce the dimensionality of the matrix into reduced dimensions, which becomes the embdding of every word. ''' from collections import Counter import itertools from util import naive_clean_text import numpy as np from sklearn.decomposition import TruncatedSVD import matplotlib.pyplot as plt #Reduce the dimensionality of a matrix def reduce_to_k_dim(M, k=2): svd = TruncatedSVD(n_components = k, n_iter = 100, random_state = 456, tol = 0.0) reduce_matrix_x = svd.fit_transform(M) #print(reduce_matrix_x) return reduce_matrix_x def naive_doc_term_matrix(corpus): split_docs_words= [naive_clean_text(doc).split() for doc in corpus] print("split_docs_words", "\n", split_docs_words) word_counts = Counter(itertools.chain(*split_docs_words)) print("word_counts", "\n", word_counts) vocab_word_index = {x: i for i, x in enumerate(word_counts)} reverse_vocab_word_index = {i: x for i, x in enumerate(word_counts)} print("vocabulary\n", vocab_word_index) print("reverse_vocabulary\n", reverse_vocab_word_index) vocab_size = len(vocab_word_index) N = len(corpus) matrix_x = np.zeros((vocab_size, N)) print(matrix_x) for i, doc in enumerate(split_docs_words): for word in doc: matrix_x[vocab_word_index[word]][i] += 1 print("N:\t", N) print("matrix_x:\n", matrix_x) matrix_x_dict = {} for row in enumerate(matrix_x): row_index, word_doc_vector = row matrix_x_dict[reverse_vocab_word_index[row_index]] = word_doc_vector print("Doc Term Matrix:") print('{:10s},{:10s}, {}'.format("word index", "word", "Doc Co Occurrence")) for key, value in matrix_x_dict.items(): print('{:<10d}, {:10s}, {}'.format(vocab_word_index[key], key, value)) return matrix_x, vocab_word_index, reverse_vocab_word_index def plot_embeddings(reduce_matrix_x, vocabulary): for word, i in vocabulary.items(): x = reduce_matrix_x[i][0] y = reduce_matrix_x[i][1] #print(word, ":\t", x, ":\t", y) plt.scatter(x, y) plt.annotate(word, (x, y)) plt.show() #corpus = ["I love playing cricket", "you love playing football", "We love playing cricket", "All love playing football"] #, "You love all sports", "We love all sports"] corpus = ["I love playing both sports, Cricket and Football", "Indians play both sports, Cricket and Football", "Football is more popular sport than Cricket"] #matrix_x, vocabulary, reverse_vocabulary = naive_window_co_occurrence_matrix(corpus, window_size=1) matrix_x, vocabulary, reverse_vocabulary = naive_doc_term_matrix(corpus) #Reduce the matrix to 2 columns reduce_matrix_x = reduce_to_k_dim(matrix_x, k=2) reduce_matrix_x_dict = {} for row in enumerate(reduce_matrix_x): row_index, word_doc_vector = row reduce_matrix_x_dict[reverse_vocabulary[row_index]] = word_doc_vector print ("\n\nReduced Matrix [Embeddings Matrix]:") print('{:10s}, {:10s}, {}'.format("Word Index", "Word", "Embedding")) for key, value in reduce_matrix_x_dict.items(): print('{:<10d}, {:10s}, {}'.format(vocabulary[key], key, value)) plot_embeddings(reduce_matrix_x, vocabulary)
Window-based Co-Occurrence Matrix
In this, we count the associated words mapping between a predefined window. Idea is that similar documents keep the company of similar words. After applying this to our corpus, we get the matrix of the size of vocabulary * size of the vocabulary.
We can reduce the size similarly using SVD as we have done for the previous methodology. Now, in this also, we will get the word embeddings.
Example of Window-based Co-Occurrence Matrix based SVD:
Let us consider that we have the following docs in a corpus:
Doc1: “I love playing cricket”
Doc2: “you love playing football”
Doc3: “We love playing cricket”
Doc4: “I love all sports”
Doc5: “You love all sports”
Doc6: “We love all sports”
Window-based Co-Occurrence Matrix of these words are as following (with window size 1):
word Index | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | |
Word | i | love | playing | cricket | you | football | we | all | sports | |
0 | i | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | love | 2 | 0 | 3 | 0 | 2 | 0 | 2 | 3 | 0 |
2 | playing | 0 | 3 | 0 | 2 | 0 | 1 | 0 | 0 | 0 |
3 | cricket | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | you | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 | football | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
6 | we | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
7 | all | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 3 |
8 | sports | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0 |
We have got a matrix of 15*15. Using SVD, we will reduce it to the following matrix of 15*2.
Word Index | Word | Embedding-> | |
0 | i | 1.85E+00 | 2.89E-15 |
1 | love | -9.50E-15 | 5.43E+00 |
2 | playing | 3.24E+00 | 5.13E-15 |
3 | cricket | -2.00E-15 | 1.10E+00 |
4 | you | 1.85E+00 | 2.89E-15 |
5 | football | -9.99E-16 | 5.49E-01 |
6 | we | 1.85E+00 | 2.89E-15 |
7 | all | 3.74E+00 | 6.01E-15 |
8 | sports | -4.66E-15 | 1.91E+00 |

In a similar way, the matrix can be prepared in a very large corpus. For a corpus, a matrix of N*N would be created where N is the number of words in the corpus. In a typical corpus, N might be in millions.
In a real-world scenario typically a matrix of N*N is converted to N*d using SVD. So every word typically has embedding of d dimension. Where d is typically in the range of 100 to 500.
Python code of the above Example for Window based Co-occurrence Matrix (GitHub code location) is as follows:
''' Author: Gyan Mittal Corresponding Document: https://gyan-mittal.com/nlp-ai-ml/nlp-word-embedding-svd-based-methods/#WindowBasedCoOccurrenceMatrix Brief about Window-based Co-Occurrence Matrix based SVD: In this, we count the associated words mapping between a predefined window. Idea is that similar documents keep the company of similar words. After applying this to our corpus, we get the matrix of the size of vocabulary * size of the vocabulary. After applying SVD, we can reduce the dimensionality of the matrix into reduced dimensions, which becomes the embdding of every word. About Code: This code demonstrates the concept of Window-based Co-Occurrence Matrix based SVD with simple example corpus ''' from collections import Counter import itertools from util import naive_clean_text import numpy as np from sklearn.decomposition import TruncatedSVD import matplotlib.pyplot as plt def reduce_to_k_dim(M, k=2): svd = TruncatedSVD(n_components = k, n_iter = 100, random_state = 456, tol = 0.0) reduce_matrix_x = svd.fit_transform(M) #print(reduce_matrix_x) return reduce_matrix_x def naive_window_co_occurrence_matrix(corpus, window_size=4): split_docs_words= [naive_clean_text(doc).split() for doc in corpus] #print("split_docs_words", "\n", split_docs_words) word_counts = Counter(itertools.chain(*split_docs_words)) #print("word_counts", "\n", word_counts) #words = [x for i, x in enumerate(word_counts)] vocab_word_index = {x: i for i, x in enumerate(word_counts)} reverse_vocab_word_index = {i: x for i, x in enumerate(word_counts)} #print("vocab_word_index\n", vocab_word_index) #print("reverse_vocab_word_index\n", reverse_vocab_word_index) vocab_size = len(vocab_word_index) #print("vocab_size:\t", vocab_size) matrix_x = np.zeros((vocab_size, vocab_size)) for line in split_docs_words: for i in range(len(line)): #print(i, line[i]) target = line[i] target_index = vocab_word_index[target] left = max(i - window_size, 0) right = min(i + window_size, len(line) - 1) for j in range(left, right + 1): window_word = line[j] if(i != j and target_index != vocab_word_index[window_word]): matrix_x[target_index][vocab_word_index[window_word]] += 1 matrix_x_dict = {} for row in enumerate(matrix_x): row_index, word_doc_vector = row matrix_x_dict[reverse_vocab_word_index[row_index]] = word_doc_vector print("Window Co Occurrence Matrix:") print('{:10s}, {:10s}, {}'.format("word index", "word", "Co Occurrence value")) for key, value in matrix_x_dict.items(): print('{:<10d}, {:10s}, {}'.format(vocab_word_index[key], key, value)) return matrix_x, vocab_word_index, reverse_vocab_word_index def plot_embeddings(reduce_matrix_x, vocabulary): for word, i in vocabulary.items(): x = reduce_matrix_x[i][0] y = reduce_matrix_x[i][1] #print(word, ":\t", x, ":\t", y) plt.scatter(x, y) plt.annotate(word, (x, y)) plt.show() corpus = ["I love playing cricket", "you love playing football", "We love playing cricket", "I love all sports", "You love all sports", "We love all sports"] matrix_x, vocabulary, reverse_vocabulary = naive_window_co_occurrence_matrix(corpus, window_size=1) #Reduce the matrix to 2 columns reduce_matrix_x = reduce_to_k_dim(matrix_x, k=2) reduce_matrix_x_dict = {} for row in enumerate(reduce_matrix_x): row_index, word_doc_vector = row reduce_matrix_x_dict[reverse_vocabulary[row_index]] = word_doc_vector print ("\n\nReduced Matrix [Embeddings Matrix]:") print('{:10s}, {:10s}, {}'.format("Word Index", "Word", "Embedding")) for key, value in reduce_matrix_x_dict.items(): print('{:<10d}, {:10s}, {}'.format(vocabulary[key], key, value)) plot_embeddings(reduce_matrix_x, vocabulary)
It requires a large amount of memory to store the resulting matrix. For example, if our corpus is of 1 million words, then we would have to create a Window Co-occurrence matrix of 1 million by 1 million dimensions. We can imagine the matrix size requirement in the case of 1 trillion words vocabulary.
Limitations of SVD:
- Generally, the matrix is of very high dimension, which results in high training cost O(mn2) where m is the size of a dictionary and n is the embedding size. So for a corpus with a large number of words, it becomes very difficult to train.
- Very quickly becomes imbalance matrix due to high frequency words.
- Matrix dimension changes, as soon as a new word is introduced.
Workaround of some of the above problems:
- Ignore the common words like “is”, “the” etc.
- The weight of the two different words should be considered based on the distance between the two words, rather than just raw count.
As we will see iteration-based models like word2vec solves it in a much elegant way.
References
For getting more details about SVD, please refer to the following links:
- http://web.stanford.edu/class/cs224n/readings/cs224n-2019-notes01-wordvecs1.pdf
- http://web.stanford.edu/class/cs224n/assignments/a1_preview/exploring_word_vectors.html