NLP – SVD based Word Embedding

Various word encoding methods like Integer/ Label Encoding or One-Hot Encoding have many limitations, among them, the main limitation is that these encoding doesn’t have the semantic relationship between the words. So, using these encodings, we have no way to find out that the words “Soccer” and “Football” are nearby words.

Also, In One-Hot Encoding, as vocabulary size increases, memory requirement increases, as well feature spaces increases. Feature space increase results in a curse of dimensionality issues. Whereas Integer/ Label Encoding suffers from undesirable bias.

We could use the WordNet list of Synonyms to get similarity, but it will fail due to the following issues:

  1. Many times, two words have the same meaning in some particular context only. Synonym list doesn’t contain that information
  2. We can define the synonyms of some limited combinations of words only, maintaining the list for all pairs of words is not possible. [suppose we have 1 million words in dictionary, then we would have to maintain the pair of 1 million * 1 million combinations. We can very well imagine what would happen in case we have 1 trillion words in our corpus.]
  3. As the vocabulary evolve, we need regularly update the synonym list
  4. It would be much more efficient and useful in case we can calculate the similarity between any two words by looking into their encoding. 

So we should have some way where words can be encoded in low dimensionality space (predefined dimensions which are independent of vocab size) in such a way that similarity between two words can be calculated using this.

To achieve the above, the concept of embeddings has been introduced. Idea is to represent a word in a fixed vector space, which is not dependent on the size of the vocabulary. Using the vector representation of words, the similarity between two words can be calculated by calculating the distance of these two word’s vector representation, which is called embeddings.

In word embeddings, each word can be represented as an N-dimensional vector (mostly 300), and similarity between two words can be calculated by calculating the distance between these words in N-dimensional space. So, it solves the high dimensionality-related issues as well, which can be seen in OneHotEncoding where the dimension of word encoding increases as vocabulary size increases.

For calculating the word embeddings, intuition has been used that word’s meaning should be represented by the words that frequently appear nearby that word. 

[“You shall know a word by the company it keeps” (J. R. Firth 1957: 11)]

Generating Word Embedding also evolve through various methodologies as below:

  • SVD based methods
  • Neural Network/ Iteration based methods

In SVD based embedding methods, first, we create the matrix of co-occurrence in the following two ways.

Then we reduce the dimensionality of the matrix using SVD. After applying SVD, each word in the vocabulary has an embedding in reduced space. Distance between two words can be now calculated by calculating the distance of two corresponding embedding vectors.

[To get more details about SVD, please refer to the Reference links given at the end of the document.]

Let us see both the methods mentioned above to calculate the co-occurrence matrix.

Document Term Matrix

In this, we count the occurrence of each word in each document. Doing this creates a large matrix of the size of vocabulary into the number of documents. After applying SVD, we can reduce the dimensionality of the matrix into reduced dimensions, which becomes the embedding of every word.

Example of Document Term Matrix based SVD:

Let us consider that we have the following docs in a corpus:
Doc1: “I love playing both sports, Cricket and Football”
Doc2: “Indians play both sports, Cricket and Football”
Doc3: “Football is more popular sport than Cricket”

Term Document matrix of these words in different docs are as following:

Word IndexWordDoc1 OccurrenceDoc2 OccurrenceDoc3 Occurrence
0i100
1love100
2playing100
3both110
4sports110
5cricket111
6and110
7football111
8indians010
9play010
10is001
11more001
12popular001
13sport001
14than001
Document Term Matrix

Please observe the following points:

  • Our corpus has 15 unique words. In the above example, we have a 15*3 matrix since there are 15 words and 3 documents in the corpus.
  • Value in any cell represents the number of times word corresponding to the row, occurs to the doc corresponding to the column.
  • Word “sports” at index 4 occurs one time in Doc 1 and Doc2 and zero times in Doc3. Similarly “cricket” at index 5 is occurring once in all three documents.

In our example, by applying the SVD we will transform the 15*3 matrix into a 15*2 matrix, where we would consider the value of each row as embedding for the corresponding word (2 dimensional).

Word IndexWordEmbeddings ->
0i0.68012243-0.31358878
1love0.68012243-0.31358878
2playing0.68012243-0.31358878
3both1.30218352-0.54500864
4sports1.30218352-0.54500864
5cricket1.690092630.37591857
6and1.30218352-0.54500864
7football1.690092630.37591857
8indians0.62206109-0.23141986
9play0.62206109-0.23141986
10is0.387909110.92092721
11more0.387909110.92092721
12popular0.387909110.92092721
13sport0.387909110.92092721
14than0.387909110.92092721
Embeddings in 2D space

In a similar way, the matrix can be prepared in a very large corpus. In a corpus matrix of N * D would be created where N is the number of words in the corpus and D is the number of documents in the corpus. In a typical corpus, N might be in millions whereas documents can be in hundreds of thousands.

In a real-world scenario typically a matrix of N*D is converted to N*d using SVD. So every word typically has embedding of d dimension. Where d is typically in the range of 100 to 500.

Limitations of Document Term Matrix based approach

A reasonable big document, which might be covering something in-depth tends to have various unrelated terms together. Which can bias our algorithm. To overcome this problem let us evaluate a better approach called “Window-based co-occurrence matrix” in the next section.

Python code of the above Example for Document Term Matrix (GitHub code location) is as follows:

'''
Author: Gyan Mittal
Corresponding Document: https://gyan-mittal.com/nlp-ai-ml/nlp-word-embedding-svd-based-methods/#DocTermMatrix
Brief about Document Term Matrix based SVD:
In this, we count the occurrence of each word in each document.
Doing this creates a large matrix of the size of vocabulary into the number of documents.
After applying SVD, we can reduce the dimensionality of the matrix into reduced dimensions, which becomes the embdding of every word.
'''

from collections import Counter
import itertools
from util import naive_clean_text
import numpy as np
from sklearn.decomposition import TruncatedSVD
import matplotlib.pyplot as plt

#Reduce the dimensionality of a matrix
def reduce_to_k_dim(M, k=2):

    svd = TruncatedSVD(n_components = k, n_iter = 100, random_state = 456, tol = 0.0)
    reduce_matrix_x = svd.fit_transform(M)
    #print(reduce_matrix_x)
    return reduce_matrix_x

def naive_doc_term_matrix(corpus):

    split_docs_words= [naive_clean_text(doc).split() for doc in corpus]
    print("split_docs_words", "\n", split_docs_words)
    word_counts = Counter(itertools.chain(*split_docs_words))
    print("word_counts", "\n", word_counts)

    vocab_word_index = {x: i for i, x in enumerate(word_counts)}
    reverse_vocab_word_index = {i: x for i, x in enumerate(word_counts)}
    print("vocabulary\n", vocab_word_index)
    print("reverse_vocabulary\n", reverse_vocab_word_index)

    vocab_size = len(vocab_word_index)
    N = len(corpus)
    matrix_x = np.zeros((vocab_size, N))
    print(matrix_x)

    for i, doc in enumerate(split_docs_words):
        for word in doc:
            matrix_x[vocab_word_index[word]][i] += 1

    print("N:\t", N)
    print("matrix_x:\n", matrix_x)

    matrix_x_dict = {}
    for row in enumerate(matrix_x):
        row_index, word_doc_vector = row
        matrix_x_dict[reverse_vocab_word_index[row_index]] = word_doc_vector

    print("Doc Term Matrix:")

    print('{:10s},{:10s}, {}'.format("word index", "word", "Doc Co Occurrence"))
    for key, value in matrix_x_dict.items():
        print('{:<10d}, {:10s}, {}'.format(vocab_word_index[key], key, value))
    
    return  matrix_x, vocab_word_index, reverse_vocab_word_index

def plot_embeddings(reduce_matrix_x, vocabulary):
    
    for word, i in vocabulary.items():
        x = reduce_matrix_x[i][0]
        y = reduce_matrix_x[i][1]
        #print(word, ":\t", x, ":\t", y)
        plt.scatter(x, y)
        plt.annotate(word, (x, y))
    plt.show()

#corpus = ["I love playing cricket", "you love playing football", "We love playing cricket", "All love playing football"] #, "You love all sports", "We love all sports"]
corpus = ["I love playing both sports, Cricket and Football", "Indians play both sports, Cricket and Football", "Football is more popular sport than Cricket"]

#matrix_x, vocabulary, reverse_vocabulary = naive_window_co_occurrence_matrix(corpus, window_size=1)
matrix_x, vocabulary, reverse_vocabulary = naive_doc_term_matrix(corpus)

#Reduce the matrix to 2 columns
reduce_matrix_x = reduce_to_k_dim(matrix_x, k=2)

reduce_matrix_x_dict = {}
for row in enumerate(reduce_matrix_x):
    row_index, word_doc_vector = row
    reduce_matrix_x_dict[reverse_vocabulary[row_index]] = word_doc_vector

print ("\n\nReduced Matrix [Embeddings Matrix]:")
print('{:10s}, {:10s}, {}'.format("Word Index", "Word", "Embedding"))
for key, value in reduce_matrix_x_dict.items():
    print('{:<10d}, {:10s}, {}'.format(vocabulary[key], key, value))

plot_embeddings(reduce_matrix_x, vocabulary)

Window-based Co-Occurrence Matrix

In this, we count the associated words mapping between a predefined window. Idea is that similar documents keep the company of similar words. After applying this to our corpus, we get the matrix of the size of vocabulary * size of the vocabulary.

We can reduce the size similarly using SVD as we have done for the previous methodology. Now, in this also, we will get the word embeddings. 

Example of Window-based Co-Occurrence Matrix based SVD:

Let us consider that we have the following docs in a corpus:
Doc1: “I love playing cricket”
Doc2: “you love playing football”
Doc3: “We love playing cricket”
Doc4: “I love all sports”
Doc5: “You love all sports”
Doc6: “We love all sports”

Window-based Co-Occurrence Matrix of these words are as following (with window size 1):

word
Index
012345678
Wordiloveplayingcricketyoufootballweallsports
0i020000000
1love203020230
2playing030201000
3cricket002000000
4you020000000
5football001000000
6we020000000
7all030000003
8sports000000030
Window-based Co-Occurrence with window size 1

We have got a matrix of 15*15. Using SVD, we will reduce it to the following matrix of 15*2.

Word IndexWordEmbedding->
0i1.85E+002.89E-15
1love-9.50E-155.43E+00
2playing3.24E+005.13E-15
3cricket-2.00E-151.10E+00
4you1.85E+002.89E-15
5football-9.99E-165.49E-01
6we1.85E+002.89E-15
7all3.74E+006.01E-15
8sports-4.66E-151.91E+00
Embeddings in 2D space
Word Embedding Plot

In a similar way, the matrix can be prepared in a very large corpus. For a corpus, a matrix of N*N would be created where N is the number of words in the corpus. In a typical corpus, N might be in millions.

In a real-world scenario typically a matrix of N*N is converted to N*d using SVD. So every word typically has embedding of d dimension. Where d is typically in the range of 100 to 500.

Python code of the above Example for Window based Co-occurrence Matrix (GitHub code location) is as follows:

'''
Author: Gyan Mittal
Corresponding Document: https://gyan-mittal.com/nlp-ai-ml/nlp-word-embedding-svd-based-methods/#WindowBasedCoOccurrenceMatrix
Brief about Window-based Co-Occurrence Matrix based SVD:
In this, we count the associated words mapping between a predefined window.
Idea is that similar documents keep the company of similar words. After applying this to our corpus, we get the matrix of the size of vocabulary * size of the vocabulary.
After applying SVD, we can reduce the dimensionality of the matrix into reduced dimensions, which becomes the embdding of every word.
About Code: This code demonstrates the concept of Window-based Co-Occurrence Matrix based SVD with simple example corpus
'''
from collections import Counter
import itertools
from util import naive_clean_text
import numpy as np
from sklearn.decomposition import TruncatedSVD
import matplotlib.pyplot as plt

def reduce_to_k_dim(M, k=2):
    
    svd = TruncatedSVD(n_components = k, n_iter = 100, random_state = 456, tol = 0.0)
    reduce_matrix_x = svd.fit_transform(M)
    #print(reduce_matrix_x)
    return reduce_matrix_x

def naive_window_co_occurrence_matrix(corpus, window_size=4):

    split_docs_words= [naive_clean_text(doc).split() for doc in corpus]
    #print("split_docs_words", "\n", split_docs_words)
    word_counts = Counter(itertools.chain(*split_docs_words))
    #print("word_counts", "\n", word_counts)
    #words = [x for i, x in enumerate(word_counts)]
    vocab_word_index = {x: i for i, x in enumerate(word_counts)}
    reverse_vocab_word_index = {i: x for i, x in enumerate(word_counts)}
    #print("vocab_word_index\n", vocab_word_index)
    #print("reverse_vocab_word_index\n", reverse_vocab_word_index)

    vocab_size = len(vocab_word_index)
    #print("vocab_size:\t", vocab_size)
    matrix_x = np.zeros((vocab_size, vocab_size))
    for line in split_docs_words:
        for i in range(len(line)):
            #print(i, line[i])
            target = line[i]
            target_index = vocab_word_index[target]
            left = max(i - window_size, 0)
            right = min(i + window_size, len(line) - 1)
            for j in range(left, right + 1):
                window_word = line[j]
                if(i != j and target_index != vocab_word_index[window_word]):
                    matrix_x[target_index][vocab_word_index[window_word]] += 1

    matrix_x_dict = {}
    for row in enumerate(matrix_x):
        row_index, word_doc_vector = row
        matrix_x_dict[reverse_vocab_word_index[row_index]] = word_doc_vector

    print("Window Co Occurrence Matrix:")

    print('{:10s}, {:10s}, {}'.format("word index", "word", "Co Occurrence value"))
    for key, value in matrix_x_dict.items():
        print('{:<10d}, {:10s}, {}'.format(vocab_word_index[key], key, value))
    
    return  matrix_x, vocab_word_index, reverse_vocab_word_index

def plot_embeddings(reduce_matrix_x, vocabulary):
    
    for word, i in vocabulary.items():
        x = reduce_matrix_x[i][0]
        y = reduce_matrix_x[i][1]
        #print(word, ":\t", x, ":\t", y)
        plt.scatter(x, y)
        plt.annotate(word, (x, y))
    plt.show()

corpus = ["I love playing cricket", "you love playing football", "We love playing cricket", "I love all sports", "You love all sports", "We love all sports"]
matrix_x, vocabulary, reverse_vocabulary = naive_window_co_occurrence_matrix(corpus, window_size=1)

#Reduce the matrix to 2 columns
reduce_matrix_x = reduce_to_k_dim(matrix_x, k=2)

reduce_matrix_x_dict = {}
for row in enumerate(reduce_matrix_x):
    row_index, word_doc_vector = row
    reduce_matrix_x_dict[reverse_vocabulary[row_index]] = word_doc_vector

print ("\n\nReduced Matrix [Embeddings Matrix]:")
print('{:10s}, {:10s}, {}'.format("Word Index", "Word", "Embedding"))
for key, value in reduce_matrix_x_dict.items():
    print('{:<10d}, {:10s}, {}'.format(vocabulary[key], key, value))

plot_embeddings(reduce_matrix_x, vocabulary)

It requires a large amount of memory to store the resulting matrix. For example, if our corpus is of 1 million words, then we would have to create a Window Co-occurrence matrix of 1 million by 1 million dimensions. We can imagine the matrix size requirement in the case of 1 trillion words vocabulary.

Limitations of SVD:

  • Generally, the matrix is of very high dimension, which results in high training cost O(mn2) where m is the size of a dictionary and n is the embedding size. So for a corpus with a large number of words, it becomes very difficult to train.
  • Very quickly becomes imbalance matrix due to high frequency words.
  • Matrix dimension changes, as soon as a new word is introduced.

Workaround of some of the above problems:

  • Ignore the common words like “is”, “the” etc.
  • The weight of the two different words should be considered based on the distance between the two words, rather than just raw count.

As we will see iteration-based models like word2vec solves it in a much elegant way.

References

For getting more details about SVD, please refer to the following links:

  • http://web.stanford.edu/class/cs224n/readings/cs224n-2019-notes01-wordvecs1.pdf
  • http://web.stanford.edu/class/cs224n/assignments/a1_preview/exploring_word_vectors.html

About Gyan Mittal

Gyan is Engineering Head at Times Internet Limited. He is a technology leader and AI-ML expert with 22+ of experience in hands-on architecture, design, and development of a broad range of software technologies. Gyan Likes writing blogs in Technology, especially in Artificial Intelligence, Machine Learning, NLP, and Data Science. In his blogs, Gyan tries to explain complex concepts in simple language, mostly supported by very simple example code (with the corresponding GitHub link) that is very easy to understand. Gyan has done B. Tech. from IIT Kanpur, M.Tech. From IIT Madras and Artificial Intelligence Professional Program from Stanford University (2019-2021).
Bookmark the permalink.

Leave a Reply

Your email address will not be published.