NLP – Word2Vec Word Embeddings -Neural Network/ Iteration based embedding model

Various word encoding methods like Integer/ Label Encoding or One-Hot Encoding have many limitations, among them, the main limitation is that these encoding doesn’t have the semantic relationship between the words. So, using these encodings, we have no way to find out that the words “Soccer” and “Football” are nearby words.

Also, In One-Hot Encoding, as vocabulary size increases, memory requirement increases, as well feature spaces increases. Feature space increase results in a curse of dimensionality issues. Whereas Integer/ Label Encoding suffers from undesirable bias.

We could use the WordNet list of Synonyms to get similarity, but it will fail due to the following issues:

  1. Many times, two words have the same meaning in some particular context only. Synonym list doesn’t contain that information
  2. We can define the synonyms of some limited combinations of words only, maintaining the list for all pairs of words is not possible. [suppose we have 1 million words in dictionary, then we would have to maintain the pair of 1 million * 1 million combinations. We can very well imagine what would happen in case we have 1 trillion words in our corpus.]
  3. As the vocabulary evolve, we need regularly update the synonym list
  4. It would be much more efficient and useful in case we can calculate the similarity between any two words by looking into their encoding. 

So we should have some way where words can be encoded in low dimensionality space (predefined dimensions which are independent of vocab size) in such a way that similarity between two words can be calculated using this.

To achieve the above, the concept of embeddings has been introduced. Idea is to represent a word in a fixed vector space, which is not dependent on the size of the vocabulary. Using the vector representation of words, the similarity between two words can be calculated by calculating the distance of these two word’s vector representation, which is called embeddings.

In word embeddings, each word can be represented as an N-dimensional vector (mostly 300), and similarity between two words can be calculated by calculating the distance between these words in N-dimensional space. So, it solves the high dimensionality-related issues as well, which can be seen in OneHotEncoding where the dimension of word encoding increases as vocabulary size increases.

For calculating the word embeddings, intuition has been used that word’s meaning should be represented by the words that frequently appear nearby that word. 

[“You shall know a word by the company it keeps” (J. R. Firth 1957: 11)]

Generating Word Embedding also evolve through various methodologies as below:

  • SVD based methods
  • Neural Network/ Iteration based methods

Initially, SVD based embeddings have been introduced. Using this method, first, we calculate word co-occurrence matrix based on Document Term Matrix or window-based Co-Occurrence Matrix. Then we reduce the dimensionality of the matrix using SVD, which gives the embeddings of each word in reduced space.

SVD based embeddings also suffer many limitations like word co-occurrence matrix very quickly become imbalance as the vocab size increases due to high-frequency words. Also, it has a very high training cost O(mn2) where m is the size of a vocabulary/ dictionary and n is the embedding size. 

Here instead of computing the word embeddings as we have seen in SVD, the following Neural Network/ iterative Word Embedding based models are now becoming popular to calculate Word Embedding.

  • Word2Vec
  • GloVe
  • fastText

In this blog, we will focus on Word2Vec, which was developed first compare to other models fastText and GloVe.

Word2Vec is developed by Google in 2013, whereas Glove by researchers from Stanford University in 2014 and fastText is developed by Facebook in 2015.

Word2Vec Embeddings

A team of Google researchers lead by Tomas Mikolov developed, patented, and published Word2Vec in two publications in 2013 (please refer to the Reference docs given at the end of the document). For learning word embeddings from raw text, Word2Vec is a computationally efficient predictive model.

Word2Vec is a shallow, two-layer neural network that is trained to reconstruct linguistic contexts of words. The idea is to design a model whose parameters are vectors. Then train the model on a certain objective. At every iteration we run our model, evaluate the errors, and follow an update rule.

It takes a large corpus of words as input and outputs a vector space with hundreds of dimensions, with each unique word in the corpus allocated to a corresponding vector in the space.

Word vectors are positioned in the vector space in such a way that words with similar contexts in the corpus are close together in the space.

It comes in two model architectures, the Continuous Bag-of-Words (CBOW) and the Skip-Gram. We will concentrate on Skip-Gram Architecture.

Word2Vec Skip-Gram Architecture:

In Skip-Gram architecture, the Objective of the model is to train the model in such a way so that given a central word our model should be able to predict the context words correctly.

Symbols Meaning:

x: input layer, which is one-hot encoding of center word
W: weight between the input layer and hidden layer
W': weight between the hidden layer and output layer
V: Vocabulary Size
N: Embedding Size/ Size of hidden layer
h: the hidden layer
u: Output layer before activation
\hat{y}: Output layer after activation, predicted context words
y: Actual context words for center word x
\alpha: Learning rate
C: Number of context words for a given center word
T: Number of training examples

Forward Propagation:

    \[ h = W^T. x \]

    \[ u = W'^T.h\]

    \[ \hat{y} = softmax(u)\]

Loss for every contxet word c for a given center word

    \[ J = -\sum_{w\in{V}} y_w.log(\hat{y_w})   = = - log(\hat{y_c})  \]

Loss for all contxet words C for a given center word is as following

    \[ J =  -  \sum_{c=1}^{C} log(\hat{y}_{c})  \]

We have T training examples and C context words, the objective is to minimize following loss function:

    \[ J =  - \frac{1}{T}   \sum_{t=1}^{T}   \sum_{c=1}^{C} log(\hat{y}_{t,c})  \]

Backword Propogation:

    \[ \partial{J}/\partial{w'_{i,j}} = \sum_{c=1}^{C} (y_{c,j} - t_{c,j) \;. \;h_i \]

    \[ \partial{J}/\partial{w_{k,i}} = \sum_{j=1}^{V} \sum_{c=1}^{C} (y_{c,j} - t_{c,j) \;. \;w'_{i,j} \;. \;x_k \]

    \[ \;where\;  t_{c,j}=1  \;if \;the \;j^{th} \;node \;of \;the \;c^{th} \;true \;output \;is \;1  \;else\; t_{c,j} = 0\]

    \[ w_{i,j}^{'(new)} = w_{i,j}^{'(old)} - \alpha \;. \; \sum_{c=1}^{C} (y_{c,j} - t_{c,j) \;. \;h_i  \]

    \[ w_{k,i}^{(new)} = w_{k,i}^{(old)} - \alpha \;. \; \sum_{j=1}^{V} \sum_{c=1}^{C} (y_{c,j} - t_{c,j) \;. \;w'_{i,j} \;. \;x_k \]

After training the model, every row of W will correspond to word vector embedding of the corresponding word belongs to that row. Intuition is that weights of similar words would have similar embeddings as the would have the company of similar words.

To understand the skip-gram model, let us go through the following example.

Example of Word2Vec using Skip-Gram

In this example, the basic Word2Vec skip-gram model is explained without introducing any complex concept.

Let us consider that we have the following sentences in our different docs in the corpus:

Sentence1: I love playing Football
Sentence2: I love playing Cricket
Sentence3: I love playing sports

We are considering Embedding_size N as 2 in our example.
Vocabulary size V in the above example sentences is 6.

In our example sentence, our training data will look like the following with the window size 1:

SentenceCenter Word (X_center_word_train)Context Words (y_context_words_train)
Sentence1: I love playing Football Ilove
Sentence1: I love playing Football loveI, Playing
Sentence1: I love playing Football playinglove, Football
Sentence1: I love playing Football Footballplaying
Sentence2: I love playing Cricket Ilove
Sentence2: I love playing Cricket loveI, playing
Sentence2: I love playing Cricket playinglove, Cricket
Sentence2: I love playing Cricket Cricketplaying
Sentence3: I love playing sportsIlove
Sentence3: I love playing sportsloveI, playing
Sentence3: I love playing sportsplayinglove, sports
Sentence3: I love playing sportssportsplaying

In this algorithm, we take the neural network of two layers. The objective of this neural network is to predict the context words, given the center word within a given window.

The first layer weights are of the size of Vocab_size * Embedding_size. Whereas the second layer weights size is Embedding_size * Vocab_size.

In the final layer, the softmax activation function is used to predict the context word.

Once the training is complete, then each row in the first layer weight becomes the embedding of the corresponding word index in the vocabulary.

This corpus has 6 unique words (vocabulary/ dictionary size)
vocabulary/ dictionary (Label/ Integer Encoding of Words): {‘i’: 0, ‘love’: 1, ‘playing’: 2, ‘football’: 3, ‘cricket’: 4, ‘sports’: 5}

Training data of above table, will now become as following:

X_train: [[1, 0, 0, 0, 0, 0], [0, 1, 0, 0, 0, 0], [0, 0, 1, 0, 0, 0], [0, 0, 0, 1, 0, 0], [1, 0, 0, 0, 0, 0], [0, 1, 0, 0, 0, 0], [0, 0, 1, 0, 0, 0], [0, 0, 0, 0, 1, 0], [1, 0, 0, 0, 0, 0], [0, 1, 0, 0, 0, 0], [0, 0, 1, 0, 0, 0], [0, 0, 0, 0, 0, 1]]


y_train: [[0, 1, 0, 0, 0, 0], [1, 0, 1, 0, 0, 0], [0, 1, 0, 1, 0, 0], [0, 0, 1, 0, 0, 0], [0, 1, 0, 0, 0, 0], [1, 0, 1, 0, 0, 0], [0, 1, 0, 0, 1, 0], [0, 0, 1, 0, 0, 0], [0, 1, 0, 0, 0, 0], [1, 0, 1, 0, 0, 0], [0, 1, 0, 0, 0, 1], [0, 0, 1, 0, 0, 0]]

Training data in table format along with corresponding word can be represented as follows:

SentenceCenter Word (X_train)Context Words (y_train)
Sentence1: I love playing FootballI [1, 0, 0, 0, 0, 0]love [0, 1, 0, 0, 0, 0]
Sentence1: I love playing Footballlove [0, 1, 0, 0, 0, 0]I, Playing [1, 0, 1, 0, 0, 0]
Sentence1: I love playing Footballplaying [0, 0, 1, 0, 0, 0]love, Football [0, 1, 0, 1, 0, 0]
Sentence1: I love playing FootballFootball [0, 0, 0, 1, 0, 0]playing [0, 0, 1, 0, 0, 0]
Sentence2: I love playing CricketI [1, 0, 0, 0, 0, 0]love [0, 1, 0, 0, 0, 0]
Sentence2: I love playing Cricketlove [0, 1, 0, 0, 0, 0]I, playing [1, 0, 1, 0, 0, 0]
Sentence2: I love playing Cricketplaying [0, 0, 1, 0, 0, 0]love, Cricket [0, 1, 0, 0, 1, 0]
Sentence2: I love playing CricketCricket [0, 0, 0, 0, 1, 0]playing [0, 0, 1, 0, 0, 0]
Sentence3: I love playing sportsI [1, 0, 0, 0, 0, 0]love [0, 1, 0, 0, 0, 0]
Sentence3: I love playing sportslove [0, 1, 0, 0, 0, 0]I, playing [1, 0, 1, 0, 0, 0]
Sentence3: I love playing sportsplaying [0, 0, 1, 0, 0, 0]love, sports [0, 1, 0, 0, 0, 1]
Sentence3: I love playing sportssports [0, 0, 0, 0, 0, 1]playing [0, 0, 1, 0, 0, 0]

A the beginning we have the following weights initialized:

W:

[[ 0.17640523 0.04001572]
[ 0.0978738 0.22408932]
[ 0.1867558 -0.09772779]
[ 0.09500884 -0.01513572]
[-0.01032189 0.04105985]
[ 0.01440436 0.14542735]]

W‘:

[[ 0.07610377 0.0121675 0.04438632 0.03336743 0.14940791 -0.02051583]
[ 0.03130677 -0.08540957 -0.25529898 0.06536186 0.08644362 -0.0742165 ]]

In Our training example training data of first iteration is as following:

SentenceCenter Word (X_train)Context Words (y_train)
Sentence1: I love playing FootballI [1, 0, 0, 0, 0, 0]love [0, 1, 0, 0, 0, 0]
X: [1, 0, 0, 0, 0, 0]
y: [0, 1, 0, 0, 0, 0]

Concept to notice: h corresponds to the row of weights corresponding to the word index of training data. It means that every training data corresponds to a corresponding row of weight W. So after the training rows corresponding to similar words in W will have similar values. And the distance between two words can be calculated by calculating the distance between values of corresponding rows in W. So once training is completed then each row in W will become the embedding for the corresponding word that belongs to this index.

Similarly, training data of second iteration is as following:

SentenceCenter Word (X_train)Context Words (y_train)
Sentence1: I love playing Footballlove [0, 1, 0, 0, 0, 0]I, Playing [1, 0, 1, 0, 0, 0]
X: [0, 1, 0, 0, 0, 0]
y: [1, 0, 1, 0, 0, 0]

Notice, now we have two context words.

Similarly we will train our model for remaining training set. This will be our one epoch. We have done this process in our training example for 1000 epochs. Summary of loss is as following:

epoch 0 loss = 2.7026976169578862 learning_rate_alpha: 0.001
epoch 100 loss = 2.68797227145197 learning_rate_alpha: 0.001
epoch 200 loss = 2.6731992825801623 learning_rate_alpha: 0.001
epoch 300 loss = 2.649735624190748 learning_rate_alpha: 0.001
epoch 400 loss = 2.603958412204618 learning_rate_alpha: 0.001
epoch 500 loss = 2.5125432210561343 learning_rate_alpha: 0.001
epoch 600 loss = 2.346991127035263 learning_rate_alpha: 0.001
epoch 700 loss = 2.109007118561944 learning_rate_alpha: 0.001
epoch 800 loss = 1.8655582377416664 learning_rate_alpha: 0.001
epoch 900 loss = 1.6800931978078923 learning_rate_alpha: 0.001
epoch 999 loss = 1.5545440269879862 learning_rate_alpha: 0.001

Updated value of weight W after 1000 epochs of training:

[[ 1.149124 -0.86998417]
[-0.04047704 1.48333981]
[ 0.55466348 -0.8665634 ]
[ 0.02787687 0.25416947]
[-0.06727489 0.29595964]
[-0.03852687 0.38587039]]

After training the model, every row of W will correspond to word vector embedding of the corresponding word belongs to that row. So our embedding for each word after 1000 epochs looks as following:

Word IndexWordEmbedding->
0i1.149123999-0.8699841668
1love-0.040477039881.483339811
2playing0.5546634793-0.8665633952
3football0.027876869120.2541694707
4cricket-0.067274888840.2959596391
5sports-0.038526873910.3858703899
Embeddings in 2D space

Embeddings after 1000 epochs are as following:

You can notice that since in our example sports, cricket and football are used in the same context, so their embeddings are nearby compared to other words once training gets completed.

Word Embeddings in 2-D space and training loss can be represented as follows after each epoch:

Please observe in the above that how the embeddings of each word are converging after each epoch and Loss is getting reduced.

Limitations

In this blog, we have covered the naive implementation of the Word2Vec Skip-Gram model. Training time and memory requirements will quickly increase as soon as our vocab size increases.

To overcome the above limitations Word2Vec paper suggests trying the following improvements:

  1. Using Hierarchical Softmax instead of softmax
  2. Use negative sampling

Also, the paper suggests the Subsampling of Frequent Words so that frequent words like “in”, “the”, and “a” etc., which usually provide less information compared to other words can be removed from the training.

Further Work

We have just gone through the naive implementation of the Word2Vec Skip-Gram model in this blog. For now, we are parking the detailed discussion on improvements suggested to the naive Skip-Gram Word2vec implementation for some later days. Similarly, We will try to cover the Word2Vec CBOW model and which model is more suited in which situation.

We will also try to cover a detailed discussion of models like fastText, and GloVe including differences and improvements with comparison to other models.

Python code of the above Example for Document Term Matrix (GitHub code location) is as follows:

'''
Author: Gyan Mittal
Corresponding Document: https://gyan-mittal.com/nlp-ai-ml/nlp-word2vec-skipgram-neural-network-iteration-based-methods-word-embeddings
Brief about word2vec: A team of Google researchers lead by Tomas Mikolov developed, patented, and published Word2vec in two publications in 2013.
For learning word embeddings from raw text, Word2Vec is a computationally efficient predictive model.
Word2Vec methodology is used to calculate Word Embedding based on Neural Network/ iterative.
Word2Vec methodology have two model architectures: the Continuous Bag-of-Words (CBOW) model and the Skip-Gram model.
About Code: This code demonstrates the basic concept of calculating the word embeddings
using word2vec methodology using Skip-Gram model.
'''

from collections import Counter
import itertools
import numpy as np
import re
import matplotlib.pyplot as plt
import os
import imageio

# Clean the text after converting it to lower case
def naive_clean_text(text):
    text = text.strip().lower() #Convert to lower case
    text = re.sub(r"[^A-Za-z0-9]", " ", text) #replace all the characters with space except mentioned here
    return text

def naive_softmax(x):
    exp_x = np.exp(x)
    return exp_x / exp_x.sum()

def prepare_training_data(corpus_sentences):
    window_size = 1
    split_corpus_sentences_words = [naive_clean_text(sentence).split() for sentence in corpus_sentences]

    X_center_word_train = []
    y_context_words_train = []

    word_counts = Counter(itertools.chain(*split_corpus_sentences_words))

    vocab_word_index = {x: i for i, x in enumerate(word_counts)}
    reverse_vocab_word_index = {value: key for key, value in vocab_word_index.items()}
    vocab_size = len(vocab_word_index)

    for sentence in split_corpus_sentences_words:
        for i in range(len(sentence)):
            center_word = [0 for x in range(vocab_size)]
            center_word[vocab_word_index[sentence[i]]] = 1
            context = [0 for x in range(vocab_size)]

            for j in range(i - window_size, i + window_size+1):
                #print(i, "\t", j)
                if i != j and j >= 0 and j < len(sentence):
                    context[vocab_word_index[sentence[j]]] = 1
            X_center_word_train.append(center_word)
            y_context_words_train.append(context)

    return X_center_word_train, y_context_words_train, vocab_word_index


def naive_softmax_loss_and_gradient(h, context_word_idx, W1):

    u = np.dot(W1.T, h)
    yhat = naive_softmax(u)


    loss_context_word = -np.log(yhat[context_word_idx])
    #prediction error
    e = yhat.copy()
    e[context_word_idx] -= 1

    dW_context_word = np.dot(W1, e)
    dW1_context_word = np.dot(h[:, np.newaxis], e[np.newaxis, :])

    return loss_context_word, dW_context_word, dW1_context_word

def plot_embeddings_and_loss(W, vocab_word_index, loss_log, epoch, max_loss, img_files=[]):
    fig, (ax1, ax2) = plt.subplots(1, 2, sharex=False, figsize=(10, 5))

    ax1.set_title('Word Embeddings in 2-d space for the given example')
    plt.setp(ax1, xlabel='Embedding dimension - 1', ylabel='Embedding dimension - 2')
    #ax1.set_xlim([min(W[:, 0]) - 1, max(W[:, 0]) + 1])
    #ax1.set_ylim([min(W[:, 1]) - 1, max(W[:, 1]) + 1])
    #ax1.set_xlim([-3, 3.5])
    #ax1.set_ylim([-3.5, 3])

    for word, i in vocab_word_index.items():
        x_coord = W[i][0]
        y_coord = W[i][1]
        ax1.plot(x_coord, y_coord, "cD", markersize=5)
        ax1.text(x_coord, y_coord, word, fontsize=10)
    ax2.set_title("Loss graph")
    plt.setp(ax2, xlabel='#Epochs (Log scale)', ylabel='Loss')
    ax2.set_xlim([1 , epoch * 1.1])
    ax2.set_xscale('log')
    ax2.set_ylim([0, max_loss * 1.1])

    if(len(loss_log) > 0):
        ax2.plot(1, max(loss_log), "bD")
        ax2.plot(loss_log, "b")
        ax2.set_title("Loss is " + r"\bf{" + str("{:.6f}".format(loss_log[-1])) + "}" + " after " + r"\bf{" + str(f'{len(loss_log) - 1:,}') + "}" + " epochs")

    directory = "images"
    if not os.path.exists(directory):
        os.makedirs(directory)
    filename = f'images/{len(loss_log)}.png'
    for i in range(13):
        img_files.append(filename)
    # save frame
    plt.savefig(filename)
    plt.close()
    return img_files

def create_gif(input_image_filenames, output_gif_name):
    # build gif
    with imageio.get_writer(output_gif_name, mode='I') as writer:
        for image_file_name in input_image_filenames:
            image = imageio.imread(image_file_name)
            writer.append_data(image)
    # Remove image files
    for image_file_name in set(input_image_filenames):
        os.remove(image_file_name)

def train(x_train, y_train, vocab_word_index, embedding_dim=2, epochs=1000, learning_rate_alpha=1e-03):
    vocab_size = len(vocab_word_index)

    loss_log = []
    saved_epoch_no = 0
    np.random.seed(0)
    W = np.random.normal(0, .1, (vocab_size, embedding_dim))
    W1 = np.random.normal(0, .1, (embedding_dim, vocab_size))

    for epoch_number in range(0, epochs):
        loss = 0
        for i in range(len(x_train)):
            dW = np.zeros(W.shape)
            dW1 = np.zeros(W1.shape)
            h = np.dot(W.T, x_train[i])
            for context_word_idx in range(vocab_size):
                if (y_train[i][context_word_idx]):
                    loss_context_word, dW_context_word, dW1_context_word = naive_softmax_loss_and_gradient(h, context_word_idx, W1)
                    loss += loss_context_word
                    current_center_word_idx = -1
                    for c_i in range(len(x_train[i])):
                        if x_train[i][c_i] == 1:
                            current_center_word_idx = c_i
                            break
                    dW[current_center_word_idx] += dW_context_word
                    dW1 += dW1_context_word

            W -= learning_rate_alpha * dW
            W1 -= learning_rate_alpha * dW1

        loss /= len(x_train)
        if (epoch_number == 0):
            image_files = plot_embeddings_and_loss(W, vocab_word_index, loss_log, epochs, loss)
            loss_log.append(loss)
        loss_log.append(loss)
        if(epoch_number%(epochs/10) == 0 or epoch_number == (epochs - 1) or epoch_number == 0):
            print("epoch ", epoch_number, " loss = ", loss, " learning_rate_alpha:\t", learning_rate_alpha)

        if ((epoch_number == 1) or np.ceil(np.log10(epoch_number + 2)) > saved_epoch_no or (epoch_number + 1) == epochs):
            image_files = plot_embeddings_and_loss(W, vocab_word_index, loss_log, epochs, max(loss_log), image_files)
            saved_epoch_no = np.ceil(np.log10(epoch_number + 2))
    return W, W1, image_files

def plot_embeddings(W, vocab_word_index):
    for word, i in vocab_word_index.items():
        x_coord = W[i][0]
        y_coord = W[i][1]
        plt.scatter(x_coord, y_coord)
        plt.annotate(word, (x_coord, y_coord))
        print(word, ":\t[", x_coord, ",", y_coord, "]")
    plt.show()

embedding_dim = 2
epochs = 2000
learning_rate_alpha = 1e-03
corpus_sentences = ["I love playing Football", "I love playing Cricket", "I love playing sports"]
#corpus_sentences = ["I love playing Football"]

x_train, y_train, vocab_word_index = prepare_training_data(corpus_sentences)

W, W1, image_files = train(x_train, y_train, vocab_word_index, embedding_dim, epochs, learning_rate_alpha)

print("W1:\t", W1, W1.shape)
print("W:\t", W, W.shape)
create_gif(image_files, 'images/word2vec_skipgram.gif')

References

About Gyan Mittal

Gyan is Engineering Head at Times Internet Limited. He is a technology leader and AI-ML expert with 22+ of experience in hands-on architecture, design, and development of a broad range of software technologies. Gyan Likes writing blogs in Technology, especially in Artificial Intelligence, Machine Learning, NLP, and Data Science. In his blogs, Gyan tries to explain complex concepts in simple language, mostly supported by very simple example code (with the corresponding GitHub link) that is very easy to understand. Gyan has done B. Tech. from IIT Kanpur, M.Tech. From IIT Madras and Artificial Intelligence Professional Program from Stanford University (2019-2021).
Bookmark the permalink.

2 Comments

  1. Thanks for your blog, nice to read. Do not stop.

Leave a Reply

Your email address will not be published.