NLP – Word Embeddings

Various word encoding methods like Integer/ Label Encoding or One-Hot Encoding have many limitations, among them, the main limitation is that these encoding doesn’t have the semantic relationship between the words. So, using these encodings, we have no way to find out that the words “Soccer” and “Football” are nearby words.

Also, In One-Hot Encoding, as vocabulary size increases, memory requirement increases, as well feature spaces increases. Feature space increase results in a curse of dimensionality issues. Whereas Integer/ Label Encoding suffers from undesirable bias.

We could use the WordNet list of Synonyms to get similarity, but it will fail due to the following issues:

  1. Many times, two words have the same meaning in some particular context only. Synonym list doesn’t contain that information
  2. We can define the synonyms of some limited combinations of words only, maintaining the list for all pairs of words is not possible. [suppose we have 1 million words in dictionary, then we would have to maintain the pair of 1 million * 1 million combinations. We can very well imagine what would happen in case we have 1 trillion words in our corpus.]
  3. As the vocabulary evolve, we need regularly update the synonym list
  4. It would be much more efficient and useful in case we can calculate the similarity between any two words by looking into their encoding. 

So we should have some way where words can be encoded in low dimensionality space (predefined dimensions which are independent of vocab size) in such a way that similarity between two words can be calculated using this.

To achieve the above, the concept of embeddings has been introduced. Idea is to represent a word in a fixed vector space, which is not dependent on the size of the vocabulary. Using the vector representation of words, the similarity between two words can be calculated by calculating the distance of these two word’s vector representation, which is called embeddings.

In word embeddings, each word can be represented as an N-dimensional vector (mostly 300), and similarity between two words can be calculated by calculating the distance between these words in N-dimensional space. So, it solves the high dimensionality-related issues as well, which can be seen in OneHotEncoding where the dimension of word encoding increases as vocabulary size increases.

For calculating the word embeddings, intuition has been used that word’s meaning should be represented by the words that frequently appear nearby that word. 

[“You shall know a word by the company it keeps” (J. R. Firth 1957: 11)]

Generating Word Embedding also evolve through various methodologies as below:

  • SVD based methods
  • Neural Network/ Iteration based methods

SVD based methods

Neural Network/ Iteration based methods

References

  • http://web.stanford.edu/class/cs224n/readings/cs224n-2019-notes01-wordvecs1.pdf

About Gyan Mittal

Gyan is Engineering Head at Times Internet Limited. He is a technology leader and AI-ML expert with 22+ of experience in hands-on architecture, design, and development of a broad range of software technologies. Gyan Likes writing blogs in Technology, especially in Artificial Intelligence, Machine Learning, NLP, and Data Science. In his blogs, Gyan tries to explain complex concepts in simple language, mostly supported by very simple example code (with the corresponding GitHub link) that is very easy to understand. Gyan has done B. Tech. from IIT Kanpur, M.Tech. From IIT Madras and Artificial Intelligence Professional Program from Stanford University (2019-2021).
Bookmark the permalink.

Leave a Reply

Your email address will not be published.