Various word encoding methods like Integer/ Label Encoding or One-Hot Encoding have many limitations, among them, the main limitation is that these encoding doesn’t have the semantic relationship between the words. So, using these encodings, we have no way to find out that the words “Soccer” and “Football” are nearby words.
Also, In One-Hot Encoding, as vocabulary size increases, memory requirement increases, as well feature spaces increases. Feature space increase results in a curse of dimensionality issues. Whereas Integer/ Label Encoding suffers from undesirable bias.
We could use the WordNet list of Synonyms to get similarity, but it will fail due to the following issues:
- Many times, two words have the same meaning in some particular context only. Synonym list doesn’t contain that information
- We can define the synonyms of some limited combinations of words only, maintaining the list for all pairs of words is not possible. [suppose we have 1 million words in dictionary, then we would have to maintain the pair of 1 million * 1 million combinations. We can very well imagine what would happen in case we have 1 trillion words in our corpus.]
- As the vocabulary evolve, we need regularly update the synonym list
- It would be much more efficient and useful in case we can calculate the similarity between any two words by looking into their encoding.
So we should have some way where words can be encoded in low dimensionality space (predefined dimensions which are independent of vocab size) in such a way that similarity between two words can be calculated using this.
To achieve the above, the concept of embeddings has been introduced. Idea is to represent a word in a fixed vector space, which is not dependent on the size of the vocabulary. Using the vector representation of words, the similarity between two words can be calculated by calculating the distance of these two word’s vector representation, which is called embeddings.
In word embeddings, each word can be represented as an N-dimensional vector (mostly 300), and similarity between two words can be calculated by calculating the distance between these words in N-dimensional space. So, it solves the high dimensionality-related issues as well, which can be seen in One–Hot–Encoding where the dimension of word encoding increases as vocabulary size increases.
For calculating the word embeddings, intuition has been used that word’s meaning should be represented by the words that frequently appear nearby that word.
[“You shall know a word by the company it keeps” (J. R. Firth 1957: 11)]
Generating Word Embedding also evolve through various methodologies as below:
- SVD based methods
- Neural Network/ Iteration based methods
SVD based methods
Neural Network/ Iteration based methods
- Word2vec
- Fast text
- Glove
References
- http://web.stanford.edu/class/cs224n/readings/cs224n-2019-notes01-wordvecs1.pdf