Word Embeddings

Word embeddings are dense vector representations of words in a continuous vector space, where semantically similar words are mapped to nearby points. Unlike sparse representations like BoW or TF-IDF, embeddings capture semantic relationships between words.

Core Concepts

Vector Space Model

Words are represented as dense vectors
Similar words have similar vectors
Vector arithmetic captures semantic relationships
```
king - man + woman ≈ queen
```

Types of Embeddings

1. Word2Vec

Two main architectures:

CBOW (Continuous Bag of Words)

Predicts a word given its context:

from gensim.models import Word2Vec

def train_cbow(sentences):
    model = Word2Vec(sentences,
                    vector_size=100,
                    window=5,
                    min_count=1,
                    sg=0)  # CBOW architecture
    return model

Skip-gram

Predicts context words given a target word:

def train_skipgram(sentences):
    model = Word2Vec(sentences,
                    vector_size=100,
                    window=5,
                    min_count=1,
                    sg=1)  # Skip-gram architecture
    return model

2. GloVe (Global Vectors)

Based on global word-word co-occurrence statistics:

from gensim.models import KeyedVectors

def load_glove_vectors(path):
    return KeyedVectors.load_word2vec_format(path)

3. FastText

Includes subword information:

from gensim.models import FastText

def train_fasttext(sentences):
    model = FastText(sentences,
                    vector_size=100,
                    window=5,
                    min_count=1)
    return model

Implementation

Basic Usage

import numpy as np
from gensim.models import Word2Vec

class WordEmbedder:
    def __init__(self, vector_size=100, window=5):
        self.vector_size = vector_size
        self.window = window
        self.model = None
    
    def train(self, sentences):
        self.model = Word2Vec(sentences,
                            vector_size=self.vector_size,
                            window=self.window,
                            min_count=1)
        
    def get_vector(self, word):
        return self.model.wv[word]
    
    def most_similar(self, word, n=10):
        return self.model.wv.most_similar(word, topn=n)

Advanced Features

1. Document Embeddings

Combining word vectors to represent documents:

def document_embedding(doc, word_vectors):
    vectors = [word_vectors[word] 
              for word in doc 
              if word in word_vectors]
    return np.mean(vectors, axis=0)

2. Custom Training

def custom_embedding_training(texts, vector_size=100):
    # Preprocess texts
    sentences = [text.split() for text in texts]
    
    # Train model
    model = Word2Vec(sentences,
                    vector_size=vector_size,
                    window=5,
                    min_count=1,
                    workers=4)
    
    return model

Applications

Text Classification:
- Document categorization
- Sentiment analysis
- Topic modeling
Information Retrieval:
- Semantic search
- Document similarity
- Question answering
Language Tasks:
- Machine translation
- Text summarization
- Named entity recognition

Best Practices

1. Training Data

Use large corpus
Clean and preprocess text
Consider domain-specific data

2. Model Parameters

Choose appropriate vector size
Adjust window size
Set minimum frequency

3. Evaluation

Use intrinsic evaluation (analogies)
Perform task-specific evaluation
Compare with baselines

Implementation Example

class TextEmbedder:
    def __init__(self, pretrained_path=None):
        self.model = None
        if pretrained_path:
            self.load_pretrained(pretrained_path)
    
    def train(self, sentences, vector_size=100):
        self.model = Word2Vec(sentences,
                            vector_size=vector_size,
                            window=5,
                            min_count=1,
                            workers=4)
    
    def load_pretrained(self, path):
        self.model = KeyedVectors.load_word2vec_format(path)
    
    def get_document_embedding(self, document):
        words = document.split()
        vectors = [self.model.wv[word] 
                  for word in words 
                  if word in self.model.wv]
        return np.mean(vectors, axis=0)
    
    def find_similar_words(self, word, n=10):
        return self.model.wv.most_similar(word, topn=n)
    
    def analogy(self, word1, word2, word3):
        return self.model.wv.most_similar(
            positive=[word2, word3],
            negative=[word1]
        )

Advantages

Semantic Understanding:
- Captures word relationships
- Handles synonyms and analogies
- Preserves semantic similarity
Dimensionality:
- Dense representation
- Lower dimensional space
- Efficient computation

Limitations

Training Requirements:
- Large corpus needed
- Computational resources
- Domain adaptation
Word Ambiguity:
- Single vector per word
- No context consideration
- Polysemy handling

Summary

Word embeddings represent a significant advancement in text representation, offering dense, meaningful vectors that capture semantic relationships between words. Their ability to model word similarities and relationships makes them valuable for various NLP tasks, though they have limitations in handling context and ambiguity.

PreviousTF-IDF

NextContextual Embeddings

Getting Started

Math

Machine Learning

Deep Learning

Natural Language Processing

Reinforcement Learning

References