Word Embeddings
Word embeddings are dense vector representations of words in a continuous vector space, where semantically similar words are mapped to nearby points. Unlike sparse representations like BoW or TF-IDF, embeddings capture semantic relationships between words.
Core Concepts
Vector Space Model
- Words are represented as dense vectors
- Similar words have similar vectors
- Vector arithmetic captures semantic relationships
king - man + woman ≈ queen
Types of Embeddings
1. Word2Vec
Two main architectures:
CBOW (Continuous Bag of Words)
Predicts a word given its context:
from gensim.models import Word2Vec
def train_cbow(sentences):
model = Word2Vec(sentences,
vector_size=100,
window=5,
min_count=1,
sg=0) # CBOW architecture
return model
Skip-gram
Predicts context words given a target word:
def train_skipgram(sentences):
model = Word2Vec(sentences,
vector_size=100,
window=5,
min_count=1,
sg=1) # Skip-gram architecture
return model
2. GloVe (Global Vectors)
Based on global word-word co-occurrence statistics:
from gensim.models import KeyedVectors
def load_glove_vectors(path):
return KeyedVectors.load_word2vec_format(path)
3. FastText
Includes subword information:
from gensim.models import FastText
def train_fasttext(sentences):
model = FastText(sentences,
vector_size=100,
window=5,
min_count=1)
return model
Implementation
Basic Usage
import numpy as np
from gensim.models import Word2Vec
class WordEmbedder:
def __init__(self, vector_size=100, window=5):
self.vector_size = vector_size
self.window = window
self.model = None
def train(self, sentences):
self.model = Word2Vec(sentences,
vector_size=self.vector_size,
window=self.window,
min_count=1)
def get_vector(self, word):
return self.model.wv[word]
def most_similar(self, word, n=10):
return self.model.wv.most_similar(word, topn=n)
Advanced Features
1. Document Embeddings
Combining word vectors to represent documents:
def document_embedding(doc, word_vectors):
vectors = [word_vectors[word]
for word in doc
if word in word_vectors]
return np.mean(vectors, axis=0)
2. Custom Training
def custom_embedding_training(texts, vector_size=100):
# Preprocess texts
sentences = [text.split() for text in texts]
# Train model
model = Word2Vec(sentences,
vector_size=vector_size,
window=5,
min_count=1,
workers=4)
return model
Applications
-
Text Classification:
- Document categorization
- Sentiment analysis
- Topic modeling
-
Information Retrieval:
- Semantic search
- Document similarity
- Question answering
-
Language Tasks:
- Machine translation
- Text summarization
- Named entity recognition
Best Practices
1. Training Data
- Use large corpus
- Clean and preprocess text
- Consider domain-specific data
2. Model Parameters
- Choose appropriate vector size
- Adjust window size
- Set minimum frequency
3. Evaluation
- Use intrinsic evaluation (analogies)
- Perform task-specific evaluation
- Compare with baselines
Implementation Example
class TextEmbedder:
def __init__(self, pretrained_path=None):
self.model = None
if pretrained_path:
self.load_pretrained(pretrained_path)
def train(self, sentences, vector_size=100):
self.model = Word2Vec(sentences,
vector_size=vector_size,
window=5,
min_count=1,
workers=4)
def load_pretrained(self, path):
self.model = KeyedVectors.load_word2vec_format(path)
def get_document_embedding(self, document):
words = document.split()
vectors = [self.model.wv[word]
for word in words
if word in self.model.wv]
return np.mean(vectors, axis=0)
def find_similar_words(self, word, n=10):
return self.model.wv.most_similar(word, topn=n)
def analogy(self, word1, word2, word3):
return self.model.wv.most_similar(
positive=[word2, word3],
negative=[word1]
)
Advantages
-
Semantic Understanding:
- Captures word relationships
- Handles synonyms and analogies
- Preserves semantic similarity
-
Dimensionality:
- Dense representation
- Lower dimensional space
- Efficient computation
Limitations
-
Training Requirements:
- Large corpus needed
- Computational resources
- Domain adaptation
-
Word Ambiguity:
- Single vector per word
- No context consideration
- Polysemy handling
Summary
Word embeddings represent a significant advancement in text representation, offering dense, meaningful vectors that capture semantic relationships between words. Their ability to model word similarities and relationships makes them valuable for various NLP tasks, though they have limitations in handling context and ambiguity.