TF-IDF (Term Frequency-Inverse Document Frequency)
TF-IDF is a numerical statistic that reflects the importance of a word in a document relative to a collection of documents. It combines local term frequency with global document frequency to provide a balanced measure of term importance.
Core Concepts
Term Frequency (TF)
How often a word appears in a document:
Inverse Document Frequency (IDF)
Measures how important a term is across all documents:
TF-IDF Score
Combines both metrics:
Implementation
Basic TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizer
def compute_tfidf(documents):
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(documents)
feature_names = vectorizer.get_feature_names_out()
return tfidf_matrix, feature_names
Custom TF-IDF
import numpy as np
def custom_tfidf(documents):
# Compute TF
def compute_tf(text):
words = text.split()
tf = {}
for word in words:
tf[word] = tf.get(word, 0) + 1
# Normalize
total_words = len(words)
for word in tf:
tf[word] = tf[word] / total_words
return tf
# Compute IDF
def compute_idf(documents):
N = len(documents)
idf = {}
for doc in documents:
words = set(doc.split())
for word in words:
idf[word] = idf.get(word, 0) + 1
for word in idf:
idf[word] = np.log(N / idf[word])
return idf
# Compute TF-IDF
tf_scores = [compute_tf(doc) for doc in documents]
idf_scores = compute_idf(documents)
return tf_scores, idf_scores
Advanced Features
1. Normalization Options
def normalized_tfidf(documents, norm='l2'):
vectorizer = TfidfVectorizer(
norm=norm,
smooth_idf=True,
sublinear_tf=True
)
return vectorizer.fit_transform(documents)
2. N-gram Support
def ngram_tfidf(documents, ngram_range=(1, 2)):
vectorizer = TfidfVectorizer(
ngram_range=ngram_range,
stop_words='english'
)
return vectorizer.fit_transform(documents)
Applications
-
Information Retrieval:
- Document search
- Relevance ranking
- Content recommendation
-
Text Analysis:
- Feature extraction
- Document similarity
- Keyword extraction
Best Practices
1. Preprocessing
- Remove stop words
- Apply stemming/lemmatization
- Handle special characters
2. Parameter Tuning
- Adjust smoothing
- Choose normalization
- Set frequency thresholds
3. Feature Selection
- Remove rare terms
- Filter common words
- Consider domain vocabulary
Implementation Example
class TfidfAnalyzer:
def __init__(self,
min_df=1,
max_df=1.0,
ngram_range=(1, 1),
norm='l2'):
self.vectorizer = TfidfVectorizer(
min_df=min_df,
max_df=max_df,
ngram_range=ngram_range,
norm=norm,
stop_words='english'
)
def fit_transform(self, documents):
# Transform documents to TF-IDF representation
tfidf_matrix = self.vectorizer.fit_transform(documents)
# Get feature names
features = self.vectorizer.get_feature_names_out()
return tfidf_matrix, features
def get_top_terms(self, tfidf_matrix, n=10):
# Get top terms for each document
features = self.vectorizer.get_feature_names_out()
for doc_idx in range(tfidf_matrix.shape[0]):
doc_terms = tfidf_matrix[doc_idx].toarray()[0]
top_term_indices = doc_terms.argsort()[-n:][::-1]
top_terms = [(features[i], doc_terms[i])
for i in top_term_indices]
yield top_terms
Advantages
-
Term Importance:
- Balances local and global importance
- Penalizes common terms
- Highlights distinctive words
-
Versatility:
- Works with various text types
- Supports multiple languages
- Adaptable to different domains
Limitations
-
Dimensionality:
- High-dimensional vectors
- Sparse matrices
- Computational overhead
-
Semantic Gaps:
- No word relationships
- Ignores word order
- Limited context understanding
Summary
TF-IDF is a powerful text representation technique that improves upon simple bag-of-words by considering both local and global term importance. Its effectiveness in capturing term significance makes it a popular choice for various NLP applications, despite limitations in semantic understanding.