TF-IDF (Term Frequency-Inverse Document Frequency)

TF-IDF is a numerical statistic that reflects the importance of a word in a document relative to a collection of documents. It combines local term frequency with global document frequency to provide a balanced measure of term importance.

Core Concepts

Term Frequency (TF)

How often a word appears in a document:

TF(t,d) = \frac{\text{count of term t in document d}}{\text{total terms in document d}}

Inverse Document Frequency (IDF)

Measures how important a term is across all documents:

IDF(t) = \log\frac{\text{total number of documents}}{\text{number of documents containing term t}}

TF-IDF Score

Combines both metrics:

TF\text{-}IDF(t,d) = TF(t,d) \times IDF(t)

Implementation

Basic TF-IDF

from sklearn.feature_extraction.text import TfidfVectorizer

def compute_tfidf(documents):
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform(documents)
    feature_names = vectorizer.get_feature_names_out()
    return tfidf_matrix, feature_names

Custom TF-IDF

import numpy as np

def custom_tfidf(documents):
    # Compute TF
    def compute_tf(text):
        words = text.split()
        tf = {}
        for word in words:
            tf[word] = tf.get(word, 0) + 1
        # Normalize
        total_words = len(words)
        for word in tf:
            tf[word] = tf[word] / total_words
        return tf
    
    # Compute IDF
    def compute_idf(documents):
        N = len(documents)
        idf = {}
        for doc in documents:
            words = set(doc.split())
            for word in words:
                idf[word] = idf.get(word, 0) + 1
        
        for word in idf:
            idf[word] = np.log(N / idf[word])
        return idf
    
    # Compute TF-IDF
    tf_scores = [compute_tf(doc) for doc in documents]
    idf_scores = compute_idf(documents)
    
    return tf_scores, idf_scores

Advanced Features

1. Normalization Options

def normalized_tfidf(documents, norm='l2'):
    vectorizer = TfidfVectorizer(
        norm=norm,
        smooth_idf=True,
        sublinear_tf=True
    )
    return vectorizer.fit_transform(documents)

2. N-gram Support

def ngram_tfidf(documents, ngram_range=(1, 2)):
    vectorizer = TfidfVectorizer(
        ngram_range=ngram_range,
        stop_words='english'
    )
    return vectorizer.fit_transform(documents)

Applications

Information Retrieval:
- Document search
- Relevance ranking
- Content recommendation
Text Analysis:
- Feature extraction
- Document similarity
- Keyword extraction

Best Practices

1. Preprocessing

Remove stop words
Apply stemming/lemmatization
Handle special characters

2. Parameter Tuning

Adjust smoothing
Choose normalization
Set frequency thresholds

3. Feature Selection

Remove rare terms
Filter common words
Consider domain vocabulary

Implementation Example

class TfidfAnalyzer:
    def __init__(self,
                 min_df=1,
                 max_df=1.0,
                 ngram_range=(1, 1),
                 norm='l2'):
        self.vectorizer = TfidfVectorizer(
            min_df=min_df,
            max_df=max_df,
            ngram_range=ngram_range,
            norm=norm,
            stop_words='english'
        )
        
    def fit_transform(self, documents):
        # Transform documents to TF-IDF representation
        tfidf_matrix = self.vectorizer.fit_transform(documents)
        
        # Get feature names
        features = self.vectorizer.get_feature_names_out()
        
        return tfidf_matrix, features
    
    def get_top_terms(self, tfidf_matrix, n=10):
        # Get top terms for each document
        features = self.vectorizer.get_feature_names_out()
        for doc_idx in range(tfidf_matrix.shape[0]):
            doc_terms = tfidf_matrix[doc_idx].toarray()[0]
            top_term_indices = doc_terms.argsort()[-n:][::-1]
            top_terms = [(features[i], doc_terms[i]) 
                        for i in top_term_indices]
            yield top_terms

Advantages

Term Importance:
- Balances local and global importance
- Penalizes common terms
- Highlights distinctive words
Versatility:
- Works with various text types
- Supports multiple languages
- Adaptable to different domains

Limitations

Dimensionality:
- High-dimensional vectors
- Sparse matrices
- Computational overhead
Semantic Gaps:
- No word relationships
- Ignores word order
- Limited context understanding

Summary

TF-IDF is a powerful text representation technique that improves upon simple bag-of-words by considering both local and global term importance. Its effectiveness in capturing term significance makes it a popular choice for various NLP applications, despite limitations in semantic understanding.

PreviousBag of Words

NextEmbeddings

Getting Started

Math

Machine Learning

Deep Learning

Natural Language Processing

Reinforcement Learning

References