TF-IDF (Term Frequency-Inverse Document Frequency)

TF-IDF is a numerical statistic that reflects the importance of a word in a document relative to a collection of documents. It combines local term frequency with global document frequency to provide a balanced measure of term importance.

Core Concepts

Term Frequency (TF)

How often a word appears in a document:

TF(t,d)=count of term t in document dtotal terms in document dTF(t,d) = \frac{\text{count of term t in document d}}{\text{total terms in document d}}

Inverse Document Frequency (IDF)

Measures how important a term is across all documents:

IDF(t)=logtotal number of documentsnumber of documents containing term tIDF(t) = \log\frac{\text{total number of documents}}{\text{number of documents containing term t}}

TF-IDF Score

Combines both metrics:

TF-IDF(t,d)=TF(t,d)×IDF(t)TF\text{-}IDF(t,d) = TF(t,d) \times IDF(t)

Implementation

Basic TF-IDF

from sklearn.feature_extraction.text import TfidfVectorizer

def compute_tfidf(documents):
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform(documents)
    feature_names = vectorizer.get_feature_names_out()
    return tfidf_matrix, feature_names

Custom TF-IDF

import numpy as np

def custom_tfidf(documents):
    # Compute TF
    def compute_tf(text):
        words = text.split()
        tf = {}
        for word in words:
            tf[word] = tf.get(word, 0) + 1
        # Normalize
        total_words = len(words)
        for word in tf:
            tf[word] = tf[word] / total_words
        return tf
    
    # Compute IDF
    def compute_idf(documents):
        N = len(documents)
        idf = {}
        for doc in documents:
            words = set(doc.split())
            for word in words:
                idf[word] = idf.get(word, 0) + 1
        
        for word in idf:
            idf[word] = np.log(N / idf[word])
        return idf
    
    # Compute TF-IDF
    tf_scores = [compute_tf(doc) for doc in documents]
    idf_scores = compute_idf(documents)
    
    return tf_scores, idf_scores

Advanced Features

1. Normalization Options

def normalized_tfidf(documents, norm='l2'):
    vectorizer = TfidfVectorizer(
        norm=norm,
        smooth_idf=True,
        sublinear_tf=True
    )
    return vectorizer.fit_transform(documents)

2. N-gram Support

def ngram_tfidf(documents, ngram_range=(1, 2)):
    vectorizer = TfidfVectorizer(
        ngram_range=ngram_range,
        stop_words='english'
    )
    return vectorizer.fit_transform(documents)

Applications

  1. Information Retrieval:

    • Document search
    • Relevance ranking
    • Content recommendation
  2. Text Analysis:

    • Feature extraction
    • Document similarity
    • Keyword extraction

Best Practices

1. Preprocessing

  • Remove stop words
  • Apply stemming/lemmatization
  • Handle special characters

2. Parameter Tuning

  • Adjust smoothing
  • Choose normalization
  • Set frequency thresholds

3. Feature Selection

  • Remove rare terms
  • Filter common words
  • Consider domain vocabulary

Implementation Example

class TfidfAnalyzer:
    def __init__(self,
                 min_df=1,
                 max_df=1.0,
                 ngram_range=(1, 1),
                 norm='l2'):
        self.vectorizer = TfidfVectorizer(
            min_df=min_df,
            max_df=max_df,
            ngram_range=ngram_range,
            norm=norm,
            stop_words='english'
        )
        
    def fit_transform(self, documents):
        # Transform documents to TF-IDF representation
        tfidf_matrix = self.vectorizer.fit_transform(documents)
        
        # Get feature names
        features = self.vectorizer.get_feature_names_out()
        
        return tfidf_matrix, features
    
    def get_top_terms(self, tfidf_matrix, n=10):
        # Get top terms for each document
        features = self.vectorizer.get_feature_names_out()
        for doc_idx in range(tfidf_matrix.shape[0]):
            doc_terms = tfidf_matrix[doc_idx].toarray()[0]
            top_term_indices = doc_terms.argsort()[-n:][::-1]
            top_terms = [(features[i], doc_terms[i]) 
                        for i in top_term_indices]
            yield top_terms

Advantages

  1. Term Importance:

    • Balances local and global importance
    • Penalizes common terms
    • Highlights distinctive words
  2. Versatility:

    • Works with various text types
    • Supports multiple languages
    • Adaptable to different domains

Limitations

  1. Dimensionality:

    • High-dimensional vectors
    • Sparse matrices
    • Computational overhead
  2. Semantic Gaps:

    • No word relationships
    • Ignores word order
    • Limited context understanding

Summary

TF-IDF is a powerful text representation technique that improves upon simple bag-of-words by considering both local and global term importance. Its effectiveness in capturing term significance makes it a popular choice for various NLP applications, despite limitations in semantic understanding.