Bag of Words (BoW)

Bag of Words (BoW) is a fundamental text representation technique that converts text into fixed-length vectors by counting word frequencies, disregarding grammar and word order but maintaining multiplicity.

Basic Concept

The BoW model represents text as a "bag" (multiset) of its words:

Each document becomes a vector
Each position corresponds to a word in the vocabulary
Values represent word frequencies

Implementation

Simple BoW

from collections import Counter
from sklearn.feature_extraction.text import CountVectorizer

def simple_bow(documents):
    # Initialize vectorizer
    vectorizer = CountVectorizer()
    
    # Fit and transform documents
    X = vectorizer.fit_transform(documents)
    
    # Get feature names (vocabulary)
    vocab = vectorizer.get_feature_names_out()
    
    return X, vocab

# Example usage
documents = [
    "The cat sat on the mat",
    "The dog ran in the park"
]
X, vocab = simple_bow(documents)

Advanced Features

1. N-gram Support

def ngram_bow(documents, ngram_range=(1, 2)):
    vectorizer = CountVectorizer(
        ngram_range=ngram_range,
        stop_words='english'
    )
    X = vectorizer.fit_transform(documents)
    return X, vectorizer.get_feature_names_out()

2. Custom Preprocessing

def custom_bow(documents):
    vectorizer = CountVectorizer(
        preprocessor=custom_preprocessor,
        tokenizer=custom_tokenizer,
        token_pattern=None
    )
    return vectorizer.fit_transform(documents)

Variations

1. Binary BoW

Only considers presence/absence of words:

def binary_bow(documents):
    vectorizer = CountVectorizer(binary=True)
    return vectorizer.fit_transform(documents)

2. Frequency-Limited BoW

Filters words based on document frequency:

def filtered_bow(documents, min_df=2, max_df=0.95):
    vectorizer = CountVectorizer(
        min_df=min_df,  # Minimum document frequency
        max_df=max_df   # Maximum document frequency
    )
    return vectorizer.fit_transform(documents)

Applications

Document Classification:
- Topic categorization
- Spam detection
- Sentiment analysis
Information Retrieval:
- Document similarity
- Search relevance
- Content recommendation

Best Practices

1. Preprocessing

Remove stop words
Apply stemming/lemmatization
Handle case sensitivity

2. Vocabulary Management

Set minimum frequency threshold
Remove rare/common words
Consider domain-specific terms

3. Feature Selection

Remove irrelevant features
Use dimensionality reduction
Consider word importance

Implementation Example

class BowAnalyzer:
    def __init__(self, 
                 ngram_range=(1, 1),
                 min_df=1,
                 max_df=1.0,
                 binary=False):
        self.vectorizer = CountVectorizer(
            ngram_range=ngram_range,
            min_df=min_df,
            max_df=max_df,
            binary=binary,
            stop_words='english'
        )
        
    def fit_transform(self, documents):
        # Transform documents to BoW representation
        bow_matrix = self.vectorizer.fit_transform(documents)
        
        # Get feature names
        features = self.vectorizer.get_feature_names_out()
        
        return bow_matrix, features
    
    def transform(self, documents):
        return self.vectorizer.transform(documents)
    
    def get_vocabulary(self):
        return self.vectorizer.vocabulary_

Advantages

Simplicity:
- Easy to implement
- Intuitive representation
- Fast computation
Effectiveness:
- Works well for basic tasks
- Captures term frequency
- Suitable for classification

Limitations

Loss of Order:
- Ignores word sequence
- Loses grammatical structure
- No semantic context
Sparsity:
- High-dimensional vectors
- Many zero values
- Memory intensive
Semantic Loss:
- No word relationships
- No meaning preservation
- Limited context understanding

Summary

Bag of Words is a foundational text representation technique that, despite its simplicity, remains useful for many NLP tasks. While it has limitations in capturing semantic meaning and word order, its simplicity and effectiveness make it a valuable starting point for text analysis tasks.

PreviousText Representation

NextTF-IDF

Getting Started

Math

Machine Learning

Deep Learning

Natural Language Processing

Reinforcement Learning

References