Skip to content

RFC-012: Episode Summarization Using Local Transformers

Abstract

Design and implement episode summarization feature that generates concise summaries and key takeaways from episode transcripts using local PyTorch transformer models. This enables privacy-preserving, cost-effective summarization without requiring external API services, while maintaining integration with the existing metadata generation pipeline.

Problem Statement

Issue #17 describes the need to generate concise summaries and key takeaways from episode transcripts. While API-based LLM services (OpenAI GPT) provide high-quality results, they have drawbacks:

  • Privacy concerns: Transcripts sent to external APIs may contain sensitive content
  • Cost: API usage incurs per-token costs that scale with transcript length
  • Rate limits: API providers enforce rate limits that can slow batch processing
  • Dependency: Requires internet connectivity and API key management
  • Latency: Network round-trips add latency to processing pipeline

Local transformer models address these concerns by running entirely on-device, providing privacy, predictable costs (hardware), and no rate limits. However, they require:

  • GPU memory or sufficient CPU resources
  • Model download and caching
  • Careful prompt engineering for quality results
  • Memory management for long transcripts

Constraints & Assumptions

  • Summarization must be opt-in (default false) for backwards compatibility
  • Hardware Constraint: Must run on Apple M4 Pro with 48 GB RAM (primary development/testing platform)
  • Models must be selected and optimized to work within this memory constraint
  • Apple Silicon uses Metal Performance Shaders (MPS) backend for PyTorch, not CUDA
  • While 48 GB RAM is generous, memory efficiency is still important for concurrent operations
  • Model selection should prioritize models that fit comfortably in available memory (e.g., bart-base ~500MB, distilbart ~300MB)
  • Must support MPS backend for GPU acceleration on Apple Silicon
  • Local models are preferred over API-based solutions for privacy and cost reasons
  • Summaries are stored in metadata documents (PRD-004/RFC-011 structure)
  • Transcripts can be long (5000-20000+ words); models must handle long inputs efficiently
  • Users may have limited GPU memory; CPU fallback must be supported
  • Model downloads should be cached and reusable across runs
  • Summarization should not block transcript processing; can be async or sequential
  • Quality should be reasonable but may be lower than premium API services

Design & Implementation

1. Configuration

Add new configuration fields to config.Config:

generate_summaries: bool = False  # Opt-in for backwards compatibility
summary_provider: Literal["local", "openai"] = "local"  # Default to local
summary_model: Optional[str] = None  # Model identifier (e.g., "facebook/bart-large-cnn")
summary_max_length: int = 150  # Max tokens for summary
summary_min_length: int = 30  # Min tokens for summary
summary_max_takeaways: int = 10  # Maximum number of key takeaways
summary_device: Optional[str] = None  # "cuda", "mps", "cpu", or None for auto-detection
summary_batch_size: int = 1  # Batch size for processing (1 = sequential)
summary_chunk_size: Optional[int] = None  # Chunk size for long transcripts (None = no chunking)
summary_cache_dir: Optional[str] = None  # Custom cache directory for models
```yaml

- `--generate-summaries`: Enable summary generation
- `--summary-provider`: Choose provider (`local`, `openai`)
- `--summary-model`: Model identifier (e.g., `facebook/bart-large-cnn`)
- `--summary-max-length`: Maximum summary length in tokens
- `--summary-max-takeaways`: Maximum number of key takeaways
- `--summary-device`: Force device (`cuda`, `mps`, `cpu`, or `auto` for auto-detection)
- `--summary-chunk-size`: Chunk size for long transcripts (default: model max length)

### 2. Local Transformer Integration

#### 2.1 Dependency Management

Add dependencies to `pyproject.toml` in the `[project.optional-dependencies.ml]` section:

```toml

# pyproject.toml [project.optional-dependencies.ml]

"torch>=2.0.0,<3.0.0",  # PyTorch core
"transformers>=4.30.0,<5.0.0",  # Hugging Face Transformers library
"sentencepiece>=0.1.99,<1.0.0",  # Tokenizer dependency for some models
"accelerate>=0.20.0,<1.0.0",  # Optional: for model loading optimizations
```text

- CUDA-enabled PyTorch (installed separately based on CUDA version)
- `bitsandbytes` (for 8-bit quantization to reduce memory usage)

## 2.2 Model Selection

Recommended models for summarization:

**BART-based models** (best for abstractive summarization):

- `facebook/bart-large-cnn`: High quality, ~560M parameters, requires ~2GB GPU memory
- `facebook/bart-large-xsum`: Optimized for extreme summarization, ~560M parameters
- `facebook/bart-base`: Smaller, faster, ~140M parameters, ~500MB GPU memory

**T5-based models** (good for extractive/abstractive hybrid):

- `google/flan-t5-large`: Instruction-tuned, good for structured outputs, ~780M parameters
- `google/flan-t5-base`: Smaller alternative, ~250M parameters

**DistilBART** (lightweight option):

- `sshleifer/distilbart-cnn-12-6`: Smaller BART variant, ~260M parameters, faster inference

**Default selection logic**:

```python
DEFAULT_SUMMARY_MODELS = {
    "bart-large": "facebook/bart-large-cnn",  # BART-large (best quality ~2GB memory)
    "bart-small": "facebook/bart-base",  # BART-base (smallest, lowest memory ~500MB, recommended for M4 Pro)
    "fast": "sshleifer/distilbart-cnn-12-6",  # DistilBART (faster, lower memory ~300MB)
}

def select_summary_model(cfg: Config) -> str:
    """Select summary model based on configuration and available resources.

    Optimized for Apple M4 Pro with 48GB RAM:
    - Prefers bart-base or distilbart for memory efficiency
    - Supports MPS backend for GPU acceleration on Apple Silicon
    """

    if cfg.summary_model:
        return cfg.summary_model

    # Auto-select based on available resources

```text

    # For Apple Silicon (M4 Pro), prefer memory-efficient models

```python

import logging
import os
from pathlib import Path
from typing import Optional, Dict, Any, List
import torch
from transformers import (
    AutoTokenizer,
    AutoModelForSeq2SeqLM,
    pipeline,
    Pipeline,
)

logger = logging.getLogger(__name__)

# Hugging Face cache directory (standard location)

HF_CACHE_DIR = Path.home() / ".cache" / "huggingface" / "transformers"

class SummaryModel:

```text

    """Wrapper for local transformer summarization model."""

```python

    def __init__(
        self,
        model_name: str,
        device: Optional[str] = None,
        cache_dir: Optional[str] = None,
    ):
        """Initialize summary model.

```python

    def _detect_device(self, device: Optional[str]) -> str:
        """Detect and return appropriate device.

```python

    def _load_model(self) -> None:
        """Load model and tokenizer from cache or download."""
        try:
            logger.info(f"Loading summarization model: {self.model_name} on {self.device}")

```python

            # - "cuda" -> 0 (first CUDA device)

```python

            # - "mps" -> "mps" (Apple Silicon)

```python

            # - "cpu" -> -1 (CPU)

```python

    def summarize(
        self,
        text: str,
        max_length: int = 150,
        min_length: int = 30,
        do_sample: bool = False,
    ) -> str:
        """Generate summary of input text.

```python

    def generate_takeaways(
        self,
        text: str,
        max_takeaways: int = 10,
        max_length_per_takeaway: int = 100,
    ) -> List[str]:
        """Generate key takeaways from text.

```javascript

        # Strategy: Generate longer summary, then split into bullet points

```

## Strategy 1: Chunking with Sliding Window

```python

def chunk_text_for_summarization(
    text: str,
    chunk_size: int,
    overlap: int = 200,
) -> List[str]:
    """Split long text into overlapping chunks.

    Args:
        text: Input text
        chunk_size: Target chunk size in tokens
        overlap: Overlap between chunks in tokens

    Returns:
        List of text chunks
    """

    # Tokenize to get accurate token counts

```text

    tokens = tokenizer.encode(text, add_special_tokens=False)

```python

def summarize_long_text(
    model: SummaryModel,
    text: str,
    chunk_size: int = 1024,
    max_length: int = 150,
) -> str:

```text

    """Summarize long text by chunking and combining summaries.

```python

def hierarchical_summarize(
    model: SummaryModel,
    text: str,
    levels: int = 2,
    max_length: int = 150,
) -> str:
    """Hierarchical summarization: summarize sections, then summarize summaries.

    Args:
        model: Summary model instance
        text: Input text
        levels: Number of summarization levels
        max_length: Final summary length

    Returns:
        Final summary
    """

```text

    # Split into paragraphs or sections

```python

    model: SummaryModel,
    text: str,
    max_length: int = 150,
) -> str:
    """Extract key sentences first, then summarize extracted content.

    Args:
        model: Summary model instance
        text: Input text
        max_length: Summary length

    Returns:
        Summary
    """

    # Simple extraction: take first and last sentences, plus middle sentences

    sentences = [s.strip() for s in text.split(". ") if s.strip()]

```text

    if len(sentences) <= 5:
        return model.summarize(text, max_length=max_length)

```python

def optimize_model_memory(model: SummaryModel) -> None:
    """Optimize model for memory efficiency.

    Supports both CUDA (NVIDIA) and MPS (Apple Silicon) backends.
    """
    if model.device == "cuda":

        # Enable gradient checkpointing (trades compute for memory)

        if hasattr(model.model, "gradient_checkpointing_enable"):
            model.model.gradient_checkpointing_enable()

        # Use half precision (FP16) to reduce memory

        model.model = model.model.half()

        # Clear cache

```text

        torch.cuda.empty_cache()
    elif model.device == "mps":

```python

        # MPS may benefit from FP16, but test performance impact

```python

def optimize_for_cpu(model: SummaryModel) -> None:
    """Optimize model for CPU inference."""

    # Use INT8 quantization if available

    try:
        from transformers import BitsAndBytesConfig

        # Note: INT8 quantization typically requires GPU

        # For CPU, use model optimization techniques

        pass
    except ImportError:
        pass

    # Set number of threads for CPU

```text

    torch.set_num_threads(os.cpu_count() or 4)

```

    model.tokenizer = None
    model.pipeline = None

    # Clear device-specific cache

    if torch.cuda.is_available():
        torch.cuda.empty_cache()
    elif hasattr(torch.backends, "mps") and torch.backends.mps.is_available():
        if hasattr(torch.mps, "empty_cache"):

```text

            torch.mps.empty_cache()

```python

def generate_takeaways_with_prompt(
    model: SummaryModel,
    text: str,
    max_takeaways: int = 10,
) -> List[str]:
    """Generate takeaways using instruction prompt.

    Args:
        model: Summary model (preferably instruction-tuned like flan-t5)
        text: Input text
        max_takeaways: Maximum takeaways

    Returns:
        List of takeaways
    """

    # Construct prompt for instruction-tuned models

    prompt = f"""Summarize the following text and extract {max_takeaways} key takeaways.

Text:
{text}

Key Takeaways:
"""

```text

    # For instruction-tuned models (e.g., flan-t5)

```python

        # Parse takeaways from result

```python

# In metadata.py or summarizer.py

class SummaryMetadata(BaseModel):
    """Summary metadata structure."""
    short_summary: str
    key_takeaways: List[str]
    generated_at: datetime
    model_used: str
    provider: str  # "local", "openai"
    word_count: int

    @field_serializer('generated_at')
    def serialize_generated_at(self, value: datetime) -> str:
        return value.isoformat()

# Integration in episode_processor.py or workflow.py

def generate_episode_summary(
    transcript_path: Path,
    cfg: Config,
    summary_model: Optional[SummaryModel] = None,
) -> Optional[SummaryMetadata]:

```text

    """Generate summary for episode transcript.

```
        provider="local",
        word_count=len(transcript.split()),
    )

```python

def safe_summarize(
    model: SummaryModel,
    text: str,
    max_length: int = 150,
) -> str:
    """Safely generate summary with error handling.

    Returns:
        Summary text, or empty string on failure
    """
    try:
        return model.summarize(text, max_length=max_length)
    except (torch.cuda.OutOfMemoryError, RuntimeError) as e:

        # Handle both CUDA and MPS out-of-memory errors

        if "out of memory" in str(e).lower() or "mps" in str(e).lower():
            logger.error(f"Device out of memory during summarization ({model.device}): {e}")

```text

            # Fallback: use CPU or smaller model

```python

- Subsequent runs load from cache (fast)

**Batch Processing**:

- Process multiple episodes sequentially (batch_size=1) to avoid memory issues
- Can parallelize across episodes if GPU memory allows

**Lazy Loading**:

- Load model only when `generate_summaries=True`
- Unload model after processing to free memory

## Testing Strategy

- Unit tests for model loading and summarization
- Integration tests with sample transcripts
- Memory tests for long transcripts
- Error handling tests (missing model, OOM, etc.)
- Performance benchmarks (tokens/second, memory usage)

## Alternatives Considered

### API-Based Solutions (OpenAI)

- **Pros**: Higher quality, no local resources needed
- **Cons**: Privacy concerns, cost, rate limits, internet required
- **Decision**: Support as optional provider, but default to local

### Smaller Models (DistilBART, T5-small)

- **Pros**: Lower memory, faster inference
- **Cons**: Lower quality summaries
- **Decision**: Provide as options, auto-select based on resources

### Quantization (8-bit, 4-bit)

- **Pros**: Significant memory reduction
- **Cons**: Requires `bitsandbytes`, slight quality loss
- **Decision**: Document as advanced option, not default

## Rollout Plan

1. Create RFC-012 document (this document)
2. Implement `summarizer.py` module with local transformer support
3. Integrate with metadata generation pipeline
4. Add configuration options
5. Add tests
6. Update documentation
7. Release as opt-in feature
8. Collect user feedback on quality and performance

## Open Questions

- Should we support multiple local models with quality/speed tradeoffs? (Decision: Yes, with auto-selection based on hardware)
- How to handle very long transcripts (>20k words)? Chunking strategy? (Decision: Chunking with sliding window)
- Should summaries be cached to avoid regeneration? (Decision: No, regenerate on each run, use `--skip-existing` to prevent overwrites)
- Do we need GPU detection and automatic model selection? (Decision: Yes, with MPS support for Apple Silicon)
- Should we support model fine-tuning for podcast-specific summarization? (Future consideration)
- **Apple Silicon Optimization**: What's the best model size/configuration for M4 Pro 48GB? (Decision: bart-base recommended, distilbart for speed)

## References

- Issue #17: Generate short summary and key takeaways from each episode
- Issue #30: Create PRD and RFC for episode summary generation feature
- PRD-004: Per-Episode Metadata Document Generation
- RFC-011: Per-Episode Metadata Document Generation
- Hugging Face Transformers: <https://huggingface.co/docs/transformers/>
- BART Paper: <https://arxiv.org/abs/1910.13461>
- T5 Paper: <https://arxiv.org/abs/1910.10683>