RFC-012: Episode Summarization Using Local Transformers¶
- Status: Completed
- Authors:
- Stakeholders: Maintainers, users generating episode summaries, developers integrating summarization
- Related PRDs:
docs/prd/PRD-005-episode-summarization.md(to be created),docs/prd/PRD-004-metadata-generation.md - Related ADRs:
- ADR-009: Privacy-First Local Summarization
- ADR-010: Hierarchical Summarization Pattern
- Related Issues: Issue #17, Issue #30
Abstract¶
Design and implement episode summarization feature that generates concise summaries and key takeaways from episode transcripts using local PyTorch transformer models. This enables privacy-preserving, cost-effective summarization without requiring external API services, while maintaining integration with the existing metadata generation pipeline.
Problem Statement¶
Issue #17 describes the need to generate concise summaries and key takeaways from episode transcripts. While API-based LLM services (OpenAI GPT) provide high-quality results, they have drawbacks:
- Privacy concerns: Transcripts sent to external APIs may contain sensitive content
- Cost: API usage incurs per-token costs that scale with transcript length
- Rate limits: API providers enforce rate limits that can slow batch processing
- Dependency: Requires internet connectivity and API key management
- Latency: Network round-trips add latency to processing pipeline
Local transformer models address these concerns by running entirely on-device, providing privacy, predictable costs (hardware), and no rate limits. However, they require:
- GPU memory or sufficient CPU resources
- Model download and caching
- Careful prompt engineering for quality results
- Memory management for long transcripts
Constraints & Assumptions¶
- Summarization must be opt-in (default
false) for backwards compatibility - Hardware Constraint: Must run on Apple M4 Pro with 48 GB RAM (primary development/testing platform)
- Models must be selected and optimized to work within this memory constraint
- Apple Silicon uses Metal Performance Shaders (MPS) backend for PyTorch, not CUDA
- While 48 GB RAM is generous, memory efficiency is still important for concurrent operations
- Model selection should prioritize models that fit comfortably in available memory (e.g., bart-base ~500MB, distilbart ~300MB)
- Must support MPS backend for GPU acceleration on Apple Silicon
- Local models are preferred over API-based solutions for privacy and cost reasons
- Summaries are stored in metadata documents (PRD-004/RFC-011 structure)
- Transcripts can be long (5000-20000+ words); models must handle long inputs efficiently
- Users may have limited GPU memory; CPU fallback must be supported
- Model downloads should be cached and reusable across runs
- Summarization should not block transcript processing; can be async or sequential
- Quality should be reasonable but may be lower than premium API services
Design & Implementation¶
1. Configuration¶
Add new configuration fields to config.Config:
generate_summaries: bool = False # Opt-in for backwards compatibility
summary_provider: Literal["local", "openai"] = "local" # Default to local
summary_model: Optional[str] = None # Model identifier (e.g., "facebook/bart-large-cnn")
summary_max_length: int = 150 # Max tokens for summary
summary_min_length: int = 30 # Min tokens for summary
summary_max_takeaways: int = 10 # Maximum number of key takeaways
summary_device: Optional[str] = None # "cuda", "mps", "cpu", or None for auto-detection
summary_batch_size: int = 1 # Batch size for processing (1 = sequential)
summary_chunk_size: Optional[int] = None # Chunk size for long transcripts (None = no chunking)
summary_cache_dir: Optional[str] = None # Custom cache directory for models
```yaml
- `--generate-summaries`: Enable summary generation
- `--summary-provider`: Choose provider (`local`, `openai`)
- `--summary-model`: Model identifier (e.g., `facebook/bart-large-cnn`)
- `--summary-max-length`: Maximum summary length in tokens
- `--summary-max-takeaways`: Maximum number of key takeaways
- `--summary-device`: Force device (`cuda`, `mps`, `cpu`, or `auto` for auto-detection)
- `--summary-chunk-size`: Chunk size for long transcripts (default: model max length)
### 2. Local Transformer Integration
#### 2.1 Dependency Management
Add dependencies to `pyproject.toml` in the `[project.optional-dependencies.ml]` section:
```toml
# pyproject.toml [project.optional-dependencies.ml]
"torch>=2.0.0,<3.0.0", # PyTorch core
"transformers>=4.30.0,<5.0.0", # Hugging Face Transformers library
"sentencepiece>=0.1.99,<1.0.0", # Tokenizer dependency for some models
"accelerate>=0.20.0,<1.0.0", # Optional: for model loading optimizations
```text
- CUDA-enabled PyTorch (installed separately based on CUDA version)
- `bitsandbytes` (for 8-bit quantization to reduce memory usage)
## 2.2 Model Selection
Recommended models for summarization:
**BART-based models** (best for abstractive summarization):
- `facebook/bart-large-cnn`: High quality, ~560M parameters, requires ~2GB GPU memory
- `facebook/bart-large-xsum`: Optimized for extreme summarization, ~560M parameters
- `facebook/bart-base`: Smaller, faster, ~140M parameters, ~500MB GPU memory
**T5-based models** (good for extractive/abstractive hybrid):
- `google/flan-t5-large`: Instruction-tuned, good for structured outputs, ~780M parameters
- `google/flan-t5-base`: Smaller alternative, ~250M parameters
**DistilBART** (lightweight option):
- `sshleifer/distilbart-cnn-12-6`: Smaller BART variant, ~260M parameters, faster inference
**Default selection logic**:
```python
DEFAULT_SUMMARY_MODELS = {
"bart-large": "facebook/bart-large-cnn", # BART-large (best quality ~2GB memory)
"bart-small": "facebook/bart-base", # BART-base (smallest, lowest memory ~500MB, recommended for M4 Pro)
"fast": "sshleifer/distilbart-cnn-12-6", # DistilBART (faster, lower memory ~300MB)
}
def select_summary_model(cfg: Config) -> str:
"""Select summary model based on configuration and available resources.
Optimized for Apple M4 Pro with 48GB RAM:
- Prefers bart-base or distilbart for memory efficiency
- Supports MPS backend for GPU acceleration on Apple Silicon
"""
if cfg.summary_model:
return cfg.summary_model
# Auto-select based on available resources
```text
# For Apple Silicon (M4 Pro), prefer memory-efficient models
```python
import logging
import os
from pathlib import Path
from typing import Optional, Dict, Any, List
import torch
from transformers import (
AutoTokenizer,
AutoModelForSeq2SeqLM,
pipeline,
Pipeline,
)
logger = logging.getLogger(__name__)
# Hugging Face cache directory (standard location)
HF_CACHE_DIR = Path.home() / ".cache" / "huggingface" / "transformers"
class SummaryModel:
```text
"""Wrapper for local transformer summarization model."""
```python
def __init__(
self,
model_name: str,
device: Optional[str] = None,
cache_dir: Optional[str] = None,
):
"""Initialize summary model.
```python
def _detect_device(self, device: Optional[str]) -> str:
"""Detect and return appropriate device.
```python
def _load_model(self) -> None:
"""Load model and tokenizer from cache or download."""
try:
logger.info(f"Loading summarization model: {self.model_name} on {self.device}")
```python
# - "cuda" -> 0 (first CUDA device)
```python
# - "mps" -> "mps" (Apple Silicon)
```python
# - "cpu" -> -1 (CPU)
```python
def summarize(
self,
text: str,
max_length: int = 150,
min_length: int = 30,
do_sample: bool = False,
) -> str:
"""Generate summary of input text.
```python
def generate_takeaways(
self,
text: str,
max_takeaways: int = 10,
max_length_per_takeaway: int = 100,
) -> List[str]:
"""Generate key takeaways from text.
```javascript
# Strategy: Generate longer summary, then split into bullet points
```
## Strategy 1: Chunking with Sliding Window
```python
def chunk_text_for_summarization(
text: str,
chunk_size: int,
overlap: int = 200,
) -> List[str]:
"""Split long text into overlapping chunks.
Args:
text: Input text
chunk_size: Target chunk size in tokens
overlap: Overlap between chunks in tokens
Returns:
List of text chunks
"""
# Tokenize to get accurate token counts
```text
tokens = tokenizer.encode(text, add_special_tokens=False)
```python
def summarize_long_text(
model: SummaryModel,
text: str,
chunk_size: int = 1024,
max_length: int = 150,
) -> str:
```text
"""Summarize long text by chunking and combining summaries.
```python
def hierarchical_summarize(
model: SummaryModel,
text: str,
levels: int = 2,
max_length: int = 150,
) -> str:
"""Hierarchical summarization: summarize sections, then summarize summaries.
Args:
model: Summary model instance
text: Input text
levels: Number of summarization levels
max_length: Final summary length
Returns:
Final summary
"""
```text
# Split into paragraphs or sections
```python
model: SummaryModel,
text: str,
max_length: int = 150,
) -> str:
"""Extract key sentences first, then summarize extracted content.
Args:
model: Summary model instance
text: Input text
max_length: Summary length
Returns:
Summary
"""
# Simple extraction: take first and last sentences, plus middle sentences
sentences = [s.strip() for s in text.split(". ") if s.strip()]
```text
if len(sentences) <= 5:
return model.summarize(text, max_length=max_length)
```python
def optimize_model_memory(model: SummaryModel) -> None:
"""Optimize model for memory efficiency.
Supports both CUDA (NVIDIA) and MPS (Apple Silicon) backends.
"""
if model.device == "cuda":
# Enable gradient checkpointing (trades compute for memory)
if hasattr(model.model, "gradient_checkpointing_enable"):
model.model.gradient_checkpointing_enable()
# Use half precision (FP16) to reduce memory
model.model = model.model.half()
# Clear cache
```text
torch.cuda.empty_cache()
elif model.device == "mps":
```python
# MPS may benefit from FP16, but test performance impact
```python
def optimize_for_cpu(model: SummaryModel) -> None:
"""Optimize model for CPU inference."""
# Use INT8 quantization if available
try:
from transformers import BitsAndBytesConfig
# Note: INT8 quantization typically requires GPU
# For CPU, use model optimization techniques
pass
except ImportError:
pass
# Set number of threads for CPU
```text
torch.set_num_threads(os.cpu_count() or 4)
```
model.tokenizer = None
model.pipeline = None
# Clear device-specific cache
if torch.cuda.is_available():
torch.cuda.empty_cache()
elif hasattr(torch.backends, "mps") and torch.backends.mps.is_available():
if hasattr(torch.mps, "empty_cache"):
```text
torch.mps.empty_cache()
```python
def generate_takeaways_with_prompt(
model: SummaryModel,
text: str,
max_takeaways: int = 10,
) -> List[str]:
"""Generate takeaways using instruction prompt.
Args:
model: Summary model (preferably instruction-tuned like flan-t5)
text: Input text
max_takeaways: Maximum takeaways
Returns:
List of takeaways
"""
# Construct prompt for instruction-tuned models
prompt = f"""Summarize the following text and extract {max_takeaways} key takeaways.
Text:
{text}
Key Takeaways:
"""
```text
# For instruction-tuned models (e.g., flan-t5)
```python
# Parse takeaways from result
```python
# In metadata.py or summarizer.py
class SummaryMetadata(BaseModel):
"""Summary metadata structure."""
short_summary: str
key_takeaways: List[str]
generated_at: datetime
model_used: str
provider: str # "local", "openai"
word_count: int
@field_serializer('generated_at')
def serialize_generated_at(self, value: datetime) -> str:
return value.isoformat()
# Integration in episode_processor.py or workflow.py
def generate_episode_summary(
transcript_path: Path,
cfg: Config,
summary_model: Optional[SummaryModel] = None,
) -> Optional[SummaryMetadata]:
```text
"""Generate summary for episode transcript.
```
provider="local",
word_count=len(transcript.split()),
)
```python
def safe_summarize(
model: SummaryModel,
text: str,
max_length: int = 150,
) -> str:
"""Safely generate summary with error handling.
Returns:
Summary text, or empty string on failure
"""
try:
return model.summarize(text, max_length=max_length)
except (torch.cuda.OutOfMemoryError, RuntimeError) as e:
# Handle both CUDA and MPS out-of-memory errors
if "out of memory" in str(e).lower() or "mps" in str(e).lower():
logger.error(f"Device out of memory during summarization ({model.device}): {e}")
```text
# Fallback: use CPU or smaller model
```python
- Subsequent runs load from cache (fast)
**Batch Processing**:
- Process multiple episodes sequentially (batch_size=1) to avoid memory issues
- Can parallelize across episodes if GPU memory allows
**Lazy Loading**:
- Load model only when `generate_summaries=True`
- Unload model after processing to free memory
## Testing Strategy
- Unit tests for model loading and summarization
- Integration tests with sample transcripts
- Memory tests for long transcripts
- Error handling tests (missing model, OOM, etc.)
- Performance benchmarks (tokens/second, memory usage)
## Alternatives Considered
### API-Based Solutions (OpenAI)
- **Pros**: Higher quality, no local resources needed
- **Cons**: Privacy concerns, cost, rate limits, internet required
- **Decision**: Support as optional provider, but default to local
### Smaller Models (DistilBART, T5-small)
- **Pros**: Lower memory, faster inference
- **Cons**: Lower quality summaries
- **Decision**: Provide as options, auto-select based on resources
### Quantization (8-bit, 4-bit)
- **Pros**: Significant memory reduction
- **Cons**: Requires `bitsandbytes`, slight quality loss
- **Decision**: Document as advanced option, not default
## Rollout Plan
1. Create RFC-012 document (this document)
2. Implement `summarizer.py` module with local transformer support
3. Integrate with metadata generation pipeline
4. Add configuration options
5. Add tests
6. Update documentation
7. Release as opt-in feature
8. Collect user feedback on quality and performance
## Open Questions
- Should we support multiple local models with quality/speed tradeoffs? (Decision: Yes, with auto-selection based on hardware)
- How to handle very long transcripts (>20k words)? Chunking strategy? (Decision: Chunking with sliding window)
- Should summaries be cached to avoid regeneration? (Decision: No, regenerate on each run, use `--skip-existing` to prevent overwrites)
- Do we need GPU detection and automatic model selection? (Decision: Yes, with MPS support for Apple Silicon)
- Should we support model fine-tuning for podcast-specific summarization? (Future consideration)
- **Apple Silicon Optimization**: What's the best model size/configuration for M4 Pro 48GB? (Decision: bart-base recommended, distilbart for speed)
## References
- Issue #17: Generate short summary and key takeaways from each episode
- Issue #30: Create PRD and RFC for episode summary generation feature
- PRD-004: Per-Episode Metadata Document Generation
- RFC-011: Per-Episode Metadata Document Generation
- Hugging Face Transformers: <https://huggingface.co/docs/transformers/>
- BART Paper: <https://arxiv.org/abs/1910.13461>
- T5 Paper: <https://arxiv.org/abs/1910.10683>