PRD-004: Per-Episode Metadata Document Generation¶

Status: ✅ Implemented (v2.2.0)
Related RFCs: RFC-011, RFC-012

Summary¶

Generate comprehensive metadata documents for each episode alongside transcripts, capturing feed-level and episode-level information for search, analytics, integration, and archival use cases.

Background & Context¶

Currently, podcast_scraper focuses on downloading transcripts but doesn't systematically capture and persist rich metadata about feeds and episodes. This metadata is valuable for:

Search and discovery: Finding episodes by guest names, topics, dates
Analytics: Understanding feed patterns, guest frequency, publication schedules
Integration: Enabling other tools to consume structured episode data
Database ingestion: Loading metadata directly into databases (PostgreSQL, MongoDB, Elasticsearch, ClickHouse) without transformation code
Archival: Preserving complete episode context alongside transcripts
Future features: Episode categorization, recommendation systems, summarization (PRD-005)

Goals¶

Generate structured metadata documents (JSON/YAML) for each processed episode
Capture comprehensive feed and episode information in a machine-readable format
Integrate with existing pipeline without disrupting current workflows
Enable downstream tools to consume episode metadata programmatically
Database-friendly format: Design schema for direct ingestion into PostgreSQL (JSONB), MongoDB, Elasticsearch, and ClickHouse without transformation code
Preserve metadata alongside transcripts for complete episode records

Non-Goals¶

Real-time metadata API (future consideration)
Metadata versioning/migration (handled via schema versioning)
Metadata search indexing (consumers can build their own indexes)
Metadata editing/updates after generation (one-way generation)

Personas¶

Archivist Ava: Needs complete episode records with metadata for compliance and legal review
Researcher Riley: Wants structured data for NLP analysis and pattern detection across podcast series
Developer Devin: Building tools that consume episode metadata for search, recommendations, or analytics
Analyst Alex: Analyzing podcast patterns, guest frequency, publication schedules

User Stories¶

As Archivist Ava, I can generate metadata documents alongside transcripts to create complete episode records.
As Researcher Riley, I can consume structured metadata JSON to analyze patterns across multiple podcast feeds.
As Developer Devin, I can integrate episode metadata into my application without parsing RSS feeds directly.
As Analyst Alex, I can aggregate metadata from multiple runs to understand feed evolution over time.
As Database Developer Dana, I can load metadata JSON files directly into PostgreSQL JSONB columns, MongoDB collections, Elasticsearch indices, or ClickHouse tables without writing transformation code.
As any operator, I can opt-in to metadata generation via configuration flag.
As any operator, I can choose JSON or YAML format based on my preference (JSON recommended for database ingestion).

Functional Requirements¶

FR1: Metadata Generation Control¶

FR1.1: Add generate_metadata config field (default false for backwards compatibility)
FR1.2: Add --generate-metadata CLI flag
FR1.3: Add metadata_format config field ("json" or "yaml", default "json")
FR1.4: Metadata generation respects --dry-run mode (logs planned metadata without writing files)

FR2: Feed-Level Metadata¶

FR2.1: Capture feed title
FR2.2: Capture feed URL
FR2.3: Capture feed description (if available)
FR2.4: Capture feed language (from config or detected)
FR2.5: Capture feed authors (from RSS author tags)
FR2.6: Capture feed image/logo URL (if available)
FR2.7: Capture feed last updated date (if available)

FR3: Episode-Level Metadata¶

FR3.1: Capture episode title
FR3.2: Capture episode description/summary (HTML-stripped)
FR3.3: Capture episode published date (parsed and normalized)
FR3.4: Capture episode GUID/ID (if available)
FR3.5: Capture episode link/URL (if available)
FR3.6: Capture episode duration (if available)
FR3.7: Capture episode number/sequence (if available)
FR3.8: Capture episode image/artwork URL (if available)

FR4: Content Metadata¶

FR4.1: Capture transcript URLs with types/formats (from Podcasting 2.0 tags)
FR4.2: Capture media URL (enclosure)
FR4.3: Capture media type/MIME type
FR4.4: Capture detected guest names (from RFC-010 speaker detection)
FR4.5: Capture detected host names (from RFC-010 speaker detection)
FR4.6: Capture transcript source ("direct_download" or "whisper_transcription")
FR4.7: Capture Whisper model used (if applicable)
FR4.8: Capture transcript file path (relative to output directory)

FR5: Processing Metadata¶

FR5.1: Capture processing timestamp (ISO 8601 format)
FR5.2: Capture output directory path
FR5.3: Capture run ID (if applicable)
FR5.4: Capture processing configuration snapshot (selected config fields)
FR5.5: Capture schema version for future compatibility

FR6: File Storage¶

FR6.1: Store metadata files in same directory as transcripts (by default)
FR6.2: Use naming convention: <episode_number> - <title>.metadata.json or .metadata.yaml
FR6.3: Respect --skip-existing semantics (skip metadata generation if file exists)
FR6.4: Support optional separate metadata/ subdirectory (configurable)

FR7: Schema & Format¶

FR7.1: Define JSON Schema for metadata structure
FR7.2: Support JSON format (machine-readable, default, database-friendly)
FR7.3: Support YAML format (human-readable, optional)
FR7.4: Include schema version field for future evolution
FR7.5: Validate metadata structure before writing
FR7.6: Use database-friendly conventions:
Field names in snake_case (compatible with SQL and NoSQL databases)
Date/time fields as ISO 8601 strings (not datetime objects)
Consistent data types across all fields
Arrays for multi-value fields (hosts, guests, transcript URLs)
Nested objects for logical grouping (feed, episode, content, processing)

FR8: Integration Points¶

FR8.1: Generate metadata during episode processing workflow
FR8.2: Integrate with RFC-010 speaker detection (populate host/guest names)
FR8.3: Integrate with RFC-004 filesystem layout (use same output directory structure)
FR8.4: Integrate with PRD-001 transcript pipeline (capture transcript URLs)
FR8.5: Integrate with PRD-002 Whisper fallback (capture Whisper model info)

Success Metrics¶

Metadata files generated for 100% of processed episodes when feature enabled
Metadata schema validates successfully for all generated files
Zero impact on existing transcript download/transcription workflows when disabled
Metadata files consumable by standard JSON/YAML parsers
Database ingestion: Metadata files can be loaded directly into PostgreSQL JSONB, MongoDB, Elasticsearch, and ClickHouse without transformation code
Processing time increase <5% when metadata generation enabled

Dependencies¶

RFC-010: Automatic Speaker Name Detection (populates host/guest names)
RFC-004: Filesystem Layout & Run Management (output directory structure)
PRD-001: Transcript Acquisition Pipeline (transcript URLs)
PRD-002: Whisper Fallback Transcription (Whisper model info)
Current models: podcast_scraper.models package (Episode, RssFeed)

Design Considerations¶

Format Selection¶

JSON: Machine-readable, widely supported, smaller file size
YAML: Human-readable, easier to edit manually, larger file size
Decision: Support both, default to JSON for performance

Storage Location¶

Same directory as transcripts: Simple, keeps related files together
Separate metadata/ subdirectory: Cleaner separation, easier to exclude from searches
Decision: Default to same directory, allow configurable subdirectory

Optional vs Required¶

Opt-in (default false): Backwards compatible, doesn't affect existing users
Opt-out (default true): More useful by default, but changes behavior
Decision: Opt-in for backwards compatibility

Schema Versioning¶

Version field: Enables future schema evolution without breaking consumers
Semantic versioning: Major.minor.patch format
Decision: Start with version 1.0.0, increment major for breaking changes

Database Integration Design¶

Unified JSON format: Single JSON schema works across all target databases (PostgreSQL, MongoDB, Elasticsearch, ClickHouse) without format variations
Database-friendly conventions: Schema design ensures direct ingestion capability (see RFC-011 for technical details)
No transformation required: Metadata files can be loaded directly into target databases without custom code

Database Integration¶

Metadata files are designed to be directly loadable into common databases (PostgreSQL, MongoDB, Elasticsearch, ClickHouse) without transformation code. The unified JSON format with snake_case field names and ISO 8601 date serialization ensures compatibility across all target databases.

For detailed technical examples and database-specific loading instructions, see RFC-011.

Open Questions¶

Should metadata include full transcript text or just references? (Decision: References only, transcripts are separate files)
Should metadata be generated for episodes without transcripts? (Decision: Yes, metadata is independent of transcript availability)
How to handle metadata updates for existing episodes? (Decision: Regenerate on each run, use --skip-existing to prevent overwrites)
Should metadata include checksums/hashes? (Future consideration)
Do we need database-specific format variations? (Decision: No, unified JSON with snake_case and ISO 8601 dates works universally)

RFC-010: Automatic Speaker Name Detection (will populate guest/host names)
RFC-004: Filesystem Layout & Run Management (output directory structure)
PRD-001: Transcript Acquisition Pipeline (current transcript workflow)
PRD-002: Whisper Fallback Transcription (Whisper integration)
PRD-005: Episode Summarization (future use case for metadata)

Release Checklist¶

[ ] PRD reviewed and approved
[ ] RFC-011 created with technical design
[ ] Implementation completed
[ ] Tests cover metadata generation, validation, format options
[ ] Documentation updated (README, config examples)
[ ] Schema versioning strategy documented