PRD-012: Google Gemini Provider Integration¶

Status: ✅ Implemented (v2.5.0)
Revision: 2
Date: 2026-02-04
Related RFCs: RFC-035 (Revised)
Related PRDs: PRD-006 (OpenAI), PRD-010 (Mistral)

Summary¶

Add Google Gemini as an optional provider for transcription, speaker detection, and summarization capabilities. Gemini is unique in offering native multimodal support with audio understanding and an industry-leading 2 million token context window. Like OpenAI, Gemini uses a unified provider pattern where a single GeminiProvider class implements all three capabilities. This builds on the existing modularization architecture to provide seamless provider switching.

Background & Context¶

Currently, the podcast scraper supports multiple providers. Google Gemini offers several compelling advantages:

Native Audio Understanding: Gemini can process audio directly without separate transcription API
2M Token Context Window: Process entire podcast seasons in one request
Multimodal: Can analyze audio, images, and text together
Competitive Pricing: Free tier available, paid tier very competitive
Google Integration: Works well with Google Cloud ecosystem

This PRD addresses adding Gemini as a full-featured provider option.

Goals¶

Add Google Gemini as provider option for transcription, speaker detection, and summarization
Maintain 100% backward compatibility with existing providers
Follow unified provider pattern (like OpenAI) - single class implementing all three protocols
Provide secure API key management via environment variables
Support both Config-based and experiment-based factory modes from the start
Leverage Gemini's native audio understanding for transcription (file upload and inline data)
Utilize massive context window for full transcript processing
Use environment-based model defaults (test vs production)
Create Gemini-specific prompt templates

Gemini Model Selection and Cost Analysis¶

Configuration Fields¶

Add to config.py when implementing Gemini providers (following OpenAI pattern):

# Gemini API Configuration
gemini_api_key: Optional[str] = Field(
    default=None,
    alias="gemini_api_key",
    description="Google AI API key (prefer GEMINI_API_KEY env var or .env file)"
)

gemini_api_base: Optional[str] = Field(
    default=None,
    alias="gemini_api_base",
    description="Gemini API base URL (for E2E testing with mock servers)"
)

# Gemini Model Selection (environment-based defaults, like OpenAI)
gemini_transcription_model: str = Field(
    default_factory=_get_default_gemini_transcription_model,
    alias="gemini_transcription_model",
    description="Gemini model for transcription (default: environment-based)"
)

gemini_speaker_model: str = Field(
    default_factory=_get_default_gemini_speaker_model,
    alias="gemini_speaker_model",
    description="Gemini model for speaker detection (default: environment-based)"
)

gemini_summary_model: str = Field(
    default_factory=_get_default_gemini_summary_model,
    alias="gemini_summary_model",
    description="Gemini model for summarization (default: environment-based)"
)

# Shared settings (like OpenAI)
gemini_temperature: float = Field(
    default=0.3,
    alias="gemini_temperature",
    description="Temperature for Gemini generation (0.0-2.0, lower = more deterministic)"
)

gemini_max_tokens: Optional[int] = Field(
    default=None,
    alias="gemini_max_tokens",
    description="Max tokens for Gemini generation (None = model default)"
)

# Gemini Prompt Configuration (following OpenAI pattern)
gemini_summary_system_prompt: Optional[str] = Field(
    default=None,
    alias="gemini_summary_system_prompt",
    description="Gemini system prompt for summarization (default: gemini/summarization/system_v1)"
)

gemini_summary_user_prompt: str = Field(
    default="gemini/summarization/long_v1",
    alias="gemini_summary_user_prompt",
    description="Gemini user prompt for summarization"
)

gemini_speaker_system_prompt: Optional[str] = Field(
    default=None,
    alias="gemini_speaker_system_prompt",
    description="Gemini system prompt for speaker detection (default: gemini/ner/system_ner_v1)"
)

gemini_speaker_user_prompt: str = Field(
    default="gemini/ner/guest_host_v1",
    alias="gemini_speaker_user_prompt",
    description="Gemini user prompt for speaker detection"
)

Environment-based defaults:

Test environment: gemini-2.0-flash (free tier, fast)
Production environment: gemini-1.5-pro (best quality, 2M context)

Model Options and Pricing¶

Model	Input Cost	Output Cost	Context Window	Best For
gemini-2.0-flash	$0.10 / 1M tokens	$0.40 / 1M tokens	1M tokens	Dev/Test (fast, cheap)
gemini-1.5-pro	$1.25 / 1M tokens	$5.00 / 1M tokens	2M tokens	Production (best quality)
gemini-1.5-flash	$0.075 / 1M tokens	$0.30 / 1M tokens	1M tokens	Budget production

Audio Pricing:

Audio input: ~$0.00025 per second (~$0.90 per hour)

Note: Prices subject to change. Check Google AI Pricing for current rates.

Free Tier¶

Google offers a generous free tier:

Gemini 2.0 Flash: 15 RPM, 1M TPM, 1500 RPD
Gemini 1.5 Flash: 15 RPM, 1M TPM, 1500 RPD
Gemini 1.5 Pro: 2 RPM, 32K TPM, 50 RPD

Dev/Test vs Production Model Selection¶

Environment	Transcription	Speaker Model	Summary Model	Rationale
Dev/Test	`gemini-2.0-flash`	`gemini-2.0-flash`	`gemini-2.0-flash`	Free tier, fast
Production	`gemini-1.5-pro`	`gemini-1.5-pro`	`gemini-1.5-pro`	Best quality, 2M context

Cost Comparison: All Providers (Per 100 Episodes)¶

Component	OpenAI (gpt-4o-mini)	Mistral (small)	DeepSeek (chat)	Gemini (flash)
Transcription	$0.60	TBD	❌ N/A	$0.90
Speaker Detection	$0.14	$0.03	$0.004	$0.01
Summarization	$0.41	$0.08	$0.012	$0.04
Total	$1.15	$0.11+	$0.016	$0.95

Note: Gemini audio transcription is slightly more expensive than Whisper, but offers native multimodal understanding.

Non-Goals¶

Vision/image capabilities beyond audio (future consideration)
Changing default behavior (local providers remain default)
Modifying existing provider implementations
Google Cloud Vertex AI integration (different SDK, future PRD)

Personas¶

Quality Seeker Quinn: Wants 2M context for full season analysis
Google Cloud Gary: Already uses Google Cloud, prefers ecosystem
Free Tier Fiona: Wants to use generous free tier for development
Multimodal Mary: Wants native audio understanding without separate transcription

User Stories¶

As Quality Seeker Quinn, I can use Gemini 1.5 Pro with 2M context to analyze entire seasons.
As Google Cloud Gary, I can use my existing Google AI API key for podcast processing.
As Free Tier Fiona, I can develop and test using Gemini's free tier.
As Multimodal Mary, I can send audio directly to Gemini without separate transcription.
As any operator, I can see which provider was used in logs and metadata.

Functional Requirements¶

FR1: Provider Selection¶

FR1.1: Add "gemini" as valid value for transcription_provider config field
FR1.2: Add "gemini" as valid value for speaker_detector_provider config field
FR1.3: Add "gemini" as valid value for summary_provider config field
FR1.4: Provider selection is independent per capability
FR1.5: Default values maintain current behavior (local providers)
FR1.6: Support both Config-based and experiment-based factory modes from the start

FR2: API Key Management¶

FR2.1: Support GEMINI_API_KEY environment variable (like OPENAI_API_KEY)
FR2.2: Support .env file for configuration
FR2.3: API key is never stored in source code
FR2.4: Missing API key results in clear error message
FR2.5: Support GEMINI_API_BASE environment variable for E2E testing (like OPENAI_API_BASE)

FR3: Transcription with Gemini¶

FR3.1: Gemini provider uses native audio understanding for transcription
FR3.2: Maintains same interface as other providers (protocol-compliant)
FR3.3: Supports both audio file upload and inline data (if available in SDK)
FR3.4: Returns results in same format as other providers
FR3.5: Handles large audio files (2M context window eliminates need for chunking)

FR4: Speaker Detection with Gemini¶

FR4.1: Gemini provider uses chat models for entity extraction
FR4.2: Maintains same interface as other providers
FR4.3: Uses Gemini-specific prompt templates

FR5: Summarization with Gemini¶

FR5.1: Gemini provider uses chat models for summarization
FR5.2: Leverages massive context window (up to 2M tokens)
FR5.3: Can process entire transcripts without chunking
FR5.4: Uses Gemini-specific prompt templates

FR6: Prompt Management¶

FR6.1: Create Gemini-specific prompt templates in prompts/gemini/ folder
FR6.2: Prompts optimized for Gemini models

FR7: Logging and Observability¶

FR7.1: Log which provider is used for each capability
FR7.2: Include provider information in metadata documents

Technical Requirements¶

TR1: Architecture¶

TR1.1: Follow unified provider pattern (like OpenAI) - single class implementing all three protocols
TR1.2: Create providers/gemini/gemini_provider.py with unified GeminiProvider class
TR1.3: GeminiProvider implements TranscriptionProvider, SpeakerDetector, and SummarizationProvider protocols
TR1.4: Update all factories to include Gemini option with support for both Config-based and experiment-based modes
TR1.5: Create prompts/gemini/ directory with provider-specific prompt templates
TR1.6: Follow OpenAI provider architecture exactly for consistency

TR2: Dependencies¶

TR2.1: Add google-genai Python package as optional dependency (migrated from google-generativeai in Issue #415)
TR2.2: Lazy import when provider is selected (ImportError with helpful message if not installed)
TR2.3: Add to pyproject.toml optional dependencies: gemini = ["google-genai>=0.1.0,<1.0.0"]

TR3: Testing¶

TR3.1: Unit tests with mocked API
TR3.2: Integration tests with E2E server mock
TR3.3: E2E tests for complete workflow

TR4: E2E Server Extensions¶

TR4.1: Add Gemini mock endpoints to E2E HTTP server
TR4.2: Add gemini_api_base() helper to E2EServerURLs
TR4.3: Support gemini_api_base config field for custom base URL (like openai_api_base)

Success Criteria¶

✅ Users can select Gemini for all three capabilities via unified provider
✅ Native audio transcription works via Gemini (file upload and inline data)
✅ Default behavior unchanged (local providers remain default)
✅ API keys managed securely via GEMINI_API_KEY environment variable
✅ Environment-based model defaults (test vs production)
✅ Both Config-based and experiment-based factory modes supported
✅ E2E tests pass with mock server support
✅ Follows OpenAI provider pattern exactly for consistency

Provider Capability Matrix (Updated)¶

| Capability | Local | OpenAI | Anthropic | Mistral | DeepSeek | Gemini | Grok | Ollama | | ---------- | ----- | ------ | --------- | ------- | -------- | ------ | | Transcription | ✅ | ✅ | ❌ | ✅ | ❌ | ✅ | | Speaker Detection | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | | Summarization | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | | Max Context | N/A | 128k | 200k | 256k | 64k | 2M |

Future Considerations¶

Vertex AI integration for enterprise deployments
Multimodal analysis (audio + images for show notes)
Gemini's native grounding with Google Search
Context caching for cost optimization