PRD-012: Google Gemini Provider Integration¶
- Status: ✅ Implemented (v2.5.0)
- Revision: 2
- Date: 2026-02-04
- Related RFCs: RFC-035 (Revised)
- Related PRDs: PRD-006 (OpenAI), PRD-010 (Mistral)
Summary¶
Add Google Gemini as an optional provider for transcription, speaker detection, and summarization capabilities. Gemini is unique in offering native multimodal support with audio understanding and an industry-leading 2 million token context window. Like OpenAI, Gemini uses a unified provider pattern where a single GeminiProvider class implements all three capabilities. This builds on the existing modularization architecture to provide seamless provider switching.
Background & Context¶
Currently, the podcast scraper supports multiple providers. Google Gemini offers several compelling advantages:
- Native Audio Understanding: Gemini can process audio directly without separate transcription API
- 2M Token Context Window: Process entire podcast seasons in one request
- Multimodal: Can analyze audio, images, and text together
- Competitive Pricing: Free tier available, paid tier very competitive
- Google Integration: Works well with Google Cloud ecosystem
This PRD addresses adding Gemini as a full-featured provider option.
Goals¶
- Add Google Gemini as provider option for transcription, speaker detection, and summarization
- Maintain 100% backward compatibility with existing providers
- Follow unified provider pattern (like OpenAI) - single class implementing all three protocols
- Provide secure API key management via environment variables
- Support both Config-based and experiment-based factory modes from the start
- Leverage Gemini's native audio understanding for transcription (file upload and inline data)
- Utilize massive context window for full transcript processing
- Use environment-based model defaults (test vs production)
- Create Gemini-specific prompt templates
Gemini Model Selection and Cost Analysis¶
Configuration Fields¶
Add to config.py when implementing Gemini providers (following OpenAI pattern):
# Gemini API Configuration
gemini_api_key: Optional[str] = Field(
default=None,
alias="gemini_api_key",
description="Google AI API key (prefer GEMINI_API_KEY env var or .env file)"
)
gemini_api_base: Optional[str] = Field(
default=None,
alias="gemini_api_base",
description="Gemini API base URL (for E2E testing with mock servers)"
)
# Gemini Model Selection (environment-based defaults, like OpenAI)
gemini_transcription_model: str = Field(
default_factory=_get_default_gemini_transcription_model,
alias="gemini_transcription_model",
description="Gemini model for transcription (default: environment-based)"
)
gemini_speaker_model: str = Field(
default_factory=_get_default_gemini_speaker_model,
alias="gemini_speaker_model",
description="Gemini model for speaker detection (default: environment-based)"
)
gemini_summary_model: str = Field(
default_factory=_get_default_gemini_summary_model,
alias="gemini_summary_model",
description="Gemini model for summarization (default: environment-based)"
)
# Shared settings (like OpenAI)
gemini_temperature: float = Field(
default=0.3,
alias="gemini_temperature",
description="Temperature for Gemini generation (0.0-2.0, lower = more deterministic)"
)
gemini_max_tokens: Optional[int] = Field(
default=None,
alias="gemini_max_tokens",
description="Max tokens for Gemini generation (None = model default)"
)
# Gemini Prompt Configuration (following OpenAI pattern)
gemini_summary_system_prompt: Optional[str] = Field(
default=None,
alias="gemini_summary_system_prompt",
description="Gemini system prompt for summarization (default: gemini/summarization/system_v1)"
)
gemini_summary_user_prompt: str = Field(
default="gemini/summarization/long_v1",
alias="gemini_summary_user_prompt",
description="Gemini user prompt for summarization"
)
gemini_speaker_system_prompt: Optional[str] = Field(
default=None,
alias="gemini_speaker_system_prompt",
description="Gemini system prompt for speaker detection (default: gemini/ner/system_ner_v1)"
)
gemini_speaker_user_prompt: str = Field(
default="gemini/ner/guest_host_v1",
alias="gemini_speaker_user_prompt",
description="Gemini user prompt for speaker detection"
)
Environment-based defaults:
- Test environment:
gemini-2.0-flash(free tier, fast) - Production environment:
gemini-1.5-pro(best quality, 2M context)
Model Options and Pricing¶
| Model | Input Cost | Output Cost | Context Window | Best For |
|---|---|---|---|---|
| gemini-2.0-flash | $0.10 / 1M tokens | $0.40 / 1M tokens | 1M tokens | Dev/Test (fast, cheap) |
| gemini-1.5-pro | $1.25 / 1M tokens | $5.00 / 1M tokens | 2M tokens | Production (best quality) |
| gemini-1.5-flash | $0.075 / 1M tokens | $0.30 / 1M tokens | 1M tokens | Budget production |
Audio Pricing:
- Audio input: ~$0.00025 per second (~$0.90 per hour)
Note: Prices subject to change. Check Google AI Pricing for current rates.
Free Tier¶
Google offers a generous free tier:
- Gemini 2.0 Flash: 15 RPM, 1M TPM, 1500 RPD
- Gemini 1.5 Flash: 15 RPM, 1M TPM, 1500 RPD
- Gemini 1.5 Pro: 2 RPM, 32K TPM, 50 RPD
Dev/Test vs Production Model Selection¶
| Environment | Transcription | Speaker Model | Summary Model | Rationale |
|---|---|---|---|---|
| Dev/Test | gemini-2.0-flash |
gemini-2.0-flash |
gemini-2.0-flash |
Free tier, fast |
| Production | gemini-1.5-pro |
gemini-1.5-pro |
gemini-1.5-pro |
Best quality, 2M context |
Cost Comparison: All Providers (Per 100 Episodes)¶
| Component | OpenAI (gpt-4o-mini) | Mistral (small) | DeepSeek (chat) | Gemini (flash) |
|---|---|---|---|---|
| Transcription | $0.60 | TBD | ❌ N/A | $0.90 |
| Speaker Detection | $0.14 | $0.03 | $0.004 | $0.01 |
| Summarization | $0.41 | $0.08 | $0.012 | $0.04 |
| Total | $1.15 | $0.11+ | $0.016 | $0.95 |
Note: Gemini audio transcription is slightly more expensive than Whisper, but offers native multimodal understanding.
Non-Goals¶
- Vision/image capabilities beyond audio (future consideration)
- Changing default behavior (local providers remain default)
- Modifying existing provider implementations
- Google Cloud Vertex AI integration (different SDK, future PRD)
Personas¶
- Quality Seeker Quinn: Wants 2M context for full season analysis
- Google Cloud Gary: Already uses Google Cloud, prefers ecosystem
- Free Tier Fiona: Wants to use generous free tier for development
- Multimodal Mary: Wants native audio understanding without separate transcription
User Stories¶
- As Quality Seeker Quinn, I can use Gemini 1.5 Pro with 2M context to analyze entire seasons.
- As Google Cloud Gary, I can use my existing Google AI API key for podcast processing.
- As Free Tier Fiona, I can develop and test using Gemini's free tier.
- As Multimodal Mary, I can send audio directly to Gemini without separate transcription.
- As any operator, I can see which provider was used in logs and metadata.
Functional Requirements¶
FR1: Provider Selection¶
- FR1.1: Add
"gemini"as valid value fortranscription_providerconfig field - FR1.2: Add
"gemini"as valid value forspeaker_detector_providerconfig field - FR1.3: Add
"gemini"as valid value forsummary_providerconfig field - FR1.4: Provider selection is independent per capability
- FR1.5: Default values maintain current behavior (local providers)
- FR1.6: Support both Config-based and experiment-based factory modes from the start
FR2: API Key Management¶
- FR2.1: Support
GEMINI_API_KEYenvironment variable (likeOPENAI_API_KEY) - FR2.2: Support
.envfile for configuration - FR2.3: API key is never stored in source code
- FR2.4: Missing API key results in clear error message
- FR2.5: Support
GEMINI_API_BASEenvironment variable for E2E testing (likeOPENAI_API_BASE)
FR3: Transcription with Gemini¶
- FR3.1: Gemini provider uses native audio understanding for transcription
- FR3.2: Maintains same interface as other providers (protocol-compliant)
- FR3.3: Supports both audio file upload and inline data (if available in SDK)
- FR3.4: Returns results in same format as other providers
- FR3.5: Handles large audio files (2M context window eliminates need for chunking)
FR4: Speaker Detection with Gemini¶
- FR4.1: Gemini provider uses chat models for entity extraction
- FR4.2: Maintains same interface as other providers
- FR4.3: Uses Gemini-specific prompt templates
FR5: Summarization with Gemini¶
- FR5.1: Gemini provider uses chat models for summarization
- FR5.2: Leverages massive context window (up to 2M tokens)
- FR5.3: Can process entire transcripts without chunking
- FR5.4: Uses Gemini-specific prompt templates
FR6: Prompt Management¶
- FR6.1: Create Gemini-specific prompt templates in
prompts/gemini/folder - FR6.2: Prompts optimized for Gemini models
FR7: Logging and Observability¶
- FR7.1: Log which provider is used for each capability
- FR7.2: Include provider information in metadata documents
Technical Requirements¶
TR1: Architecture¶
- TR1.1: Follow unified provider pattern (like OpenAI) - single class implementing all three protocols
- TR1.2: Create
providers/gemini/gemini_provider.pywith unifiedGeminiProviderclass - TR1.3:
GeminiProviderimplementsTranscriptionProvider,SpeakerDetector, andSummarizationProviderprotocols - TR1.4: Update all factories to include Gemini option with support for both Config-based and experiment-based modes
- TR1.5: Create
prompts/gemini/directory with provider-specific prompt templates - TR1.6: Follow OpenAI provider architecture exactly for consistency
TR2: Dependencies¶
- TR2.1: Add
google-genaiPython package as optional dependency (migrated fromgoogle-generativeaiin Issue #415) - TR2.2: Lazy import when provider is selected (ImportError with helpful message if not installed)
- TR2.3: Add to
pyproject.tomloptional dependencies:gemini = ["google-genai>=0.1.0,<1.0.0"]
TR3: Testing¶
- TR3.1: Unit tests with mocked API
- TR3.2: Integration tests with E2E server mock
- TR3.3: E2E tests for complete workflow
TR4: E2E Server Extensions¶
- TR4.1: Add Gemini mock endpoints to E2E HTTP server
- TR4.2: Add
gemini_api_base()helper toE2EServerURLs - TR4.3: Support
gemini_api_baseconfig field for custom base URL (likeopenai_api_base)
Success Criteria¶
- ✅ Users can select Gemini for all three capabilities via unified provider
- ✅ Native audio transcription works via Gemini (file upload and inline data)
- ✅ Default behavior unchanged (local providers remain default)
- ✅ API keys managed securely via
GEMINI_API_KEYenvironment variable - ✅ Environment-based model defaults (test vs production)
- ✅ Both Config-based and experiment-based factory modes supported
- ✅ E2E tests pass with mock server support
- ✅ Follows OpenAI provider pattern exactly for consistency
Provider Capability Matrix (Updated)¶
| Capability | Local | OpenAI | Anthropic | Mistral | DeepSeek | Gemini | Grok | Ollama | | ---------- | ----- | ------ | --------- | ------- | -------- | ------ | | Transcription | ✅ | ✅ | ❌ | ✅ | ❌ | ✅ | | Speaker Detection | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | | Summarization | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | | Max Context | N/A | 128k | 200k | 256k | 64k | 2M |
Future Considerations¶
- Vertex AI integration for enterprise deployments
- Multimodal analysis (audio + images for show notes)
- Gemini's native grounding with Google Search
- Context caching for cost optimization