Dependencies Guide¶
This guide documents the key third-party dependencies used in the podcast scraper project, including the rationale for their selection, alternatives considered, and dependency management philosophy.
For high-level architectural decisions, see Architecture. For general development practices, see Development Guide.
Overview¶
The project uses a layered dependency approach: core dependencies (always required) provide essential functionality, while ML dependencies (optional) enable advanced features like transcription and summarization.
Core Dependencies (Always Required)¶
requests (>=2.31.0)¶
-
Purpose: HTTP client for downloading RSS feeds, transcripts, and media files
-
Why chosen: Industry-standard library with excellent session management, connection pooling, and retry capabilities. Used throughout
rss/downloader.pyandrss_parser.pyfor
all network operations.
-
Key features utilized: Session pooling with custom retry adapters (
LoggingRetry), streaming downloads for large media files, configurable timeouts and headers -
Alternatives considered:
urllib3(too low-level),httpx(less mature at project start)
pydantic (>=2.6.0)¶
-
Purpose: Data validation, serialization, and configuration management via the
Configmodel -
Why chosen: Provides immutable, type-safe configuration with automatic validation, JSON/YAML parsing, and excellent error messages. Central to the architecture's "typed,
immutable configuration" design principle.
-
Key features utilized: Frozen dataclasses, field validators, JSON/YAML serialization, nested model validation, type coercion
-
Alternatives considered:
dataclasses(no validation),attrs(less validation features),marshmallow(more verbose)
defusedxml (>=0.7.1)¶
-
Purpose: Safe XML/RSS parsing that prevents XML bomb attacks and entity expansion vulnerabilities
-
Why chosen: Security-first RSS parsing is critical when processing untrusted feeds from the internet. Drop-in replacement for stdlib XML parsers with security hardening.
-
Key features utilized: Safe ElementTree parsing in
rss_parser.py, automatic protection against XXE attacks -
Alternatives considered: Standard library
xml.etree(vulnerable),lxml(heavier dependency)
tqdm (>=4.66.0)¶
-
Purpose: Progress bars for long-running operations (downloads, transcription)
-
Why chosen: Rich, customizable progress visualization with minimal API surface. Integrates cleanly via the
progress.pypluggable factory pattern. -
Key features utilized: Multi-level progress tracking, dynamic updates, thread-safe counters, custom formatting
-
Alternatives considered:
click.progressbar(less flexible),rich.progress(heavier dependency)
platformdirs (>=3.11.0)¶
-
Purpose: Cross-platform directory path resolution for output, cache, and configuration files
-
Why chosen: Handles OS-specific conventions (Linux XDG, macOS Application Support, Windows AppData) transparently. Essential for determining safe output roots and
validating user-provided paths.
-
Key features utilized: User data directory resolution, cache directory paths
-
Alternatives considered: Manual path construction (not portable),
appdirs(unmaintained)
PyYAML (>=6.0)¶
-
Purpose: YAML configuration file parsing alongside JSON support
-
Why chosen: YAML provides more human-friendly configuration syntax than JSON, with comments and multi-line string support. Widely adopted in operations/DevOps contexts.
-
Key features utilized: Safe YAML loading, round-trip with Pydantic models
-
Alternatives considered: JSON-only (less user-friendly), TOML (less mature Python support)
ML Dependencies (Optional, Install via pip install -e .[ml])¶
openai-whisper (>=20231117)¶
-
Purpose: Automatic speech recognition for podcast transcription fallback
-
Why chosen: State-of-the-art open-source ASR with multiple model sizes, multilingual support, and screenplay formatting. Local-first approach ensures privacy and no API
costs.
-
Key features utilized: Model selection (tiny→large), language detection, speaker diarization hints,
.enmodel variants for English optimization -
Alternatives considered: Google Speech-to-Text (API costs), Azure Speech (API costs), Vosk (less accurate), Mozilla DeepSpeech (deprecated)
-
Lazy loading: Imported conditionally in
providers/ml/whisper_utils.pyto avoid hard dependency
spacy (>=3.7.0) and English pipeline models (en-core-web-sm, en-core-web-trf)¶
-
Purpose: Named Entity Recognition (NER) for automatic speaker detection from episode metadata
-
Why chosen: Production-ready NLP library with pre-trained models for person name extraction. Fast, accurate, and supports multiple languages via consistent model naming
(en_core_web_sm, es_core_news_sm, etc.).
-
Key features utilized: PERSON entity extraction, language-aware model selection, efficient batch processing
-
Alternatives considered:
transformersNER (overkill for this use case), regex patterns (too brittle),nltk(less accurate) -
Model installation: English models
en-core-web-sm(small, ~13 MB) anden-core-web-trf(transformer, ~457 MB) are installed as direct URL dependencies from GitHub releases (same approach spaCy documents for large assets).
Pins live in pyproject.toml under [project.optional-dependencies] → ml. Versions must
stay compatible with spaCy 3.7.x (see spaCy models).
Optional local wheel cache for spaCy models¶
Pip’s normal cache helps on repeat installs, but new virtualenvs or workflows that clear the
pip cache (for example some CI jobs) can still re-download the large .whl files from GitHub.
Optional workflow:
- Download wheels into
wheels/spacy/(gitignored*.whl; seewheels/README.md):
make download-spacy-wheels
- Install with
make init(recommended): ifwheels/spacy/*.whlexists, the Makefile setsPIP_FIND_LINKSfor pip automatically (unless you already exported it).
For a manual pip install without Make, point pip at that directory:
export PIP_FIND_LINKS="$(pwd)/wheels/spacy"
pip install -e ".[dev,ml,llm]"
On Windows (cmd): set PIP_FIND_LINKS=%CD%\wheels\spacy before pip install.
- CI / Docker: GitHub Actions already uses
actions/setup-pythonpip caching; a local wheel directory is most useful for developers, Snyk jobs that runpip cache purge, or custom images. Prefer a narrow cache key if you addactions/cacheforwheels/spacy(for example hashscripts/spacy_model_wheels_requirements.txt).
When spaCy model pins change in pyproject.toml:
-
Update
scripts/spacy_model_wheels_requirements.txtso its two PEP 508 lines match the newen-core-web-* @ https://...entries inpyproject.toml(unit testtest_spacy_model_wheels_requirements_syncenforces this). -
Re-run
make download-spacy-wheels(orpip download -r scripts/spacy_model_wheels_requirements.txt -d wheels/spacy). -
Re-install
[ml]so the new wheels are used. -
Version compatibility: spaCy 3.7.x requires matching 3.7.x pipeline packages for these English models. When updating spaCy or model URLs, align both and refresh the requirements file above.
torch (>=2.0.0) and transformers (>=4.30.0)¶
-
Purpose: Deep learning framework (torch) and pre-trained transformer models (transformers) for episode summarization
-
Why chosen:
transformersprovides access to production-ready summarization models (BART, PEGASUS, LED) with automatic caching and hardware acceleration.torchis the de
facto standard for deep learning in Python with excellent MPS (Apple Silicon) and CUDA (NVIDIA) support.
-
Key features utilized:
-
torch: Device detection (MPS/CUDA/CPU), memory-efficient inference, gradient-free execution
-
transformers: Model auto-loading, tokenization, generation with beam search, automatic model caching (
~/.cache/huggingface/) -
Models used: BART-large (map phase), LED/long-fast (reduce phase), PEGASUS (alternative)
-
Alternatives considered: OpenAI API (costs/privacy), spaCy summarization (less sophisticated)
-
Lazy loading: Imported conditionally in
providers/ml/summarizer.pyto avoid hard dependency when summarization is disabled
sentencepiece (>=0.1.99)¶
-
Purpose: Tokenization for certain transformer models (required by PEGASUS and others)
-
Why chosen: Required dependency for models using SentencePiece tokenization. Lightweight and efficient.
-
Key features utilized: Automatic integration via
transformerslibrary
accelerate (>=0.20.0)¶
-
Purpose: Optimized model loading and inference acceleration for large transformer models
-
Why chosen: Reduces model loading time and memory usage, especially for 16-bit inference and device mapping. Official Hugging Face library for production deployments.
-
Key features utilized: Fast model initialization, memory optimization for limited-RAM systems
protobuf (>=3.20.0)¶
-
Purpose: Protocol buffer serialization (required by some transformer model configurations)
-
Why chosen: Transitive dependency for certain model formats. Pinned to avoid version conflicts.
openai (>=1.0.0)¶
-
Purpose: OpenAI Python SDK for API-based providers (transcription, speaker detection, summarization)
-
Why chosen: Official OpenAI Python SDK with excellent type hints, async support, and comprehensive API coverage. Used by OpenAI providers (
OpenAITranscriptionProvider,OpenAISpeakerDetector,OpenAISummarizationProvider)
for cloud-based capabilities.
-
Key features utilized: Chat completions API (summarization, speaker detection), audio transcriptions API (transcription), custom base URL support (for E2E testing with mock servers)
-
Alternatives considered: Direct
requestscalls (less maintainable, no type hints),anthropicSDK (not implemented) -
E2E Testing: Providers support
openai_api_baseconfiguration for custom endpoints, allowing E2E tests to use mock servers instead of real API calls. See Provider Implementation Guide for details.
Dependency Management Philosophy¶
- Core vs Optional: Core dependencies are minimal and stable. Heavy ML dependencies
are optional (
pip install -e .[ml]) to avoid forcing users to install GB-sized
packages when only transcript downloading is needed.
- Version pinning: Minimum versions are specified, but upper bounds are avoided to allow users to upgrade independently. Major version changes (e.g., Pydantic v2) are
tracked carefully.
-
Security: Security-focused libraries (
defusedxml) are preferred. Regular updates viapip-auditandbandit(dev tools) ensure vulnerability detection. -
Lazy loading: Optional dependencies (
openai-whisper,torch,transformers,spacy) are imported lazily with graceful fallbacks when unavailable. -
Platform compatibility: All dependencies support Linux, macOS, and Windows. Platform-specific optimizations (MPS, CUDA) are detected at runtime.
Development Dependencies (Optional, Install via pip install -e .[dev])¶
pytest-json-report (>=1.5.0,<2.0.0)¶
-
Purpose: Generate structured JSON reports from pytest test runs for metrics collection and analysis
-
Why chosen: Provides machine-readable test metrics (pass/fail counts, durations, flaky test detection) that integrate with our metrics collection system (RFC-025). Used in nightly workflow for comprehensive test metrics
tracking.
-
Key features utilized: JSON report generation (
--json-report), test outcome tracking, rerun detection for flaky tests -
Alternatives considered: Custom pytest plugins (more maintenance), JUnit XML only (less structured data)
-
Usage: Automatically used in nightly workflow via
--json-report --json-report-file=reports/pytest.json
Installation¶
Core Dependencies Only¶
pip install -e .
With ML Dependencies¶
pip install -e .[ml]
With API Provider Dependencies (OpenAI)¶
Note: The openai package is already included in core dependencies (see pyproject.toml).
You do not need to install it separately. OpenAI providers work with just the core package:
# For LLM-only users (no ML dependencies needed)
pip install -e .
If you want to use both OpenAI and local ML providers:
# For users who want both OpenAI and local ML options
pip install -e ".[ml]"
With API Provider Dependencies (Gemini, Anthropic, Mistral, Ollama)¶
Note: LLM API providers (OpenAI, Gemini, Anthropic, Mistral, Ollama/httpx) are grouped
in the [llm] extra. There is no separate [gemini] or [ollama] extra.
# For LLM API provider users (no ML dependencies needed)
pip install -e ".[llm]"
If you want to use both LLM API providers and local ML providers:
# For users who want both LLM API providers and local ML options
pip install -e ".[ml,llm]"
With Ollama (Local LLM Server)¶
Note: The httpx package required for Ollama health checks is included in the [llm]
extra (there is no separate [ollama] extra). Install with pip install -e ".[llm]".
Prerequisites:
- Install Ollama server:
brew install ollama(macOS) or download - Start Ollama server:
ollama serve(keep running) - Pull models:
ollama pull llama3.3:latest
Installation:
# Ollama support is part of the [llm] extra
pip install -e ".[llm]"
If you want to use both Ollama and local ML providers:
# For users who want both LLM API providers and local ML options
pip install -e ".[ml,llm]"
Important Notes:
- Ollama is a local, self-hosted solution - no API key required
- Ollama server must be running (
ollama serve) before use - Models must be pulled (
ollama pull <model-name>) before use - See Ollama Provider Guide for detailed setup and troubleshooting
With Development Dependencies¶
pip install -e .[dev]
Known Dependency Issues & Tracking¶
This section documents known issues, warnings, and compatibility problems with dependencies that require monitoring or upstream fixes.
thinc/spaCy FutureWarning: torch.cuda.amp.autocast Deprecation¶
Status: Tracked (Issue #416) Severity: Low (cosmetic warning, no functional impact) First Observed: 2026-02-08 Last Updated: 2026-02-08
Description:
A FutureWarning is emitted from the thinc library (used by spaCy) about deprecated PyTorch API usage:
FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.cuda.amp.autocast(...)` instead.
Root Cause:
- Warning originates from
thinc/shims/pytorch.py:114 - Known compatibility issue between
thincand newer PyTorch versions - Not actionable by users - fix must come from upstream
thincmaintainers
Current Mitigation:
- Warning is suppressed in
src/podcast_scraper/cli.py(lines 1681-1689) to prevent log noise - Suppression uses specific module filter (
module="thinc.*") to avoid hiding other warnings
Action Items:
- [ ] Monitor
thincreleases for fix (check thinc releases) - [ ] Update
thincdependency when fix is available (viaspacydependency) - [ ] Remove warning suppression in
cli.pyonce fixed upstream - [ ] Verify no functionality impact after update
Related Dependencies:
spacy>=3.7.0,<4.0.0(transitive dependency:thinc)torch>=2.0.0,<3.0.0(PyTorch)
References:
- Issue: #416
- Warning source:
thinc/shims/pytorch.py:114 - PyTorch deprecation:
torch.cuda.amp.autocast(args...)→torch.cuda.amp.autocast(...)
Related Documentation¶
- Architecture - High-level system design and dependency overview
- Development Guide - General development practices
- ML Provider Reference - Details on ML dependencies
- Provider Implementation Guide - Provider testing with dependencies