Dependencies Guide¶

This guide documents the key third-party dependencies used in the podcast scraper project, including the rationale for their selection, alternatives considered, and dependency management philosophy.

For high-level architectural decisions, see Architecture. For general development practices, see Development Guide.

Overview¶

The project uses a layered dependency approach: core dependencies (always required) provide essential functionality, while ML dependencies (optional) enable advanced features like transcription and summarization.

Core Dependencies (Always Required)¶

`requests` (>=2.31.0)¶

Purpose: HTTP client for downloading RSS feeds, transcripts, and media files
Why chosen: Industry-standard library with excellent session management, connection pooling, and retry capabilities. Used throughout rss/downloader.py and rss_parser.py for

all network operations.

Key features utilized: Session pooling with custom retry adapters (LoggingRetry), streaming downloads for large media files, configurable timeouts and headers
Alternatives considered: urllib3 (too low-level), httpx (less mature at project start)

`pydantic` (>=2.6.0)¶

Purpose: Data validation, serialization, and configuration management via the Config model
Why chosen: Provides immutable, type-safe configuration with automatic validation, JSON/YAML parsing, and excellent error messages. Central to the architecture's "typed,

immutable configuration" design principle.

Key features utilized: Frozen dataclasses, field validators, JSON/YAML serialization, nested model validation, type coercion
Alternatives considered: dataclasses (no validation), attrs (less validation features), marshmallow (more verbose)

`defusedxml` (>=0.7.1)¶

Purpose: Safe XML/RSS parsing that prevents XML bomb attacks and entity expansion vulnerabilities
Why chosen: Security-first RSS parsing is critical when processing untrusted feeds from the internet. Drop-in replacement for stdlib XML parsers with security hardening.
Key features utilized: Safe ElementTree parsing in rss_parser.py, automatic protection against XXE attacks
Alternatives considered: Standard library xml.etree (vulnerable), lxml (heavier dependency)

`tqdm` (>=4.66.0)¶

Purpose: Progress bars for long-running operations (downloads, transcription)
Why chosen: Rich, customizable progress visualization with minimal API surface. Integrates cleanly via the progress.py pluggable factory pattern.
Key features utilized: Multi-level progress tracking, dynamic updates, thread-safe counters, custom formatting
Alternatives considered: click.progressbar (less flexible), rich.progress (heavier dependency)

`platformdirs` (>=3.11.0)¶

Purpose: Cross-platform directory path resolution for output, cache, and configuration files
Why chosen: Handles OS-specific conventions (Linux XDG, macOS Application Support, Windows AppData) transparently. Essential for determining safe output roots and

validating user-provided paths.

Key features utilized: User data directory resolution, cache directory paths
Alternatives considered: Manual path construction (not portable), appdirs (unmaintained)

`PyYAML` (>=6.0)¶

Purpose: YAML configuration file parsing alongside JSON support
Why chosen: YAML provides more human-friendly configuration syntax than JSON, with comments and multi-line string support. Widely adopted in operations/DevOps contexts.
Key features utilized: Safe YAML loading, round-trip with Pydantic models
Alternatives considered: JSON-only (less user-friendly), TOML (less mature Python support)

ML Dependencies (Optional, Install via `pip install -e .[ml]`)¶

`openai-whisper` (>=20231117)¶

Purpose: Automatic speech recognition for podcast transcription fallback
Why chosen: State-of-the-art open-source ASR with multiple model sizes, multilingual support, and screenplay formatting. Local-first approach ensures privacy and no API

costs.

Key features utilized: Model selection (tiny→large), language detection, speaker diarization hints, .en model variants for English optimization
Alternatives considered: Google Speech-to-Text (API costs), Azure Speech (API costs), Vosk (less accurate), Mozilla DeepSpeech (deprecated)
Lazy loading: Imported conditionally in providers/ml/whisper_utils.py to avoid hard dependency

`spacy` (>=3.7.0) and English pipeline models (`en-core-web-sm`, `en-core-web-trf`)¶

Purpose: Named Entity Recognition (NER) for automatic speaker detection from episode metadata
Why chosen: Production-ready NLP library with pre-trained models for person name extraction. Fast, accurate, and supports multiple languages via consistent model naming

(en_core_web_sm, es_core_news_sm, etc.).

Key features utilized: PERSON entity extraction, language-aware model selection, efficient batch processing
Alternatives considered: transformers NER (overkill for this use case), regex patterns (too brittle), nltk (less accurate)
Model installation: English models en-core-web-sm (small, ~13 MB) and en-core-web-trf (transformer, ~457 MB) are installed as direct URL dependencies from GitHub releases (same approach spaCy documents for large assets).

Pins live in pyproject.toml under [project.optional-dependencies] → ml. Versions must stay compatible with spaCy 3.7.x (see spaCy models).

Optional local wheel cache for spaCy models¶

Pip’s normal cache helps on repeat installs, but new virtualenvs or workflows that clear the pip cache (for example some CI jobs) can still re-download the large .whl files from GitHub.

Optional workflow:

Download wheels into wheels/spacy/ (gitignored *.whl; see wheels/README.md):

make download-spacy-wheels

Install with make init (recommended): if wheels/spacy/*.whl exists, the Makefile sets PIP_FIND_LINKS for pip automatically (unless you already exported it).

For a manual pip install without Make, point pip at that directory:

export PIP_FIND_LINKS="$(pwd)/wheels/spacy"
pip install -e ".[dev,ml,llm]"

On Windows (cmd): set PIP_FIND_LINKS=%CD%\wheels\spacy before pip install.

CI / Docker: GitHub Actions already uses actions/setup-python pip caching; a local wheel directory is most useful for developers, Snyk jobs that run pip cache purge, or custom images. Prefer a narrow cache key if you add actions/cache for wheels/spacy (for example hash scripts/spacy_model_wheels_requirements.txt).

When spaCy model pins change in pyproject.toml:

Update scripts/spacy_model_wheels_requirements.txt so its two PEP 508 lines match the new en-core-web-* @ https://... entries in pyproject.toml (unit test test_spacy_model_wheels_requirements_sync enforces this).
Re-run make download-spacy-wheels (or pip download -r scripts/spacy_model_wheels_requirements.txt -d wheels/spacy).
Re-install [ml] so the new wheels are used.
Version compatibility: spaCy 3.7.x requires matching 3.7.x pipeline packages for these English models. When updating spaCy or model URLs, align both and refresh the requirements file above.

`torch` (>=2.0.0) and `transformers` (>=4.30.0)¶

Purpose: Deep learning framework (torch) and pre-trained transformer models (transformers) for episode summarization
Why chosen: transformers provides access to production-ready summarization models (BART, PEGASUS, LED) with automatic caching and hardware acceleration. torch is the de

facto standard for deep learning in Python with excellent MPS (Apple Silicon) and CUDA (NVIDIA) support.

Key features utilized:
torch: Device detection (MPS/CUDA/CPU), memory-efficient inference, gradient-free execution
transformers: Model auto-loading, tokenization, generation with beam search, automatic model caching (~/.cache/huggingface/)
Models used: BART-large (map phase), LED/long-fast (reduce phase), PEGASUS (alternative)
Alternatives considered: OpenAI API (costs/privacy), spaCy summarization (less sophisticated)
Lazy loading: Imported conditionally in providers/ml/summarizer.py to avoid hard dependency when summarization is disabled

`sentencepiece` (>=0.1.99)¶

Purpose: Tokenization for certain transformer models (required by PEGASUS and others)
Why chosen: Required dependency for models using SentencePiece tokenization. Lightweight and efficient.
Key features utilized: Automatic integration via transformers library

`accelerate` (>=0.20.0)¶

Purpose: Optimized model loading and inference acceleration for large transformer models
Why chosen: Reduces model loading time and memory usage, especially for 16-bit inference and device mapping. Official Hugging Face library for production deployments.
Key features utilized: Fast model initialization, memory optimization for limited-RAM systems

`protobuf` (>=3.20.0)¶

Purpose: Protocol buffer serialization (required by some transformer model configurations)
Why chosen: Transitive dependency for certain model formats. Pinned to avoid version conflicts.

`openai` (>=1.0.0)¶

Purpose: OpenAI Python SDK for API-based providers (transcription, speaker detection, summarization)
Why chosen: Official OpenAI Python SDK with excellent type hints, async support, and comprehensive API coverage. Used by OpenAI providers (OpenAITranscriptionProvider, OpenAISpeakerDetector, OpenAISummarizationProvider)

for cloud-based capabilities.

Key features utilized: Chat completions API (summarization, speaker detection), audio transcriptions API (transcription), custom base URL support (for E2E testing with mock servers)
Alternatives considered: Direct requests calls (less maintainable, no type hints), anthropic SDK (not implemented)
E2E Testing: Providers support openai_api_base configuration for custom endpoints, allowing E2E tests to use mock servers instead of real API calls. See Provider Implementation Guide for details.

Dependency Management Philosophy¶

Core vs Optional: Core dependencies are minimal and stable. Heavy ML dependencies are optional (pip install -e .[ml]) to avoid forcing users to install GB-sized

packages when only transcript downloading is needed.

Version pinning: Minimum versions are specified, but upper bounds are avoided to allow users to upgrade independently. Major version changes (e.g., Pydantic v2) are

tracked carefully.

Security: Security-focused libraries (defusedxml) are preferred. Regular updates via pip-audit and bandit (dev tools) ensure vulnerability detection.
Lazy loading: Optional dependencies (openai-whisper, torch, transformers, spacy) are imported lazily with graceful fallbacks when unavailable.
Platform compatibility: All dependencies support Linux, macOS, and Windows. Platform-specific optimizations (MPS, CUDA) are detected at runtime.

Development Dependencies (Optional, Install via `pip install -e .[dev]`)¶

`pytest-json-report` (>=1.5.0,<2.0.0)¶

Purpose: Generate structured JSON reports from pytest test runs for metrics collection and analysis
Why chosen: Provides machine-readable test metrics (pass/fail counts, durations, flaky test detection) that integrate with our metrics collection system (RFC-025). Used in nightly workflow for comprehensive test metrics

tracking.

Key features utilized: JSON report generation (--json-report), test outcome tracking, rerun detection for flaky tests
Alternatives considered: Custom pytest plugins (more maintenance), JUnit XML only (less structured data)
Usage: Automatically used in nightly workflow via --json-report --json-report-file=reports/pytest.json

Installation¶

Core Dependencies Only¶

pip install -e .

With ML Dependencies¶

pip install -e .[ml]

With API Provider Dependencies (OpenAI)¶

Note: The openai package is already included in core dependencies (see pyproject.toml). You do not need to install it separately. OpenAI providers work with just the core package:

# For LLM-only users (no ML dependencies needed)
pip install -e .

If you want to use both OpenAI and local ML providers:

# For users who want both OpenAI and local ML options
pip install -e ".[ml]"

With API Provider Dependencies (Gemini, Anthropic, Mistral, Ollama)¶

Note: LLM API providers (OpenAI, Gemini, Anthropic, Mistral, Ollama/httpx) are grouped in the [llm] extra. There is no separate [gemini] or [ollama] extra.

# For LLM API provider users (no ML dependencies needed)
pip install -e ".[llm]"

If you want to use both LLM API providers and local ML providers:

# For users who want both LLM API providers and local ML options
pip install -e ".[ml,llm]"

With Ollama (Local LLM Server)¶

Note: The httpx package required for Ollama health checks is included in the [llm] extra (there is no separate [ollama] extra). Install with pip install -e ".[llm]".

Prerequisites:

Install Ollama server: brew install ollama (macOS) or download
Start Ollama server: ollama serve (keep running)
Pull models: ollama pull llama3.3:latest

Installation:

# Ollama support is part of the [llm] extra
pip install -e ".[llm]"

If you want to use both Ollama and local ML providers:

# For users who want both LLM API providers and local ML options
pip install -e ".[ml,llm]"

Important Notes:

Ollama is a local, self-hosted solution - no API key required
Ollama server must be running (ollama serve) before use
Models must be pulled (ollama pull <model-name>) before use
See Ollama Provider Guide for detailed setup and troubleshooting

With Development Dependencies¶

pip install -e .[dev]

Known Dependency Issues & Tracking¶

This section documents known issues, warnings, and compatibility problems with dependencies that require monitoring or upstream fixes.

thinc/spaCy FutureWarning: torch.cuda.amp.autocast Deprecation¶

Status: Tracked (Issue #416) Severity: Low (cosmetic warning, no functional impact) First Observed: 2026-02-08 Last Updated: 2026-02-08

Description: A FutureWarning is emitted from the thinc library (used by spaCy) about deprecated PyTorch API usage:

FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.cuda.amp.autocast(...)` instead.

Root Cause:

Warning originates from thinc/shims/pytorch.py:114
Known compatibility issue between thinc and newer PyTorch versions
Not actionable by users - fix must come from upstream thinc maintainers

Current Mitigation:

Warning is suppressed in src/podcast_scraper/cli.py (lines 1681-1689) to prevent log noise
Suppression uses specific module filter (module="thinc.*") to avoid hiding other warnings

Action Items:

[ ] Monitor thinc releases for fix (check thinc releases)
[ ] Update thinc dependency when fix is available (via spacy dependency)
[ ] Remove warning suppression in cli.py once fixed upstream
[ ] Verify no functionality impact after update

Related Dependencies:

spacy>=3.7.0,<4.0.0 (transitive dependency: thinc)
torch>=2.0.0,<3.0.0 (PyTorch)

References:

Issue: #416
Warning source: thinc/shims/pytorch.py:114
PyTorch deprecation: torch.cuda.amp.autocast(args...) → torch.cuda.amp.autocast(...)

Architecture - High-level system design and dependency overview
Development Guide - General development practices
ML Provider Reference - Details on ML dependencies
Provider Implementation Guide - Provider testing with dependencies

Dependencies Guide¶

Overview¶

Core Dependencies (Always Required)¶

requests (>=2.31.0)¶

pydantic (>=2.6.0)¶

defusedxml (>=0.7.1)¶

tqdm (>=4.66.0)¶

platformdirs (>=3.11.0)¶

PyYAML (>=6.0)¶

ML Dependencies (Optional, Install via pip install -e .[ml])¶

openai-whisper (>=20231117)¶

spacy (>=3.7.0) and English pipeline models (en-core-web-sm, en-core-web-trf)¶