Glossary¶
Key terms and concepts used throughout the podcast_scraper codebase and documentation.
Architecture Terms¶
Provider¶
A modular implementation of a specific capability (transcription, speaker detection, summarization). Providers implement protocols and can be swapped without changing calling code.
Examples: WhisperTranscriptionProvider, OpenAITranscriptionProvider
See: Provider Implementation Guide
Protocol¶
A Python typing.Protocol that defines the interface a provider must implement.
Enables duck typing and dependency injection.
Examples: TranscriptionProvider, SpeakerDetectionProvider, SummarizationProvider
Pipeline¶
The orchestrated sequence of operations that processes podcast episodes: RSS parsing → transcript download → transcription → speaker detection → summarization → metadata.
Entry point: run_pipeline() in src/podcast_scraper/__init__.py
Config¶
Pydantic model that holds all configuration options. Can be loaded from CLI arguments, YAML/JSON files, or environment variables.
See: Configuration Guide
Testing Terms¶
Critical Path¶
The minimal set of tests that verify core functionality. Used for fast CI feedback.
Marked with @pytest.mark.critical_path.
See: Critical Path Testing Guide
E2E Test¶
End-to-end test that exercises the full system with real HTTP calls and actual ML models. Uses the E2E test server for mocking external APIs.
Marker: @pytest.mark.e2e
See: E2E Testing Guide
E2E Server¶
A FastAPI test server (tests/e2e/e2e_server.py) that mocks external APIs (OpenAI,
RSS feeds) for E2E testing without hitting real endpoints.
Integration Test¶
Test that verifies component interactions with mocked external dependencies. Uses real internal code but mocks network, filesystem, and ML models.
Marker: @pytest.mark.integration
See: Integration Testing Guide
Unit Test¶
Test that verifies a single function or class in isolation. All dependencies are mocked. Fast and deterministic.
Marker: @pytest.mark.unit
See: Unit Testing Guide
Serial Marker¶
@pytest.mark.serial - Forces test to run sequentially (not in parallel). Used for
tests that consume significant memory or have shared state.
Test Fixture¶
Pytest fixture that provides test data or setup. Defined in conftest.py files.
Examples: sample_episode, mock_config, temp_output_dir
ML Terms¶
Model Preloading¶
Process of downloading and caching ML models before running tests. Prevents network calls during test execution.
Command: make preload-ml-models
See: RFC-028
Whisper¶
OpenAI's speech-to-text model used for transcription when podcasts don't provide transcripts.
Models: tiny, base, small, medium, large
spaCy¶
NLP library used for Named Entity Recognition (NER) in speaker detection.
Model: en_core_web_sm
BART / LED¶
Transformer models used for summarization.
- BART:
facebook/bart-large-cnn- Short document summarization - LED:
allenai/led-base-16384- Long document summarization
CI/CD Terms¶
Two-Tier Testing¶
CI strategy where PRs run fast critical-path tests, and main branch runs full test suite.
See: CI/CD
Path Filtering¶
GitHub Actions feature that only triggers workflows when specific files change. Reduces unnecessary CI runs.
Pre-commit Hook¶
Git hook that runs checks (formatting, linting) before allowing commits.
Config: .pre-commit-config.yaml
File/Directory Terms¶
.test_outputs/¶
Directory where test-generated files are written. Gitignored.
.build/¶
Directory for build artifacts (docs site, distributions, coverage reports). Gitignored.
.cache/¶
ML model cache directory within the project. Alternative to ~/.cache/.
conftest.py¶
Pytest configuration file containing fixtures and hooks. Can exist at multiple levels (root, per-directory).
Configuration Terms¶
RSS Feed¶
XML feed that lists podcast episodes with metadata and media URLs.
Podcast 2.0 Transcript Tag¶
RSS extension that provides transcript URLs directly in the feed XML.
Output Directory¶
Where podcast_scraper writes downloaded transcripts, metadata, and summaries.
CLI: --output-dir
Config: output_dir
Skip Existing¶
Option to skip processing episodes that already have output files.
CLI: --skip-existing
Config: skip_existing: true
Code Style Terms¶
Black¶
Python code formatter. Enforces consistent style.
Line length: 100 characters
isort¶
Python import sorter. Groups and orders imports.
flake8¶
Python linter for style and error checking.
mypy¶
Static type checker for Python.
markdownlint¶
Linter for Markdown files. Config in .markdownlint.json.
Related Documentation¶
- Architecture - System design overview
- Development Guide - Development workflow
- Testing Guide - Test execution reference