Skip to content

Glossary

Key terms and concepts used throughout the podcast_scraper codebase and documentation.


Architecture Terms

Provider

A modular implementation of a specific capability (transcription, speaker detection, summarization). Providers implement protocols and can be swapped without changing calling code.

Examples: WhisperTranscriptionProvider, OpenAITranscriptionProvider

See: Provider Implementation Guide

Protocol

A Python typing.Protocol that defines the interface a provider must implement. Enables duck typing and dependency injection.

Examples: TranscriptionProvider, SpeakerDetectionProvider, SummarizationProvider

See: Protocol Extension Guide

Pipeline

The orchestrated sequence of operations that processes podcast episodes: RSS parsing → transcript download → transcription → speaker detection → summarization → metadata.

Entry point: run_pipeline() in src/podcast_scraper/__init__.py

Config

Pydantic model that holds all configuration options. Can be loaded from CLI arguments, YAML/JSON files, or environment variables.

See: Configuration Guide


Testing Terms

Critical Path

The minimal set of tests that verify core functionality. Used for fast CI feedback. Marked with @pytest.mark.critical_path.

See: Critical Path Testing Guide

E2E Test

End-to-end test that exercises the full system with real HTTP calls and actual ML models. Uses the E2E test server for mocking external APIs.

Marker: @pytest.mark.e2e

See: E2E Testing Guide

E2E Server

A FastAPI test server (tests/e2e/e2e_server.py) that mocks external APIs (OpenAI, RSS feeds) for E2E testing without hitting real endpoints.

Integration Test

Test that verifies component interactions with mocked external dependencies. Uses real internal code but mocks network, filesystem, and ML models.

Marker: @pytest.mark.integration

See: Integration Testing Guide

Unit Test

Test that verifies a single function or class in isolation. All dependencies are mocked. Fast and deterministic.

Marker: @pytest.mark.unit

See: Unit Testing Guide

Serial Marker

@pytest.mark.serial - Forces test to run sequentially (not in parallel). Used for tests that consume significant memory or have shared state.

Test Fixture

Pytest fixture that provides test data or setup. Defined in conftest.py files.

Examples: sample_episode, mock_config, temp_output_dir


ML Terms

Model Preloading

Process of downloading and caching ML models before running tests. Prevents network calls during test execution.

Command: make preload-ml-models

See: RFC-028

Whisper

OpenAI's speech-to-text model used for transcription when podcasts don't provide transcripts.

Models: tiny, base, small, medium, large

spaCy

NLP library used for Named Entity Recognition (NER) in speaker detection.

Model: en_core_web_sm

BART / LED

Transformer models used for summarization.

  • BART: facebook/bart-large-cnn - Short document summarization
  • LED: allenai/led-base-16384 - Long document summarization

CI/CD Terms

Two-Tier Testing

CI strategy where PRs run fast critical-path tests, and main branch runs full test suite.

See: CI/CD

Path Filtering

GitHub Actions feature that only triggers workflows when specific files change. Reduces unnecessary CI runs.

Pre-commit Hook

Git hook that runs checks (formatting, linting) before allowing commits.

Config: .pre-commit-config.yaml


File/Directory Terms

.test_outputs/

Directory where test-generated files are written. Gitignored.

.build/

Directory for build artifacts (docs site, distributions, coverage reports). Gitignored.

.cache/

ML model cache directory within the project. Alternative to ~/.cache/.

conftest.py

Pytest configuration file containing fixtures and hooks. Can exist at multiple levels (root, per-directory).


Configuration Terms

RSS Feed

XML feed that lists podcast episodes with metadata and media URLs.

Podcast 2.0 Transcript Tag

RSS extension that provides transcript URLs directly in the feed XML.

Output Directory

Where podcast_scraper writes downloaded transcripts, metadata, and summaries.

CLI: --output-dir

Config: output_dir

Skip Existing

Option to skip processing episodes that already have output files.

CLI: --skip-existing

Config: skip_existing: true


Code Style Terms

Black

Python code formatter. Enforces consistent style.

Line length: 100 characters

isort

Python import sorter. Groups and orders imports.

flake8

Python linter for style and error checking.

mypy

Static type checker for Python.

markdownlint

Linter for Markdown files. Config in .markdownlint.json.