Core API¶

This is the primary public API for podcast_scraper. Use these functions for programmatic access.

Quick Start¶

import podcast_scraper

# Create configuration
cfg = podcast_scraper.Config(
    rss="https://example.com/feed.xml",
    output_dir="./transcripts",
    max_episodes=10
)

# Run the pipeline
count, summary = podcast_scraper.run_pipeline(cfg)
print(f"Downloaded {count} transcripts: {summary}")

API Reference¶

run_pipeline ¶

run_pipeline(cfg: Config) -> Tuple[int, str]

Execute the main podcast scraping pipeline.

This is the primary entry point for programmatic use of podcast_scraper. It orchestrates the complete workflow from RSS feed fetching to transcript generation and optional metadata/summarization.

The pipeline executes the following stages:

Setup output directory (with optional run ID subdirectory)
Fetch and parse RSS feed
Detect speakers (if auto-detection enabled)
Process episodes concurrently:
Download published transcripts
Or queue media for Whisper transcription
Transcribe queued media files sequentially (if Whisper enabled)
Generate metadata documents (if enabled)
Generate episode summaries (if enabled)
Clean up temporary files

Parameters:

Name	Type	Description	Default
`cfg`	`Config`	Configuration object with all pipeline settings. See `Config` for available options.	required

Returns:

Type	Description
`Tuple[int, str]`	Tuple[int, str]: A tuple containing: count (int): Number of episodes processed (transcripts saved or planned) summary (str): Human-readable summary message describing the run

Raises:

Type	Description
`RuntimeError`	If output directory cleanup fails when `clean_output=True`
`ValueError`	If RSS URL is invalid or feed cannot be parsed
`FileNotFoundError`	If configuration file references missing files
`OSError`	If file system operations fail

Example

from podcast_scraper import Config, run_pipeline

cfg = Config( ... rss="https://example.com/feed.xml", ... output_dir="./transcripts", ... max_episodes=10 ... ) count, summary = run_pipeline(cfg) print(f"Downloaded {count} transcripts: {summary}") Downloaded 10 transcripts: Processed 10/50 episodes

Example with Whisper transcription

cfg = Config( ... rss="https://example.com/feed.xml", ... transcribe_missing=True, ... whisper_model="base", ... screenplay=True, ... num_speakers=2 ... ) count, summary = run_pipeline(cfg)

Example with metadata and summaries

cfg = Config( ... rss="https://example.com/feed.xml", ... generate_metadata=True, ... generate_summaries=True ... ) count, summary = run_pipeline(cfg)

Note

For non-interactive use (daemons, services), consider using the service.run() function instead, which provides structured error handling and return values.

See Also

Config: Configuration model with all available options
service.run(): Service API with structured error handling
load_config_file(): Load configuration from JSON/YAML file

Source code in src/podcast_scraper/workflow/orchestration.py

def run_pipeline(cfg: config.Config) -> Tuple[int, str]:
    """Execute the main podcast scraping pipeline.

    This is the primary entry point for programmatic use of podcast_scraper. It orchestrates
    the complete workflow from RSS feed fetching to transcript generation and optional
    metadata/summarization.

    The pipeline executes the following stages:

    1. Setup output directory (with optional run ID subdirectory)
    2. Fetch and parse RSS feed
    3. Detect speakers (if auto-detection enabled)
    4. Process episodes concurrently:
       - Download published transcripts
       - Or queue media for Whisper transcription
    5. Transcribe queued media files sequentially (if Whisper enabled)
    6. Generate metadata documents (if enabled)
    7. Generate episode summaries (if enabled)
    8. Clean up temporary files

    Args:
        cfg: Configuration object with all pipeline settings. See `Config` for available options.

    Returns:
        Tuple[int, str]: A tuple containing:

            - count (int): Number of episodes processed (transcripts saved or planned)
            - summary (str): Human-readable summary message describing the run

    Raises:
        RuntimeError: If output directory cleanup fails when `clean_output=True`
        ValueError: If RSS URL is invalid or feed cannot be parsed
        FileNotFoundError: If configuration file references missing files
        OSError: If file system operations fail

    Example:
        >>> from podcast_scraper import Config, run_pipeline
        >>>
        >>> cfg = Config(
        ...     rss="https://example.com/feed.xml",
        ...     output_dir="./transcripts",
        ...     max_episodes=10
        ... )
        >>> count, summary = run_pipeline(cfg)
        >>> print(f"Downloaded {count} transcripts: {summary}")
        Downloaded 10 transcripts: Processed 10/50 episodes

    Example with Whisper transcription:
        >>> cfg = Config(
        ...     rss="https://example.com/feed.xml",
        ...     transcribe_missing=True,
        ...     whisper_model="base",
        ...     screenplay=True,
        ...     num_speakers=2
        ... )
        >>> count, summary = run_pipeline(cfg)

    Example with metadata and summaries:
        >>> cfg = Config(
        ...     rss="https://example.com/feed.xml",
        ...     generate_metadata=True,
        ...     generate_summaries=True
        ... )
        >>> count, summary = run_pipeline(cfg)

    Note:
        For non-interactive use (daemons, services), consider using the `service.run()`
        function instead, which provides structured error handling and return values.

    See Also:
        - `Config`: Configuration model with all available options
        - `service.run()`: Service API with structured error handling
        - `load_config_file()`: Load configuration from JSON/YAML file
    """
    # Step 1: Setup pipeline environment
    effective_output_dir, run_suffix, full_config_string, pipeline_metrics = (
        _setup_pipeline_environment(cfg)
    )

    from ..gi.deps import validate_gil_grounding_dependencies

    validate_gil_grounding_dependencies(cfg)

    # Initialize JSONL emitter if enabled
    jsonl_emitter = _setup_jsonl_emitter(cfg, effective_output_dir, pipeline_metrics)

    # Step 1.5: Preload ML models if configured
    wf_stages.setup.preload_ml_models_if_needed(cfg)

    # Step 1.6: Create all providers once (singleton pattern per run)
    # Providers are created here and passed to stages to avoid redundant initialization
    transcription_provider, speaker_detector, summary_provider = _create_all_providers(cfg)

    # Step 1.7-1.8: Setup logging and device tracking
    _setup_logging_and_devices(
        cfg, transcription_provider, speaker_detector, summary_provider, pipeline_metrics
    )

    # Step 1.5: Create run manifest
    run_manifest = _create_run_manifest(cfg, effective_output_dir)

    # Step 2-4: Fetch and prepare episodes
    feed, rss_bytes, feed_metadata, episodes = _fetch_and_prepare_episodes(cfg, pipeline_metrics)

    # Step 5-6.5: Setup pipeline resources
    normalizing_start, host_detection_result, transcription_resources, processing_resources = (
        _setup_pipeline_resources(
            cfg,
            feed,
            episodes,
            effective_output_dir,
            transcription_provider,
            speaker_detector,
            pipeline_metrics,
        )
    )

    # Wrap all processing in try-finally to ensure cleanup always happens
    # This prevents memory leaks if exceptions occur during processing
    try:
        saved = _process_episodes_with_threading(
            cfg=cfg,
            episodes=episodes,
            feed=feed,
            effective_output_dir=effective_output_dir,
            run_suffix=run_suffix,
            feed_metadata=feed_metadata,
            host_detection_result=host_detection_result,
            transcription_resources=transcription_resources,
            processing_resources=processing_resources,
            pipeline_metrics=pipeline_metrics,
            summary_provider=summary_provider,
            transcription_provider=transcription_provider,
            normalizing_start=normalizing_start,
        )

    finally:
        # Step 9.5: Unload models to free memory
        # This runs even if exceptions occur above, preventing memory leaks
        _cleanup_providers(transcription_resources, summary_provider)
        if jsonl_emitter is not None:
            try:
                jsonl_emitter.__exit__(None, None, None)
            except Exception:
                pass

    # Step 10-15: Finalize pipeline (cleanup, save metrics, generate reports)
    return _finalize_pipeline(
        cfg=cfg,
        saved=saved,
        transcription_resources=transcription_resources,
        effective_output_dir=effective_output_dir,
        run_suffix=run_suffix,
        pipeline_metrics=pipeline_metrics,
        episodes=episodes,
        jsonl_emitter=jsonl_emitter,
        run_manifest=run_manifest,
        summary_provider=summary_provider,
        transcription_provider=transcription_provider,
    )

load_config_file ¶

load_config_file(path: str) -> Dict[str, Any]

Load configuration from a JSON or YAML file.

This function reads a configuration file and returns a dictionary of configuration values. The file format is auto-detected from the file extension (.json, .yaml, or .yml).

The returned dictionary can be unpacked into the Config constructor to create a configuration object.

Parameters:

Name	Type	Description	Default
`path`	`str`	Path to configuration file (JSON or YAML). Supports tilde expansion for home directory (e.g., "~/config.yaml").	required

Returns:

Type	Description
`Dict[str, Any]`	Dict[str, Any]: Dictionary containing configuration values from the file. Keys correspond to `Config` field names (using aliases where applicable).

Raises:

Type	Description
`ValueError`	If any of the following occur: Config path is empty Config file does not exist File format is invalid (not JSON or YAML) JSON parsing fails YAML parsing fails
`OSError`	If file cannot be read due to permissions or I/O errors

Example

from podcast_scraper import Config, load_config_file, run_pipeline

Load from YAML file¶

config_dict = load_config_file("config.yaml") cfg = Config(**config_dict) count, summary = run_pipeline(cfg)

Example with JSON

config_dict = load_config_file("config.json") cfg = Config(**config_dict)

Example with direct usage

from podcast_scraper import load_config_file, service

Service API provides load_config_file convenience¶

result = service.run_from_config_file("config.yaml")

Supported Formats

JSON (.json):

{
  "rss": "https://example.com/feed.xml",
  "output_dir": "./transcripts",
  "max_episodes": 50
}

YAML (.yaml, .yml):

rss: https://example.com/feed.xml
output_dir: ./transcripts
max_episodes: 50

Note

Field aliases are supported (e.g., both "rss" and "rss_url" work)
See Config documentation for all available configuration options
Configuration files should not contain sensitive data (API keys, passwords)

See Also

Config: Configuration model and field documentation
service.run_from_config_file(): Direct service API from config file
Configuration examples: config/examples/config.example.json, config/examples/config.example.yaml

Source code in src/podcast_scraper/config.py

def load_config_file(
    path: str,
) -> Dict[str, Any]:  # noqa: C901 - file parsing handles multiple formats
    """Load configuration from a JSON or YAML file.

    This function reads a configuration file and returns a dictionary of configuration values.
    The file format is auto-detected from the file extension (`.json`, `.yaml`, or `.yml`).

    The returned dictionary can be unpacked into the `Config` constructor to create a
    configuration object.

    Args:
        path: Path to configuration file (JSON or YAML). Supports tilde expansion for
              home directory (e.g., "~/config.yaml").

    Returns:
        Dict[str, Any]: Dictionary containing configuration values from the file.
            Keys correspond to `Config` field names (using aliases where applicable).

    Raises:
        ValueError: If any of the following occur:

            - Config path is empty
            - Config file does not exist
            - File format is invalid (not JSON or YAML)
            - JSON parsing fails
            - YAML parsing fails

        OSError: If file cannot be read due to permissions or I/O errors

    Example:
        >>> from podcast_scraper import Config, load_config_file, run_pipeline
        >>>
        >>> # Load from YAML file
        >>> config_dict = load_config_file("config.yaml")
        >>> cfg = Config(**config_dict)
        >>> count, summary = run_pipeline(cfg)

    Example with JSON:
        >>> config_dict = load_config_file("config.json")
        >>> cfg = Config(**config_dict)

    Example with direct usage:
        >>> from podcast_scraper import load_config_file, service
        >>>
        >>> # Service API provides load_config_file convenience
        >>> result = service.run_from_config_file("config.yaml")

    Supported Formats:
        **JSON** (`.json`):

            {
              "rss": "https://example.com/feed.xml",
              "output_dir": "./transcripts",
              "max_episodes": 50
            }

        **YAML** (`.yaml`, `.yml`):

            rss: https://example.com/feed.xml
            output_dir: ./transcripts
            max_episodes: 50

    Note:
        - Field aliases are supported (e.g., both "rss" and "rss_url" work)
        - See `Config` documentation for all available configuration options
        - Configuration files should not contain sensitive data (API keys, passwords)

    See Also:
        - `Config`: Configuration model and field documentation
        - `service.run_from_config_file()`: Direct service API from config file
        - Configuration examples: `config/examples/config.example.json`,
          `config/examples/config.example.yaml`
    """
    if not path:
        raise ValueError("Config path cannot be empty")

    cfg_path = Path(path).expanduser()
    try:
        resolved = cfg_path.resolve()
    except (OSError, RuntimeError) as exc:
        raise ValueError(f"Invalid config path: {path} ({exc})") from exc

    if not resolved.exists():
        raise ValueError(f"Config file not found: {resolved}")

    suffix = resolved.suffix.lower()
    try:
        text = resolved.read_text(encoding="utf-8")
    except OSError as exc:
        raise ValueError(f"Failed to read config file {resolved}: {exc}") from exc

    if suffix == ".json":
        try:
            data = json.loads(text)
        except json.JSONDecodeError as exc:
            raise ValueError(f"Invalid JSON config file {resolved}: {exc}") from exc
    elif suffix in (".yaml", ".yml"):
        try:
            data = yaml.safe_load(text)
        except yaml.YAMLError as exc:  # type: ignore[attr-defined]
            raise ValueError(f"Invalid YAML config file {resolved}: {exc}") from exc
    else:
        raise ValueError(f"Unsupported config file type: {resolved.suffix}")

    if not isinstance(data, dict):
        raise ValueError("Config file must contain a mapping/object at the top level")

    return data

Package Information¶

Versioning¶

version `module-attribute` ¶

__version__ = '2.4.0'

__api_version__ `module-attribute` ¶

__api_version__ = __version__

Configuration - Detailed configuration options
Service API - Non-interactive service interface
CLI Interface - Command-line interface

Core API¶

Quick Start¶

API Reference¶

run_pipeline ¶

load_config_file ¶

Load from YAML file¶

Service API provides load_config_file convenience¶

Package Information¶

Versioning¶

__version__ module-attribute ¶

__api_version__ module-attribute ¶

Related¶

version `module-attribute` ¶

__api_version__ `module-attribute` ¶