Skip to content

Core API

This is the primary public API for podcast_scraper. Use these functions for programmatic access.

Quick Start

import podcast_scraper

# Create configuration
cfg = podcast_scraper.Config(
    rss="https://example.com/feed.xml",
    output_dir="./transcripts",
    max_episodes=10
)

# Run the pipeline
count, summary = podcast_scraper.run_pipeline(cfg)
print(f"Downloaded {count} transcripts: {summary}")

API Reference

run_pipeline

run_pipeline(cfg: Config) -> Tuple[int, str]

Execute the main podcast scraping pipeline.

This is the primary entry point for programmatic use of podcast_scraper. It orchestrates the complete workflow from RSS feed fetching to transcript generation and optional metadata/summarization.

The pipeline executes the following stages:

  1. Setup output directory (with optional run ID subdirectory)
  2. Fetch and parse RSS feed
  3. Detect speakers (if auto-detection enabled)
  4. Process episodes concurrently:
  5. Download published transcripts
  6. Or queue media for Whisper transcription
  7. Transcribe queued media files sequentially (if Whisper enabled)
  8. Generate metadata documents (if enabled)
  9. Generate episode summaries (if enabled)
  10. Clean up temporary files

Parameters:

Name Type Description Default
cfg Config

Configuration object with all pipeline settings. See Config for available options.

required

Returns:

Type Description
Tuple[int, str]

Tuple[int, str]: A tuple containing:

  • count (int): Number of episodes processed (transcripts saved or planned)
  • summary (str): Human-readable summary message describing the run

Raises:

Type Description
RuntimeError

If output directory cleanup fails when clean_output=True

ValueError

If RSS URL is invalid or feed cannot be parsed

FileNotFoundError

If configuration file references missing files

OSError

If file system operations fail

Example

from podcast_scraper import Config, run_pipeline

cfg = Config( ... rss="https://example.com/feed.xml", ... output_dir="./transcripts", ... max_episodes=10 ... ) count, summary = run_pipeline(cfg) print(f"Downloaded {count} transcripts: {summary}") Downloaded 10 transcripts: Processed 10/50 episodes

Example with Whisper transcription

cfg = Config( ... rss="https://example.com/feed.xml", ... transcribe_missing=True, ... whisper_model="base", ... screenplay=True, ... num_speakers=2 ... ) count, summary = run_pipeline(cfg)

Note

For non-interactive use (daemons, services), consider using the service.run() function instead, which provides structured error handling and return values.

See Also
  • Config: Configuration model with all available options
  • service.run(): Service API with structured error handling
  • load_config_file(): Load configuration from JSON/YAML file
Source code in src/podcast_scraper/workflow/orchestration.py
1651
1652
1653
1654
1655
1656
1657
1658
1659
1660
1661
1662
1663
1664
1665
1666
1667
1668
1669
1670
1671
1672
1673
1674
1675
1676
1677
1678
1679
1680
1681
1682
1683
1684
1685
1686
1687
1688
1689
1690
1691
1692
1693
1694
1695
1696
1697
1698
1699
1700
1701
1702
1703
1704
1705
1706
1707
1708
1709
1710
1711
1712
1713
1714
1715
1716
1717
1718
1719
1720
1721
1722
1723
1724
1725
1726
1727
1728
1729
1730
1731
1732
1733
1734
1735
1736
1737
1738
1739
1740
1741
1742
1743
1744
1745
1746
1747
1748
1749
1750
1751
1752
1753
1754
1755
1756
1757
1758
1759
1760
1761
1762
1763
1764
1765
1766
1767
1768
1769
1770
1771
1772
1773
1774
1775
1776
1777
1778
1779
1780
1781
1782
1783
1784
1785
1786
1787
1788
1789
1790
1791
1792
1793
1794
1795
1796
1797
1798
1799
1800
1801
1802
1803
1804
1805
1806
1807
1808
1809
1810
def run_pipeline(cfg: config.Config) -> Tuple[int, str]:
    """Execute the main podcast scraping pipeline.

    This is the primary entry point for programmatic use of podcast_scraper. It orchestrates
    the complete workflow from RSS feed fetching to transcript generation and optional
    metadata/summarization.

    The pipeline executes the following stages:

    1. Setup output directory (with optional run ID subdirectory)
    2. Fetch and parse RSS feed
    3. Detect speakers (if auto-detection enabled)
    4. Process episodes concurrently:
       - Download published transcripts
       - Or queue media for Whisper transcription
    5. Transcribe queued media files sequentially (if Whisper enabled)
    6. Generate metadata documents (if enabled)
    7. Generate episode summaries (if enabled)
    8. Clean up temporary files

    Args:
        cfg: Configuration object with all pipeline settings. See `Config` for available options.

    Returns:
        Tuple[int, str]: A tuple containing:

            - count (int): Number of episodes processed (transcripts saved or planned)
            - summary (str): Human-readable summary message describing the run

    Raises:
        RuntimeError: If output directory cleanup fails when `clean_output=True`
        ValueError: If RSS URL is invalid or feed cannot be parsed
        FileNotFoundError: If configuration file references missing files
        OSError: If file system operations fail

    Example:
        >>> from podcast_scraper import Config, run_pipeline
        >>>
        >>> cfg = Config(
        ...     rss="https://example.com/feed.xml",
        ...     output_dir="./transcripts",
        ...     max_episodes=10
        ... )
        >>> count, summary = run_pipeline(cfg)
        >>> print(f"Downloaded {count} transcripts: {summary}")
        Downloaded 10 transcripts: Processed 10/50 episodes

    Example with Whisper transcription:
        >>> cfg = Config(
        ...     rss="https://example.com/feed.xml",
        ...     transcribe_missing=True,
        ...     whisper_model="base",
        ...     screenplay=True,
        ...     num_speakers=2
        ... )
        >>> count, summary = run_pipeline(cfg)

    Example with metadata and summaries:
        >>> cfg = Config(
        ...     rss="https://example.com/feed.xml",
        ...     generate_metadata=True,
        ...     generate_summaries=True
        ... )
        >>> count, summary = run_pipeline(cfg)

    Note:
        For non-interactive use (daemons, services), consider using the `service.run()`
        function instead, which provides structured error handling and return values.

    See Also:
        - `Config`: Configuration model with all available options
        - `service.run()`: Service API with structured error handling
        - `load_config_file()`: Load configuration from JSON/YAML file
    """
    # Step 1: Setup pipeline environment
    effective_output_dir, run_suffix, full_config_string, pipeline_metrics = (
        _setup_pipeline_environment(cfg)
    )

    from ..gi.deps import validate_gil_grounding_dependencies

    validate_gil_grounding_dependencies(cfg)

    # Initialize JSONL emitter if enabled
    jsonl_emitter = _setup_jsonl_emitter(cfg, effective_output_dir, pipeline_metrics)

    # Step 1.5: Preload ML models if configured
    wf_stages.setup.preload_ml_models_if_needed(cfg)

    # Step 1.6: Create all providers once (singleton pattern per run)
    # Providers are created here and passed to stages to avoid redundant initialization
    transcription_provider, speaker_detector, summary_provider = _create_all_providers(cfg)

    # Step 1.7-1.8: Setup logging and device tracking
    _setup_logging_and_devices(
        cfg, transcription_provider, speaker_detector, summary_provider, pipeline_metrics
    )

    # Step 1.5: Create run manifest
    run_manifest = _create_run_manifest(cfg, effective_output_dir)

    # Step 2-4: Fetch and prepare episodes
    feed, rss_bytes, feed_metadata, episodes = _fetch_and_prepare_episodes(cfg, pipeline_metrics)

    # Step 5-6.5: Setup pipeline resources
    normalizing_start, host_detection_result, transcription_resources, processing_resources = (
        _setup_pipeline_resources(
            cfg,
            feed,
            episodes,
            effective_output_dir,
            transcription_provider,
            speaker_detector,
            pipeline_metrics,
        )
    )

    # Wrap all processing in try-finally to ensure cleanup always happens
    # This prevents memory leaks if exceptions occur during processing
    try:
        saved = _process_episodes_with_threading(
            cfg=cfg,
            episodes=episodes,
            feed=feed,
            effective_output_dir=effective_output_dir,
            run_suffix=run_suffix,
            feed_metadata=feed_metadata,
            host_detection_result=host_detection_result,
            transcription_resources=transcription_resources,
            processing_resources=processing_resources,
            pipeline_metrics=pipeline_metrics,
            summary_provider=summary_provider,
            transcription_provider=transcription_provider,
            normalizing_start=normalizing_start,
        )

    finally:
        # Step 9.5: Unload models to free memory
        # This runs even if exceptions occur above, preventing memory leaks
        _cleanup_providers(transcription_resources, summary_provider)
        if jsonl_emitter is not None:
            try:
                jsonl_emitter.__exit__(None, None, None)
            except Exception:
                pass

    # Step 10-15: Finalize pipeline (cleanup, save metrics, generate reports)
    return _finalize_pipeline(
        cfg=cfg,
        saved=saved,
        transcription_resources=transcription_resources,
        effective_output_dir=effective_output_dir,
        run_suffix=run_suffix,
        pipeline_metrics=pipeline_metrics,
        episodes=episodes,
        jsonl_emitter=jsonl_emitter,
        run_manifest=run_manifest,
        summary_provider=summary_provider,
        transcription_provider=transcription_provider,
    )

load_config_file

load_config_file(path: str) -> Dict[str, Any]

Load configuration from a JSON or YAML file.

This function reads a configuration file and returns a dictionary of configuration values. The file format is auto-detected from the file extension (.json, .yaml, or .yml).

The returned dictionary can be unpacked into the Config constructor to create a configuration object.

Parameters:

Name Type Description Default
path str

Path to configuration file (JSON or YAML). Supports tilde expansion for home directory (e.g., "~/config.yaml").

required

Returns:

Type Description
Dict[str, Any]

Dict[str, Any]: Dictionary containing configuration values from the file. Keys correspond to Config field names (using aliases where applicable).

Raises:

Type Description
ValueError

If any of the following occur:

  • Config path is empty
  • Config file does not exist
  • File format is invalid (not JSON or YAML)
  • JSON parsing fails
  • YAML parsing fails
OSError

If file cannot be read due to permissions or I/O errors

Example

from podcast_scraper import Config, load_config_file, run_pipeline

Load from YAML file

config_dict = load_config_file("config.yaml") cfg = Config(**config_dict) count, summary = run_pipeline(cfg)

Example with JSON

config_dict = load_config_file("config.json") cfg = Config(**config_dict)

Example with direct usage

from podcast_scraper import load_config_file, service

Service API provides load_config_file convenience

result = service.run_from_config_file("config.yaml")

Supported Formats

JSON (.json):

{
  "rss": "https://example.com/feed.xml",
  "output_dir": "./transcripts",
  "max_episodes": 50
}

YAML (.yaml, .yml):

rss: https://example.com/feed.xml
output_dir: ./transcripts
max_episodes: 50
Note
  • Field aliases are supported (e.g., both "rss" and "rss_url" work)
  • See Config documentation for all available configuration options
  • Configuration files should not contain sensitive data (API keys, passwords)
See Also
  • Config: Configuration model and field documentation
  • service.run_from_config_file(): Direct service API from config file
  • Configuration examples: config/examples/config.example.json, config/examples/config.example.yaml
Source code in src/podcast_scraper/config.py
3595
3596
3597
3598
3599
3600
3601
3602
3603
3604
3605
3606
3607
3608
3609
3610
3611
3612
3613
3614
3615
3616
3617
3618
3619
3620
3621
3622
3623
3624
3625
3626
3627
3628
3629
3630
3631
3632
3633
3634
3635
3636
3637
3638
3639
3640
3641
3642
3643
3644
3645
3646
3647
3648
3649
3650
3651
3652
3653
3654
3655
3656
3657
3658
3659
3660
3661
3662
3663
3664
3665
3666
3667
3668
3669
3670
3671
3672
3673
3674
3675
3676
3677
3678
3679
3680
3681
3682
3683
3684
3685
3686
3687
3688
3689
3690
3691
3692
3693
3694
3695
3696
3697
3698
3699
3700
3701
3702
3703
def load_config_file(
    path: str,
) -> Dict[str, Any]:  # noqa: C901 - file parsing handles multiple formats
    """Load configuration from a JSON or YAML file.

    This function reads a configuration file and returns a dictionary of configuration values.
    The file format is auto-detected from the file extension (`.json`, `.yaml`, or `.yml`).

    The returned dictionary can be unpacked into the `Config` constructor to create a
    configuration object.

    Args:
        path: Path to configuration file (JSON or YAML). Supports tilde expansion for
              home directory (e.g., "~/config.yaml").

    Returns:
        Dict[str, Any]: Dictionary containing configuration values from the file.
            Keys correspond to `Config` field names (using aliases where applicable).

    Raises:
        ValueError: If any of the following occur:

            - Config path is empty
            - Config file does not exist
            - File format is invalid (not JSON or YAML)
            - JSON parsing fails
            - YAML parsing fails

        OSError: If file cannot be read due to permissions or I/O errors

    Example:
        >>> from podcast_scraper import Config, load_config_file, run_pipeline
        >>>
        >>> # Load from YAML file
        >>> config_dict = load_config_file("config.yaml")
        >>> cfg = Config(**config_dict)
        >>> count, summary = run_pipeline(cfg)

    Example with JSON:
        >>> config_dict = load_config_file("config.json")
        >>> cfg = Config(**config_dict)

    Example with direct usage:
        >>> from podcast_scraper import load_config_file, service
        >>>
        >>> # Service API provides load_config_file convenience
        >>> result = service.run_from_config_file("config.yaml")

    Supported Formats:
        **JSON** (`.json`):

            {
              "rss": "https://example.com/feed.xml",
              "output_dir": "./transcripts",
              "max_episodes": 50
            }

        **YAML** (`.yaml`, `.yml`):

            rss: https://example.com/feed.xml
            output_dir: ./transcripts
            max_episodes: 50

    Note:
        - Field aliases are supported (e.g., both "rss" and "rss_url" work)
        - See `Config` documentation for all available configuration options
        - Configuration files should not contain sensitive data (API keys, passwords)

    See Also:
        - `Config`: Configuration model and field documentation
        - `service.run_from_config_file()`: Direct service API from config file
        - Configuration examples: `config/examples/config.example.json`,
          `config/examples/config.example.yaml`
    """
    if not path:
        raise ValueError("Config path cannot be empty")

    cfg_path = Path(path).expanduser()
    try:
        resolved = cfg_path.resolve()
    except (OSError, RuntimeError) as exc:
        raise ValueError(f"Invalid config path: {path} ({exc})") from exc

    if not resolved.exists():
        raise ValueError(f"Config file not found: {resolved}")

    suffix = resolved.suffix.lower()
    try:
        text = resolved.read_text(encoding="utf-8")
    except OSError as exc:
        raise ValueError(f"Failed to read config file {resolved}: {exc}") from exc

    if suffix == ".json":
        try:
            data = json.loads(text)
        except json.JSONDecodeError as exc:
            raise ValueError(f"Invalid JSON config file {resolved}: {exc}") from exc
    elif suffix in (".yaml", ".yml"):
        try:
            data = yaml.safe_load(text)
        except yaml.YAMLError as exc:  # type: ignore[attr-defined]
            raise ValueError(f"Invalid YAML config file {resolved}: {exc}") from exc
    else:
        raise ValueError(f"Unsupported config file type: {resolved.suffix}")

    if not isinstance(data, dict):
        raise ValueError("Config file must contain a mapping/object at the top level")

    return data

Package Information

Versioning

__version__ module-attribute

__version__ = '2.4.0'

__api_version__ module-attribute

__api_version__ = __version__