Service API¶

The Service API provides a clean, programmatic interface optimized for non-interactive use, such as running as a daemon or service (e.g., with supervisor, systemd).

Overview¶

The service API is designed to:

Work exclusively with configuration files (no CLI arguments)
Provide structured return values and error handling
Be suitable for process management tools
Maintain clean separation from CLI concerns
Use the same validated Config model as the CLI: service.run builds Config(**config_dict) from the merged configuration. There is no separate allowlist of keys in the service layer, so documented fields such as preprocessing_mp3_bitrate_kbps are accepted whenever they are valid on Config (GitHub #561).

Quick Start¶

from podcast_scraper import service, Config

# Option 1: From Config object
cfg = Config(
    rss="https://example.com/feed.xml",
    output_dir="./transcripts"
)
result = service.run(cfg)

if result.success:
    print(f"Processed {result.episodes_processed} episodes")
    print(f"Summary: {result.summary}")
else:
    print(f"Error: {result.error}")

# Option 2: From config file
result = service.run_from_config_file("config.yaml")

Multi-feed (GitHub #440): If the loaded config has two or more feed entries in rss_urls (from YAML feeds / rss_urls, a promoted rss list, or objects with url plus optional per-feed overrides), service.run / run_from_config_file runs one pipeline per feed under <output_dir>/feeds/<stable_feed_id>/, matching the CLI. output_dir must be set in that case. After the batch, #506 writes corpus_manifest.json and corpus_run_summary.json at the corpus parent; with vector_search and FAISS, #505 builds one <output_dir>/search index. The return value’s multi_feed_summary field holds the same JSON-shaped dict as corpus_run_summary.json (or None on single-feed runs), including batch_incidents and per-feed episode_incidents_unique (schema 1.1.0). Field tables: CORPUS_MULTI_FEED_ARTIFACTS.md. See also CONFIGURATION.md — RSS and multi-feed.

Soft-only multi-feed success (GitHub #559): multi_feed_strict defaults to false (lenient). A multi-feed run can then return success=True with error=None if every failed feed is classified as soft (same rules as in CONFIGURATION.md — RSS and multi-feed). In that case the aggregated per-feed messages are on ServiceResult.soft_failures (non-empty string). If success is false because of a hard failure or strict mode (multi_feed_strict: true), soft_failures stays None. multi_feed_summary / corpus_run_summary.json still report overall_ok: false when any feed failed. In Python, pass multi_feed_strict= into Config; deprecated YAML-only keys are documented in the same CONFIGURATION section.

Episode selection (GitHub #521): The same episode_order, episode_since, episode_until, episode_offset, and max_episodes fields in YAML/JSON apply to each inner single-feed run. See CONFIGURATION.md — Episode selection.

Append / resume (GitHub #444): If Config.append is true, each inner run uses a stable run_append_* directory and skips episodes that are already complete on disk (metadata episode_id + required artifacts). Incompatible with clean_output. See CONFIGURATION.md — Append / resume.

Corpus lock (multi-feed): While two or more feeds are processed, service.run acquires an advisory exclusive lock file .podcast_scraper.lock under the corpus parent (output_dir) using filelock. If another process already holds the lock, the call returns immediately with success=False, episodes_processed=0, and error describing the lock conflict. Disable locking with environment variable PODCAST_SCRAPER_CORPUS_LOCK=0 (tests, advanced workflows). Single-feed service.run does not use this lock.

API Reference¶

run ¶

run(cfg: Config) -> ServiceResult

Run the podcast scraping pipeline with the given configuration.

This is the main entry point for programmatic use. It executes the full pipeline and returns a structured result suitable for service/daemon use.

When cfg.rss_urls contains two or more URLs (e.g. from YAML feeds:), runs one pipeline per feed under output_dir/feeds/<stable_name>/ (GitHub #440), same layout as the multi-feed CLI.

Parameters:

Name	Type	Description	Default
`cfg`	`Config`	Configuration object (can be created from Config() or Config(**load_config_file()))	required

Returns:

Type	Description
`ServiceResult`	ServiceResult with processing results

Example

from podcast_scraper import service, config cfg = config.Config(rss_url="https://example.com/feed.xml") result = service.run(cfg) if result.success: ... print(f"Success: {result.summary}") ... else: ... print(f"Error: {result.error}")

Source code in src/podcast_scraper/service.py

def run(cfg: config.Config) -> ServiceResult:
    """Run the podcast scraping pipeline with the given configuration.

    This is the main entry point for programmatic use. It executes the full pipeline
    and returns a structured result suitable for service/daemon use.

    When ``cfg.rss_urls`` contains two or more URLs (e.g. from YAML ``feeds:``), runs one
    pipeline per feed under ``output_dir/feeds/<stable_name>/`` (GitHub #440), same layout as
    the multi-feed CLI.

    Args:
        cfg: Configuration object (can be created from Config() or Config(**load_config_file()))

    Returns:
        ServiceResult with processing results

    Example:
        >>> from podcast_scraper import service, config
        >>> cfg = config.Config(rss_url="https://example.com/feed.xml")
        >>> result = service.run(cfg)
        >>> if result.success:
        ...     print(f"Success: {result.summary}")
        ... else:
        ...     print(f"Error: {result.error}")
    """
    try:
        # Apply logging configuration if specified
        if cfg.log_file or cfg.log_level:
            resolved_log = workflow.resolve_log_file_path(cfg.log_file, cfg.output_dir)
            workflow.apply_log_level(
                level=cfg.log_level or "INFO",
                log_file=resolved_log,
            )

        multi_entries = list(cfg.rss_urls or [])
        if len(multi_entries) >= 2:
            return _run_multi_feed(cfg, multi_entries)

        # Single-feed path. Opt-in ``feeds/<slug>/`` wrapping (#644) is applied
        # at Config construction via the ``_apply_single_feed_corpus_layout``
        # post-validator, so ``cfg.output_dir`` is already in its final form
        # here — no extra wrapping needed in this hot path.
        count, summary = workflow.run_pipeline(cfg)

        return ServiceResult(
            episodes_processed=count,
            summary=summary,
            success=True,
            error=None,
        )
    except Exception as e:
        error_safe = redact_for_log(str(e))
        logger.error("Pipeline execution failed: %s", error_safe, exc_info=True)
        return ServiceResult(
            episodes_processed=0,
            summary="",
            success=False,
            error=error_safe,
        )

run_from_config_file ¶

run_from_config_file(config_path: str | Path) -> ServiceResult

Run the pipeline from a configuration file.

Convenience function that loads a config file and runs the pipeline. This is the recommended entry point for service/daemon usage.

Parameters:

Name	Type	Description	Default
`config_path`	`str \| Path`	Path to configuration file (JSON or YAML)	required

Returns:

Type	Description
`ServiceResult`	ServiceResult with processing results

Raises:

Type	Description
`FileNotFoundError`	If config file doesn't exist
`ValueError`	If config file is invalid

Example

from podcast_scraper import service result = service.run_from_config_file("config.yaml") if not result.success: ... sys.exit(1)

Source code in src/podcast_scraper/service.py

def run_from_config_file(config_path: str | Path) -> ServiceResult:
    """Run the pipeline from a configuration file.

    Convenience function that loads a config file and runs the pipeline.
    This is the recommended entry point for service/daemon usage.

    Args:
        config_path: Path to configuration file (JSON or YAML)

    Returns:
        ServiceResult with processing results

    Raises:
        FileNotFoundError: If config file doesn't exist
        ValueError: If config file is invalid

    Example:
        >>> from podcast_scraper import service
        >>> result = service.run_from_config_file("config.yaml")
        >>> if not result.success:
        ...     sys.exit(1)
    """
    try:
        config_dict = config.load_config_file(str(config_path))
        cfg = config.Config(**config_dict)
    except FileNotFoundError:
        error_msg = f"Configuration file not found: {config_path}"
        error_safe = redact_for_log(error_msg)
        logger.error("%s", error_safe)
        return ServiceResult(
            episodes_processed=0,
            summary="",
            success=False,
            error=error_safe,
        )
    except Exception as exc:
        error_safe = redact_for_log(f"Failed to load configuration file: {exc}")
        logger.error("%s", error_safe)
        return ServiceResult(
            episodes_processed=0,
            summary="",
            success=False,
            error=error_safe,
        )

    from .monitor.memray_util import maybe_reexec_memray_service

    memray_err = maybe_reexec_memray_service(
        memray=bool(cfg.memray),
        output_dir=cfg.output_dir,
        memray_output=cfg.memray_output,
    )
    if memray_err:
        return ServiceResult(
            episodes_processed=0,
            summary="",
            success=False,
            error=memray_err,
        )

    return run(cfg)

main ¶

main() -> int

Main entry point for service mode (CLI-like but config-file only).

This function is designed to be called as a script entry point: python -m podcast_scraper.service --config config.yaml

It accepts a --config argument (optional if PODCAST_SCRAPER_CONFIG env var is set) and is optimized for non-interactive use.

Config file resolution order: 1. --config argument (if provided) 2. PODCAST_SCRAPER_CONFIG environment variable 3. Default: /app/config.yaml (for Docker/service usage)

Returns:

Type	Description
`int`	Exit code (0 for success, 1 for failure)

Source code in src/podcast_scraper/service.py

def main() -> int:
    """Main entry point for service mode (CLI-like but config-file only).

    This function is designed to be called as a script entry point:
    python -m podcast_scraper.service --config config.yaml

    It accepts a --config argument (optional if PODCAST_SCRAPER_CONFIG env var is set)
    and is optimized for non-interactive use.

    Config file resolution order:
    1. --config argument (if provided)
    2. PODCAST_SCRAPER_CONFIG environment variable
    3. Default: /app/config.yaml (for Docker/service usage)

    Returns:
        Exit code (0 for success, 1 for failure)
    """
    import argparse
    import os

    # Initialize ML environment variables early (before any ML imports)
    setup.initialize_ml_environment()

    # Default config path (for Docker/service usage)
    default_config = os.getenv("PODCAST_SCRAPER_CONFIG", "/app/config.yaml")

    parser = argparse.ArgumentParser(
        description="Podcast Scraper Service - Run pipeline from configuration file",
        formatter_class=argparse.RawDescriptionHelpFormatter,
        epilog="""
Examples:
  # Run with config file
  python -m podcast_scraper.service --config config.yaml

  # Run with environment variable
  PODCAST_SCRAPER_CONFIG=/path/to/config.yaml python -m podcast_scraper.service

  # Run with default path (Docker/service mode)
  python -m podcast_scraper.service

  # For supervisor/systemd usage
  [program:podcast_scraper]
  command=python -m podcast_scraper.service --config /path/to/config.yaml
  autostart=true
  autorestart=true
        """,
    )
    parser.add_argument(
        "--config",
        default=None,
        help=(
            "Path to configuration file (JSON or YAML). "
            "If not provided, uses PODCAST_SCRAPER_CONFIG environment variable "
            f"or default: {default_config}"
        ),
    )
    parser.add_argument(
        "--version",
        action="version",
        version=f"podcast_scraper {__version__}",
    )

    args = parser.parse_args()

    # Resolve config file path
    config_path = args.config or default_config

    # Run the service
    result = run_from_config_file(config_path)

    # Print results
    if result.success:
        print(result.summary)
        return 0
    else:
        print(f"Error: {result.error}", file=sys.stderr)
        return 1

ServiceResult Class¶

ServiceResult `dataclass` ¶

ServiceResult(episodes_processed: int, summary: str, success: bool = True, error: Optional[str] = None, multi_feed_summary: Optional[Dict[str, Any]] = None, soft_failures: Optional[str] = None)

Result of a service run.

Attributes:

Name	Type	Description
`episodes_processed`	`int`	Number of episodes processed (transcripts saved/planned)
`summary`	`str`	Human-readable summary message
`success`	`bool`	Whether the run completed successfully
`error`	`Optional[str]`	Error message if success is False, None otherwise
`multi_feed_summary`	`Optional[Dict[str, Any]]`	When `run` used multi-feed mode (two or more `rss_urls`), the same JSON-shaped dict as `corpus_run_summary.json` at the corpus parent (GitHub #506). `None` for single-feed runs and for top-level failures before multi-feed finalize.
`soft_failures`	`Optional[str]`	When `success` is True but some feeds failed with soft classifications only (GitHub #559; default `multi_feed_strict` is False), holds the same aggregated detail that would have been in `error` under strict mode (`multi_feed_strict=True`). `None` when there were no soft-only failures.

Daemon Usage¶

Systemd Service¶

[Unit]
Description=Podcast Scraper Service
After=network.target

[Service]
Type=simple
User=podcast
WorkingDirectory=/opt/podcast-scraper
ExecStart=/usr/bin/python3 -m podcast_scraper.service --config /etc/podcast-scraper/config.yaml
Restart=on-failure
RestartSec=30

[Install]
WantedBy=multi-user.target

Supervisor Configuration¶

[program:podcast_scraper]
command=/usr/bin/python3 -m podcast_scraper.service --config /etc/podcast-scraper/config.yaml
directory=/opt/podcast-scraper
user=podcast
autostart=true
autorestart=true
redirect_stderr=true
stdout_logfile=/var/log/podcast-scraper.log

Programmatic Error Handling¶

import sys
from podcast_scraper import service

result = service.run_from_config_file("config.yaml")

if not result.success:
    # Log error and exit with appropriate code
    print(f"Service failed: {result.error}", file=sys.stderr)
    sys.exit(1)

# success is True; multi-feed may still have soft-classified feed failures
if result.soft_failures:
    print(f"Warning (soft-only feed failures): {result.soft_failures}", file=sys.stderr)

print(f"Success: {result.summary}")
sys.exit(0)

Docker Usage¶

For Docker-based deployments, see the Docker Service Guide which covers:

Service-oriented Docker execution
Environment variables and volume mounts
Supervisor integration
Docker Compose examples
Troubleshooting

Service API¶

Overview¶

Quick Start¶

API Reference¶

run ¶

run_from_config_file ¶

main ¶

ServiceResult Class¶

ServiceResult dataclass ¶

Daemon Usage¶

Systemd Service¶

Supervisor Configuration¶

Programmatic Error Handling¶

Docker Usage¶

See Also¶

ServiceResult `dataclass` ¶