Skip to content

Service API

The Service API provides a clean, programmatic interface optimized for non-interactive use, such as running as a daemon or service (e.g., with supervisor, systemd).

Overview

The service API is designed to:

  • Work exclusively with configuration files (no CLI arguments)
  • Provide structured return values and error handling
  • Be suitable for process management tools
  • Maintain clean separation from CLI concerns
  • Use the same validated Config model as the CLI: service.run builds Config(**config_dict) from the merged configuration. There is no separate allowlist of keys in the service layer, so documented fields such as preprocessing_mp3_bitrate_kbps are accepted whenever they are valid on Config (GitHub #561).

Quick Start

from podcast_scraper import service, Config

# Option 1: From Config object
cfg = Config(
    rss="https://example.com/feed.xml",
    output_dir="./transcripts"
)
result = service.run(cfg)

if result.success:
    print(f"Processed {result.episodes_processed} episodes")
    print(f"Summary: {result.summary}")
else:
    print(f"Error: {result.error}")

# Option 2: From config file
result = service.run_from_config_file("config.yaml")

Multi-feed (GitHub #440): If the loaded config has two or more feed entries in rss_urls (from YAML feeds / rss_urls, a promoted rss list, or objects with url plus optional per-feed overrides), service.run / run_from_config_file runs one pipeline per feed under <output_dir>/feeds/<stable_feed_id>/, matching the CLI. output_dir must be set in that case. After the batch, #506 writes corpus_manifest.json and corpus_run_summary.json at the corpus parent; with vector_search and FAISS, #505 builds one <output_dir>/search index. The return value’s multi_feed_summary field holds the same JSON-shaped dict as corpus_run_summary.json (or None on single-feed runs), including batch_incidents and per-feed episode_incidents_unique (schema 1.1.0). Field tables: CORPUS_MULTI_FEED_ARTIFACTS.md. See also CONFIGURATION.md — RSS and multi-feed.

Soft-only multi-feed success (GitHub #559): multi_feed_strict defaults to false (lenient). A multi-feed run can then return success=True with error=None if every failed feed is classified as soft (same rules as in CONFIGURATION.md — RSS and multi-feed). In that case the aggregated per-feed messages are on ServiceResult.soft_failures (non-empty string). If success is false because of a hard failure or strict mode (multi_feed_strict: true), soft_failures stays None. multi_feed_summary / corpus_run_summary.json still report overall_ok: false when any feed failed. In Python, pass multi_feed_strict= into Config; deprecated YAML-only keys are documented in the same CONFIGURATION section.

Episode selection (GitHub #521): The same episode_order, episode_since, episode_until, episode_offset, and max_episodes fields in YAML/JSON apply to each inner single-feed run. See CONFIGURATION.md — Episode selection.

Append / resume (GitHub #444): If Config.append is true, each inner run uses a stable run_append_* directory and skips episodes that are already complete on disk (metadata episode_id + required artifacts). Incompatible with clean_output. See CONFIGURATION.md — Append / resume.

Corpus lock (multi-feed): While two or more feeds are processed, service.run acquires an advisory exclusive lock file .podcast_scraper.lock under the corpus parent (output_dir) using filelock. If another process already holds the lock, the call returns immediately with success=False, episodes_processed=0, and error describing the lock conflict. Disable locking with environment variable PODCAST_SCRAPER_CORPUS_LOCK=0 (tests, advanced workflows). Single-feed service.run does not use this lock.

API Reference

run

run(cfg: Config) -> ServiceResult

Run the podcast scraping pipeline with the given configuration.

This is the main entry point for programmatic use. It executes the full pipeline and returns a structured result suitable for service/daemon use.

When cfg.rss_urls contains two or more URLs (e.g. from YAML feeds:), runs one pipeline per feed under output_dir/feeds/<stable_name>/ (GitHub #440), same layout as the multi-feed CLI.

Parameters:

Name Type Description Default
cfg Config

Configuration object (can be created from Config() or Config(**load_config_file()))

required

Returns:

Type Description
ServiceResult

ServiceResult with processing results

Example

from podcast_scraper import service, config cfg = config.Config(rss_url="https://example.com/feed.xml") result = service.run(cfg) if result.success: ... print(f"Success: {result.summary}") ... else: ... print(f"Error: {result.error}")

Source code in src/podcast_scraper/service.py
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
def run(cfg: config.Config) -> ServiceResult:
    """Run the podcast scraping pipeline with the given configuration.

    This is the main entry point for programmatic use. It executes the full pipeline
    and returns a structured result suitable for service/daemon use.

    When ``cfg.rss_urls`` contains two or more URLs (e.g. from YAML ``feeds:``), runs one
    pipeline per feed under ``output_dir/feeds/<stable_name>/`` (GitHub #440), same layout as
    the multi-feed CLI.

    Args:
        cfg: Configuration object (can be created from Config() or Config(**load_config_file()))

    Returns:
        ServiceResult with processing results

    Example:
        >>> from podcast_scraper import service, config
        >>> cfg = config.Config(rss_url="https://example.com/feed.xml")
        >>> result = service.run(cfg)
        >>> if result.success:
        ...     print(f"Success: {result.summary}")
        ... else:
        ...     print(f"Error: {result.error}")
    """
    try:
        # Apply logging configuration if specified
        if cfg.log_file or cfg.log_level:
            resolved_log = workflow.resolve_log_file_path(cfg.log_file, cfg.output_dir)
            workflow.apply_log_level(
                level=cfg.log_level or "INFO",
                log_file=resolved_log,
            )

        multi_entries = list(cfg.rss_urls or [])
        if len(multi_entries) >= 2:
            return _run_multi_feed(cfg, multi_entries)

        # Single-feed path. Opt-in ``feeds/<slug>/`` wrapping (#644) is applied
        # at Config construction via the ``_apply_single_feed_corpus_layout``
        # post-validator, so ``cfg.output_dir`` is already in its final form
        # here — no extra wrapping needed in this hot path.
        count, summary = workflow.run_pipeline(cfg)

        return ServiceResult(
            episodes_processed=count,
            summary=summary,
            success=True,
            error=None,
        )
    except Exception as e:
        error_safe = redact_for_log(str(e))
        logger.error("Pipeline execution failed: %s", error_safe, exc_info=True)
        return ServiceResult(
            episodes_processed=0,
            summary="",
            success=False,
            error=error_safe,
        )

run_from_config_file

run_from_config_file(config_path: str | Path) -> ServiceResult

Run the pipeline from a configuration file.

Convenience function that loads a config file and runs the pipeline. This is the recommended entry point for service/daemon usage.

Parameters:

Name Type Description Default
config_path str | Path

Path to configuration file (JSON or YAML)

required

Returns:

Type Description
ServiceResult

ServiceResult with processing results

Raises:

Type Description
FileNotFoundError

If config file doesn't exist

ValueError

If config file is invalid

Example

from podcast_scraper import service result = service.run_from_config_file("config.yaml") if not result.success: ... sys.exit(1)

Source code in src/podcast_scraper/service.py
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
def run_from_config_file(config_path: str | Path) -> ServiceResult:
    """Run the pipeline from a configuration file.

    Convenience function that loads a config file and runs the pipeline.
    This is the recommended entry point for service/daemon usage.

    Args:
        config_path: Path to configuration file (JSON or YAML)

    Returns:
        ServiceResult with processing results

    Raises:
        FileNotFoundError: If config file doesn't exist
        ValueError: If config file is invalid

    Example:
        >>> from podcast_scraper import service
        >>> result = service.run_from_config_file("config.yaml")
        >>> if not result.success:
        ...     sys.exit(1)
    """
    try:
        config_dict = config.load_config_file(str(config_path))
        cfg = config.Config(**config_dict)
    except FileNotFoundError:
        error_msg = f"Configuration file not found: {config_path}"
        error_safe = redact_for_log(error_msg)
        logger.error("%s", error_safe)
        return ServiceResult(
            episodes_processed=0,
            summary="",
            success=False,
            error=error_safe,
        )
    except Exception as exc:
        error_safe = redact_for_log(f"Failed to load configuration file: {exc}")
        logger.error("%s", error_safe)
        return ServiceResult(
            episodes_processed=0,
            summary="",
            success=False,
            error=error_safe,
        )

    from .monitor.memray_util import maybe_reexec_memray_service

    memray_err = maybe_reexec_memray_service(
        memray=bool(cfg.memray),
        output_dir=cfg.output_dir,
        memray_output=cfg.memray_output,
    )
    if memray_err:
        return ServiceResult(
            episodes_processed=0,
            summary="",
            success=False,
            error=memray_err,
        )

    return run(cfg)

main

main() -> int

Main entry point for service mode (CLI-like but config-file only).

This function is designed to be called as a script entry point: python -m podcast_scraper.service --config config.yaml

It accepts a --config argument (optional if PODCAST_SCRAPER_CONFIG env var is set) and is optimized for non-interactive use.

Config file resolution order: 1. --config argument (if provided) 2. PODCAST_SCRAPER_CONFIG environment variable 3. Default: /app/config.yaml (for Docker/service usage)

Returns:

Type Description
int

Exit code (0 for success, 1 for failure)

Source code in src/podcast_scraper/service.py
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
def main() -> int:
    """Main entry point for service mode (CLI-like but config-file only).

    This function is designed to be called as a script entry point:
    python -m podcast_scraper.service --config config.yaml

    It accepts a --config argument (optional if PODCAST_SCRAPER_CONFIG env var is set)
    and is optimized for non-interactive use.

    Config file resolution order:
    1. --config argument (if provided)
    2. PODCAST_SCRAPER_CONFIG environment variable
    3. Default: /app/config.yaml (for Docker/service usage)

    Returns:
        Exit code (0 for success, 1 for failure)
    """
    import argparse
    import os

    # Initialize ML environment variables early (before any ML imports)
    setup.initialize_ml_environment()

    # Default config path (for Docker/service usage)
    default_config = os.getenv("PODCAST_SCRAPER_CONFIG", "/app/config.yaml")

    parser = argparse.ArgumentParser(
        description="Podcast Scraper Service - Run pipeline from configuration file",
        formatter_class=argparse.RawDescriptionHelpFormatter,
        epilog="""
Examples:
  # Run with config file
  python -m podcast_scraper.service --config config.yaml

  # Run with environment variable
  PODCAST_SCRAPER_CONFIG=/path/to/config.yaml python -m podcast_scraper.service

  # Run with default path (Docker/service mode)
  python -m podcast_scraper.service

  # For supervisor/systemd usage
  [program:podcast_scraper]
  command=python -m podcast_scraper.service --config /path/to/config.yaml
  autostart=true
  autorestart=true
        """,
    )
    parser.add_argument(
        "--config",
        default=None,
        help=(
            "Path to configuration file (JSON or YAML). "
            "If not provided, uses PODCAST_SCRAPER_CONFIG environment variable "
            f"or default: {default_config}"
        ),
    )
    parser.add_argument(
        "--version",
        action="version",
        version=f"podcast_scraper {__version__}",
    )

    args = parser.parse_args()

    # Resolve config file path
    config_path = args.config or default_config

    # Run the service
    result = run_from_config_file(config_path)

    # Print results
    if result.success:
        print(result.summary)
        return 0
    else:
        print(f"Error: {result.error}", file=sys.stderr)
        return 1

ServiceResult Class

ServiceResult dataclass

ServiceResult(episodes_processed: int, summary: str, success: bool = True, error: Optional[str] = None, multi_feed_summary: Optional[Dict[str, Any]] = None, soft_failures: Optional[str] = None)

Result of a service run.

Attributes:

Name Type Description
episodes_processed int

Number of episodes processed (transcripts saved/planned)

summary str

Human-readable summary message

success bool

Whether the run completed successfully

error Optional[str]

Error message if success is False, None otherwise

multi_feed_summary Optional[Dict[str, Any]]

When run used multi-feed mode (two or more rss_urls), the same JSON-shaped dict as corpus_run_summary.json at the corpus parent (GitHub #506). None for single-feed runs and for top-level failures before multi-feed finalize.

soft_failures Optional[str]

When success is True but some feeds failed with soft classifications only (GitHub #559; default multi_feed_strict is False), holds the same aggregated detail that would have been in error under strict mode (multi_feed_strict=True). None when there were no soft-only failures.

Daemon Usage

Systemd Service

[Unit]
Description=Podcast Scraper Service
After=network.target

[Service]
Type=simple
User=podcast
WorkingDirectory=/opt/podcast-scraper
ExecStart=/usr/bin/python3 -m podcast_scraper.service --config /etc/podcast-scraper/config.yaml
Restart=on-failure
RestartSec=30

[Install]
WantedBy=multi-user.target

Supervisor Configuration

[program:podcast_scraper]
command=/usr/bin/python3 -m podcast_scraper.service --config /etc/podcast-scraper/config.yaml
directory=/opt/podcast-scraper
user=podcast
autostart=true
autorestart=true
redirect_stderr=true
stdout_logfile=/var/log/podcast-scraper.log

Programmatic Error Handling

import sys
from podcast_scraper import service

result = service.run_from_config_file("config.yaml")

if not result.success:
    # Log error and exit with appropriate code
    print(f"Service failed: {result.error}", file=sys.stderr)
    sys.exit(1)

# success is True; multi-feed may still have soft-classified feed failures
if result.soft_failures:
    print(f"Warning (soft-only feed failures): {result.soft_failures}", file=sys.stderr)

print(f"Success: {result.summary}")
sys.exit(0)

Docker Usage

For Docker-based deployments, see the Docker Service Guide which covers:

  • Service-oriented Docker execution
  • Environment variables and volume mounts
  • Supervisor integration
  • Docker Compose examples
  • Troubleshooting

See Also