Docker Service Guide¶

This guide explains how to use podcast_scraper as a service-oriented Docker container, suitable for daemon/service usage with supervisor, systemd, or as a long-running container.

Overview¶

The Docker container is designed to run headless without requiring CLI arguments. It automatically reads configuration from a config file and can be managed with supervisor for advanced process management.

Quick Start¶

Simple Service Mode (Direct)¶

Mount a config file and run:

docker run -v /host/config.yaml:/app/config.yaml \
           -v /host/output:/app/output \
           podcast_scraper

The container will automatically:

Read config from /app/config.yaml (default location)
Start the service without requiring --config argument
Process episodes according to your configuration

Service Mode with Custom Config Path¶

Use environment variable to specify a custom config path:

docker run -e PODCAST_SCRAPER_CONFIG=/custom/path/config.yaml \
           -v /host/config.yaml:/custom/path/config.yaml \
           -v /host/output:/app/output \
           podcast_scraper

Supervisor Mode (Advanced)¶

For advanced process management with automatic restarts and logging:

docker run -v /host/config.yaml:/app/config.yaml \
           -v /host/supervisor.conf:/etc/supervisor/conf.d/podcast_scraper.conf \
           -v /host/output:/app/output \
           -v /host/logs:/var/log/podcast_scraper \
           podcast_scraper

Environment Variables¶

Variable	Default	Description
`PODCAST_SCRAPER_CONFIG`	`/app/config.yaml`	Path to configuration file
`PODCAST_SCRAPER_WORK_DIR`	`/app`	Working directory for service

Setting Environment Variables¶

Using -e flag:

docker run -e PODCAST_SCRAPER_CONFIG=/custom/path/config.yaml \
           podcast_scraper

Using Docker Compose:

services:
  podcast_scraper:
    image: podcast_scraper:latest
    environment:
      - PODCAST_SCRAPER_CONFIG=/app/config.yaml
      - PODCAST_SCRAPER_WORK_DIR=/app

Volume Mounts¶

Required Volumes¶

Config file: Mount your configuration file to /app/config.yaml (or path specified by PODCAST_SCRAPER_CONFIG):

-v /host/config.yaml:/app/config.yaml

Output directory: Mount output directory to location specified in your config file:

-v /host/output:/app/output

Optional Volumes¶

Supervisor config: Mount supervisor configuration to enable supervisor mode:

-v /host/supervisor.conf:/etc/supervisor/conf.d/podcast_scraper.conf

Logs: Mount log directory for supervisor-managed logging:

-v /host/logs:/var/log/podcast_scraper

Model cache directories (ML variant only, recommended for persistence):

Mount model cache directories to persist downloaded models across container restarts. This significantly speeds up container startup after the first run:

# Whisper model cache
-v /host/whisper-cache:/opt/whisper-cache

# Transformers/Hugging Face model cache
-v /host/huggingface-cache:/home/podcast/.cache/huggingface

Benefits:

Faster startup: Models don't need to be re-downloaded on container restart
Persistent storage: Models survive container recreation
Bandwidth savings: Models downloaded once, reused across containers
Offline capability: Models available even if download fails

Example with all recommended volumes (ML variant):

docker run -v /host/config.yaml:/app/config.yaml \
           -v /host/output:/app/output \
           -v /host/whisper-cache:/opt/whisper-cache \
           -v /host/huggingface-cache:/home/podcast/.cache/huggingface \
           -v /host/logs:/var/log/podcast_scraper \
           podcast-scraper:ml

Supervisor Integration¶

Supervisor provides advanced process management with automatic restarts, logging, and monitoring.

Enabling Supervisor Mode¶

Create supervisor config file (see docker/supervisor.conf.example in the repository):

[supervisord]
nodaemon=true
logfile=/var/log/supervisor/supervisord.log
pidfile=/var/run/supervisord.pid

[program:podcast_scraper]
command=python -m podcast_scraper.service
directory=/app
autostart=true
autorestart=true
startretries=3
startsecs=10
stopwaitsecs=30
stdout_logfile=/var/log/podcast_scraper/stdout.log
stderr_logfile=/var/log/podcast_scraper/stderr.log
stdout_logfile_maxbytes=10MB
stderr_logfile_maxbytes=10MB
stdout_logfile_backups=5
stderr_logfile_backups=5
environment=PYTHONUNBUFFERED="1",PODCAST_SCRAPER_CONFIG="/app/config.yaml"

Mount supervisor config:

docker run -v /host/config.yaml:/app/config.yaml \
           -v /host/supervisor.conf:/etc/supervisor/conf.d/podcast_scraper.conf \
           podcast_scraper

Container automatically detects supervisor config and starts supervisor instead of running service directly.

Supervisor Benefits¶

Automatic restarts: Service restarts automatically on failure
Logging: Structured logging to files with rotation
Monitoring: Supervisor provides status and control commands
Process management: Supervisor manages the service lifecycle

Docker Compose Examples¶

Quick Start with Provided Files¶

The repository includes ready-to-use Docker Compose files:

ML-enabled variant:

# Use the default docker-compose.yml
docker-compose up -d

LLM-only variant:

# Use the LLM-only specific compose file
docker-compose -f docker-compose.llm-only.yml up -d

See docker-compose.yml and docker-compose.llm-only.yml in the repository root for complete examples with resource limits, volume mounts, and environment variables.

Basic Service¶

version: '3.8'

services:
  podcast_scraper:
    image: podcast_scraper:latest
    volumes:
      - ./config.yaml:/app/config.yaml:ro
      - ./output:/app/output
    environment:
      - PODCAST_SCRAPER_CONFIG=/app/config.yaml
    restart: unless-stopped

Service with Supervisor¶

version: '3.8'

services:
  podcast_scraper:
    image: podcast_scraper:latest
    volumes:
      - ./config.yaml:/app/config.yaml:ro
      - ./output:/app/output
      - ./supervisor.conf:/etc/supervisor/conf.d/podcast_scraper.conf:ro
      - ./logs:/var/log/podcast_scraper
    environment:
      - PODCAST_SCRAPER_CONFIG=/app/config.yaml
    restart: unless-stopped

Service with Model Cache Persistence¶

version: '3.8'

services:
  podcast_scraper:
    image: podcast_scraper:ml
    volumes:
      - ./config.yaml:/app/config.yaml:ro
      - ./output:/app/output
      # Model cache persistence (recommended)
      - ./whisper-cache:/opt/whisper-cache
      - ./huggingface-cache:/home/podcast/.cache/huggingface
    environment:
      - PODCAST_SCRAPER_CONFIG=/app/config.yaml
    restart: unless-stopped

Service with Custom Config Path¶

version: '3.8'

services:
  podcast_scraper:
    image: podcast_scraper:latest
    volumes:
      - ./config.yaml:/custom/path/config.yaml:ro
      - ./output:/app/output
    environment:
      - PODCAST_SCRAPER_CONFIG=/custom/path/config.yaml
    restart: unless-stopped

Configuration File¶

The service reads configuration from a JSON or YAML file. See Configuration API for complete configuration options.

Example config.yaml:

rss: https://example.com/feed.xml
output_dir: /app/output
max_episodes: 50
transcribe_missing: true
whisper_model: base.en
generate_metadata: true
generate_summaries: true
summary_provider: transformers
summary_model: facebook/bart-base

Error Handling¶

Missing Config File¶

If the config file is not found, the container will exit with an error:

Error: Config file not found: /app/config.yaml
Please mount a config file or set PODCAST_SCRAPER_CONFIG environment variable

Solution: Mount a config file or set PODCAST_SCRAPER_CONFIG environment variable.

Invalid Config File¶

If the config file is invalid, the service will log an error and exit:

Error: Failed to load configuration file: <error details>

Solution: Validate your config file format (JSON or YAML) and ensure all required fields are present.

Backwards Compatibility¶

The service maintains backwards compatibility with CLI-style usage:

# Still works: explicit --config argument
docker run -v /host/config.yaml:/app/config.yaml \
           podcast_scraper --config /app/config.yaml

However, the recommended approach is to use the default config path or environment variable:

# Recommended: use default path or environment variable
docker run -v /host/config.yaml:/app/config.yaml \
           podcast_scraper

Troubleshooting¶

Container Exits Immediately¶

Check config file:

Ensure config file is mounted correctly
Verify config file path matches PODCAST_SCRAPER_CONFIG (or default /app/config.yaml)
Check config file permissions (must be readable)

Check logs:

docker logs <container_id>

Supervisor Not Starting¶

Check supervisor config:

Ensure supervisor config is mounted to /etc/supervisor/conf.d/podcast_scraper.conf
Validate supervisor config syntax (INI format)
Check supervisor logs: /var/log/supervisor/supervisord.log

Service Fails to Start¶

Check service logs:

Direct mode: Check container logs with docker logs
Supervisor mode: Check /var/log/podcast_scraper/stdout.log and /var/log/podcast_scraper/stderr.log

Common issues:

Invalid config file format
Missing required config fields
Network issues (RSS feed unreachable)
Permission issues (output directory not writable)

Performance Optimization¶

Model Cache Persistence¶

For ML-enabled variants, mount model cache directories to avoid re-downloading models on each container restart:

-v /host/whisper-cache:/opt/whisper-cache \
-v /host/huggingface-cache:/home/podcast/.cache/huggingface

Performance impact:

First run: Models downloaded (~1-2GB, 5-10 minutes)
Subsequent runs: Models loaded from cache (~10-30 seconds)
Without cache: Models re-downloaded on every container restart

Example with model cache persistence:

docker run -v /host/config.yaml:/app/config.yaml \
           -v /host/output:/app/output \
           -v /host/whisper-cache:/opt/whisper-cache \
           -v /host/huggingface-cache:/home/podcast/.cache/huggingface \
           podcast-scraper:ml

Resource Limits (Production)¶

For production deployments, consider setting resource limits:

services:
  podcast_scraper:
    image: podcast-scraper:ml
    deploy:
      resources:
        limits:
          cpus: '2'
          memory: 4G
        reservations:
          cpus: '1'
          memory: 2G

Recommendations:

LLM-only variant: 1 CPU, 512MB-1GB RAM
ML-enabled variant: 2-4 CPUs, 2-4GB RAM (depending on model size)
GPU support: Not included in base image, requires custom build

Thread Optimization¶

The container sets thread limits to prevent resource contention:

OMP_NUM_THREADS=1
MKL_NUM_THREADS=1
NUMEXPR_NUM_THREADS=1

These are optimal for containerized environments. Override only if you have specific performance requirements.

Build Optimization¶

Faster builds:

Use --build-arg PRELOAD_ML_MODELS=false to skip model preloading during build
Models will be downloaded on first container run instead
Reduces build time from ~10-15 min to ~5-7 min

Smaller images:

Use LLM-only variant (INSTALL_EXTRAS="") if you only need API providers
Reduces image size from ~1-3GB to ~200-300MB

Security Best Practices¶

Secrets Management¶

Never commit secrets to Docker images or config files:

# ❌ BAD: Hardcoded in config file
openai_api_key: sk-abc123...

# ✅ GOOD: Use environment variables
docker run -e OPENAI_API_KEY=sk-abc123... podcast-scraper

# ✅ GOOD: Use .env file (not committed)
docker run --env-file .env podcast-scraper

Docker Compose with secrets:

services:
  podcast_scraper:
    environment:
      - OPENAI_API_KEY=${OPENAI_API_KEY}  # From .env file or host environment

Non-Root User¶

The container runs as non-root user (podcast, UID 1000) for security:

Reduced attack surface: Limited privileges if container is compromised
File permissions: Proper ownership for application directories
Industry standard: Follows Docker security best practices

Image Security¶

Regular updates:

Rebuild images regularly to get security patches
Use versioned tags instead of :latest in production
Monitor for vulnerabilities with docker scan or Snyk

Minimal base image:

Uses python:3.12-slim (Debian-based, minimal packages)
Multi-stage build removes build tools from final image
Only runtime dependencies included

Network Security¶

Limit network exposure:

Container doesn't expose ports (no web server)
Only outbound connections (RSS feeds, API calls)
Use Docker networks for service isolation

File Permissions¶

Config file permissions:

Mount config files as read-only (:ro flag)
Use proper file permissions on host (e.g., chmod 600 config.yaml)

# Read-only mount for security
docker run -v ./config.yaml:/app/config.yaml:ro podcast-scraper

Resource Limits (Docker Compose)¶

Set resource limits to prevent resource exhaustion:

deploy:
  resources:
    limits:
      cpus: '2'
      memory: 4G

Security Scanning¶

Scan images for vulnerabilities:

# Using Docker Scout (built-in)
docker scout cves podcast-scraper:latest

# Using Snyk
snyk test --docker podcast-scraper:latest

CI/CD integration:

Snyk workflow scans Docker images automatically
Results available in GitHub Security tab

Best Practices Summary¶

✅ DO:

Use environment variables for secrets
Mount config files as read-only
Use versioned image tags in production
Set resource limits
Scan images regularly for vulnerabilities
Use non-root user (already configured)

❌ DON'T:

Commit API keys to config files
Use :latest tag in production
Run as root user
Expose unnecessary ports
Skip security scanning
Hardcode secrets in Dockerfiles

Service API - Service mode API reference
Configuration API - Config file format and options
Docker Variants Guide - LLM-only vs ML-enabled variants
Supervisor Example: config/examples/supervisor.conf.example (in repository root)
Docker Supervisor Config: docker/supervisor.conf.example (in repository root)