Skip to content

Docker Variants Guide

This guide explains how to use different Docker image variants based on your needs: LLM-only (small, fast) or ML-enabled (full features).

Overview

Podcast Scraper provides two Docker image variants:

Variant Tag Size Use Case Dependencies
LLM-only :llm-only or :latest-llm ~200-300MB OpenAI/API providers only Core + OpenAI SDK
ML-enabled :ml or :latest ~1-3GB Local ML models (Whisper, spaCy, Transformers) Core + ML dependencies

Quick Start

For users who only use OpenAI or other LLM API providers:

# Build LLM-only image
docker build --build-arg INSTALL_EXTRAS="" -t podcast-scraper:llm-only .

# Run LLM-only image
docker run -v /host/config.yaml:/app/config.yaml \
           -e OPENAI_API_KEY=sk-your-key \
           podcast-scraper:llm-only

Benefits:

  • Smaller image: ~200-300MB vs ~1-3GB (90%+ size reduction)
  • Faster builds: No ML dependencies to download/compile
  • Faster startup: No model loading overhead
  • Lower resource usage: No GPU/CPU-intensive ML libraries

Limitations:

  • Cannot use local Whisper transcription
  • Cannot use local spaCy speaker detection
  • Cannot use local Transformers summarization
  • Requires API keys for OpenAI providers

ML-Enabled Variant (Full Features)

For users who want local ML models:

# Build ML-enabled image (default)
docker build --build-arg INSTALL_EXTRAS=ml -t podcast-scraper:ml .

# Or use default (ML is default for backwards compatibility)
docker build -t podcast-scraper:ml .

# Run ML-enabled image
docker run -v /host/config.yaml:/app/config.yaml \
           podcast-scraper:ml

Benefits:

  • Full features: All providers available (local + API)
  • Privacy: Local processing, no API calls needed
  • Cost-effective: No API costs for transcription/summarization
  • Offline capable: Works without internet (after models downloaded)

Limitations:

  • Larger image: ~1-3GB (includes ML models)
  • Slower builds: ML dependencies take time to download/compile
  • Higher resource usage: Requires CPU/GPU for ML inference

Build Arguments

INSTALL_EXTRAS

Controls which optional dependencies to install:

  • "" (empty) = Core only (LLM-only variant)
  • "ml" = Core + ML dependencies (ML-enabled variant, default)

Example:

# LLM-only
docker build --build-arg INSTALL_EXTRAS="" -t podcast-scraper:llm-only .

# ML-enabled
docker build --build-arg INSTALL_EXTRAS=ml -t podcast-scraper:ml .

PRELOAD_ML_MODELS

Only applies when INSTALL_EXTRAS=ml. Controls ML model preloading:

  • "true" = Preload models during build (default)
  • "false" = Skip model preloading (faster builds, models loaded at runtime)

Example:

# Build ML variant without preloading models (faster build)
docker build --build-arg INSTALL_EXTRAS=ml --build-arg PRELOAD_ML_MODELS=false -t podcast-scraper:ml .

ML preloading details

When INSTALL_EXTRAS=ml and PRELOAD_ML_MODELS=true, the image runs scripts/cache/preload_ml_models.py --production (same bundle as make preload-ml-models-production): Whisper (tiny.en + base.en), production and test Transformers models, hybrid LongT5/FLAN, and GIL evidence models. There are no separate WHISPER_MODELS / SKIP_TRANSFORMERS Docker build args; use PRELOAD_ML_MODELS=false for a smaller image that downloads models at runtime.

Tagging Strategy

LLM-only variant:

  • podcast-scraper:llm-only - LLM-only variant
  • podcast-scraper:latest-llm - Latest LLM-only variant
  • podcast-scraper:2.5.0-llm - Versioned LLM-only variant

ML-enabled variant:

  • podcast-scraper:ml - ML-enabled variant
  • podcast-scraper:latest - Latest ML-enabled variant (default)
  • podcast-scraper:2.5.0 - Versioned ML-enabled variant

Industry Best Practices

This follows common Docker tagging patterns:

  1. :latest points to the most common variant (ML-enabled in this case)
  2. Variant-specific tags (:llm-only, :ml) for explicit selection
  3. Versioned tags (:2.5.0, :2.5.0-llm) for reproducibility

Where the variants are used

The variants live as build arguments on docker/pipeline/Dockerfile. They show up in two places in the compose stack:

Compose service INSTALL_EXTRAS Built by
pipeline ml make stack-test-build (or docker compose --profile pipeline build pipeline)
pipeline-llm llm make stack-test-build-cloud (or docker compose --profile pipeline-llm build pipeline-llm)

When the API spawns a pipeline job, viewer_operator.yaml's pipeline_install_extras field selects which service the factory targets — mlpipeline (ML image), llmpipeline-llm (LLM image). See Docker Compose guide for the operator UI flow that drives this selection.

Single-container docker run examples

If you want to run the pipeline as a one-shot or scheduler-driven container without the full compose stack:

LLM-only:

docker build --build-arg INSTALL_EXTRAS=llm -f docker/pipeline/Dockerfile -t podcast-scraper:llm .
docker run -v ./config.yaml:/app/config.yaml \
           -v ./output:/app/output \
           -e OPENAI_API_KEY=sk-your-key \
           podcast-scraper:llm

ML-enabled:

docker build --build-arg INSTALL_EXTRAS=ml -f docker/pipeline/Dockerfile -t podcast-scraper:ml .
docker run -v ./config.yaml:/app/config.yaml \
           -v ./output:/app/output \
           # Optional: model cache persistence (avoids redownload on rebuild)
           -v ./whisper-cache:/opt/whisper-cache \
           -v ./huggingface-cache:/home/podcast/.cache/huggingface \
           podcast-scraper:ml

The single-container flow uses podcast_scraper.service (entrypoint fallback) which reads /app/config.yaml. See Docker Service guide for env vars, supervisor mode, and security hardening.

Configuration Compatibility

Both variants use the same configuration format. The difference is which providers are available:

LLM-Only Configuration

rss: https://example.com/feed.xml
output_dir: /app/output

# Only API providers available
transcription_provider: openai
speaker_detector_provider: openai
summary_provider: openai

# OpenAI API key required
openai_api_key: ${OPENAI_API_KEY}

ML-Enabled Configuration

rss: https://example.com/feed.xml
output_dir: /app/output

# All providers available (local + API)
transcription_provider: whisper  # or openai
speaker_detector_provider: spacy  # or openai
summary_provider: transformers  # or openai

# Optional: OpenAI API key for API providers
openai_api_key: ${OPENAI_API_KEY}

Migration Guide

Switching from ML to LLM-Only

  1. Rebuild image:
docker build --build-arg INSTALL_EXTRAS="" -t podcast-scraper:llm-only .
  1. Update configuration:
  2. Change transcription_provider: whispertranscription_provider: openai
  3. Change speaker_detector_provider: spacyspeaker_detector_provider: openai
  4. Change summary_provider: transformerssummary_provider: openai

  5. Add API keys:

  6. Set OPENAI_API_KEY environment variable
  7. Or add openai_api_key to config file

  8. Update Docker Compose:

  9. Change image: podcast-scraper:mlimage: podcast-scraper:llm-only
  10. Add OPENAI_API_KEY environment variable

Switching from LLM-Only to ML

  1. Rebuild image:
docker build --build-arg INSTALL_EXTRAS=ml -t podcast-scraper:ml .
  1. Update configuration:
  2. Change providers to local ML providers (optional, can still use API)
  3. Remove openai_api_key if using only local providers

  4. Update Docker Compose:

  5. Change image: podcast-scraper:llm-onlyimage: podcast-scraper:ml

CI/CD Integration

GitHub Actions Example

jobs:
  build-variants:
    strategy:
      matrix:
        variant:
          - name: llm-only
            args: INSTALL_EXTRAS=
            tag: podcast-scraper:llm-only
          - name: ml
            args: INSTALL_EXTRAS=ml
            tag: podcast-scraper:ml
    steps:
      - name: Build ${{ matrix.variant.name }} variant
        run: |
          docker build \
            --build-arg ${{ matrix.variant.args }} \
            -t ${{ matrix.variant.tag }} .

Size Comparison

Variant Base Image Dependencies Models Total
LLM-only ~150MB ~50MB 0MB ~200MB
ML-enabled ~150MB ~500MB ~1-2GB ~1.5-2.5GB

Note: Actual sizes vary based on:

  • Base image version
  • Dependency versions
  • Model preloading settings
  • Build optimizations

Performance Comparison

Metric LLM-only ML-enabled
Build time ~2-3 min ~10-15 min
Startup <1 sec ~5-10 sec
Memory usage ~100MB ~500MB-2GB
CPU usage Low High (during inference)
Network required Yes (API calls) No (after models loaded)

Decision Guide

Choose LLM-only if:

  • You only use OpenAI/API providers
  • You want smaller images and faster builds
  • You have reliable internet for API calls
  • You prefer API-based processing

Choose ML-enabled if:

  • You want local Whisper transcription
  • You want local spaCy speaker detection
  • You want local Transformers summarization
  • You need offline capability
  • You want to avoid API costs
  • You prioritize privacy (local processing)