Docker Variants Guide¶

This guide explains how to use different Docker image variants based on your needs: LLM-only (small, fast) or ML-enabled (full features).

Overview¶

Podcast Scraper provides two Docker image variants:

Variant	Tag	Size	Use Case	Dependencies
LLM-only	`:llm-only` or `:latest-llm`	~200-300MB	OpenAI/API providers only	Core + OpenAI SDK
ML-enabled	`:ml` or `:latest`	~1-3GB	Local ML models (Whisper, spaCy, Transformers)	Core + ML dependencies

Quick Start¶

LLM-Only Variant (Recommended for API Users)¶

For users who only use OpenAI or other LLM API providers:

# Build LLM-only image
docker build --build-arg INSTALL_EXTRAS="" -t podcast-scraper:llm-only .

# Run LLM-only image
docker run -v /host/config.yaml:/app/config.yaml \
           -e OPENAI_API_KEY=sk-your-key \
           podcast-scraper:llm-only

Benefits:

Smaller image: ~200-300MB vs ~1-3GB (90%+ size reduction)
Faster builds: No ML dependencies to download/compile
Faster startup: No model loading overhead
Lower resource usage: No GPU/CPU-intensive ML libraries

Limitations:

Cannot use local Whisper transcription
Cannot use local spaCy speaker detection
Cannot use local Transformers summarization
Requires API keys for OpenAI providers

ML-Enabled Variant (Full Features)¶

For users who want local ML models:

# Build ML-enabled image (default)
docker build --build-arg INSTALL_EXTRAS=ml -t podcast-scraper:ml .

# Or use default (ML is default for backwards compatibility)
docker build -t podcast-scraper:ml .

# Run ML-enabled image
docker run -v /host/config.yaml:/app/config.yaml \
           podcast-scraper:ml

Benefits:

Full features: All providers available (local + API)
Privacy: Local processing, no API calls needed
Cost-effective: No API costs for transcription/summarization
Offline capable: Works without internet (after models downloaded)

Limitations:

Larger image: ~1-3GB (includes ML models)
Slower builds: ML dependencies take time to download/compile
Higher resource usage: Requires CPU/GPU for ML inference

Build Arguments¶

`INSTALL_EXTRAS`¶

Controls which optional dependencies to install:

"" (empty) = Core only (LLM-only variant)
"ml" = Core + ML dependencies (ML-enabled variant, default)

Example:

# LLM-only
docker build --build-arg INSTALL_EXTRAS="" -t podcast-scraper:llm-only .

# ML-enabled
docker build --build-arg INSTALL_EXTRAS=ml -t podcast-scraper:ml .

`PRELOAD_ML_MODELS`¶

Only applies when INSTALL_EXTRAS=ml. Controls ML model preloading:

"true" = Preload models during build (default)
"false" = Skip model preloading (faster builds, models loaded at runtime)

Example:

# Build ML variant without preloading models (faster build)
docker build --build-arg INSTALL_EXTRAS=ml --build-arg PRELOAD_ML_MODELS=false -t podcast-scraper:ml .

ML preloading details¶

When INSTALL_EXTRAS=ml and PRELOAD_ML_MODELS=true, the image runs scripts/cache/preload_ml_models.py --production (same bundle as make preload-ml-models-production): Whisper (tiny.en + base.en), production and test Transformers models, hybrid LongT5/FLAN, and GIL evidence models. There are no separate WHISPER_MODELS / SKIP_TRANSFORMERS Docker build args; use PRELOAD_ML_MODELS=false for a smaller image that downloads models at runtime.

Tagging Strategy¶

Recommended Tags¶

LLM-only variant:

podcast-scraper:llm-only - LLM-only variant
podcast-scraper:latest-llm - Latest LLM-only variant
podcast-scraper:2.5.0-llm - Versioned LLM-only variant

ML-enabled variant:

podcast-scraper:ml - ML-enabled variant
podcast-scraper:latest - Latest ML-enabled variant (default)
podcast-scraper:2.5.0 - Versioned ML-enabled variant

Industry Best Practices¶

This follows common Docker tagging patterns:

:latest points to the most common variant (ML-enabled in this case)
Variant-specific tags (:llm-only, :ml) for explicit selection
Versioned tags (:2.5.0, :2.5.0-llm) for reproducibility

Where the variants are used¶

The variants live as build arguments on docker/pipeline/Dockerfile. They show up in two places in the compose stack:

Compose service	`INSTALL_EXTRAS`	Built by
`pipeline`	`ml`	`make stack-test-build` (or `docker compose --profile pipeline build pipeline`)
`pipeline-llm`	`llm`	`make stack-test-build-cloud` (or `docker compose --profile pipeline-llm build pipeline-llm`)

When the API spawns a pipeline job, viewer_operator.yaml's pipeline_install_extras field selects which service the factory targets — ml → pipeline (ML image), llm → pipeline-llm (LLM image). See Docker Compose guide for the operator UI flow that drives this selection.

Single-container `docker run` examples¶

If you want to run the pipeline as a one-shot or scheduler-driven container without the full compose stack:

LLM-only:

docker build --build-arg INSTALL_EXTRAS=llm -f docker/pipeline/Dockerfile -t podcast-scraper:llm .
docker run -v ./config.yaml:/app/config.yaml \
           -v ./output:/app/output \
           -e OPENAI_API_KEY=sk-your-key \
           podcast-scraper:llm

ML-enabled:

docker build --build-arg INSTALL_EXTRAS=ml -f docker/pipeline/Dockerfile -t podcast-scraper:ml .
docker run -v ./config.yaml:/app/config.yaml \
           -v ./output:/app/output \
           # Optional: model cache persistence (avoids redownload on rebuild)
           -v ./whisper-cache:/opt/whisper-cache \
           -v ./huggingface-cache:/home/podcast/.cache/huggingface \
           podcast-scraper:ml

The single-container flow uses podcast_scraper.service (entrypoint fallback) which reads /app/config.yaml. See Docker Service guide for env vars, supervisor mode, and security hardening.

Configuration Compatibility¶

Both variants use the same configuration format. The difference is which providers are available:

LLM-Only Configuration¶

rss: https://example.com/feed.xml
output_dir: /app/output

# Only API providers available
transcription_provider: openai
speaker_detector_provider: openai
summary_provider: openai

# OpenAI API key required
openai_api_key: ${OPENAI_API_KEY}

ML-Enabled Configuration¶

rss: https://example.com/feed.xml
output_dir: /app/output

# All providers available (local + API)
transcription_provider: whisper  # or openai
speaker_detector_provider: spacy  # or openai
summary_provider: transformers  # or openai

# Optional: OpenAI API key for API providers
openai_api_key: ${OPENAI_API_KEY}

Migration Guide¶

Switching from ML to LLM-Only¶

Rebuild image:

docker build --build-arg INSTALL_EXTRAS="" -t podcast-scraper:llm-only .

Update configuration:
Change transcription_provider: whisper → transcription_provider: openai
Change speaker_detector_provider: spacy → speaker_detector_provider: openai
Change summary_provider: transformers → summary_provider: openai
Add API keys:
Set OPENAI_API_KEY environment variable
Or add openai_api_key to config file
Update Docker Compose:
Change image: podcast-scraper:ml → image: podcast-scraper:llm-only
Add OPENAI_API_KEY environment variable

Switching from LLM-Only to ML¶

Rebuild image:

docker build --build-arg INSTALL_EXTRAS=ml -t podcast-scraper:ml .

Update configuration:
Change providers to local ML providers (optional, can still use API)
Remove openai_api_key if using only local providers
Update Docker Compose:
Change image: podcast-scraper:llm-only → image: podcast-scraper:ml

CI/CD Integration¶

GitHub Actions Example¶

jobs:
  build-variants:
    strategy:
      matrix:
        variant:
          - name: llm-only
            args: INSTALL_EXTRAS=
            tag: podcast-scraper:llm-only
          - name: ml
            args: INSTALL_EXTRAS=ml
            tag: podcast-scraper:ml
    steps:
      - name: Build ${{ matrix.variant.name }} variant
        run: |
          docker build \
            --build-arg ${{ matrix.variant.args }} \
            -t ${{ matrix.variant.tag }} .

Size Comparison¶

Variant	Base Image	Dependencies	Models	Total
LLM-only	~150MB	~50MB	0MB	~200MB
ML-enabled	~150MB	~500MB	~1-2GB	~1.5-2.5GB

Note: Actual sizes vary based on:

Base image version
Dependency versions
Model preloading settings
Build optimizations

Performance Comparison¶

Metric	LLM-only	ML-enabled
Build time	~2-3 min	~10-15 min
Startup	<1 sec	~5-10 sec
Memory usage	~100MB	~500MB-2GB
CPU usage	Low	High (during inference)
Network required	Yes (API calls)	No (after models loaded)

Decision Guide¶

Choose LLM-only if:

You only use OpenAI/API providers
You want smaller images and faster builds
You have reliable internet for API calls
You prefer API-based processing

Choose ML-enabled if:

You want local Whisper transcription
You want local spaCy speaker detection
You want local Transformers summarization
You need offline capability
You want to avoid API costs
You prioritize privacy (local processing)

Docker Service Guide - Service-oriented Docker usage
Development Guide - Local installation and setup
Provider Configuration - Provider setup
AI Provider Comparison - Compare providers