ADR-044: Local LLM Backend Abstraction¶
- Status: Accepted
- Date: 2026-01-11
- Authors: Podcast Scraper Team
- Related RFCs: RFC-042
Context & Problem Statement¶
Local LLM inference requires different backends depending on the hardware (PyTorch for NVIDIA GPUs, llama.cpp for Apple Silicon). Hardcoding one library would limit the project's portability.
Decision¶
We implement a Local LLM Backend Abstraction.
- The
hybrid_mlprovider interacts with backends via a simpleInferenceBackendprotocol. - Supported backends include:
transformers: For standard PyTorch/GPU execution.llama_cpp: Optimized for CPU and Apple Silicon (Metal).ollama: For users who prefer a managed local inference server.
Rationale¶
- Hardware Portability: Allows the scraper to run with peak performance on both Mac laptops and Linux servers.
- Optimization: We can leverage GGUF/quantized models via
llama.cppto run 7B-14B models in <16GB of RAM. - Future-Proofing: Easy to add new backends (e.g., vLLM, TensorRT-LLM) without touching the summarization logic.
Alternatives Considered¶
- PyTorch-Only: Rejected as it is inefficient for LLM inference on consumer-grade Apple hardware.
Consequences¶
- Positive: Broad hardware support; minimal memory footprint; flexible deployment options.
- Negative: Requires managing multiple optional system dependencies.