RFC-048: Evaluation ↔ Application Tightening & Alignment¶

Status: Completed
Authors: Marko Dragoljevic
Stakeholders: ML evaluation team, Core developers
Related RFCs:
docs/rfc/RFC-015-ai-experiment-pipeline.md (AI Experiment Pipeline)
docs/rfc/RFC-041-podcast-ml-benchmarking-framework.md (Benchmarking Framework)
docs/rfc/RFC-045-ml-model-optimization-guide.md (ML Model Optimization)
docs/rfc/RFC-046-materialization-architecture.md (Materialization Architecture)

Abstract¶

Over multiple iterations, we identified that evaluation runs and application (prod/dev) runs can diverge in behavior despite using the same underlying models. This RFC documents the concrete alignment rules, decisions made, and remaining optional improvements, so the system can ship cleanly now while leaving a clear path for future evaluation hardening.

This RFC explicitly separates:

What is required to ship now
What is optional and deferred (eval-only enhancements)

Architecture Alignment: This RFC ensures that evaluation results are representative of application behavior by enforcing a single code path, explicit parameter configuration, and comprehensive fingerprinting. It aligns with RFC-015 (AI Experiment Pipeline) and RFC-041 (Benchmarking Framework) by establishing clear contracts between evaluation and production execution.

Problem Statement¶

Evaluation runs and application runs can diverge in behavior despite using the same underlying models. This divergence creates several problems:

Hidden parameter drift: Implicit defaults in eval configs differ from app configs
Unrepresentative results: Evaluation metrics don't reflect actual production behavior
Debugging difficulty: When production behavior differs from eval, root cause is unclear
Silent behavior changes: Defaults and implicit logic can change without being tracked

Key Findings:

Eval vs App Differences Are Normal — But Must Be Controlled

Differences observed were caused by: - Implicit defaults vs explicit config values - Eval-only scoring logic (gates, caps, scope filters) - Different preprocessing profiles applied implicitly - Reduce-stage parameter overrides triggered dynamically

Conclusion: Evaluation and application must share the same execution path. Any difference must be: - Explicit in config - Logged clearly - Optional (never implicit)

Use Cases:

Production deployment: Ensure that what was evaluated is what gets deployed
Regression detection: Identify when eval and prod behavior diverge
Reproducibility: Recreate production behavior from eval artifacts
Debugging: Trace production issues back to eval runs

Goals¶

Ensure evaluation results are representative of application behavior
Avoid hidden parameter drift between eval configs and app configs
Make all summarization / NER behavior explainable from logs + config
Ship a stable production version without introducing a new eval framework chapter

Non-goals¶

Building dashboards
Introducing new eval modes or presets
Changing baseline model choices

Design & Implementation¶

Alignment Rules (Authoritative)¶

Rule 1: One Code Path¶

Requirement: Eval runs must call the same providers and pipeline stages as the app.

Implementation: - No duplicated logic ("eval summarizer", "prod summarizer") - Shared provider initialization - Shared preprocessing pipeline - Shared generation logic

Allowed differences: - Scorers (eval-only, read-only observers) - Reference loading (eval-only) - Gating thresholds (eval-only validation)

Rule 2: Explicit Parameters Only¶

Requirement: All ML behavior must be driven by config.

Required config sections: - map_params - reduce_params - tokenization - chunking - preprocessing_profile

No silent defaults for: - early_stopping - min_new_tokens - max_new_tokens - no_repeat_ngram_size

Implementation: - All parameters must appear in config and fingerprint - If a value matters, it must be explicit - Defaults are only allowed for non-behavioral settings (e.g., logging level)

Rule 3: Preprocessing Is Part of the Model¶

Requirement: Preprocessing materially affects output quality and must be treated as part of the model contract.

Decisions: - Preprocessing profile (e.g. cleaning_v4) is treated as part of the model contract - Profile must be: - Explicit in config - Logged during execution - Stored in fingerprints

Implementation: - Eval and app must use the same preprocessing profile unless intentionally testing differences - Profile version is included in fingerprint - Profile changes are tracked as model changes

Rule 4: Dynamic Safeguards Are Allowed (and Required)¶

Requirement: Dynamic controls that protect model behavior are not eval-only hacks. They are core runtime safety.

Examples: - Capping max_new_tokens based on input size - Forcing min_new_tokens=0 to prevent expansion - Switching reduce strategy based on combined input size

Implementation: - These safeguards: - Run in both eval and app - Are logged with clear reasoning - Appear in validation logs - Safeguard logic is part of the provider, not the scorer

Rule 5: Scoring Never Mutates Behavior¶

Requirement: Scorers are read-only observers.

Implementation: - Scorers must: - Never change generation parameters - Never filter or rewrite predictions - Never affect chunking or reduce decisions - All filtering (e.g. NER scope filtering) happens inside the scorer only - Scorers operate on predictions after they are generated

Summarization Alignment (Final)¶

What Is Locked In¶

MAP/REDUCE pipeline
Dynamic reduce capping logic
Expansion prevention via runtime caps
Cleaning profile included in fingerprint

What Is Explicitly Deferred¶

Eval-only "no expansion ever" policies
Special eval presets
Length-normalized eval scoring

Note: These may be added later without changing prod behavior.

NER Alignment (Final)¶

Key Decisions¶

Gold entities define scope, not positions: Position mismatch = FP by design
Scope-aware filtering: Only ignores out-of-scope predictions
Entity identity matters more than exact span offsets: Matches real-world KG usage

Alignment Rule¶

NER in eval and app: - Uses identical spaCy pipeline - Uses identical preprocessing input - Differs only in scorer and reference loading

Artifacts Per Run (Required)¶

Requirement: Every run (eval or app) must produce:

predictions.jsonl
metrics.json
metrics_report.md
fingerprint.json
run.log

Implementation: - If any are missing, the run is incomplete - All artifacts must be generated by the same code path - Artifacts must be validatable against schemas

Fingerprinting Contract¶

Requirement: Fingerprints must include:

Model IDs (raw HF IDs)
Transformers version
Preprocessing profile
Effective generation parameters (after caps)
Chunking strategy

Implementation: - Fingerprints are the source of truth for explaining behavior - Fingerprints must be deterministic and reproducible - Fingerprint mismatches indicate behavior differences

Key Decisions¶

Single Code Path
Decision: Eval and app use identical execution path
Rationale: Ensures evaluation results are representative of production behavior
Explicit Parameters
Decision: All behavioral parameters must be explicit in config
Rationale: Prevents hidden drift and makes behavior explainable
Preprocessing as Model Contract
Decision: Preprocessing profile is part of the model fingerprint
Rationale: Preprocessing materially affects output quality
Dynamic Safeguards in Provider
Decision: Runtime safety logic runs in provider, not scorer
Rationale: Safeguards are part of production behavior, not eval-only
Scorers Are Read-Only
Decision: Scorers never mutate behavior
Rationale: Ensures scoring doesn't affect production-equivalent behavior
NER Scope-Based Evaluation
Decision: Gold entities define scope, position mismatches are FPs
Rationale: Matches real-world knowledge graph usage patterns

Shipping Readiness Checklist¶

You are ready to ship when:

✅ Eval and app use identical code path
✅ All parameters are explicit in config
✅ Preprocessing profile is in fingerprint
✅ Dynamic safeguards are logged
✅ Scorers are read-only
✅ All required artifacts are generated
✅ Fingerprints are complete and deterministic

Deferred Work (Next Chapter)¶

Explicitly out of scope for this release:

Eval presets (eval-fast, eval-strict)
Dashboarding
Benchmark dataset integration for NER
LLM-based NER fallback

Note: These can be layered later without breaking shipped behavior.

Benefits¶

Representative Evaluation: Eval results accurately reflect production behavior
No Hidden Drift: All differences between eval and app are explicit and tracked
Explainable Behavior: All behavior can be explained from logs and config
Stable Production: Production version ships without eval framework churn
Clear Path Forward: Deferred work is clearly documented and can be added later

Final Note¶

This alignment work ensures that:

What you evaluate is what you ship.
No hidden behavior. No silent drift. No eval illusions.

References¶

Related RFC: docs/rfc/RFC-015-ai-experiment-pipeline.md
Related RFC: docs/rfc/RFC-041-podcast-ml-benchmarking-framework.md
Related RFC: docs/rfc/RFC-045-ml-model-optimization-guide.md
Related RFC: docs/rfc/RFC-046-materialization-architecture.md
Source Code: src/podcast_scraper/evaluation/, src/podcast_scraper/providers/ml/