Skip to content

RFC-004: Filesystem Layout & Run Management

Abstract

Define the rules governing output directories, filename sanitization, temporary storage, and run suffix semantics to guarantee deterministic, safe, and resumable filesystem interactions.

Problem Statement

Without consistent naming and directory policies, transcript archives become hard to diff, risk clobbering existing data, and may interact poorly with user environments. The system requires cross-platform-safe filenames, predictable run folders, and cleanup of temporary artifacts.

Constraints & Assumptions

  • Target environments include macOS, Linux, and Windows; filenames must avoid reserved characters.
  • Users may override output directories, but we encourage safe locations (home directory, platform data/cache roots).
  • Whisper fallback stores intermediate media files that must be cleaned up to conserve disk.

Design & Implementation

  1. Base output directory
  2. Default: output/rss_<sanitized_host>_<hash> where hash is first 8 chars of SHA-1 of the RSS URL.
  3. filesystem.derive_output_dir handles overrides, invoking validation.
  4. Validation
  5. filesystem.validate_and_normalize_output_dir resolves paths, verifies they fall under safe roots (cwd, home, platform dirs), and warns otherwise.
  6. Run suffixes
  7. filesystem.setup_output_directory derives optional run_suffix based on cfg.run_id or Whisper usage.
  8. Effective output path is <output_dir>/run_<run_suffix> when suffix present.
  9. Filename sanitization
  10. filesystem.sanitize_filename strips control characters, collapses whitespace, and replaces unsafe characters with _.
  11. Episode filenames follow <idx:04d> - <title_safe>[ _<run_suffix>].<ext>.
  12. Whisper outputs
  13. filesystem.build_whisper_output_name truncates titles (32 chars) and appends run suffix when available.
  14. Temporary media stored in <effective_output_dir>/.tmp_media/ and removed post-transcription.
  15. Run tracking files (Issue #379)
  16. run.json - Top-level run summary combining run manifest and pipeline metrics (created at pipeline end)
  17. index.json - Episode index listing all processed episodes with status, paths, and error information (created at pipeline end)
  18. run_manifest.json - Comprehensive run manifest capturing system state for reproducibility (created at pipeline start, saved at pipeline end)
  19. metrics.json - Pipeline metrics including episode statuses, stage timings, and performance data (created at pipeline end)
  20. All files are written to <effective_output_dir>/ and include schema versioning for forward compatibility.
  21. Cleanup semantics
  22. --clean-output triggers deletion of existing output directory if not in dry-run mode.
  23. Best-effort removal of temp directories even on errors (warnings if removal fails).

Key Decisions

  • Hash-based directory suffix avoids collisions between feeds hosted on same domain but different paths.
  • Suffix semantics provide provenance (run ID, Whisper model) within output directories without complicating base naming.
  • Sanitization policy prioritizes readability while remaining filesystem-safe across OSes.

Alternatives Considered

  • Timestamped root directories: Rejected in favor of deterministic names; use --run-id auto when unique runs are desired.
  • Per-episode subdirectories: Rejected to keep archives flat and easy to diff.

Testing Strategy

  • Unit tests cover sanitization edge cases and output derivation logic.
  • Integration tests ensure --clean-output, --skip-existing, and Whisper workflows interact correctly with directory management.

Rollout & Monitoring

  • Warnings emitted when users choose "unsafe" directories (outside recommended roots) for observability.
  • Future enhancements (checksums, subdirectory partitioning) can extend this RFC with backward compatibility.

References

  • Source: podcast_scraper/filesystem.py
  • Orchestrator usage: docs/rfc/RFC-001-workflow-orchestration.md
  • Whisper specifics: docs/rfc/RFC-005-whisper-integration.md