Data Models¶
Data models used throughout the podcast_scraper codebase.
Overview¶
The models module defines core data structures:
RssFeed- Parsed RSS feed representationEpisode- Individual podcast episodeTranscriptionJob- Whisper transcription job
API Reference¶
RssFeed
dataclass
¶
RssFeed(title: str, items: List[Element], base_url: str, authors: List[str] = list())
Represents a parsed RSS feed with metadata and episode items.
This dataclass holds the parsed RSS feed information including the feed title, all episode items as XML elements, the base URL for resolving relative links, and a list of detected authors.
Attributes:
| Name | Type | Description |
|---|---|---|
title |
str
|
The podcast feed title (from |
items |
List[Element]
|
List of XML elements representing individual episodes ( |
base_url |
str
|
Base URL of the RSS feed, used for resolving relative URLs. |
authors |
List[str]
|
List of author names extracted from the feed metadata. |
Example
feed = RssFeed( ... title="My Podcast", ... items=[item1, item2], ... base_url="https://example.com/feed.xml", ... authors=["John Doe"] ... )
Episode
dataclass
¶
Episode(idx: int, title: str, title_safe: str, item: Element, transcript_urls: List[Tuple[str, Optional[str]]], media_url: Optional[str] = None, media_type: Optional[str] = None)
Represents a podcast episode with metadata and content URLs.
This dataclass encapsulates all information about a single podcast episode, including its position in the feed, title information, transcript URLs, and media file details.
Attributes:
| Name | Type | Description |
|---|---|---|
idx |
int
|
Episode index in the feed (0-based, starting from most recent). |
title |
str
|
Original episode title from RSS feed. |
title_safe |
str
|
Filesystem-safe version of the title for use in filenames. |
item |
Element
|
Original XML element from the RSS feed containing all episode data. |
transcript_urls |
List[Tuple[str, Optional[str]]]
|
List of (url, mime_type) tuples for available transcripts. |
media_url |
Optional[str]
|
URL of the podcast media file (audio/video). None if not available. |
media_type |
Optional[str]
|
MIME type of the media file (e.g., "audio/mpeg"). None if not available. |
Example
episode = Episode( ... idx=0, ... title="Episode 1: Introduction", ... title_safe="episode-1-introduction", ... item=xml_element, ... transcript_urls=[("https://example.com/transcript.vtt", "text/vtt")], ... media_url="https://example.com/audio.mp3", ... media_type="audio/mpeg" ... )
TranscriptionJob
dataclass
¶
TranscriptionJob(idx: int, ep_title: str, ep_title_safe: str, temp_media: str, detected_speaker_names: Optional[List[str]] = None, episode: Optional[Episode] = None, media_download_elapsed: Optional[float] = None)
Represents a media transcription job for Whisper.
This dataclass tracks information needed to transcribe a podcast episode's media file using Whisper. It includes episode metadata and paths to temporary media files, along with any detected speaker names for diarization.
Attributes:
| Name | Type | Description |
|---|---|---|
idx |
int
|
Episode index in the feed (0-based, starting from most recent). |
ep_title |
str
|
Original episode title from RSS feed. |
ep_title_safe |
str
|
Filesystem-safe version of the title for output filenames. |
temp_media |
str
|
Path to the temporary downloaded media file to transcribe. |
detected_speaker_names |
Optional[List[str]]
|
Optional list of speaker names detected from episode metadata or show notes. Used for screenplay formatting if available. |
episode |
Optional[Episode]
|
Optional reference to the source Episode (for metrics and stable IDs). |
Example
job = TranscriptionJob( ... idx=0, ... ep_title="Episode 1: Introduction", ... ep_title_safe="episode-1-introduction", ... temp_media="/tmp/episode-1.mp3", ... detected_speaker_names=["Alice", "Bob"] ... )
Usage Examples¶
Working with Episodes¶
from podcast_scraper.models import Episode
episode = Episode(
number=1,
title="Example Episode",
link="https://example.com/episode-1",
transcript_url="https://example.com/transcript.txt",
media_url="https://example.com/audio.mp3",
media_type="audio/mpeg"
)
print(f"Episode {episode.number}: {episode.title}")
print(f"Transcript: {episode.transcript_url}")
Working with Feeds¶
from podcast_scraper.models import RssFeed, Episode
# Create episodes
ep1 = Episode(number=1, title="Ep 1", link="...", media_url="...")
ep2 = Episode(number=2, title="Ep 2", link="...", media_url="...")
# Create feed
feed = RssFeed(
title="Example Podcast",
description="A great podcast",
link="https://example.com",
episodes=[ep1, ep2]
)
print(f"Feed: {feed.title}")
print(f"Episodes: {len(feed.episodes)}")
See Also¶
- RSS Parser - How these models are populated
- Core API - How to run the pipeline using these models