Core API¶
This is the primary public API for podcast_scraper. Use these functions for programmatic access.
Quick Start¶
import podcast_scraper
# Create configuration
cfg = podcast_scraper.Config(
rss="https://example.com/feed.xml",
output_dir="./transcripts",
max_episodes=10
)
# Run the pipeline
count, summary = podcast_scraper.run_pipeline(cfg)
print(f"Downloaded {count} transcripts: {summary}")
API Reference¶
run_pipeline
¶
run_pipeline(cfg: Config) -> Tuple[int, str]
Execute the main podcast scraping pipeline.
This is the primary entry point for programmatic use of podcast_scraper. It orchestrates the complete workflow from RSS feed fetching to transcript generation and optional metadata/summarization.
The pipeline executes the following stages:
- Setup output directory (with optional run ID subdirectory)
- Fetch and parse RSS feed
- Detect speakers (if auto-detection enabled)
- Process episodes concurrently:
- Download published transcripts
- Or queue media for Whisper transcription
- Transcribe queued media files sequentially (if Whisper enabled)
- Generate metadata documents (if enabled)
- Generate episode summaries (if enabled)
- Clean up temporary files
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
cfg
|
Config
|
Configuration object with all pipeline settings. See |
required |
Returns:
| Type | Description |
|---|---|
Tuple[int, str]
|
Tuple[int, str]: A tuple containing:
|
Raises:
| Type | Description |
|---|---|
RuntimeError
|
If output directory cleanup fails when |
ValueError
|
If RSS URL is invalid or feed cannot be parsed |
FileNotFoundError
|
If configuration file references missing files |
OSError
|
If file system operations fail |
Example
from podcast_scraper import Config, run_pipeline
cfg = Config( ... rss="https://example.com/feed.xml", ... output_dir="./transcripts", ... max_episodes=10 ... ) count, summary = run_pipeline(cfg) print(f"Downloaded {count} transcripts: {summary}") Downloaded 10 transcripts: Processed 10/50 episodes
Example with Whisper transcription
cfg = Config( ... rss="https://example.com/feed.xml", ... transcribe_missing=True, ... whisper_model="base", ... screenplay=True, ... num_speakers=2 ... ) count, summary = run_pipeline(cfg)
Note
For non-interactive use (daemons, services), consider using the service.run()
function instead, which provides structured error handling and return values.
See Also
Config: Configuration model with all available optionsservice.run(): Service API with structured error handlingload_config_file(): Load configuration from JSON/YAML file
Source code in src/podcast_scraper/workflow/orchestration.py
1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810 | |
load_config_file
¶
load_config_file(path: str) -> Dict[str, Any]
Load configuration from a JSON or YAML file.
This function reads a configuration file and returns a dictionary of configuration values.
The file format is auto-detected from the file extension (.json, .yaml, or .yml).
The returned dictionary can be unpacked into the Config constructor to create a
configuration object.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Path to configuration file (JSON or YAML). Supports tilde expansion for home directory (e.g., "~/config.yaml"). |
required |
Returns:
| Type | Description |
|---|---|
Dict[str, Any]
|
Dict[str, Any]: Dictionary containing configuration values from the file.
Keys correspond to |
Raises:
| Type | Description |
|---|---|
ValueError
|
If any of the following occur:
|
OSError
|
If file cannot be read due to permissions or I/O errors |
Example
from podcast_scraper import Config, load_config_file, run_pipeline
Load from YAML file¶
config_dict = load_config_file("config.yaml") cfg = Config(**config_dict) count, summary = run_pipeline(cfg)
Example with JSON
config_dict = load_config_file("config.json") cfg = Config(**config_dict)
Example with direct usage
from podcast_scraper import load_config_file, service
Service API provides load_config_file convenience¶
result = service.run_from_config_file("config.yaml")
Supported Formats
JSON (.json):
{
"rss": "https://example.com/feed.xml",
"output_dir": "./transcripts",
"max_episodes": 50
}
YAML (.yaml, .yml):
rss: https://example.com/feed.xml
output_dir: ./transcripts
max_episodes: 50
Note
- Field aliases are supported (e.g., both "rss" and "rss_url" work)
- See
Configdocumentation for all available configuration options - Configuration files should not contain sensitive data (API keys, passwords)
See Also
Config: Configuration model and field documentationservice.run_from_config_file(): Direct service API from config file- Configuration examples:
config/examples/config.example.json,config/examples/config.example.yaml
Source code in src/podcast_scraper/config.py
3595 3596 3597 3598 3599 3600 3601 3602 3603 3604 3605 3606 3607 3608 3609 3610 3611 3612 3613 3614 3615 3616 3617 3618 3619 3620 3621 3622 3623 3624 3625 3626 3627 3628 3629 3630 3631 3632 3633 3634 3635 3636 3637 3638 3639 3640 3641 3642 3643 3644 3645 3646 3647 3648 3649 3650 3651 3652 3653 3654 3655 3656 3657 3658 3659 3660 3661 3662 3663 3664 3665 3666 3667 3668 3669 3670 3671 3672 3673 3674 3675 3676 3677 3678 3679 3680 3681 3682 3683 3684 3685 3686 3687 3688 3689 3690 3691 3692 3693 3694 3695 3696 3697 3698 3699 3700 3701 3702 3703 | |
Package Information¶
Versioning¶
Related¶
- Configuration - Detailed configuration options
- Service API - Non-interactive service interface
- CLI Interface - Command-line interface