podcast_llm.extractors.base
Base content extractor interface for podcast generation.
This module provides the abstract base class that defines the interface for extracting content from different source types like PDFs, web articles, YouTube videos, etc. Concrete implementations handle the specifics of extracting from each source type.
The module defines: - BaseSourceDocument abstract base class - Common interface for content extraction - Conversion to LangChain Document format - Standard metadata fields
Example
>>> class PDFSourceDocument(BaseSourceDocument):
...     def extract(self) -> str:
...         # PDF-specific extraction logic
...         return extracted_text
The base class ensures consistent handling of different source types while allowing specialized extraction logic in the concrete implementations. This enables modular addition of new source types while maintaining a uniform interface.
- class podcast_llm.extractors.base.BaseSourceDocument[source]
- Bases: - ABC- Abstract base class for source document content extractors. - This class defines the interface for extracting content from different source types like PDFs, web articles, YouTube videos, etc. Concrete implementations handle the specifics of extracting from each source type. - Example - >>> class PDFSourceDocument(BaseSourceDocument): ... def extract(self) -> str: ... # PDF-specific extraction logic ... return extracted_text - src
- Path or URL to the source media - Type:
- str 
 
 - src_type
- Type of source (e.g. ‘PDF File’, ‘Website’, ‘YouTube video’) - Type:
- str 
 
 - title
- Title describing the source - Type:
- str 
 
 - content
- The extracted content text - Type:
- Optional[str]