podcast_llm.extractors.base
Base content extractor interface for podcast generation.
This module provides the abstract base class that defines the interface for extracting content from different source types like PDFs, web articles, YouTube videos, etc. Concrete implementations handle the specifics of extracting from each source type.
The module defines: - BaseSourceDocument abstract base class - Common interface for content extraction - Conversion to LangChain Document format - Standard metadata fields
Example
>>> class PDFSourceDocument(BaseSourceDocument):
... def extract(self) -> str:
... # PDF-specific extraction logic
... return extracted_text
The base class ensures consistent handling of different source types while allowing specialized extraction logic in the concrete implementations. This enables modular addition of new source types while maintaining a uniform interface.
- class podcast_llm.extractors.base.BaseSourceDocument[source]
Bases:
ABC
Abstract base class for source document content extractors.
This class defines the interface for extracting content from different source types like PDFs, web articles, YouTube videos, etc. Concrete implementations handle the specifics of extracting from each source type.
Example
>>> class PDFSourceDocument(BaseSourceDocument): ... def extract(self) -> str: ... # PDF-specific extraction logic ... return extracted_text
- src
Path or URL to the source media
- Type:
str
- src_type
Type of source (e.g. ‘PDF File’, ‘Website’, ‘YouTube video’)
- Type:
str
- title
Title describing the source
- Type:
str
- content
The extracted content text
- Type:
Optional[str]