podcast_llm.extractors.base

Base content extractor interface for podcast generation.

This module provides the abstract base class that defines the interface for extracting content from different source types like PDFs, web articles, YouTube videos, etc. Concrete implementations handle the specifics of extracting from each source type.

The module defines: - BaseSourceDocument abstract base class - Common interface for content extraction - Conversion to LangChain Document format - Standard metadata fields

Example

>>> class PDFSourceDocument(BaseSourceDocument):
...     def extract(self) -> str:
...         # PDF-specific extraction logic
...         return extracted_text

The base class ensures consistent handling of different source types while allowing specialized extraction logic in the concrete implementations. This enables modular addition of new source types while maintaining a uniform interface.

class podcast_llm.extractors.base.BaseSourceDocument[source]

Bases: ABC

Abstract base class for source document content extractors.

This class defines the interface for extracting content from different source types like PDFs, web articles, YouTube videos, etc. Concrete implementations handle the specifics of extracting from each source type.

Example

>>> class PDFSourceDocument(BaseSourceDocument):
...     def extract(self) -> str:
...         # PDF-specific extraction logic
...         return extracted_text
src

Path or URL to the source media

Type:

str

src_type

Type of source (e.g. ‘PDF File’, ‘Website’, ‘YouTube video’)

Type:

str

title

Title describing the source

Type:

str

content

The extracted content text

Type:

Optional[str]

as_langchain_document() Document[source]

Convert the source document to a LangChain Document format.

Returns:

A LangChain Document containing the content and metadata

Return type:

Document

abstract extract() str[source]

Extract content from the source.

Parameters:

source – Path or URL to the source media

Returns:

The extracted content as a string