podcast_llm.extractors.web

Web article content extractor for podcast generation.

This module provides functionality to extract article content from web URLs using the newspaper3k library. It handles downloading and parsing web pages to extract the main article text while filtering out navigation, ads, and other non-content.

Example

>>> from podcast_llm.extractors.web import WebSourceDocument
>>> extractor = WebSourceDocument('https://example.com/article')
>>> content = extractor.extract()
>>> print(content)
'The main article text content...'

The module supports: - Downloading and parsing web article content - Intelligent extraction of main article text - Filtering out non-content elements like navigation and ads - Error handling for failed downloads or parsing

The extracted article text is returned as plain text and can be used as source material for podcast episode generation. The module handles errors gracefully if articles fail to download or parse properly.

class podcast_llm.extractors.web.WebSourceDocument(source: str)[source]

Bases: BaseSourceDocument

Extracts text content from web articles using the newspaper3k library.

This class handles extracting article content from web URLs by downloading and parsing the page HTML. It uses newspaper3k to intelligently identify and extract the main article text while filtering out navigation, ads, and other non-content elements.

Example

>>> extractor = WebSourceDocument('https://example.com/article')
>>> content = extractor.extract()
>>> print(content)
'The main article text content...'
src

The web article URL

Type:

str

src_type

Always ‘Website’

Type:

str

title

The extracted article title

Type:

str

content

The extracted article text

Type:

Optional[str]

extract() str[source]

Extract content from the source.

Parameters:

source – Path or URL to the source media

Returns:

The extracted content as a string