podcast_llm.extractors.web
Web article content extractor for podcast generation.
This module provides functionality to extract article content from web URLs using the newspaper3k library. It handles downloading and parsing web pages to extract the main article text while filtering out navigation, ads, and other non-content.
Example
>>> from podcast_llm.extractors.web import WebSourceDocument
>>> extractor = WebSourceDocument('https://example.com/article')
>>> content = extractor.extract()
>>> print(content)
'The main article text content...'
The module supports: - Downloading and parsing web article content - Intelligent extraction of main article text - Filtering out non-content elements like navigation and ads - Error handling for failed downloads or parsing
The extracted article text is returned as plain text and can be used as source material for podcast episode generation. The module handles errors gracefully if articles fail to download or parse properly.
- class podcast_llm.extractors.web.WebSourceDocument(source: str)[source]
Bases:
BaseSourceDocument
Extracts text content from web articles using the newspaper3k library.
This class handles extracting article content from web URLs by downloading and parsing the page HTML. It uses newspaper3k to intelligently identify and extract the main article text while filtering out navigation, ads, and other non-content elements.
Example
>>> extractor = WebSourceDocument('https://example.com/article') >>> content = extractor.extract() >>> print(content) 'The main article text content...'
- src
The web article URL
- Type:
str
- src_type
Always ‘Website’
- Type:
str
- title
The extracted article title
- Type:
str
- content
The extracted article text
- Type:
Optional[str]