podcast_llm.extractors.utils
Utility functions for extracting content from various source types.
This module provides helper functions for extracting text content from different source types like YouTube videos, web pages, PDFs, and audio files. It handles detecting the source type and using the appropriate extractor.
Example
>>> from podcast_llm.extractors.utils import extract_content_from_sources
>>> sources = ['https://youtube.com/watch?v=123', 'article.pdf']
>>> content = extract_content_from_sources(sources)
>>> print(len(content))
2
The module supports: - Automatic source type detection based on URL/file extension - Extraction from YouTube videos, web pages, PDFs, and audio files - Error handling for failed extractions - Converting extracted content to LangChain document format
The extracted content is returned as a list of LangChain documents that can be used for further processing. Failed extractions are logged but do not halt processing of remaining sources.
- podcast_llm.extractors.utils.extract_content_from_sources(sources: List) List [source]
Extract content from a list of source URLs/files.
Takes a list of source URLs or file paths and extracts text content from each using the appropriate extractor based on source type. Supports YouTube videos, web pages, PDFs, audio files, Word documents, and plain text files.
- Parameters:
sources (List) – List of source URLs or file paths to extract content from
- Returns:
List of extracted content as LangChain documents
- Return type:
List
Example
>>> sources = ['document.docx', 'article.pdf'] >>> content = extract_content_from_sources(sources) >>> print(len(content)) 2