podcast_llm.extractors.youtube
YouTube content extractor for podcast generation.
This module provides functionality to extract transcript content from YouTube videos using the YouTubeTranscriptApi. It handles parsing various YouTube URL formats and retrieving closed captions/subtitles.
Example
>>> from podcast_llm.extractors.youtube import YouTubeSourceDocument
>>> extractor = YouTubeSourceDocument('https://www.youtube.com/watch?v=dQw4w9WgXcQ')
>>> content = extractor.extract()
>>> print(content)
'We're no strangers to love You know the rules and so do I...'
The module supports: - Standard youtube.com URLs (https://www.youtube.com/watch?v=VIDEO_ID) - Short youtu.be URLs (https://youtu.be/VIDEO_ID) - Embedded URLs (https://www.youtube.com/embed/VIDEO_ID)
The extracted transcripts are returned as plain text and can be used as source material for podcast episode generation. The module handles errors gracefully if transcripts are unavailable or the video ID cannot be parsed.
- class podcast_llm.extractors.youtube.YouTubeSourceDocument(source: str)[source]
Bases:
BaseSourceDocument
Extracts transcript content from YouTube videos using YouTubeTranscriptApi.
This class handles extracting closed caption/subtitle content from YouTube videos by parsing various URL formats to get the video ID and retrieving the transcript. Supports standard youtube.com URLs, youtu.be short URLs, and embedded URLs.
Example
>>> extractor = YouTubeSourceDocument('https://www.youtube.com/watch?v=dQw4w9WgXcQ') >>> content = extractor.extract() >>> print(content) 'We're no strangers to love You know the rules and so do I...'
- src
The YouTube video URL or ID
- Type:
str
- src_type
Always ‘YouTube video’
- Type:
str
- title
A descriptive title combining src_type and source
- Type:
str
- content
The extracted transcript text
- Type:
Optional[str]
- video_id
The parsed YouTube video ID
- Type:
str