podcast_llm.extractors.youtube

YouTube content extractor for podcast generation.

This module provides functionality to extract transcript content from YouTube videos using the YouTubeTranscriptApi. It handles parsing various YouTube URL formats and retrieving closed captions/subtitles.

Example

>>> from podcast_llm.extractors.youtube import YouTubeSourceDocument
>>> extractor = YouTubeSourceDocument('https://www.youtube.com/watch?v=dQw4w9WgXcQ')
>>> content = extractor.extract()
>>> print(content)
'We're no strangers to love You know the rules and so do I...'

The module supports: - Standard youtube.com URLs (https://www.youtube.com/watch?v=VIDEO_ID) - Short youtu.be URLs (https://youtu.be/VIDEO_ID) - Embedded URLs (https://www.youtube.com/embed/VIDEO_ID)

The extracted transcripts are returned as plain text and can be used as source material for podcast episode generation. The module handles errors gracefully if transcripts are unavailable or the video ID cannot be parsed.

class podcast_llm.extractors.youtube.YouTubeSourceDocument(source: str)[source]

Bases: BaseSourceDocument

Extracts transcript content from YouTube videos using YouTubeTranscriptApi.

This class handles extracting closed caption/subtitle content from YouTube videos by parsing various URL formats to get the video ID and retrieving the transcript. Supports standard youtube.com URLs, youtu.be short URLs, and embedded URLs.

Example

>>> extractor = YouTubeSourceDocument('https://www.youtube.com/watch?v=dQw4w9WgXcQ')
>>> content = extractor.extract()
>>> print(content)
'We're no strangers to love You know the rules and so do I...'

src

The YouTube video URL or ID

Type:: str

src_type

Always ‘YouTube video’

Type:: str

title

A descriptive title combining src_type and source

Type:: str

content

The extracted transcript text

Type:: Optional[str]

video_id

The parsed YouTube video ID

Type:: str

extract() → str[source]

Extract content from the source.

Parameters:: source – Path or URL to the source media
Returns:: The extracted content as a string