podcast_llm.research
Research module for podcast generation.
This module provides functionality to gather background research and information for podcast episode generation. It handles retrieving content from various sources like Wikipedia and search engines.
Example
>>> from podcast_llm.research import suggest_wikipedia_articles
>>> from podcast_llm.models import WikipediaPages
>>> config = PodcastConfig()
>>> articles: WikipediaPages = suggest_wikipedia_articles(config, "Artificial Intelligence")
>>> print(articles.pages[0].name)
'Artificial intelligence'
The research process includes: - Suggesting relevant Wikipedia articles via LangChain and GPT-4 - Downloading Wikipedia article content - Performing targeted web searches with Tavily - Extracting key information from web articles - Organizing research into structured formats using Pydantic models
The module uses various APIs and services to gather comprehensive background information while maintaining rate limits and handling errors gracefully.
- podcast_llm.research.download_page_content(urls: List[str]) List[Document] [source]
Download and parse content from a list of URLs.
Uses the newspaper3k library to download and extract clean text content from web pages. Handles errors gracefully and logs success/failure for each URL. Filters out articles with no text content.
- Parameters:
urls (list) – List of URLs to download and parse
- Returns:
- List of dictionaries containing the downloaded articles with structure:
- {
‘url’: str, # Original URL ‘title’: str, # Article title ‘text’: str # Cleaned article text content
}
- Return type:
list
- podcast_llm.research.download_wikipedia_articles(suggestions: WikipediaPages) list [source]
Download Wikipedia articles based on suggested page titles.
Takes a structured list of Wikipedia page suggestions and downloads the full content of each article using the WikipediaRetriever. Handles errors gracefully if any articles fail to download.
- Parameters:
suggestions (WikipediaPages) – Structured list of suggested Wikipedia page titles
- Returns:
List of retrieved Wikipedia document objects containing page content and metadata
- Return type:
list
- podcast_llm.research.perform_tavily_queries(config: PodcastConfig, queries: SearchQueries) list [source]
Execute search queries using the Tavily API.
Performs web searches for each provided query using the Tavily search API, filtering out certain domains and PDF files. Handles API interaction and result processing to extract relevant URLs for further content scraping.
- Parameters:
queries (SearchQueries) – Structured list of search queries to execute
- Returns:
List of URLs from search results, excluding PDFs and filtered domains
- Return type:
list
- podcast_llm.research.research_background_info(config: PodcastConfig, topic: str) list [source]
Research background information for a podcast topic.
Coordinates the research process by first suggesting relevant Wikipedia articles based on the topic, then downloading the full content of those articles. Acts as the main orchestration function for gathering background research material.
- Parameters:
topic (str) – The podcast topic to research
- Returns:
List of retrieved Wikipedia document objects containing article content and metadata
- Return type:
dict
- podcast_llm.research.research_discussion_topics(config: PodcastConfig, topic: str, outline: PodcastOutline) list [source]
Research in-depth content for podcast discussion topics.
Takes a podcast topic and outline, then uses LangChain and GPT-4 to generate targeted search queries. These queries are used to find relevant articles via Tavily search. The articles are then downloaded and processed to provide detailed research material for each section of the podcast.
- Parameters:
topic (str) – The main topic for the podcast episode
outline (PodcastOutline) – Structured outline containing sections and subsections
- Returns:
- List of dictionaries containing downloaded article content with structure:
- {
‘url’: str, # Source URL ‘title’: str, # Article title ‘text’: str # Article content
}
- Return type:
list
- podcast_llm.research.suggest_wikipedia_articles(config: PodcastConfig, topic: str) WikipediaPages [source]
Suggest relevant Wikipedia articles for a given topic.
Uses LangChain and GPT-4 to intelligently suggest Wikipedia articles that would provide good background research for a podcast episode on the given topic.
- Parameters:
topic (str) – The podcast topic to research
- Returns:
A structured list of suggested Wikipedia article titles
- Return type: