podcast_llm.text_to_speech

Text-to-speech conversion module for podcast generation.

This module handles the conversion of text scripts into natural-sounding speech using multiple TTS providers (Google Cloud TTS and ElevenLabs). It includes functionality for:

  • Rate limiting API requests to stay within provider quotas

  • Exponential backoff retry logic for API resilience

  • Processing individual conversation lines with appropriate voices

  • Merging multiple audio segments into a complete podcast

  • Managing temporary audio file storage and cleanup

The module supports different voices for interviewer/interviewee to create natural conversational flow and allows configuration of voice settings and audio effects through the PodcastConfig system.

Typical usage:

config = PodcastConfig() convert_to_speech(

config, conversation_script, ‘output.mp3’, ‘.temp_audio/’, ‘mp3’

)

podcast_llm.text_to_speech.clean_text_for_tts(lines: List) List[source]

Clean text lines for text-to-speech processing by removing special characters.

Takes a list of dictionaries containing speaker and text information and removes characters that may interfere with text-to-speech synthesis, such as asterisks, underscores, and em dashes.

Parameters:

lines (List[dict]) –

List of dictionaries with structure: {

’speaker’: str, # Speaker identifier ‘text’: str # Text to be cleaned

}

Returns:

List of dictionaries with cleaned text and same structure as input

Return type:

List[dict]

podcast_llm.text_to_speech.combine_consecutive_speaker_chunks(chunks: List[dict]) List[dict][source]

Combine consecutive chunks from the same speaker into single chunks.

Parameters:

chunks (List[dict]) –

List of dictionaries containing conversation chunks with structure: {

’speaker’: str, # Speaker identifier ‘text’: str # Text content

}

Returns:

List of combined chunks where consecutive entries from the same speaker

are merged into single chunks

Return type:

List[dict]

podcast_llm.text_to_speech.convert_to_speech(config: PodcastConfig, conversation: str, output_file: str, temp_audio_dir: str, audio_format: str) None[source]

Convert a conversation script to speech audio using Google Text-to-Speech API.

Takes a conversation script consisting of speaker/text pairs and generates audio files for each line using Google’s TTS service. The individual audio files are then merged into a single output file. Uses different voices for different speakers to create a natural conversational feel.

Parameters:
  • conversation (str) –

    List of dictionaries containing conversation lines with structure: {

    ’speaker’: str, # Speaker identifier (‘Interviewer’ or ‘Interviewee’) ‘text’: str # Line content to convert to speech

    }

  • output_file (str) – Path where the final merged audio file should be saved

  • temp_audio_dir (str) – Directory path for temporary audio file storage

  • audio_format (str) – Format of the audio files (e.g. ‘mp3’)

Raises:

Exception – If any errors occur during TTS conversion or file operations

podcast_llm.text_to_speech.generate_audio(config: PodcastConfig, final_script: list, output_file: str) str[source]

Generate audio from a podcast script using text-to-speech.

Takes a final script consisting of speaker/text pairs and generates a single audio file using Google’s Text-to-Speech service. The script is first cleaned and processed to be TTS-friendly, then converted to speech with different voices for different speakers.

Parameters:
  • final_script (list) –

    List of dictionaries containing script lines with structure: {

    ’speaker’: str, # Speaker identifier (‘Interviewer’ or ‘Interviewee’) ‘text’: str # Line content to convert to speech

    }

  • output_file (str) – Path where the final audio file should be saved

Returns:

Path to the generated audio file

Return type:

str

Raises:

Exception – If any errors occur during TTS conversion or file operations

podcast_llm.text_to_speech.merge_audio_files(audio_files: List, output_file: str, audio_format: str) None[source]

Merge multiple audio files into a single output file.

Takes a list of audio files and combines them in the provided order into a single output file. Handles any audio format supported by pydub.

Parameters:
  • audio_files (list) – List of paths to audio files to merge

  • output_file (str) – Path where merged audio file should be saved

  • audio_format (str) – Format of input/output audio files (e.g. ‘mp3’, ‘wav’)

Returns:

None

Raises:

Exception – If there are any errors during the merging process

podcast_llm.text_to_speech.process_line_elevenlabs(config: PodcastConfig, text: str, speaker: str)[source]

Process a line of text into speech using ElevenLabs TTS service.

Takes a line of text and speaker identifier and generates synthesized speech using ElevenLabs’ TTS service. Uses different voices based on the speaker to create natural conversation flow.

Parameters:
  • config (PodcastConfig) – Configuration object containing API keys and settings

  • text (str) – The text content to convert to speech

  • speaker (str) – Speaker identifier to determine voice selection

Returns:

Raw audio data in bytes format containing the synthesized speech

Return type:

bytes

podcast_llm.text_to_speech.process_line_google(config: PodcastConfig, text: str, speaker: str)[source]

Process a single line of text using Google Text-to-Speech API.

Takes a line of text and speaker identifier and generates synthesized speech using Google’s TTS service. Uses different voices based on the speaker to create natural conversation flow.

Parameters:
  • text (str) – The text content to convert to speech

  • speaker (str) – Speaker identifier to determine voice selection

Returns:

Raw audio data in bytes format containing the synthesized speech

Return type:

bytes

podcast_llm.text_to_speech.process_lines_google_multispeaker(config: PodcastConfig, chunks: List)[source]

Process multiple lines of text into speech using Google’s multi-speaker TTS service.

Takes a chunk of conversation lines and generates synthesized speech using Google’s multi-speaker TTS service. Handles up to 6 turns of conversation at once for more natural conversational flow.

Parameters:
  • config (PodcastConfig) – Configuration object containing API keys and settings

  • chunks (List) –

    List of dictionaries containing conversation lines with structure: {

    ’speaker’: str, # Speaker identifier ‘text’: str # Line content to convert to speech

    }

Returns:

Raw audio data in bytes format containing the synthesized speech

Return type:

bytes