podcast_llm.extractors.pdf
PDF file extraction module.
This module provides functionality for extracting text content from PDF files using LangChain’s PyPDFLoader. It handles loading PDFs, extracting text from each page, and combining the content into a single document.
The module includes: - PDFSourceDocument class for handling PDF file extraction - Page-by-page text extraction - Combining pages with appropriate spacing - Conversion to LangChain Document format
Example
>>> from podcast_llm.extractors.pdf import PDFSourceDocument
>>> extractor = PDFSourceDocument('document.pdf')
>>> extractor.extract()
>>> print(extractor.content)
'Text content from PDF pages...'
The extraction process: 1. Loads the PDF using PyPDFLoader 2. Extracts text from each page 3. Combines pages with double newlines between them 4. Returns the complete text content
The module integrates with the BaseSourceDocument interface to provide consistent handling of PDF files alongside other source types like audio and web content.
- class podcast_llm.extractors.pdf.PDFSourceDocument(source: str)[source]
Bases:
BaseSourceDocument
A document extractor for PDF files.
This class handles extracting text content from PDF files using the PyPDFLoader from LangChain. It loads the PDF, extracts text from each page, and combines them into a single document with page breaks.
- src
Path to the source PDF file
- Type:
str
- src_type
Type of source document (‘PDF File’)
- Type:
str
- title
Title combining source type and filename
- Type:
str
- content
Extracted text content after processing
- Type:
Optional[str]
Example
>>> extractor = PDFSourceDocument('document.pdf') >>> extractor.extract() >>> print(extractor.content) 'Text content from PDF pages...'