Reader Providers
Overview
Section titled “Overview”The GraphRAG Toolkit provides a unified, extensible system for reading documents from a wide variety of sources. Reader providers abstract the details of document ingestion, allowing you to work with files, databases, APIs, cloud storage, and more using a consistent interface.
Architecture
Section titled “Architecture”Core Abstractions
Section titled “Core Abstractions”- ReaderProvider: The abstract base class for all document readers. Every concrete reader implements the
read(input_source)method, returning a list ofDocumentobjects. - BaseReaderProvider: Implements both the GraphRAG
ReaderProviderand LlamaIndexBaseReaderinterfaces, providing compatibility and a standard pattern for new readers. - LlamaIndexReaderProviderBase: A simple wrapper for LlamaIndex readers, making it easy to adapt existing LlamaIndex readers to the GraphRAG system.
- ValidatedReaderProviderBase: Extends
LlamaIndexReaderProviderBasewith input, output, and configuration validation.
Configuration Classes
Section titled “Configuration Classes”Each reader provider is paired with a configuration class (e.g., PDFReaderConfig, WebReaderConfig). These classes define the parameters required for each data source and use Python dataclasses for validation.
How to Use
Section titled “How to Use”- Choose a provider and config for your data source
- Instantiate the config with the required parameters
- Create the provider with the config
- Call
.read(input_source)to extract documents
from graphrag_toolkit.lexical_graph.indexing.load.readers import PDFReaderProvider, PDFReaderConfig
config = PDFReaderConfig( return_full_document=False, metadata_fn=lambda path: {'source': 'pdf', 'file_path': path})reader = PDFReaderProvider(config)documents = reader.read("/path/to/file.pdf")Using Metadata with Readers
Section titled “Using Metadata with Readers”Many reader providers support attaching custom metadata to each document via the metadata_fn parameter in the configuration class. The function should accept an input and return a dictionary of metadata.
def custom_metadata(path): return { "source": path, "document_type": "technical_doc", "project": "GraphRAG" }
config = PDFReaderConfig( return_full_document=False, metadata_fn=custom_metadata)Built-in Providers
Section titled “Built-in Providers”Document Readers
Section titled “Document Readers”| Provider | Config | Description | Dependencies |
|---|---|---|---|
PDFReaderProvider | PDFReaderConfig | PDF documents | pymupdf, llama-index-readers-file |
DocxReaderProvider | DocxReaderConfig | Word documents | python-docx |
PPTXReaderProvider | PPTXReaderConfig | PowerPoint files | python-pptx |
MarkdownReaderProvider | MarkdownReaderConfig | Markdown files | Built-in |
CSVReaderProvider | CSVReaderConfig | CSV files | Built-in |
JSONReaderProvider | JSONReaderConfig | JSON/JSONL files | Built-in |
StreamingJSONLReaderProvider | StreamingJSONLReaderConfig | Memory-efficient JSONL processing | Built-in |
StructuredDataReaderProvider | StructuredDataReaderConfig | CSV/Excel files with streaming | pandas, openpyxl, llama-index-readers-structured-data |
Web and Knowledge Base Readers
Section titled “Web and Knowledge Base Readers”| Provider | Config | Description | Dependencies |
|---|---|---|---|
WebReaderProvider | WebReaderConfig | Web pages | requests, beautifulsoup4 |
WikipediaReaderProvider | WikipediaReaderConfig | Wikipedia articles | wikipedia |
YouTubeReaderProvider | YouTubeReaderConfig | YouTube transcripts | youtube-transcript-api |
Cloud Storage Readers
Section titled “Cloud Storage Readers”| Provider | Config | Description | Dependencies |
|---|---|---|---|
S3DirectoryReaderProvider | S3DirectoryReaderConfig | AWS S3 buckets | boto3 |
DirectoryReaderProvider | DirectoryReaderConfig | Local directories | Built-in |
Database Readers
Section titled “Database Readers”| Provider | Config | Description | Dependencies |
|---|---|---|---|
DatabaseReaderProvider | DatabaseReaderConfig | SQL databases | Database-specific drivers |
Code and Repository Readers
Section titled “Code and Repository Readers”| Provider | Config | Description | Dependencies |
|---|---|---|---|
GitHubReaderProvider | GitHubReaderConfig | GitHub repositories | PyGithub |
Specialized Readers
Section titled “Specialized Readers”| Provider | Config | Description | Dependencies |
|---|---|---|---|
DocumentGraphReaderProvider | DocumentGraphReaderConfig | Document graphs | Built-in |
S3 Support
Section titled “S3 Support”The GraphRAG Toolkit provides two approaches for S3 integration:
1. S3DirectoryReaderProvider (Recommended)
Section titled “1. S3DirectoryReaderProvider (Recommended)”Modern S3 reader using LlamaIndex’s S3Reader for direct S3 access:
from graphrag_toolkit.lexical_graph.indexing.load.readers import S3DirectoryReaderProvider, S3DirectoryReaderConfig
# For a single fileconfig = S3DirectoryReaderConfig( bucket="my-bucket", key="documents/file.pdf", # Use 'key' for single file metadata_fn=lambda path: {'source': 's3'})
# For a directory/prefixconfig = S3DirectoryReaderConfig( bucket="my-bucket", prefix="documents/", # Use 'prefix' for directory metadata_fn=lambda path: {'source': 's3'})
# Note: Use either 'key' OR 'prefix', not bothreader = S3DirectoryReaderProvider(config)docs = reader.read()2. Legacy S3BasedDocs
Section titled “2. Legacy S3BasedDocs”Legacy system for S3 document storage and retrieval (still supported):
from graphrag_toolkit.lexical_graph.indexing.load import S3BasedDocs
s3_docs = S3BasedDocs( region="us-east-1", bucket_name="my-bucket", key_prefix="documents/", collection_id="my-collection")
# Iterate through stored documentsfor doc in s3_docs: # Process document passS3 Authentication
Section titled “S3 Authentication”S3 access uses GraphRAGConfig.session for AWS credentials. Configure via:
- AWS credentials file (
~/.aws/credentials) - Environment variables (
AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY) - IAM roles (when running on AWS)
- AWS SSO profiles
S3 Streaming for Large Files
Section titled “S3 Streaming for Large Files”The StructuredDataReaderProvider supports streaming large S3 files to avoid downloading:
config = StructuredDataReaderConfig( stream_s3=True, # Enable streaming stream_threshold_mb=100, # Stream files > 100MB pandas_config={"sep": ","})Streaming JSONL Processing
Section titled “Streaming JSONL Processing”The StreamingJSONLReaderProvider is designed for memory-efficient processing of large JSONL files. Unlike the standard JSONReaderProvider which loads all lines into memory, this provider processes files line-by-line with constant memory usage.
Key Features
Section titled “Key Features”- Memory Efficient: Processes files line-by-line without loading entire file into memory
- Batch Processing: Yields documents in configurable batches for efficient downstream processing
- S3 Support: Works seamlessly with both local files and S3 URIs
- Flexible Text Extraction: Extract text from a specific field or use entire JSON object
- Error Handling: Configurable strict mode for validation or graceful error skipping
- Progress Logging: Built-in progress tracking for long-running operations
Configuration Options
Section titled “Configuration Options”StreamingJSONLReaderConfig( batch_size=100, # Number of documents per batch (default: 100) text_field="text", # JSON field to extract as text (default: "text") # Set to None to use entire JSON as text strict_mode=False, # If True, raise errors on invalid JSON/missing fields # If False, skip invalid lines and continue (default) log_interval=10000, # Log progress every N lines (default: 10000) metadata_fn=None # Optional function to add custom metadata)Usage Examples
Section titled “Usage Examples”Basic Usage
Section titled “Basic Usage”from graphrag_toolkit.lexical_graph.indexing.load.readers import ( StreamingJSONLReaderProvider, StreamingJSONLReaderConfig)
config = StreamingJSONLReaderConfig()reader = StreamingJSONLReaderProvider(config)docs = reader.read('data.jsonl')Custom Text Field
Section titled “Custom Text Field”# Extract text from a different fieldconfig = StreamingJSONLReaderConfig( text_field="content", # Use "content" field instead of "text")reader = StreamingJSONLReaderProvider(config)docs = reader.read('data.jsonl')Use Entire JSON as Text
Section titled “Use Entire JSON as Text”# Use the entire JSON object as document textconfig = StreamingJSONLReaderConfig( text_field=None # None means use full JSON)reader = StreamingJSONLReaderProvider(config)docs = reader.read('data.jsonl')Lazy Loading for Large Files
Section titled “Lazy Loading for Large Files”# Process large files in batches without loading all into memoryconfig = StreamingJSONLReaderConfig( batch_size=50, log_interval=5000)reader = StreamingJSONLReaderProvider(config)
for batch in reader.lazy_load_data('large-file.jsonl'): # Process each batch of up to 50 documents for doc in batch: # Process individual document print(f"Line {doc.metadata['line_number']}: {doc.text[:100]}...")S3 Files
Section titled “S3 Files”# Works seamlessly with S3 URIsconfig = StreamingJSONLReaderConfig( batch_size=100, metadata_fn=lambda path: {'bucket': path.split('/')[2]})reader = StreamingJSONLReaderProvider(config)docs = reader.read('s3://my-bucket/data/large-file.jsonl')Strict Mode for Validation
Section titled “Strict Mode for Validation”# Raise errors on invalid JSON or missing fieldsconfig = StreamingJSONLReaderConfig( text_field="required_field", strict_mode=True # Will raise exception on first error)reader = StreamingJSONLReaderProvider(config)
try: docs = reader.read('data.jsonl')except (json.JSONDecodeError, ValueError) as e: print(f"Validation failed: {e}")Metadata
Section titled “Metadata”Each document includes the following metadata:
file_path: Original source path (local or S3)source: Either “local_file” or “s3”line_number: 1-based line number in the filedocument_type: Always “jsonl”- Any additional fields from
metadata_fn
Configuration Examples
Section titled “Configuration Examples”PDF Reader
Section titled “PDF Reader”from graphrag_toolkit.lexical_graph.indexing.load.readers import PDFReaderProvider, PDFReaderConfig
config = PDFReaderConfig( return_full_document=False, metadata_fn=lambda path: {'source': 'pdf', 'file_path': path})reader = PDFReaderProvider(config)docs = reader.read('document.pdf')Web Reader
Section titled “Web Reader”from graphrag_toolkit.lexical_graph.indexing.load.readers import WebReaderProvider, WebReaderConfig
config = WebReaderConfig( html_to_text=True, metadata_fn=lambda url: {'source': 'web', 'url': url})reader = WebReaderProvider(config)docs = reader.read('https://example.com')YouTube Reader
Section titled “YouTube Reader”from graphrag_toolkit.lexical_graph.indexing.load.readers import YouTubeReaderProvider, YouTubeReaderConfig
config = YouTubeReaderConfig( language="en", metadata_fn=lambda url: {'source': 'youtube', 'url': url})reader = YouTubeReaderProvider(config)docs = reader.read('https://www.youtube.com/watch?v=VIDEO_ID')Proxy Support
Section titled “Proxy Support”For environments requiring HTTP/HTTPS proxies (corporate networks, containerized deployments):
# Option 1: Configure via YouTubeReaderConfigconfig = YouTubeReaderConfig( language="en", proxy_url="http://proxy.example.com:8080", # HTTP/HTTPS proxy metadata_fn=lambda url: {'source': 'youtube', 'url': url})reader = YouTubeReaderProvider(config)docs = reader.read('https://www.youtube.com/watch?v=VIDEO_ID')
# Option 2: Configure via environment variable# export YOUTUBE_PROXY_URL=http://proxy.example.com:8080config = YouTubeReaderConfig(language="en")reader = YouTubeReaderProvider(config) # Automatically uses YOUTUBE_PROXY_URLdocs = reader.read('https://www.youtube.com/watch?v=VIDEO_ID')The proxy URL should be in the format http://proxy.example.com:port or https://proxy.example.com:port. The same proxy is used for both HTTP and HTTPS requests.
Structured Data Reader (CSV/Excel)
Section titled “Structured Data Reader (CSV/Excel)”from graphrag_toolkit.lexical_graph.indexing.load.readers import StructuredDataReaderProvider, StructuredDataReaderConfig
config = StructuredDataReaderConfig( col_index=0, # Column to use as index col_joiner=', ', # How to join columns pandas_config={"sep": ","}, # Pandas options stream_s3=True, # Enable S3 streaming stream_threshold_mb=50, # Stream files > 50MB metadata_fn=lambda path: {'source': 'structured', 'file': path})reader = StructuredDataReaderProvider(config)
# Works with local and S3 filesdocs = reader.read(['data.csv', 's3://bucket/large-file.xlsx'])S3 Directory Reader
Section titled “S3 Directory Reader”from graphrag_toolkit.lexical_graph.indexing.load.readers import S3DirectoryReaderProvider, S3DirectoryReaderConfig
# Reading from a directory/prefixconfig = S3DirectoryReaderConfig( bucket="my-bucket", prefix="documents/", # For directory access metadata_fn=lambda path: {'source': 's3', 'path': path})reader = S3DirectoryReaderProvider(config)docs = reader.read() # No parameter needed
# Reading a single fileconfig = S3DirectoryReaderConfig( bucket="my-bucket", key="documents/specific-file.pdf", # For single file metadata_fn=lambda path: {'source': 's3', 'path': path})reader = S3DirectoryReaderProvider(config)docs = reader.read() # No parameter neededStreaming JSONL Reader
Section titled “Streaming JSONL Reader”from graphrag_toolkit.lexical_graph.indexing.load.readers import StreamingJSONLReaderProvider, StreamingJSONLReaderConfig
# Memory-efficient processing of large JSONL filesconfig = StreamingJSONLReaderConfig( batch_size=100, # Process in batches text_field="text", # Field to extract as document text strict_mode=False, # Skip invalid lines instead of raising errors log_interval=10000, # Log progress every N lines metadata_fn=lambda path: {'source': 'jsonl', 'file': path})reader = StreamingJSONLReaderProvider(config)
# Works with local and S3 filesdocs = reader.read('data.jsonl')docs = reader.read('s3://bucket/large-file.jsonl')
# Or use lazy loading for streamingfor batch in reader.lazy_load_data('large-file.jsonl'): # Process each batch of documents for doc in batch: print(doc.text)Database Reader
Section titled “Database Reader”from graphrag_toolkit.lexical_graph.indexing.load.readers import DatabaseReaderProvider, DatabaseReaderConfig
config = DatabaseReaderConfig( connection_string="postgresql://user:pass@localhost/db", query="SELECT id, content FROM documents", metadata_fn=lambda row: {'source': 'database', 'id': row.get('id')})reader = DatabaseReaderProvider(config)docs = reader.read(config.query)Installation Requirements
Section titled “Installation Requirements”Different readers require different dependencies. Install as needed:
# PDF processingpip install pymupdf llama-index-readers-file
# Web scrapingpip install requests beautifulsoup4 llama-index-readers-web
# YouTube transcriptspip install youtube-transcript-api
# AWS servicespip install boto3
# Structured data processingpip install pandas openpyxl llama-index-readers-structured-data
# Office documentspip install python-docx python-pptx
# GitHub integrationpip install PyGithub
# Notion integrationpip install notion-client
# Wikipediapip install wikipediaExtending: Writing a Custom Reader
Section titled “Extending: Writing a Custom Reader”To add a new data source:
- Create a config class as a dataclass:
from dataclasses import dataclassfrom typing import Optional, Callable, Dict, Anyfrom .reader_provider_config_base import ReaderProviderConfig
@dataclassclass MyReaderConfig(ReaderProviderConfig): api_key: str = "" metadata_fn: Optional[Callable[[str], Dict[str, Any]]] = None- Subclass a base provider:
from .base_reader_provider import BaseReaderProvider
class MyReaderProvider(BaseReaderProvider): def __init__(self, config: MyReaderConfig): self.config = config
def read(self, input_source): # Implement your reading logic documents = [] # ... process input_source ... return documents- Register in
__init__.pyfor easy importing.