Documentation Index
Fetch the complete documentation index at: https://docs.praison.ai/llms.txt
Use this file to discover all available pages before exploring further.
Data Readers Module
The readers module provides a protocol-based system for loading documents from various sources into the knowledge base.
Quick Start
from praisonaiagents.knowledge.readers import (
Document,
ReaderProtocol,
get_reader_registry,
detect_source_kind,
get_file_extension
)
# Detect source type
kind = detect_source_kind("document.pdf") # Returns "file"
kind = detect_source_kind("https://example.com") # Returns "url"
kind = detect_source_kind("./docs/") # Returns "directory"
# Get file extension
ext = get_file_extension("report.pdf") # Returns "pdf"
Classes
Document
A lightweight dataclass representing a loaded document.
from dataclasses import dataclass
from typing import Any, Dict, Optional
@dataclass
class Document:
text: str
metadata: Dict[str, Any] = field(default_factory=dict)
doc_id: Optional[str] = None
Fields:
text - The document content as plain text
metadata - Optional metadata dictionary (source, title, etc.)
doc_id - Optional unique identifier
ReaderProtocol
Protocol defining the interface for document readers.
from typing import Protocol, List
class ReaderProtocol(Protocol):
name: str
supported_extensions: List[str]
def load(
self,
source: str,
metadata: Optional[Dict[str, Any]] = None,
**kwargs
) -> List[Document]:
"""Load documents from a source."""
...
ReaderRegistry
Registry for managing and discovering readers.
from praisonaiagents.knowledge.readers import get_reader_registry
registry = get_reader_registry()
# List registered readers
readers = registry.list_readers() # ['text', 'markdown', 'json', ...]
# Get reader for extension
reader = registry.get_for_extension("pdf")
# Register custom reader
registry.register(MyCustomReader())
Utility Functions
detect_source_kind
Detect the type of source without importing heavy libraries.
from praisonaiagents.knowledge.readers import detect_source_kind
detect_source_kind("file.pdf") # "file"
detect_source_kind("https://example.com") # "url"
detect_source_kind("./docs/") # "directory"
detect_source_kind("*.pdf") # "glob"
get_file_extension
Extract file extension from a path or URL.
from praisonaiagents.knowledge.readers import get_file_extension
get_file_extension("report.pdf") # "pdf"
get_file_extension("https://example.com/doc.html") # "html"
Supported File Types
| Extension | Reader | Description |
|---|
| txt, text | TextReader | Plain text files |
| md, markdown | MarkdownReader | Markdown documents |
| json, jsonl | JSONReader | JSON and JSON Lines |
| csv, tsv | CSVReader | Tabular data |
| html, htm | HTMLReader | Web pages |
| pdf | PDFReader | PDF documents |
| docx, doc | DocxReader | Word documents |
| xlsx, xls | ExcelReader | Spreadsheets |
| pptx, ppt | PowerPointReader | Presentations |
Creating Custom Readers
from praisonaiagents.knowledge.readers import Document, get_reader_registry
from typing import List, Dict, Any, Optional
class MyCustomReader:
name = "custom"
supported_extensions = ["custom", "cst"]
def load(
self,
source: str,
metadata: Optional[Dict[str, Any]] = None,
**kwargs
) -> List[Document]:
with open(source, 'r') as f:
content = f.read()
return [Document(
text=content,
metadata={"source": source, **(metadata or {})}
)]
# Register the reader
registry = get_reader_registry()
registry.register(MyCustomReader())
- Zero heavy imports at module level
- Readers are lazy-loaded when first accessed
- No chromadb, torch, or sentence_transformers dependencies