Skip to main content

Data Readers Module

The readers module provides a protocol-based system for loading documents from various sources into the knowledge base.

Quick Start

from praisonaiagents.knowledge.readers import (
    Document,
    ReaderProtocol,
    get_reader_registry,
    detect_source_kind,
    get_file_extension
)

# Detect source type
kind = detect_source_kind("document.pdf")  # Returns "file"
kind = detect_source_kind("https://example.com")  # Returns "url"
kind = detect_source_kind("./docs/")  # Returns "directory"

# Get file extension
ext = get_file_extension("report.pdf")  # Returns "pdf"

Classes

Document

A lightweight dataclass representing a loaded document.
from dataclasses import dataclass
from typing import Any, Dict, Optional

@dataclass
class Document:
    text: str
    metadata: Dict[str, Any] = field(default_factory=dict)
    doc_id: Optional[str] = None
Fields:
  • text - The document content as plain text
  • metadata - Optional metadata dictionary (source, title, etc.)
  • doc_id - Optional unique identifier

ReaderProtocol

Protocol defining the interface for document readers.
from typing import Protocol, List

class ReaderProtocol(Protocol):
    name: str
    supported_extensions: List[str]
    
    def load(
        self,
        source: str,
        metadata: Optional[Dict[str, Any]] = None,
        **kwargs
    ) -> List[Document]:
        """Load documents from a source."""
        ...

ReaderRegistry

Registry for managing and discovering readers.
from praisonaiagents.knowledge.readers import get_reader_registry

registry = get_reader_registry()

# List registered readers
readers = registry.list_readers()  # ['text', 'markdown', 'json', ...]

# Get reader for extension
reader = registry.get_for_extension("pdf")

# Register custom reader
registry.register(MyCustomReader())

Utility Functions

detect_source_kind

Detect the type of source without importing heavy libraries.
from praisonaiagents.knowledge.readers import detect_source_kind

detect_source_kind("file.pdf")           # "file"
detect_source_kind("https://example.com") # "url"
detect_source_kind("./docs/")            # "directory"
detect_source_kind("*.pdf")              # "glob"

get_file_extension

Extract file extension from a path or URL.
from praisonaiagents.knowledge.readers import get_file_extension

get_file_extension("report.pdf")                    # "pdf"
get_file_extension("https://example.com/doc.html")  # "html"

Supported File Types

ExtensionReaderDescription
txt, textTextReaderPlain text files
md, markdownMarkdownReaderMarkdown documents
json, jsonlJSONReaderJSON and JSON Lines
csv, tsvCSVReaderTabular data
html, htmHTMLReaderWeb pages
pdfPDFReaderPDF documents
docx, docDocxReaderWord documents
xlsx, xlsExcelReaderSpreadsheets
pptx, pptPowerPointReaderPresentations

Creating Custom Readers

from praisonaiagents.knowledge.readers import Document, get_reader_registry
from typing import List, Dict, Any, Optional

class MyCustomReader:
    name = "custom"
    supported_extensions = ["custom", "cst"]
    
    def load(
        self,
        source: str,
        metadata: Optional[Dict[str, Any]] = None,
        **kwargs
    ) -> List[Document]:
        with open(source, 'r') as f:
            content = f.read()
        
        return [Document(
            text=content,
            metadata={"source": source, **(metadata or {})}
        )]

# Register the reader
registry = get_reader_registry()
registry.register(MyCustomReader())

Performance

  • Zero heavy imports at module level
  • Readers are lazy-loaded when first accessed
  • No chromadb, torch, or sentence_transformers dependencies