Skip to main content

Readers Module

The Readers module provides concrete implementations for loading documents from various sources into the knowledge base.

Import

from praisonai.adapters import AutoReader, TextReader, MarkItDownReader, DirectoryReader

Quick Example

from praisonai.adapters import AutoReader

# AutoReader automatically detects source type
reader = AutoReader()

# Load from file
docs = reader.load("document.pdf")

# Load from directory
docs = reader.load("./docs/")

# Load from URL
docs = reader.load("https://example.com/page.html")

Features

  • Automatic source type detection and routing
  • Multiple reader implementations (Text, MarkItDown, Directory, URL, Glob)
  • Lazy loading of optional dependencies
  • Metadata preservation for loaded documents
  • Recursive directory traversal with exclusion patterns

Classes

AutoReader

Automatic reader that detects source type and routes to the appropriate reader.
from praisonai.adapters import AutoReader

reader = AutoReader()

# Handles files, directories, URLs, and glob patterns
docs = reader.load("report.pdf")
docs = reader.load("./documents/")
docs = reader.load("https://example.com")
docs = reader.load("*.md")

TextReader

Simple text file reader for plain text files.
from praisonai.adapters import TextReader

reader = TextReader()
docs = reader.load("notes.txt")
Supported Extensions: .txt, .text, .log

MarkItDownReader

Document reader using markitdown for rich document conversion.
from praisonai.adapters import MarkItDownReader

reader = MarkItDownReader()
docs = reader.load("report.pdf")
Supported Extensions: .pdf, .doc, .docx, .ppt, .pptx, .xls, .xlsx, .html, .htm, .md, .markdown, .csv, .json, .xml, .jpg, .jpeg, .png, .gif, .bmp, .tiff, .webp, .mp3, .wav, .ogg, .m4a, .flac
Requires markitdown package: pip install markitdown

DirectoryReader

Recursively reads all files in a directory.
from praisonai.adapters import DirectoryReader

reader = DirectoryReader(
    recursive=True,
    exclude_patterns=["*.pyc", "__pycache__", ".git", "node_modules"]
)
docs = reader.load("./project/")
Parameters:
ParameterTypeDefaultDescription
recursiveboolTrueRecursively traverse subdirectories
exclude_patternsList[str]See belowGlob patterns to exclude
Default Exclusions: *.pyc, __pycache__, .git, .svn, node_modules, *.egg-info, .env, .venv, venv

Methods

load(source, metadata=None)

Load documents from a source. Parameters:
  • source (str): File path, directory, URL, or glob pattern
  • metadata (dict, optional): Additional metadata to attach to documents
Returns: List[Document] - List of loaded documents

can_handle(source)

Check if the reader can handle the given source. Parameters:
  • source (str): Source to check
Returns: bool - True if the reader can handle this source

Example: Custom Metadata

from praisonai.adapters import AutoReader

reader = AutoReader()

# Add custom metadata to loaded documents
docs = reader.load(
    "technical_docs/",
    metadata={
        "category": "technical",
        "version": "2.0",
        "author": "engineering"
    }
)

for doc in docs:
    print(f"Source: {doc.metadata['source']}")
    print(f"Category: {doc.metadata['category']}")

Example: URL Reading

from praisonai.adapters.readers import URLReader

reader = URLReader()
docs = reader.load("https://docs.python.org/3/tutorial/index.html")

# Content is automatically extracted from HTML
print(docs[0].content[:500])

CLI Usage

praisonai knowledge add <source>
Examples:
# Add a single file
praisonai knowledge add document.pdf

# Add all files in a directory
praisonai knowledge add ./docs/

# Add files matching a pattern
praisonai knowledge add "*.pdf"

# Add from URL
praisonai knowledge add https://example.com/page.html