Skip to main content

Incremental Indexing

Incremental indexing tracks file changes using hashes and modification times, ensuring only modified files are re-indexed across sessions.

Overview

The incremental indexing system provides:
  • File hash tracking for accurate change detection
  • Modification time monitoring as a fast-path check
  • Persistent state across indexing sessions
  • Ignore patterns via .praisonignore files

Quick Start

from praisonaiagents.knowledge import Knowledge

knowledge = Knowledge()

# First index - indexes all files
result1 = knowledge.index(
    "./docs",
    memory={"user_id": "my_user"},
    incremental=True,
)
print(f"Indexed: {result1.files_indexed}, Skipped: {result1.files_skipped}")

# Second index - skips unchanged files
result2 = knowledge.index(
    "./docs",
    memory={"user_id": "my_user"},
    incremental=True,
)
print(f"Indexed: {result2.files_indexed}, Skipped: {result2.files_skipped}")

How It Works

File Tracking

The FileTracker class maintains state about indexed files:
from praisonaiagents.knowledge.indexing import FileTracker

tracker = FileTracker(state_file=".praison/.index_state.json")
tracker.load()

# Check if file has changed
if tracker.has_changed("./docs/readme.md"):
    print("File needs re-indexing")
else:
    print("File unchanged, skipping")

State Persistence

State is automatically saved to .praison/.index_state.json:
{
  "/path/to/file1.txt": {
    "path": "/path/to/file1.txt",
    "hash": "abc123...",
    "mtime": 1704067200.0,
    "size": 1024
  }
}

Ignore Patterns

Using .praisonignore

Create a .praisonignore file in your corpus directory:
# Ignore log files
*.log

# Ignore test directories
test/
tests/
__pycache__/

# Ignore specific files
secrets.txt
.env

Programmatic Exclusion

result = knowledge.index(
    "./docs",
    memory={"user_id": "my_user"},
    exclude_glob=["*.log", "test_*", "*.tmp"],
    include_glob=["*.md", "*.txt", "*.py"],
)

CLI Usage

# Incremental index (default)
praisonai knowledge index ./docs --user-id myuser

# Force full re-index
praisonai knowledge index ./docs --user-id myuser --full

# With include/exclude patterns
praisonai knowledge index ./docs -i "*.md,*.txt" -e "*.log,test_*"

# Verbose output showing skipped files
praisonai knowledge index ./docs --verbose

Index Results

The IndexResult dataclass provides detailed statistics:
from praisonaiagents.knowledge.indexing import IndexResult

result = knowledge.index("./docs", memory={"user_id": "my_user"})

print(f"Files indexed: {result.files_indexed}")
print(f"Files skipped: {result.files_skipped}")
print(f"Chunks created: {result.chunks_created}")
print(f"Duration: {result.duration_seconds:.2f}s")
print(f"Errors: {result.errors}")

Corpus Statistics

Get statistics about your indexed corpus:
from praisonaiagents.knowledge.indexing import CorpusStats

# From directory scan
stats = CorpusStats.from_directory("./docs")

print(f"Files: {stats.file_count}")
print(f"Estimated tokens: {stats.total_tokens}")
print(f"Recommended strategy: {stats.strategy_recommendation}")

Integration with Agents

from praisonaiagents import Agent

# Agent with knowledge automatically uses incremental indexing
agent = Agent(
    name="DocExpert",
    instructions="Answer questions using the knowledge base.",
    knowledge=["./docs"],
    memory={"user_id": "my_user"} ,  # Enables user-scoped indexing
)

# First chat triggers indexing
response = agent.chat("What are the key features?")

# Subsequent chats use cached index (incremental)
response = agent.chat("Tell me more about authentication")

Best Practices

  1. Use incremental mode - Default behavior, saves time on large corpora
  2. Set up ignore patterns - Exclude logs, tests, and temporary files
  3. Monitor index results - Check for errors and unexpected skips
  4. Periodic full re-index - Use --full occasionally to ensure consistency

API Reference

IndexResult

@dataclass
class IndexResult:
    files_indexed: int = 0
    files_skipped: int = 0
    chunks_created: int = 0
    duration_seconds: float = 0.0
    errors: List[str] = field(default_factory=list)
    corpus_stats: Optional[CorpusStats] = None

FileTracker

class FileTracker:
    def __init__(self, state_file: Optional[str] = None):
        """Initialize tracker with optional state file."""
    
    def has_changed(self, filepath: str) -> bool:
        """Check if file has changed since last index."""
    
    def mark_indexed(self, filepath: str, info: Dict) -> None:
        """Mark file as indexed."""
    
    def save(self) -> None:
        """Save state to file."""
    
    def load(self) -> None:
        """Load state from file."""

IgnoreMatcher

class IgnoreMatcher:
    def __init__(self, patterns: List[str] = None):
        """Initialize with ignore patterns."""
    
    def should_ignore(self, path: str) -> bool:
        """Check if path should be ignored."""
    
    @classmethod
    def from_directory(cls, directory: str) -> "IgnoreMatcher":
        """Load patterns from .praisonignore and .gitignore."""