Module praisonaiagents.knowledge

The knowledge module provides comprehensive knowledge management capabilities including document processing, vector storage, advanced chunking strategies, and semantic search functionality.

Classes

Knowledge

The main class for managing knowledge bases with vector storage and retrieval capabilities.

Parameters

config: Optional[Dict[str, Any]] = None - Configuration for vector store and processing
verbose: int = 0 - Verbosity level (0-5+) for logging output

Properties

memory - Returns a CustomMemory instance for knowledge storage
markdown - Returns MarkItDown instance for document processing
chunker - Returns a Chunking instance with configured strategy

Methods

store(content, user_id="user", agent_id=None, run_id=None, metadata=None) - Store raw text or file content
add(file_path, user_id="user", agent_id=None, run_id=None, metadata=None) - Process and add files to knowledge base
search(query, user_id=None, agent_id=None, run_id=None, rerank=True, **kwargs) - Search knowledge with optional reranking
get(memory_id) - Retrieve specific memory by ID
get_all(user_id=None, agent_id=None, run_id=None) - Retrieve all memories with filtering
update(memory_id, data) - Update existing memory
delete(memory_id) - Delete specific memory
delete_all(user_id=None, agent_id=None, run_id=None) - Batch delete with filtering
reset() - Clear all memories
history(memory_id) - Get change history for a memory

CustomMemory

A specialized memory class that bypasses LLM usage for simple fact storage.

Chunking

Unified interface for various text chunking strategies using the chonkie library.

Parameters

chunker_type: str = 'recursive' - Type of chunking strategy
chunk_size: int = 512 - Maximum size of each chunk
chunk_overlap: int = 50 - Overlap between chunks
tokenizer: Optional[Any] = None - Custom tokenizer (defaults to GPT-2)
embedding_model: Optional[Any] = None - Embedding model for semantic chunking

Methods

chunk(text: str) → List[Chunk] - Split text into chunks using configured strategy

Configuration

Vector Store Configuration

config = {
    "vector_store": {
        "provider": "chroma",  # Options: chroma, qdrant, pinecone
        "config": {
            "collection_name": "my_knowledge",
            "path": ".praison",  # For local storage
            # Additional provider-specific options
        }
    },
    "embedder": {
        "provider": "openai",
        "config": {
            "model": "text-embedding-3-small"
        }
    }
}

Chunking Strategies

1. Token Chunker (`'token'`)

Splits text by token count with overlapping windows.

chunker = Chunking(
    chunker_type='token',
    chunk_size=512,
    chunk_overlap=50
)

2. Sentence Chunker (`'sentence'`)

Splits text by sentences while respecting chunk size.

chunker = Chunking(
    chunker_type='sentence',
    chunk_size=512,
    chunk_overlap=50
)

3. Recursive Chunker (`'recursive'`) - Default

Hierarchical splitting with multiple separators.

chunker = Chunking(
    chunker_type='recursive',
    chunk_size=512
)

4. Semantic Chunker (`'semantic'`)

Groups semantically similar content together.

chunker = Chunking(
    chunker_type='semantic',
    chunk_size=512,
    embedding_model="sentence-transformers/all-MiniLM-L6-v2"
)

5. SDPM Chunker (`'sdpm'`)

Semantic Double-Pass Merge for optimal chunking.

chunker = Chunking(
    chunker_type='sdpm',
    chunk_size=512,
    embedding_model="sentence-transformers/all-MiniLM-L6-v2"
)

6. Late Chunker (`'late'`)

Optimized for retrieval performance with late interaction.

chunker = Chunking(
    chunker_type='late',
    chunk_size=512,
    embedding_model="sentence-transformers/all-MiniLM-L6-v2"
)

Usage Examples

Basic Knowledge Management

from praisonaiagents import Knowledge

# Initialize knowledge base
knowledge = Knowledge(verbose=2)

# Add documents
knowledge.add("documents/manual.pdf")
knowledge.add("data/notes.txt")

# Store raw text
knowledge.store("Important fact: The API key is XYZ123")

# Search knowledge
results = knowledge.search("API key", limit=5)
for result in results:
    print(f"Score: {result['score']}, Content: {result['memory']}")

Agent Integration

from praisonaiagents import Agent

agent = Agent(
    name="Research Assistant",
    role="Knowledge expert",
    goal="Answer questions using knowledge base",
    knowledge=["research.pdf", "notes.txt"],  # Files to process
    knowledge_config={
        "vector_store": {
            "provider": "chroma",
            "config": {
                "collection_name": "research_kb"
            }
        }
    }
)

Advanced Configuration

# Configure with custom embeddings and chunking
config = {
    "vector_store": {
        "provider": "chroma",
        "config": {
            "collection_name": "advanced_kb",
            "path": "./knowledge_store"
        }
    },
    "embedder": {
        "provider": "openai",
        "config": {
            "model": "text-embedding-3-large"
        }
    }
}

knowledge = Knowledge(config=config, verbose=3)

# Use semantic chunking for better retrieval
knowledge.chunker = Chunking(
    chunker_type='semantic',
    chunk_size=1024,
    embedding_model="BAAI/bge-small-en-v1.5"
)

Scoped Knowledge Retrieval

# User-specific knowledge
user_results = knowledge.search(
    "project requirements",
    user_id="user123"
)

# Agent-specific knowledge
agent_results = knowledge.search(
    "conversation history",
    agent_id="support_agent"
)

# Session-specific knowledge
session_results = knowledge.search(
    "current task",
    run_id="session_456"
)

Supported File Types

Documents: PDF, DOC, DOCX, PPT, PPTX, XLS, XLSX
Text: TXT, MD, CSV, JSON, XML, HTML
Images: JPG, PNG, GIF, BMP, SVG
Audio: MP3, WAV, M4A (transcription support)
Archives: ZIP (planned)

Performance Optimization

Batch Processing - Add multiple files in one call for efficiency
Chunk Size - Larger chunks for narrative content, smaller for technical
Reranking - Disable for faster search when precision isn’t critical
Embedding Cache - Reuse embeddings for duplicate content

Best Practices

Choose Appropriate Chunking - Semantic for varied content, recursive for structured
Set Meaningful Metadata - Use metadata for filtering and organization
Regular Cleanup - Delete outdated knowledge to maintain relevance
Monitor Storage - Check vector store size for large knowledge bases
Test Retrieval Quality - Verify search results match expectations ======= title: “Knowledge” sidebarTitle: “Knowledge” description: “Knowledge base management and vector storage for RAG applications” icon: “book”

Overview

The Knowledge module provides powerful knowledge base management and vector storage capabilities for building RAG (Retrieval-Augmented Generation) applications. It supports multiple file formats, various chunking strategies, and semantic search with optional reranking.

Quick Start

Install praisonaiagents

pip install praisonaiagents

Import and initialize Knowledge

from praisonaiagents import Knowledge

# Initialize knowledge base
knowledge = Knowledge()

# Add knowledge from various sources
knowledge.add("path/to/document.pdf")
knowledge.add("https://example.com/article")
knowledge.add("Direct text content about AI")

# Search the knowledge base
results = knowledge.search("What is machine learning?")

for result in results:
    print(f"Content: {result['text']}")
    print(f"Source: {result['source']}")
    print(f"Score: {result['score']}")

Advanced configuration

from praisonaiagents import Knowledge

# Configure with custom settings
knowledge = Knowledge(
    collection_name="technical_docs",
    storage_path="./my_knowledge_base",
    chunk_size=512,
    chunk_overlap=50,
    chunking_strategy="semantic",
    embedding_model="all-MiniLM-L6-v2",
    use_reranking=True,
    reranking_model="ms-marco-MiniLM-L-6-v2"
)

# Add multiple documents
docs = ["doc1.pdf", "doc2.md", "doc3.txt"]
for doc in docs:
    knowledge.add(doc)

# Search with custom limit
results = knowledge.search(
    "explain neural networks",
    limit=10
)

Key Concepts

API Reference

Constructor

Knowledge(
    collection_name: str = "knowledge_base",
    storage_path: Optional[str] = None,
    chunk_size: int = 1000,
    chunk_overlap: int = 200,
    chunking_strategy: str = "recursive",
    embedding_model: str = "all-MiniLM-L6-v2",
    use_reranking: bool = False,
    reranking_model: str = "ms-marco-MiniLM-L-6-v2"
)

Parameters

collection_name

str

default:"knowledge_base"

Name of the ChromaDB collection

storage_path

Optional[str]

Path for persistent storage (defaults to .praison/chroma_db)

chunk_size

int

default:"1000"

Size of text chunks for processing

chunk_overlap

int

default:"200"

Overlap between consecutive chunks

chunking_strategy

str

default:"recursive"

Strategy for splitting text (see Chunking Strategies section)

embedding_model

str

default:"all-MiniLM-L6-v2"

Model for generating embeddings

use_reranking

bool

default:"False"

Whether to use reranking for search results

reranking_model

str

default:"ms-marco-MiniLM-L-6-v2"

Model for reranking search results

Methods

add()

Add content to the knowledge base from various sources.

add(source: Union[str, Path]) -> bool

Parameters:

source - File path, URL, or direct text content

Returns:

bool - Success status

Supported formats:

Documents: PDF, DOCX, PPTX
Spreadsheets: XLSX, XLS, CSV
Images: PNG, JPG, JPEG
Web: HTML, URLs
Text: TXT, MD, Python, JavaScript, etc.

search()

Search the knowledge base for relevant content.

search(query: str, limit: int = 5) -> List[Dict[str, Any]]

Parameters:

query - Search query
limit - Maximum number of results

Returns:

List of dictionaries containing:
- text - Content chunk
- source - Original source
- score - Relevance score
- metadata - Additional metadata

get_context()

Get formatted context for a query (useful for agents).

get_context(query: str, max_results: int = 3) -> str

Parameters:

query - Search query
max_results - Maximum results to include

Returns:

Formatted string with relevant context

clear()

Clear all content from the knowledge base.

clear() -> None

get_stats()

Get statistics about the knowledge base.

get_stats() -> Dict[str, Any]

Returns:

Dictionary with:
- total_chunks - Number of stored chunks
- sources - List of unique sources
- collection_name - Name of the collection
- storage_path - Path to storage

Chunking Strategies

The knowledge module supports multiple chunking strategies for different use cases:

knowledge = Knowledge(chunking_strategy="token")

Splits text based on token count. Best for:

Consistent chunk sizes for LLM processing
Language model token limit management

Integration Examples

With Agents

from praisonaiagents import Agent, Knowledge

# Create knowledge base
knowledge = Knowledge()
knowledge.add("company_handbook.pdf")
knowledge.add("product_documentation.md")

# Create agent with knowledge
agent = Agent(
    name="DocAssistant",
    role="Documentation Expert",
    goal="Answer questions using the knowledge base",
    knowledge=knowledge
)

# Agent automatically uses knowledge for responses
response = agent.chat("What is our refund policy?")

RAG Application

from praisonaiagents import Agent, Task, Knowledge

# Build comprehensive knowledge base
knowledge = Knowledge(
    collection_name="tech_support",
    use_reranking=True
)

# Add multiple sources
sources = [
    "troubleshooting_guide.pdf",
    "faq.md",
    "https://docs.example.com/api",
    "Common issues:\n1. Login problems\n2. Performance issues"
]

for source in sources:
    knowledge.add(source)

# Create RAG agent
rag_agent = Agent(
    name="SupportBot",
    role="Technical Support AI",
    instructions=f"""You are a helpful support agent.
    Always search the knowledge base before answering.
    Cite your sources when providing information.""",
    knowledge=knowledge
)

# Create support task
task = Task(
    description="Help user with: {query}",
    agent=rag_agent,
    expected_output="Clear solution with source citations"
)

from praisonaiagents import Knowledge, Agent, PraisonAIAgents

# Shared knowledge base
shared_kb = Knowledge(collection_name="shared_research")

# Add research papers
papers = ["paper1.pdf", "paper2.pdf", "review.pdf"]
for paper in papers:
    shared_kb.add(paper)

# Create multiple agents sharing knowledge
researcher = Agent(
    name="Researcher",
    role="Research Analyst",
    knowledge=shared_kb
)

writer = Agent(
    name="Writer",
    role="Technical Writer",
    knowledge=shared_kb
)

reviewer = Agent(
    name="Reviewer",
    role="Peer Reviewer",
    knowledge=shared_kb
)

# All agents can access the same knowledge

Custom Processing Pipeline

from praisonaiagents import Knowledge
import logging

class CustomKnowledge(Knowledge):
    def process_before_adding(self, text: str, source: str) -> str:
        """Custom preprocessing"""
        # Clean text
        text = text.strip()
        
        # Add source header
        text = f"Source: {source}\n\n{text}"
        
        # Log processing
        logging.info(f"Processing {source}")
        
        return text
    
    def filter_search_results(self, results: list) -> list:
        """Custom result filtering"""
        # Filter out low-score results
        filtered = [r for r in results if r['score'] > 0.7]
        
        # Sort by score
        filtered.sort(key=lambda x: x['score'], reverse=True)
        
        return filtered

# Use custom knowledge
kb = CustomKnowledge(
    chunk_size=500,
    use_reranking=True
)

Best Practices

Document Preparation

Clean documents before adding (remove headers/footers if needed)
Use appropriate formats - PDF for formatted docs, MD for technical docs
Structure content with clear headings and sections
Include metadata in document names or content

Chunking Strategy

Token-based: When working with token-limited LLMs
Sentence-based: For Q&A systems needing complete thoughts
Recursive: General purpose, good default choice
Semantic: For documents with multiple distinct topics
SDPM: For academic or highly structured content
Late chunking: When retrieval accuracy is critical

Search Optimization

Use reranking for better relevance in large knowledge bases
Tune chunk size - smaller for precise retrieval, larger for context
Optimize queries - use clear, specific search terms
Limit results appropriately to balance relevance and coverage

Storage Management

Use meaningful collection names for different knowledge domains
Implement cleanup strategies for growing knowledge bases
Monitor storage size and implement archival if needed
Backup important collections regularly

Performance Considerations

Embedding Models

The choice of embedding model affects both quality and performance:

Reranking Impact

# Without reranking - faster
knowledge = Knowledge(use_reranking=False)
results = knowledge.search("query")  # ~50ms

# With reranking - more accurate
knowledge = Knowledge(use_reranking=True)
results = knowledge.search("query")  # ~200ms

Scaling Considerations

For large knowledge bases:

Use appropriate chunk sizes - Larger chunks reduce total count
Implement batch processing for adding multiple documents
Consider sharding collections by domain or time period
Monitor memory usage with embedding models
Use persistent storage to avoid reprocessing

Troubleshooting

Common issues and solutions:

Complete Example

from praisonaiagents import Knowledge, Agent, Task, PraisonAIAgents
import os

# Initialize knowledge base with optimal settings
knowledge = Knowledge(
    collection_name="customer_support",
    storage_path="./support_kb",
    chunk_size=512,
    chunk_overlap=50,
    chunking_strategy="recursive",
    use_reranking=True
)

# Add various knowledge sources
print("Building knowledge base...")

# Documentation
for doc in os.listdir("./docs"):
    if doc.endswith(('.pdf', '.md', '.txt')):
        knowledge.add(f"./docs/{doc}")
        print(f"Added: {doc}")

# FAQs from web
knowledge.add("https://example.com/faq")

# Direct knowledge
knowledge.add("""
Customer Support Best Practices:
1. Always greet customers warmly
2. Listen actively to understand the issue
3. Search knowledge base before responding
4. Provide clear, step-by-step solutions
5. Follow up to ensure resolution
""")

# Create support agent
support_agent = Agent(
    name="CustomerSupport",
    role="Senior Support Specialist",
    goal="Resolve customer issues using knowledge base",
    backstory="Expert support agent with deep product knowledge",
    knowledge=knowledge,
    verbose=True
)

# Create escalation agent
escalation_agent = Agent(
    name="EscalationSupport",
    role="Support Manager",
    goal="Handle complex issues that need escalation",
    knowledge=knowledge
)

# Define tasks
initial_support = Task(
    description="Help customer with: {customer_query}",
    agent=support_agent,
    expected_output="Solution or escalation decision"
)

escalation_task = Task(
    description="Handle escalated issue: {issue_details}",
    agent=escalation_agent,
    expected_output="Advanced solution or workaround"
)

# Create workflow
workflow = PraisonAIAgents(
    agents=[support_agent, escalation_agent],
    tasks=[initial_support, escalation_task],
    process="sequential"
)

# Test the system
test_queries = [
    "How do I reset my password?",
    "The app crashes when I upload large files",
    "Can I get a refund for my subscription?"
]

for query in test_queries:
    print(f"\nQuery: {query}")
    
    # Search knowledge directly
    results = knowledge.search(query, limit=3)
    print(f"Found {len(results)} relevant articles")
    
    # Get agent response
    response = support_agent.chat(query)
    print(f"Response: {response}")

# Get knowledge base statistics
stats = knowledge.get_stats()
print(f"\nKnowledge Base Stats:")
print(f"Total chunks: {stats['total_chunks']}")
print(f"Sources: {len(stats['sources'])}")

Getting Started

Core Concepts

Workflows

Features

Models

Tools

Other Features

Monitoring

Developers

Getting Started (No Code)

API Reference

​Module praisonaiagents.knowledge

​Classes

​Knowledge

​Parameters

​Properties

​Methods

​CustomMemory

​Chunking

​Parameters

​Methods

​Configuration

​Vector Store Configuration

​Chunking Strategies

​1. Token Chunker ('token')

​2. Sentence Chunker ('sentence')

​3. Recursive Chunker ('recursive') - Default

​4. Semantic Chunker ('semantic')

​5. SDPM Chunker ('sdpm')

​6. Late Chunker ('late')

​Usage Examples

​Basic Knowledge Management

​Agent Integration

​Advanced Configuration

​Scoped Knowledge Retrieval

​Supported File Types

​Performance Optimization

​Best Practices

​Overview

​Quick Start

​Key Concepts

​API Reference

​Constructor

​Parameters

​Methods

​add()

​search()

​get_context()

​clear()

​get_stats()

​Chunking Strategies

​Integration Examples

​With Agents

​RAG Application

​Multi-Agent Knowledge Sharing

​Custom Processing Pipeline

​Best Practices

Document Preparation

Chunking Strategy

Search Optimization

Storage Management

​Performance Considerations

​Embedding Models

​Reranking Impact

​Scaling Considerations

​Troubleshooting

​Complete Example

Module praisonaiagents.knowledge

Classes

Knowledge

Parameters

Properties

Methods

CustomMemory

Chunking

Parameters

Methods

Configuration

Vector Store Configuration

Chunking Strategies

1. Token Chunker (`'token'`)

2. Sentence Chunker (`'sentence'`)

3. Recursive Chunker (`'recursive'`) - Default

4. Semantic Chunker (`'semantic'`)

5. SDPM Chunker (`'sdpm'`)

6. Late Chunker (`'late'`)

Usage Examples

Basic Knowledge Management

Agent Integration

Advanced Configuration

Scoped Knowledge Retrieval

Supported File Types

Performance Optimization

Best Practices

Overview

Quick Start

Key Concepts

API Reference

Constructor

Parameters

Methods

add()

search()

get_context()

clear()

get_stats()

Chunking Strategies

Integration Examples

With Agents

RAG Application

Multi-Agent Knowledge Sharing

Custom Processing Pipeline

Best Practices

Performance Considerations

Embedding Models

Reranking Impact

Scaling Considerations

Troubleshooting

Complete Example