Uses embeddings to split text at semantic boundaries. Groups related content together for better retrieval.
Quick Start
Agent with Semantic Chunking
Direct API
from praisonaiagents import Agent
agent = Agent(
instructions = "Answer questions from research papers." ,
knowledge = {
"sources" : [ "papers/" ],
"chunker" : {
"type" : "semantic" ,
"chunk_size" : 512 ,
"embedding_model" : "all-MiniLM-L6-v2"
}
}
)
response = agent.start( "What methodology did they use?" )
When to Use
Good For
Research papers
Topic-dense content
Multi-subject documents
Quality over speed
Consider Alternatives
Speed-critical pipelines
Uniform chunk sizes needed
Simple structured content
Very short documents
Parameters
Parameter Type Default Description chunk_sizeint 512 Max tokens per chunk embedding_modelstr auto Model for semantic similarity
Examples
Research Analysis
agent = Agent(
instructions = "Analyze academic papers." ,
knowledge = {
"sources" : [ "research/" ],
"chunker" : {
"type" : "semantic" ,
"chunk_size" : 512 ,
"embedding_model" : "all-MiniLM-L6-v2"
}
}
)
Knowledge Base
agent = Agent(
instructions = "Answer from knowledge base." ,
knowledge = {
"sources" : [ "wiki/" , "faq.txt" ],
"chunker" : {
"type" : "semantic" ,
"chunk_size" : 256 # Smaller for precise retrieval
}
}
)
How It Works
Semantic chunking:
Splits document into sentences
Generates embeddings for each sentence
Groups consecutive similar sentences
Creates new chunk when topic changes
Semantic chunking requires computing embeddings and is slower than token/sentence chunking. Use for quality-sensitive applications where retrieval accuracy matters more than speed.
Embedding Models
The default embedding model is all-MiniLM-L6-v2. You can use any model supported by the chonkie library:
knowledge = {
"sources" : [ "docs/" ],
"chunker" : {
"type" : "semantic" ,
"embedding_model" : "all-MiniLM-L6-v2" # Default
# Or: "text-embedding-3-small", etc.
}
}