Document Summarizer with Citations
Summarize documents with proper citations, key points, and source references.Problem Statement
Who: Researchers, analysts, legal teams, content curatorsWhy: Long documents need concise summaries with verifiable citations to maintain accuracy and trust.
What You’ll Build
A recipe that reads documents, extracts key information, and produces summaries with proper citations.Input/Output Contract
| Input | Type | Required | Description |
|---|---|---|---|
document_path | string | Yes | Path to document (PDF/DOCX/TXT) |
audience | string | No | Target audience (default: general) |
length | string | No | Summary length: brief, standard, detailed |
| Output | Type | Description |
|---|---|---|
summary | string | Document summary with inline citations |
key_points | array | Key points extracted from document |
ok | boolean | Success indicator |
Prerequisites
Copy
export OPENAI_API_KEY=your_key_here
pip install praisonaiagents
Citation Accuracy: This recipe generates citations based on document content. Always verify citations against the original source before publishing. AI-generated summaries should be reviewed for accuracy.
Step-by-Step Build
1
Create Recipe Directory
Copy
mkdir -p ~/.praison/templates/document-summarizer-with-citations
cd ~/.praison/templates/document-summarizer-with-citations
2
Create TEMPLATE.yaml
Copy
name: document-summarizer-with-citations
version: "1.0.0"
description: "Summarize documents with proper citations"
author: "PraisonAI"
license: "MIT"
tags:
- documents
- summarization
- citations
- research
requires:
env:
- OPENAI_API_KEY
packages:
- praisonaiagents
inputs:
document_path:
type: string
description: "Path to document (PDF, DOCX, or TXT)"
required: true
audience:
type: string
description: "Target audience for the summary"
required: false
default: "general"
enum:
- general
- technical
- executive
- academic
length:
type: string
description: "Summary length"
required: false
default: "standard"
enum:
- brief
- standard
- detailed
outputs:
summary:
type: string
description: "Document summary with citations"
key_points:
type: array
description: "Key points from the document"
ok:
type: boolean
description: "Success indicator"
cli:
command: "praison recipes run document-summarizer-with-citations"
examples:
- 'praison recipes run document-summarizer-with-citations --input ''{"document_path": "report.pdf"}'''
safety:
dry_run_default: false
requires_consent: false
overwrites_files: false
network_access: true
pii_handling: true
3
Create recipe.py
Copy
# recipe.py
import os
from pathlib import Path
from praisonaiagents import Agent, Task, PraisonAIAgents
def run(input_data: dict, config: dict = None) -> dict:
"""Summarize document with citations."""
document_path = input_data.get("document_path")
audience = input_data.get("audience", "general")
length = input_data.get("length", "standard")
if not document_path:
return {"ok": False, "error": {"code": "MISSING_INPUT", "message": "document_path is required"}}
if not os.path.exists(document_path):
return {"ok": False, "error": {"code": "FILE_NOT_FOUND", "message": f"Document not found: {document_path}"}}
try:
# Read document
content = read_document(document_path)
if not content:
return {"ok": False, "error": {"code": "EMPTY_DOCUMENT", "message": "Document is empty or unreadable"}}
length_guidelines = {
"brief": "2-3 paragraphs, ~150 words",
"standard": "4-6 paragraphs, ~300 words",
"detailed": "8-10 paragraphs, ~600 words"
}
audience_guidelines = {
"general": "Clear, accessible language avoiding jargon",
"technical": "Include technical details and terminology",
"executive": "Focus on business impact and decisions",
"academic": "Formal tone with proper academic structure"
}
# Create analyzer agent
analyzer = Agent(
name="Document Analyst",
role="Research Analyst",
goal="Extract key information and identify citable content",
instructions="""
You are a research analyst.
- Identify main themes and arguments
- Note specific claims with page/section references
- Extract statistics and data points
- Find quotable passages
- Track source sections for citations
""",
)
# Create summarizer agent
summarizer = Agent(
name="Summary Writer",
role="Technical Writer",
goal=f"Write a {length} summary for {audience} audience",
instructions=f"""
You are a technical writer creating summaries.
Audience: {audience} - {audience_guidelines[audience]}
Length: {length} - {length_guidelines[length]}
Guidelines:
- Include inline citations [Section X] or [Page Y]
- Maintain factual accuracy
- Don't add information not in the source
- Highlight key findings
""",
)
# Create key points extractor
extractor = Agent(
name="Key Points Extractor",
role="Information Specialist",
goal="Extract actionable key points",
instructions="""
You are an information specialist.
- Extract 5-10 key points
- Each point should be self-contained
- Include relevant citations
- Prioritize by importance
""",
)
# Define tasks
analyze_task = Task(
name="analyze_document",
description=f"Analyze this document and identify key content:\n\n{content[:10000]}",
expected_output="Document analysis with citable sections",
agent=analyzer,
)
summarize_task = Task(
name="create_summary",
description=f"Create a {length} summary for {audience} audience with citations",
expected_output="Summary with inline citations",
agent=summarizer,
context=[analyze_task],
)
extract_task = Task(
name="extract_key_points",
description="Extract key points with citations",
expected_output="List of key points",
agent=extractor,
context=[analyze_task],
)
# Execute
agents = PraisonAIAgents(
agents=[analyzer, summarizer, extractor],
tasks=[analyze_task, summarize_task, extract_task],
)
result = agents.start()
# Parse key points
key_points_text = result.get("extract_key_points", "")
key_points = [kp.strip() for kp in key_points_text.split("\n") if kp.strip() and len(kp.strip()) > 10]
return {
"ok": True,
"summary": result.get("create_summary", ""),
"key_points": key_points[:10],
"artifacts": [],
"warnings": ["Always verify citations against original document"],
}
except Exception as e:
return {"ok": False, "error": {"code": "PROCESSING_ERROR", "message": str(e)}}
def read_document(path: str) -> str:
"""Read document content based on file type."""
ext = Path(path).suffix.lower()
if ext == ".txt":
with open(path, "r", encoding="utf-8") as f:
return f.read()
elif ext == ".pdf":
try:
import PyPDF2
with open(path, "rb") as f:
reader = PyPDF2.PdfReader(f)
text = []
for i, page in enumerate(reader.pages):
text.append(f"[Page {i+1}]\n{page.extract_text()}")
return "\n\n".join(text)
except ImportError:
# Fallback: treat as text
with open(path, "r", encoding="utf-8", errors="ignore") as f:
return f.read()
elif ext in [".docx", ".doc"]:
try:
from docx import Document
doc = Document(path)
return "\n\n".join([p.text for p in doc.paragraphs])
except ImportError:
with open(path, "r", encoding="utf-8", errors="ignore") as f:
return f.read()
else:
with open(path, "r", encoding="utf-8", errors="ignore") as f:
return f.read()
4
Create test_recipe.py
Copy
# test_recipe.py
import pytest
import tempfile
import os
from recipe import run, read_document
def test_missing_document_path():
result = run({})
assert result["ok"] is False
assert result["error"]["code"] == "MISSING_INPUT"
def test_file_not_found():
result = run({"document_path": "/nonexistent.pdf"})
assert result["ok"] is False
assert result["error"]["code"] == "FILE_NOT_FOUND"
def test_read_txt_document():
with tempfile.NamedTemporaryFile(mode='w', suffix='.txt', delete=False) as f:
f.write("This is test content.")
temp_path = f.name
try:
content = read_document(temp_path)
assert "test content" in content
finally:
os.unlink(temp_path)
def test_audience_options():
valid_audiences = ["general", "technical", "executive", "academic"]
for audience in valid_audiences:
assert audience in valid_audiences
@pytest.mark.integration
def test_end_to_end():
with tempfile.NamedTemporaryFile(mode='w', suffix='.txt', delete=False) as f:
f.write("""
Executive Summary
This report analyzes Q4 performance. Revenue increased 15% year-over-year.
Key findings include improved customer retention and expanded market share.
Section 1: Financial Performance
Revenue reached $10M in Q4, up from $8.7M in Q3.
Section 2: Customer Metrics
Customer satisfaction scores improved to 4.5/5.
""")
temp_path = f.name
try:
result = run({
"document_path": temp_path,
"audience": "executive",
"length": "brief"
})
assert result["ok"] is True
assert len(result["summary"]) > 50
assert len(result["key_points"]) > 0
finally:
os.unlink(temp_path)
Run Locally
Copy
# Basic usage
praison recipes run document-summarizer-with-citations \
--input '{"document_path": "report.pdf"}'
# Executive summary
praison recipes run document-summarizer-with-citations \
--input '{"document_path": "analysis.docx", "audience": "executive", "length": "brief"}'
Deploy & Integrate: 6 Integration Models
- Model 1: Embedded SDK
- Model 2: CLI Invocation
- Model 3: Plugin Mode
- Model 4: Local HTTP Sidecar
- Model 5: Remote Managed Runner
- Model 6: Event-Driven
Copy
from praisonai import recipe
result = recipe.run(
"document-summarizer-with-citations",
input={
"document_path": "research_paper.pdf",
"audience": "academic",
"length": "detailed"
}
)
if result.ok:
print(result.output["summary"])
print("\nKey Points:")
for point in result.output["key_points"]:
print(f" • {point}")
Safety: Documents may contain confidential information. Process securely.
Copy
praison recipes run document-summarizer-with-citations \
--input '{"document_path": "doc.pdf", "length": "brief"}' \
--json | jq '.summary'
Copy
class SummarizerPlugin:
def summarize(self, doc_path, audience="general"):
from praisonai import recipe
return recipe.run(
"document-summarizer-with-citations",
input={"document_path": doc_path, "audience": audience}
)
Copy
const response = await fetch('http://localhost:8765/recipes/document-summarizer-with-citations/run', {
method: 'POST',
body: JSON.stringify({
document_path: '/uploads/report.pdf',
audience: 'executive'
})
});
Copy
response = requests.post(
"https://api.doc-tools.com/summarize",
headers={"Authorization": f"Bearer {api_key}"},
json={"document_url": "https://cdn.example.com/report.pdf"}
)
Copy
def on_document_uploaded(event):
queue.send({
"recipe": "document-summarizer-with-citations",
"input": {"document_path": event['file_path']},
"callback_url": f"https://api.example.com/docs/{event['doc_id']}/summary"
})
Troubleshooting
PDF extraction fails
PDF extraction fails
Install PyPDF2:
pip install PyPDF2. For complex PDFs, consider pre-processing with OCR.Citations seem incorrect
Citations seem incorrect
AI-generated citations are approximations. Always verify against the original document before use.
Next Steps
- Meeting Minutes Action Items - Summarize meeting transcripts
- Customer Support Reply Drafter - Draft responses based on documentation

