Skip to main content

Document Summarizer with Citations

Summarize documents with proper citations, key points, and source references.

Problem Statement

Who: Researchers, analysts, legal teams, content curators
Why: Long documents need concise summaries with verifiable citations to maintain accuracy and trust.

What You’ll Build

A recipe that reads documents, extracts key information, and produces summaries with proper citations.

Input/Output Contract

InputTypeRequiredDescription
document_pathstringYesPath to document (PDF/DOCX/TXT)
audiencestringNoTarget audience (default: general)
lengthstringNoSummary length: brief, standard, detailed
OutputTypeDescription
summarystringDocument summary with inline citations
key_pointsarrayKey points extracted from document
okbooleanSuccess indicator

Prerequisites

export OPENAI_API_KEY=your_key_here
pip install praisonaiagents
Citation Accuracy: This recipe generates citations based on document content. Always verify citations against the original source before publishing. AI-generated summaries should be reviewed for accuracy.

Step-by-Step Build

1

Create Recipe Directory

mkdir -p ~/.praison/templates/document-summarizer-with-citations
cd ~/.praison/templates/document-summarizer-with-citations
2

Create TEMPLATE.yaml

name: document-summarizer-with-citations
version: "1.0.0"
description: "Summarize documents with proper citations"
author: "PraisonAI"
license: "MIT"

tags:
  - documents
  - summarization
  - citations
  - research

requires:
  env:
    - OPENAI_API_KEY
  packages:
    - praisonaiagents

inputs:
  document_path:
    type: string
    description: "Path to document (PDF, DOCX, or TXT)"
    required: true
  audience:
    type: string
    description: "Target audience for the summary"
    required: false
    default: "general"
    enum:
      - general
      - technical
      - executive
      - academic
  length:
    type: string
    description: "Summary length"
    required: false
    default: "standard"
    enum:
      - brief
      - standard
      - detailed

outputs:
  summary:
    type: string
    description: "Document summary with citations"
  key_points:
    type: array
    description: "Key points from the document"
  ok:
    type: boolean
    description: "Success indicator"

cli:
  command: "praison recipes run document-summarizer-with-citations"
  examples:
    - 'praison recipes run document-summarizer-with-citations --input ''{"document_path": "report.pdf"}'''

safety:
  dry_run_default: false
  requires_consent: false
  overwrites_files: false
  network_access: true
  pii_handling: true
3

Create recipe.py

# recipe.py
import os
from pathlib import Path
from praisonaiagents import Agent, Task, PraisonAIAgents

def run(input_data: dict, config: dict = None) -> dict:
    """Summarize document with citations."""
    document_path = input_data.get("document_path")
    audience = input_data.get("audience", "general")
    length = input_data.get("length", "standard")
    
    if not document_path:
        return {"ok": False, "error": {"code": "MISSING_INPUT", "message": "document_path is required"}}
    
    if not os.path.exists(document_path):
        return {"ok": False, "error": {"code": "FILE_NOT_FOUND", "message": f"Document not found: {document_path}"}}
    
    try:
        # Read document
        content = read_document(document_path)
        if not content:
            return {"ok": False, "error": {"code": "EMPTY_DOCUMENT", "message": "Document is empty or unreadable"}}
        
        length_guidelines = {
            "brief": "2-3 paragraphs, ~150 words",
            "standard": "4-6 paragraphs, ~300 words",
            "detailed": "8-10 paragraphs, ~600 words"
        }
        
        audience_guidelines = {
            "general": "Clear, accessible language avoiding jargon",
            "technical": "Include technical details and terminology",
            "executive": "Focus on business impact and decisions",
            "academic": "Formal tone with proper academic structure"
        }
        
        # Create analyzer agent
        analyzer = Agent(
            name="Document Analyst",
            role="Research Analyst",
            goal="Extract key information and identify citable content",
            instructions="""
            You are a research analyst.
            - Identify main themes and arguments
            - Note specific claims with page/section references
            - Extract statistics and data points
            - Find quotable passages
            - Track source sections for citations
            """,
        )
        
        # Create summarizer agent
        summarizer = Agent(
            name="Summary Writer",
            role="Technical Writer",
            goal=f"Write a {length} summary for {audience} audience",
            instructions=f"""
            You are a technical writer creating summaries.
            Audience: {audience} - {audience_guidelines[audience]}
            Length: {length} - {length_guidelines[length]}
            
            Guidelines:
            - Include inline citations [Section X] or [Page Y]
            - Maintain factual accuracy
            - Don't add information not in the source
            - Highlight key findings
            """,
        )
        
        # Create key points extractor
        extractor = Agent(
            name="Key Points Extractor",
            role="Information Specialist",
            goal="Extract actionable key points",
            instructions="""
            You are an information specialist.
            - Extract 5-10 key points
            - Each point should be self-contained
            - Include relevant citations
            - Prioritize by importance
            """,
        )
        
        # Define tasks
        analyze_task = Task(
            name="analyze_document",
            description=f"Analyze this document and identify key content:\n\n{content[:10000]}",
            expected_output="Document analysis with citable sections",
            agent=analyzer,
        )
        
        summarize_task = Task(
            name="create_summary",
            description=f"Create a {length} summary for {audience} audience with citations",
            expected_output="Summary with inline citations",
            agent=summarizer,
            context=[analyze_task],
        )
        
        extract_task = Task(
            name="extract_key_points",
            description="Extract key points with citations",
            expected_output="List of key points",
            agent=extractor,
            context=[analyze_task],
        )
        
        # Execute
        agents = PraisonAIAgents(
            agents=[analyzer, summarizer, extractor],
            tasks=[analyze_task, summarize_task, extract_task],
        )
        
        result = agents.start()
        
        # Parse key points
        key_points_text = result.get("extract_key_points", "")
        key_points = [kp.strip() for kp in key_points_text.split("\n") if kp.strip() and len(kp.strip()) > 10]
        
        return {
            "ok": True,
            "summary": result.get("create_summary", ""),
            "key_points": key_points[:10],
            "artifacts": [],
            "warnings": ["Always verify citations against original document"],
        }
        
    except Exception as e:
        return {"ok": False, "error": {"code": "PROCESSING_ERROR", "message": str(e)}}


def read_document(path: str) -> str:
    """Read document content based on file type."""
    ext = Path(path).suffix.lower()
    
    if ext == ".txt":
        with open(path, "r", encoding="utf-8") as f:
            return f.read()
    
    elif ext == ".pdf":
        try:
            import PyPDF2
            with open(path, "rb") as f:
                reader = PyPDF2.PdfReader(f)
                text = []
                for i, page in enumerate(reader.pages):
                    text.append(f"[Page {i+1}]\n{page.extract_text()}")
                return "\n\n".join(text)
        except ImportError:
            # Fallback: treat as text
            with open(path, "r", encoding="utf-8", errors="ignore") as f:
                return f.read()
    
    elif ext in [".docx", ".doc"]:
        try:
            from docx import Document
            doc = Document(path)
            return "\n\n".join([p.text for p in doc.paragraphs])
        except ImportError:
            with open(path, "r", encoding="utf-8", errors="ignore") as f:
                return f.read()
    
    else:
        with open(path, "r", encoding="utf-8", errors="ignore") as f:
            return f.read()
4

Create test_recipe.py

# test_recipe.py
import pytest
import tempfile
import os
from recipe import run, read_document

def test_missing_document_path():
    result = run({})
    assert result["ok"] is False
    assert result["error"]["code"] == "MISSING_INPUT"

def test_file_not_found():
    result = run({"document_path": "/nonexistent.pdf"})
    assert result["ok"] is False
    assert result["error"]["code"] == "FILE_NOT_FOUND"

def test_read_txt_document():
    with tempfile.NamedTemporaryFile(mode='w', suffix='.txt', delete=False) as f:
        f.write("This is test content.")
        temp_path = f.name
    
    try:
        content = read_document(temp_path)
        assert "test content" in content
    finally:
        os.unlink(temp_path)

def test_audience_options():
    valid_audiences = ["general", "technical", "executive", "academic"]
    for audience in valid_audiences:
        assert audience in valid_audiences

@pytest.mark.integration
def test_end_to_end():
    with tempfile.NamedTemporaryFile(mode='w', suffix='.txt', delete=False) as f:
        f.write("""
        Executive Summary
        
        This report analyzes Q4 performance. Revenue increased 15% year-over-year.
        Key findings include improved customer retention and expanded market share.
        
        Section 1: Financial Performance
        Revenue reached $10M in Q4, up from $8.7M in Q3.
        
        Section 2: Customer Metrics
        Customer satisfaction scores improved to 4.5/5.
        """)
        temp_path = f.name
    
    try:
        result = run({
            "document_path": temp_path,
            "audience": "executive",
            "length": "brief"
        })
        
        assert result["ok"] is True
        assert len(result["summary"]) > 50
        assert len(result["key_points"]) > 0
    finally:
        os.unlink(temp_path)

Run Locally

# Basic usage
praison recipes run document-summarizer-with-citations \
  --input '{"document_path": "report.pdf"}'

# Executive summary
praison recipes run document-summarizer-with-citations \
  --input '{"document_path": "analysis.docx", "audience": "executive", "length": "brief"}'

Deploy & Integrate: 6 Integration Models

from praisonai import recipe

result = recipe.run(
    "document-summarizer-with-citations",
    input={
        "document_path": "research_paper.pdf",
        "audience": "academic",
        "length": "detailed"
    }
)

if result.ok:
    print(result.output["summary"])
    print("\nKey Points:")
    for point in result.output["key_points"]:
        print(f"  • {point}")
Safety: Documents may contain confidential information. Process securely.

Troubleshooting

Install PyPDF2: pip install PyPDF2. For complex PDFs, consider pre-processing with OCR.
AI-generated citations are approximations. Always verify against the original document before use.

Next Steps