> ## Documentation Index
> Fetch the complete documentation index at: https://docs.praison.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Document Summarizer with Citations

> Summarize documents with proper citations and key point extraction

# Document Summarizer with Citations

Summarize documents with proper citations, key points, and source references.

## Problem Statement

**Who:** Researchers, analysts, legal teams, content curators\
**Why:** Long documents need concise summaries with verifiable citations to maintain accuracy and trust.

## What You'll Build

A recipe that reads documents, extracts key information, and produces summaries with proper citations.

```mermaid theme={"theme":{"light":"vitesse-light","dark":"vitesse-dark"}}
graph LR
    Input[📄 Document] --> Parse[Parse Content]
    Parse --> Analyze[Analyze & Extract]
    Analyze --> Summarize[Summarize]
    Summarize --> Cite[Add Citations]
    Cite --> Output[📝 Summary + Citations]

    classDef input fill:#8B0000,stroke:#7C90A0,color:#fff
    classDef process fill:#189AB4,stroke:#7C90A0,color:#fff

    class Input,Output input
    class Parse,Analyze,Summarize,Cite process
```

### Input/Output Contract

| Input           | Type   | Required | Description                                     |
| --------------- | ------ | -------- | ----------------------------------------------- |
| `document_path` | string | Yes      | Path to document (PDF/DOCX/TXT)                 |
| `audience`      | string | No       | Target audience (default: `general`)            |
| `length`        | string | No       | Summary length: `brief`, `standard`, `detailed` |

| Output       | Type    | Description                            |
| ------------ | ------- | -------------------------------------- |
| `summary`    | string  | Document summary with inline citations |
| `key_points` | array   | Key points extracted from document     |
| `ok`         | boolean | Success indicator                      |

## Prerequisites

```bash theme={"theme":{"light":"vitesse-light","dark":"vitesse-dark"}}
export OPENAI_API_KEY=your_key_here
pip install praisonaiagents
```

<Warning>
  **Citation Accuracy:** This recipe generates citations based on document content. Always verify citations against the original source before publishing. AI-generated summaries should be reviewed for accuracy.
</Warning>

## Step-by-Step Build

<Steps>
  <Step title="Create Recipe Directory">
    ```bash theme={"theme":{"light":"vitesse-light","dark":"vitesse-dark"}}
    mkdir -p ~/.praisonai/templates/document-summarizer-with-citations
    cd ~/.praisonai/templates/document-summarizer-with-citations
    ```
  </Step>

  <Step title="Create TEMPLATE.yaml">
    ```yaml theme={"theme":{"light":"vitesse-light","dark":"vitesse-dark"}}
    name: document-summarizer-with-citations
    version: "1.0.0"
    description: "Summarize documents with proper citations"
    author: "PraisonAI"
    license: "MIT"

    tags:
      - documents
      - summarization
      - citations
      - research

    requires:
      env:
        - OPENAI_API_KEY
      packages:
        - praisonaiagents

    inputs:
      document_path:
        type: string
        description: "Path to document (PDF, DOCX, or TXT)"
        required: true
      audience:
        type: string
        description: "Target audience for the summary"
        required: false
        default: "general"
        enum:
          - general
          - technical
          - executive
          - academic
      length:
        type: string
        description: "Summary length"
        required: false
        default: "standard"
        enum:
          - brief
          - standard
          - detailed

    outputs:
      summary:
        type: string
        description: "Document summary with citations"
      key_points:
        type: array
        description: "Key points from the document"
      ok:
        type: boolean
        description: "Success indicator"

    cli:
      command: "praison recipes run document-summarizer-with-citations"
      examples:
        - 'praison recipes run document-summarizer-with-citations --input ''{"document_path": "report.pdf"}'''

    safety:
      dry_run_default: false
      requires_consent: false
      overwrites_files: false
      network_access: true
      pii_handling: true
    ```
  </Step>

  <Step title="Create recipe.py">
    ```python theme={"theme":{"light":"vitesse-light","dark":"vitesse-dark"}}
    # recipe.py
    import os
    from pathlib import Path
    from praisonaiagents import Agent, Task, AgentTeam

    def run(input_data: dict, config: dict = None) -> dict:
        """Summarize document with citations."""
        document_path = input_data.get("document_path")
        audience = input_data.get("audience", "general")
        length = input_data.get("length", "standard")
        
        if not document_path:
            return {"ok": False, "error": {"code": "MISSING_INPUT", "message": "document_path is required"}}
        
        if not os.path.exists(document_path):
            return {"ok": False, "error": {"code": "FILE_NOT_FOUND", "message": f"Document not found: {document_path}"}}
        
        try:
            # Read document
            content = read_document(document_path)
            if not content:
                return {"ok": False, "error": {"code": "EMPTY_DOCUMENT", "message": "Document is empty or unreadable"}}
            
            length_guidelines = {
                "brief": "2-3 paragraphs, ~150 words",
                "standard": "4-6 paragraphs, ~300 words",
                "detailed": "8-10 paragraphs, ~600 words"
            }
            
            audience_guidelines = {
                "general": "Clear, accessible language avoiding jargon",
                "technical": "Include technical details and terminology",
                "executive": "Focus on business impact and decisions",
                "academic": "Formal tone with proper academic structure"
            }
            
            # Create analyzer agent
            analyzer = Agent(
                name="Document Analyst",
                role="Research Analyst",
                goal="Extract key information and identify citable content",
                instructions="""
                You are a research analyst.
                - Identify main themes and arguments
                - Note specific claims with page/section references
                - Extract statistics and data points
                - Find quotable passages
                - Track source sections for citations
                """,
            )
            
            # Create summarizer agent
            summarizer = Agent(
                name="Summary Writer",
                role="Technical Writer",
                goal=f"Write a {length} summary for {audience} audience",
                instructions=f"""
                You are a technical writer creating summaries.
                Audience: {audience} - {audience_guidelines[audience]}
                Length: {length} - {length_guidelines[length]}
                
                Guidelines:
                - Include inline citations [Section X] or [Page Y]
                - Maintain factual accuracy
                - Don't add information not in the source
                - Highlight key findings
                """,
            )
            
            # Create key points extractor
            extractor = Agent(
                name="Key Points Extractor",
                role="Information Specialist",
                goal="Extract actionable key points",
                instructions="""
                You are an information specialist.
                - Extract 5-10 key points
                - Each point should be self-contained
                - Include relevant citations
                - Prioritize by importance
                """,
            )
            
            # Define tasks
            analyze_task = Task(
                name="analyze_document",
                description=f"Analyze this document and identify key content:\n\n{content[:10000]}",
                expected_output="Document analysis with citable sections",
                agent=analyzer,
            )
            
            summarize_task = Task(
                name="create_summary",
                description=f"Create a {length} summary for {audience} audience with citations",
                expected_output="Summary with inline citations",
                agent=summarizer,
                context=[analyze_task],
            )
            
            extract_task = Task(
                name="extract_key_points",
                description="Extract key points with citations",
                expected_output="List of key points",
                agent=extractor,
                context=[analyze_task],
            )
            
            # Execute
            agents = AgentTeam(
                agents=[analyzer, summarizer, extractor],
                tasks=[analyze_task, summarize_task, extract_task],
            )
            
            result = agents.start()
            
            # Parse key points
            key_points_text = result.get("extract_key_points", "")
            key_points = [kp.strip() for kp in key_points_text.split("\n") if kp.strip() and len(kp.strip()) > 10]
            
            return {
                "ok": True,
                "summary": result.get("create_summary", ""),
                "key_points": key_points[:10],
                "artifacts": [],
                "warnings": ["Always verify citations against original document"],
            }
            
        except Exception as e:
            return {"ok": False, "error": {"code": "PROCESSING_ERROR", "message": str(e)}}


    def read_document(path: str) -> str:
        """Read document content based on file type."""
        ext = Path(path).suffix.lower()
        
        if ext == ".txt":
            with open(path, "r", encoding="utf-8") as f:
                return f.read()
        
        elif ext == ".pdf":
            try:
                import PyPDF2
                with open(path, "rb") as f:
                    reader = PyPDF2.PdfReader(f)
                    text = []
                    for i, page in enumerate(reader.pages):
                        text.append(f"[Page {i+1}]\n{page.extract_text()}")
                    return "\n\n".join(text)
            except ImportError:
                # Fallback: treat as text
                with open(path, "r", encoding="utf-8", errors="ignore") as f:
                    return f.read()
        
        elif ext in [".docx", ".doc"]:
            try:
                from docx import Document
                doc = Document(path)
                return "\n\n".join([p.text for p in doc.paragraphs])
            except ImportError:
                with open(path, "r", encoding="utf-8", errors="ignore") as f:
                    return f.read()
        
        else:
            with open(path, "r", encoding="utf-8", errors="ignore") as f:
                return f.read()
    ```
  </Step>

  <Step title="Create test_recipe.py">
    ```python theme={"theme":{"light":"vitesse-light","dark":"vitesse-dark"}}
    # test_recipe.py
    import pytest
    import tempfile
    import os
    from recipe import run, read_document

    def test_missing_document_path():
        result = run({})
        assert result["ok"] is False
        assert result["error"]["code"] == "MISSING_INPUT"

    def test_file_not_found():
        result = run({"document_path": "/nonexistent.pdf"})
        assert result["ok"] is False
        assert result["error"]["code"] == "FILE_NOT_FOUND"

    def test_read_txt_document():
        with tempfile.NamedTemporaryFile(mode='w', suffix='.txt', delete=False) as f:
            f.write("This is test content.")
            temp_path = f.name
        
        try:
            content = read_document(temp_path)
            assert "test content" in content
        finally:
            os.unlink(temp_path)

    def test_audience_options():
        valid_audiences = ["general", "technical", "executive", "academic"]
        for audience in valid_audiences:
            assert audience in valid_audiences

    @pytest.mark.integration
    def test_end_to_end():
        with tempfile.NamedTemporaryFile(mode='w', suffix='.txt', delete=False) as f:
            f.write("""
            Executive Summary
            
            This report analyzes Q4 performance. Revenue increased 15% year-over-year.
            Key findings include improved customer retention and expanded market share.
            
            Section 1: Financial Performance
            Revenue reached $10M in Q4, up from $8.7M in Q3.
            
            Section 2: Customer Metrics
            Customer satisfaction scores improved to 4.5/5.
            """)
            temp_path = f.name
        
        try:
            result = run({
                "document_path": temp_path,
                "audience": "executive",
                "length": "brief"
            })
            
            assert result["ok"] is True
            assert len(result["summary"]) > 50
            assert len(result["key_points"]) > 0
        finally:
            os.unlink(temp_path)
    ```
  </Step>
</Steps>

## Run Locally

```bash theme={"theme":{"light":"vitesse-light","dark":"vitesse-dark"}}
# Basic usage
praison recipes run document-summarizer-with-citations \
  --input '{"document_path": "report.pdf"}'

# Executive summary
praison recipes run document-summarizer-with-citations \
  --input '{"document_path": "analysis.docx", "audience": "executive", "length": "brief"}'
```

## Deploy & Integrate: 6 Integration Models

<Tabs>
  <Tab title="Model 1: Embedded SDK">
    ```python theme={"theme":{"light":"vitesse-light","dark":"vitesse-dark"}}
    from praisonai import recipe

    result = recipe.run(
        "document-summarizer-with-citations",
        input={
            "document_path": "research_paper.pdf",
            "audience": "academic",
            "length": "detailed"
        }
    )

    if result.ok:
        print(result.output["summary"])
        print("\nKey Points:")
        for point in result.output["key_points"]:
            print(f"  • {point}")
    ```

    <Warning>**Safety:** Documents may contain confidential information. Process securely.</Warning>
  </Tab>

  <Tab title="Model 2: CLI Invocation">
    ```bash theme={"theme":{"light":"vitesse-light","dark":"vitesse-dark"}}
    praison recipes run document-summarizer-with-citations \
      --input '{"document_path": "doc.pdf", "length": "brief"}' \
      --json | jq '.summary'
    ```
  </Tab>

  <Tab title="Model 3: Plugin Mode">
    ```python theme={"theme":{"light":"vitesse-light","dark":"vitesse-dark"}}
    class SummarizerPlugin:
        def summarize(self, doc_path, audience="general"):
            from praisonai import recipe
            return recipe.run(
                "document-summarizer-with-citations",
                input={"document_path": doc_path, "audience": audience}
            )
    ```
  </Tab>

  <Tab title="Model 4: Local HTTP Sidecar">
    ```javascript theme={"theme":{"light":"vitesse-light","dark":"vitesse-dark"}}
    const response = await fetch('http://localhost:8765/recipes/document-summarizer-with-citations/run', {
      method: 'POST',
      body: JSON.stringify({
        document_path: '/uploads/report.pdf',
        audience: 'executive'
      })
    });
    ```
  </Tab>

  <Tab title="Model 5: Remote Managed Runner">
    ```python theme={"theme":{"light":"vitesse-light","dark":"vitesse-dark"}}
    response = requests.post(
        "https://api.doc-tools.com/summarize",
        headers={"Authorization": f"Bearer {api_key}"},
        json={"document_url": "https://cdn.example.com/report.pdf"}
    )
    ```
  </Tab>

  <Tab title="Model 6: Event-Driven">
    ```python theme={"theme":{"light":"vitesse-light","dark":"vitesse-dark"}}
    def on_document_uploaded(event):
        import queue as q
        job_queue = q.Queue()  # Replace with SQS/RabbitMQ in production
        job_queue.put({
            "recipe": "document-summarizer-with-citations",
            "input": {"document_path": event['file_path']},
            "callback_url": f"https://api.example.com/docs/{event['doc_id']}/summary"
        })
    ```
  </Tab>
</Tabs>

## Troubleshooting

<AccordionGroup>
  <Accordion title="PDF extraction fails">
    Install PyPDF2: `pip install PyPDF2`. For complex PDFs, consider pre-processing with OCR.
  </Accordion>

  <Accordion title="Citations seem incorrect">
    AI-generated citations are approximations. Always verify against the original document before use.
  </Accordion>
</AccordionGroup>

## Next Steps

* **[Meeting Minutes Action Items](/docs/examples/recipe-examples/meeting-minutes-action-items)** - Summarize meeting transcripts
* **[Customer Support Reply Drafter](/docs/examples/recipe-examples/customer-support-reply-drafter)** - Draft responses based on documentation
