Video Caption Generator

Generate captions from video files with automatic language detection and support for SRT/VTT output formats.

Problem Statement

Who: Content creators, video editors, accessibility teams
Why: Manual captioning is time-consuming and expensive. Automated captions improve accessibility and SEO.

What You’ll Build

A recipe that extracts audio from video, transcribes it, and generates properly formatted caption files.

Input/Output Contract

Input	Type	Required	Description
`video_path`	string	Yes	Path to the video file
`language`	string	No	Language code (auto-detect if omitted)
`output_format`	string	No	`srt` or `vtt` (default: `srt`)

Output	Type	Description
`captions_file`	string	Path to generated caption file
`summary`	string	Brief summary of the video content
`ok`	boolean	Success indicator

Prerequisites

Required: OPENAI_API_KEY environment variable must be set.

# Set your API key
export OPENAI_API_KEY=your_key_here

# Install required packages
pip install praisonaiagents

# Optional: Install ffmpeg for audio extraction
# macOS
brew install ffmpeg

# Ubuntu/Debian
sudo apt install ffmpeg

Step-by-Step Build

Create Recipe Directory

mkdir -p ~/.praisonai/templates/video-caption-generator
cd ~/.praisonai/templates/video-caption-generator

Create TEMPLATE.yaml

Create the recipe metadata file:

# TEMPLATE.yaml
name: video-caption-generator
version: "1.0.0"
description: "Generate captions from video files with language detection"
author: "PraisonAI"
license: "MIT"

tags:
  - video
  - captions
  - accessibility
  - transcription

requires:
  env:
    - OPENAI_API_KEY
  packages:
    - praisonaiagents
  optional_env:
    - ANTHROPIC_API_KEY
  external:
    - ffmpeg

inputs:
  video_path:
    type: string
    description: "Path to the video file to caption"
    required: true
  language:
    type: string
    description: "Language code (e.g., 'en', 'es', 'fr'). Auto-detect if omitted."
    required: false
    default: "auto"
  output_format:
    type: string
    description: "Caption output format"
    required: false
    default: "srt"
    enum:
      - srt
      - vtt

outputs:
  captions_file:
    type: string
    description: "Path to the generated caption file"
  summary:
    type: string
    description: "Brief summary of the video content"
  ok:
    type: boolean
    description: "Whether the operation succeeded"

cli:
  command: "praison recipes run video-caption-generator"
  examples:
    - 'praison recipes run video-caption-generator --input ''{"video_path": "video.mp4"}'''
    - 'praison recipes run video-caption-generator --input ''{"video_path": "video.mp4", "language": "en", "output_format": "vtt"}'''

safety:
  dry_run_default: false
  requires_consent: false
  overwrites_files: true
  network_access: true
  pii_handling: false

Create recipe.py

Implement the main recipe logic:

# recipe.py
import os
import subprocess
import tempfile
from pathlib import Path
from praisonaiagents import Agent, Task, AgentTeam

def run(input_data: dict, config: dict = None) -> dict:
    """
    Generate captions from a video file.
    
    Args:
        input_data: Contains video_path, language, output_format
        config: Optional configuration overrides
        
    Returns:
        Dict with captions_file, summary, and ok status
    """
    # Validate required inputs
    video_path = input_data.get("video_path")
    if not video_path:
        return {
            "ok": False,
            "error": {"code": "MISSING_INPUT", "message": "video_path is required"},
            "captions_file": None,
            "summary": None,
        }
    
    if not os.path.exists(video_path):
        return {
            "ok": False,
            "error": {"code": "FILE_NOT_FOUND", "message": f"Video file not found: {video_path}"},
            "captions_file": None,
            "summary": None,
        }
    
    language = input_data.get("language", "auto")
    output_format = input_data.get("output_format", "srt")
    
    try:
        # Extract audio from video
        audio_path = extract_audio(video_path)
        
        # Create transcription agent
        transcriber = Agent(
            name="Transcription Specialist",
            role="Audio Transcription Expert",
            goal="Accurately transcribe audio content with timestamps",
            instructions="""
            You are an expert transcriptionist.
            - Transcribe audio accurately with proper punctuation
            - Include timestamps for each segment
            - Identify speaker changes when possible
            - Handle multiple languages if detected
            """,
        )
        
        # Create caption formatting agent
        formatter = Agent(
            name="Caption Formatter",
            role="Caption File Specialist",
            goal=f"Format transcription into {output_format.upper()} format",
            instructions=f"""
            You are a caption formatting expert.
            - Convert transcriptions to {output_format.upper()} format
            - Ensure proper timestamp formatting
            - Keep caption lines under 42 characters
            - Split long sentences appropriately
            """,
        )
        
        # Create summarizer agent
        summarizer = Agent(
            name="Content Summarizer",
            role="Video Content Analyst",
            goal="Provide a brief summary of the video content",
            instructions="""
            You are a content analyst.
            - Summarize the main topics discussed
            - Keep summary under 100 words
            - Highlight key points
            """,
        )
        
        # Define tasks
        transcribe_task = Task(
            name="transcribe_audio",
            description=f"""
            Transcribe the audio from: {audio_path}
            Language: {language if language != 'auto' else 'auto-detect'}
            
            Provide timestamped transcription segments.
            """,
            expected_output="Timestamped transcription with segments",
            agent=transcriber,
        )
        
        format_task = Task(
            name="format_captions",
            description=f"""
            Format the transcription into {output_format.upper()} caption format.
            
            For SRT format:
            1
            00:00:00,000 --> 00:00:02,500
            Caption text here
            
            For VTT format:
            WEBVTT
            
            00:00:00.000 --> 00:00:02.500
            Caption text here
            """,
            expected_output=f"Properly formatted {output_format.upper()} captions",
            agent=formatter,
            context=[transcribe_task],
        )
        
        summarize_task = Task(
            name="summarize_content",
            description="Summarize the video content based on the transcription.",
            expected_output="Brief summary of video content",
            agent=summarizer,
            context=[transcribe_task],
        )
        
        # Execute agents
        agents = AgentTeam(
            agents=[transcriber, formatter, summarizer],
            tasks=[transcribe_task, format_task, summarize_task],
        )
        
        result = agents.start()
        
        # Save captions to file
        video_name = Path(video_path).stem
        captions_file = f"{video_name}.{output_format}"
        
        with open(captions_file, "w", encoding="utf-8") as f:
            f.write(result.get("format_captions", ""))
        
        # Cleanup temp audio file
        if audio_path and os.path.exists(audio_path):
            os.remove(audio_path)
        
        return {
            "ok": True,
            "captions_file": captions_file,
            "summary": result.get("summarize_content", ""),
            "artifacts": [
                {"path": captions_file, "type": "text", "size_bytes": os.path.getsize(captions_file)}
            ],
            "warnings": [],
        }
        
    except Exception as e:
        return {
            "ok": False,
            "error": {"code": "PROCESSING_ERROR", "message": str(e)},
            "captions_file": None,
            "summary": None,
        }


def extract_audio(video_path: str) -> str:
    """Extract audio from video file using ffmpeg."""
    try:
        # Create temp file for audio
        temp_audio = tempfile.NamedTemporaryFile(suffix=".wav", delete=False)
        temp_audio.close()
        
        # Extract audio using ffmpeg
        cmd = [
            "ffmpeg", "-i", video_path,
            "-vn", "-acodec", "pcm_s16le",
            "-ar", "16000", "-ac", "1",
            "-y", temp_audio.name
        ]
        
        subprocess.run(cmd, check=True, capture_output=True)
        return temp_audio.name
        
    except FileNotFoundError:
        # ffmpeg not installed - return video path for direct processing
        return video_path
    except subprocess.CalledProcessError as e:
        raise RuntimeError(f"Audio extraction failed: {e.stderr.decode()}")

Create test_recipe.py

Write tests for the recipe:

# test_recipe.py
import pytest
import os
import tempfile
from recipe import run

def test_missing_video_path():
    """Test error handling for missing video_path."""
    result = run({})
    assert result["ok"] is False
    assert result["error"]["code"] == "MISSING_INPUT"

def test_file_not_found():
    """Test error handling for non-existent file."""
    result = run({"video_path": "/nonexistent/video.mp4"})
    assert result["ok"] is False
    assert result["error"]["code"] == "FILE_NOT_FOUND"

def test_output_format_default():
    """Test that default output format is SRT."""
    # This test would need a mock video file
    pass

def test_valid_output_formats():
    """Test that both SRT and VTT formats are accepted."""
    # Validation only - actual processing requires video file
    valid_formats = ["srt", "vtt"]
    for fmt in valid_formats:
        input_data = {
            "video_path": "test.mp4",
            "output_format": fmt
        }
        # Would need mock file for full test
        assert fmt in valid_formats

@pytest.mark.integration
def test_end_to_end():
    """Full integration test with real video file."""
    # Skip if no test video available
    test_video = os.environ.get("TEST_VIDEO_PATH")
    if not test_video or not os.path.exists(test_video):
        pytest.skip("No test video available")
    
    result = run({
        "video_path": test_video,
        "language": "en",
        "output_format": "srt"
    })
    
    assert result["ok"] is True
    assert result["captions_file"] is not None
    assert os.path.exists(result["captions_file"])
    
    # Cleanup
    if os.path.exists(result["captions_file"]):
        os.remove(result["captions_file"])

@pytest.mark.integration
def test_vtt_format():
    """Test VTT output format."""
    test_video = os.environ.get("TEST_VIDEO_PATH")
    if not test_video:
        pytest.skip("No test video available")
    
    result = run({
        "video_path": test_video,
        "output_format": "vtt"
    })
    
    if result["ok"]:
        assert result["captions_file"].endswith(".vtt")

Create README.md

Document the recipe:

# Video Caption Generator

Generate captions from video files with automatic language detection.

## Quick Start

```bash
praison recipes run video-caption-generator --input '{"video_path": "my-video.mp4"}'

Inputs

Field	Type	Required	Default	Description
video_path	string	Yes	-	Path to video file
language	string	No	auto	Language code (en, es, fr, etc.)
output_format	string	No	srt	Output format: srt or vtt

Outputs

Field	Type	Description
captions_file	string	Path to generated caption file
summary	string	Brief content summary
ok	boolean	Success indicator

Requirements

OPENAI_API_KEY environment variable
ffmpeg (optional, for audio extraction)
praisonaiagents package

Examples

Basic Usage

praison recipes run video-caption-generator \
  --input '{"video_path": "presentation.mp4"}'

Specify Language and Format

praison recipes run video-caption-generator \
  --input '{"video_path": "video.mp4", "language": "es", "output_format": "vtt"}'

Troubleshooting

Issue	Solution
”ffmpeg not found”	Install ffmpeg: `brew install ffmpeg` or `apt install ffmpeg`
”API key missing”	Set `export OPENAI_API_KEY=your_key`
Poor transcription	Try specifying the language explicitly

</Step>

<Step title="Verify Recipe Structure">
```bash
# Check directory structure
ls -la ~/.praisonai/templates/video-caption-generator/

# Expected output:
# TEMPLATE.yaml
# recipe.py
# test_recipe.py
# README.md

# Verify recipe is discovered
praison recipes list | grep video-caption

Run Locally

Using CLI

# Basic run
praison recipes run video-caption-generator \
  --input '{"video_path": "my-video.mp4"}'

# With all options
praison recipes run video-caption-generator \
  --input '{"video_path": "video.mp4", "language": "en", "output_format": "vtt"}'

# Dry run (see what would happen)
praison recipes run video-caption-generator \
  --input '{"video_path": "video.mp4"}' \
  --dry-run

Using Python SDK

from praisonai import recipe

result = recipe.run(
    "video-caption-generator",
    input={
        "video_path": "my-video.mp4",
        "language": "en",
        "output_format": "srt"
    }
)

if result.ok:
    print(f"Captions saved to: {result.output['captions_file']}")
    print(f"Summary: {result.output['summary']}")
else:
    print(f"Error: {result.error}")

Deploy & Integrate: 6 Integration Models

When to use: Python applications, Jupyter notebooks, direct integration

from praisonai import recipe

# Synchronous execution
result = recipe.run(
    "video-caption-generator",
    input={"video_path": "video.mp4", "output_format": "srt"}
)

# Access results
if result.ok:
    captions_path = result.output["captions_file"]
    summary = result.output["summary"]

Deployment note: Runs in-process, lowest latency, requires Python environment.

Safety: Ensure video files are from trusted sources. Recipe writes to local filesystem.

When to use: Shell scripts, CI/CD pipelines, language-agnostic integration

# From any language via subprocess
praison recipes run video-caption-generator \
  --input '{"video_path": "video.mp4"}' \
  --json

Node.js example:

const { execSync } = require('child_process');

const input = JSON.stringify({ video_path: 'video.mp4' });
const result = execSync(
  `praison recipes run video-caption-generator --input '${input}' --json`
);
const output = JSON.parse(result.toString());

Deployment note: Requires praisonai CLI installed on the system.

Safety: Validate file paths before passing to CLI to prevent path traversal.

When to use: IDE extensions, CMS plugins, chat applications

# VS Code extension example
class CaptionGeneratorPlugin:
    def __init__(self):
        from praisonai import recipe
        self.recipe = recipe
    
    def generate_captions(self, video_path: str):
        return self.recipe.run(
            "video-caption-generator",
            input={"video_path": video_path}
        )

# Register with IDE
plugin = CaptionGeneratorPlugin()

Deployment note: Embed in plugin architecture, handle UI callbacks.

Safety: Respect IDE sandbox permissions for file access.

When to use: Microservices, polyglot environments, local API

# Start the sidecar server
praison recipes serve --port 8765

# Call from any HTTP client
import requests

response = requests.post(
    "http://localhost:8765/recipes/video-caption-generator/run",
    json={"video_path": "video.mp4", "output_format": "srt"}
)
result = response.json()

Deployment note: Run as a local service, configure port and auth as needed.

Safety: Bind to localhost only. Use authentication for non-local access.

When to use: Multi-tenant SaaS, cloud deployments, authenticated access

import requests

# Call remote runner with auth
response = requests.post(
    "https://api.your-service.com/recipes/video-caption-generator/run",
    headers={"Authorization": "Bearer YOUR_API_KEY"},
    json={
        "video_path": "s3://bucket/video.mp4",
        "output_format": "srt"
    }
)
result = response.json()

Deployment note: Deploy behind API gateway, implement rate limiting and auth.

Safety: Use signed URLs for file access. Implement tenant isolation.

When to use: Batch processing, async workflows, queue-based systems

# Producer: Submit job to queue
import json
import queue  # or use: from your_queue_lib import queue

job_queue = queue.Queue()  # Replace with SQS/RabbitMQ client in production
job = {
    "recipe": "video-caption-generator",
    "input": {"video_path": "s3://bucket/video.mp4"},
    "callback_url": "https://your-app.com/webhook"
}
job_queue.put(json.dumps(job))

# Consumer: Process from queue
def process_job(message):
    from praisonai import recipe
    job = json.loads(message)
    result = recipe.run(job["recipe"], input=job["input"])
    
    # Send result to callback
    requests.post(job["callback_url"], json=result.to_dict())

Deployment note: Use SQS, RabbitMQ, or Redis for queue. Handle retries.

Safety: Validate callback URLs. Implement job timeout and dead-letter queues.

Troubleshooting

ffmpeg not found

Symptom: Error message about ffmpeg not being installed.Solution:

# macOS
brew install ffmpeg

# Ubuntu/Debian
sudo apt install ffmpeg

# Windows
choco install ffmpeg

The recipe will still work without ffmpeg but may have reduced quality.

API key not set

Symptom: Authentication error from OpenAI.Solution:

export OPENAI_API_KEY=your_key_here

# Verify it's set
echo $OPENAI_API_KEY

Poor transcription quality

Symptom: Captions contain errors or miss words.Solutions:

Specify the language explicitly instead of auto-detect
Ensure audio quality is good (reduce background noise)
Try a different model by setting OPENAI_MODEL=gpt-4o

Large file processing timeout

Symptom: Recipe times out on long videos.Solution:

Split video into smaller segments
Use async/event-driven integration model
Increase timeout in config

Next Steps

Podcast Transcription Cleaner - Similar recipe for audio files
Multilingual Subtitle Translator - Translate your generated captions
Integration Models - Deep dive into deployment options

Agent Recipes

Examples

​Video Caption Generator

​Problem Statement

​What You’ll Build

​Input/Output Contract

​Prerequisites

​Step-by-Step Build

​Inputs

​Outputs

​Requirements

​Examples

​Basic Usage

​Specify Language and Format

​Troubleshooting

​Run Locally

​Using CLI

​Using Python SDK

​Deploy & Integrate: 6 Integration Models

​Troubleshooting

​Next Steps

Video Caption Generator

Problem Statement

What You’ll Build

Input/Output Contract

Prerequisites

Step-by-Step Build

Inputs

Outputs

Requirements

Examples

Basic Usage

Specify Language and Format

Troubleshooting

Run Locally

Using CLI

Using Python SDK

Deploy & Integrate: 6 Integration Models

Troubleshooting

Next Steps