Skip to main content

Video Caption Generator

Generate captions from video files with automatic language detection and support for SRT/VTT output formats.

Problem Statement

Who: Content creators, video editors, accessibility teams
Why: Manual captioning is time-consuming and expensive. Automated captions improve accessibility and SEO.

What You’ll Build

A recipe that extracts audio from video, transcribes it, and generates properly formatted caption files.

Input/Output Contract

InputTypeRequiredDescription
video_pathstringYesPath to the video file
languagestringNoLanguage code (auto-detect if omitted)
output_formatstringNosrt or vtt (default: srt)
OutputTypeDescription
captions_filestringPath to generated caption file
summarystringBrief summary of the video content
okbooleanSuccess indicator

Prerequisites

Required: OPENAI_API_KEY environment variable must be set.
# Set your API key
export OPENAI_API_KEY=your_key_here

# Install required packages
pip install praisonaiagents

# Optional: Install ffmpeg for audio extraction
# macOS
brew install ffmpeg

# Ubuntu/Debian
sudo apt install ffmpeg

Step-by-Step Build

1

Create Recipe Directory

mkdir -p ~/.praison/templates/video-caption-generator
cd ~/.praison/templates/video-caption-generator
2

Create TEMPLATE.yaml

Create the recipe metadata file:
# TEMPLATE.yaml
name: video-caption-generator
version: "1.0.0"
description: "Generate captions from video files with language detection"
author: "PraisonAI"
license: "MIT"

tags:
  - video
  - captions
  - accessibility
  - transcription

requires:
  env:
    - OPENAI_API_KEY
  packages:
    - praisonaiagents
  optional_env:
    - ANTHROPIC_API_KEY
  external:
    - ffmpeg

inputs:
  video_path:
    type: string
    description: "Path to the video file to caption"
    required: true
  language:
    type: string
    description: "Language code (e.g., 'en', 'es', 'fr'). Auto-detect if omitted."
    required: false
    default: "auto"
  output_format:
    type: string
    description: "Caption output format"
    required: false
    default: "srt"
    enum:
      - srt
      - vtt

outputs:
  captions_file:
    type: string
    description: "Path to the generated caption file"
  summary:
    type: string
    description: "Brief summary of the video content"
  ok:
    type: boolean
    description: "Whether the operation succeeded"

cli:
  command: "praison recipes run video-caption-generator"
  examples:
    - 'praison recipes run video-caption-generator --input ''{"video_path": "video.mp4"}'''
    - 'praison recipes run video-caption-generator --input ''{"video_path": "video.mp4", "language": "en", "output_format": "vtt"}'''

safety:
  dry_run_default: false
  requires_consent: false
  overwrites_files: true
  network_access: true
  pii_handling: false
3

Create recipe.py

Implement the main recipe logic:
# recipe.py
import os
import subprocess
import tempfile
from pathlib import Path
from praisonaiagents import Agent, Task, PraisonAIAgents

def run(input_data: dict, config: dict = None) -> dict:
    """
    Generate captions from a video file.
    
    Args:
        input_data: Contains video_path, language, output_format
        config: Optional configuration overrides
        
    Returns:
        Dict with captions_file, summary, and ok status
    """
    # Validate required inputs
    video_path = input_data.get("video_path")
    if not video_path:
        return {
            "ok": False,
            "error": {"code": "MISSING_INPUT", "message": "video_path is required"},
            "captions_file": None,
            "summary": None,
        }
    
    if not os.path.exists(video_path):
        return {
            "ok": False,
            "error": {"code": "FILE_NOT_FOUND", "message": f"Video file not found: {video_path}"},
            "captions_file": None,
            "summary": None,
        }
    
    language = input_data.get("language", "auto")
    output_format = input_data.get("output_format", "srt")
    
    try:
        # Extract audio from video
        audio_path = extract_audio(video_path)
        
        # Create transcription agent
        transcriber = Agent(
            name="Transcription Specialist",
            role="Audio Transcription Expert",
            goal="Accurately transcribe audio content with timestamps",
            instructions="""
            You are an expert transcriptionist.
            - Transcribe audio accurately with proper punctuation
            - Include timestamps for each segment
            - Identify speaker changes when possible
            - Handle multiple languages if detected
            """,
        )
        
        # Create caption formatting agent
        formatter = Agent(
            name="Caption Formatter",
            role="Caption File Specialist",
            goal=f"Format transcription into {output_format.upper()} format",
            instructions=f"""
            You are a caption formatting expert.
            - Convert transcriptions to {output_format.upper()} format
            - Ensure proper timestamp formatting
            - Keep caption lines under 42 characters
            - Split long sentences appropriately
            """,
        )
        
        # Create summarizer agent
        summarizer = Agent(
            name="Content Summarizer",
            role="Video Content Analyst",
            goal="Provide a brief summary of the video content",
            instructions="""
            You are a content analyst.
            - Summarize the main topics discussed
            - Keep summary under 100 words
            - Highlight key points
            """,
        )
        
        # Define tasks
        transcribe_task = Task(
            name="transcribe_audio",
            description=f"""
            Transcribe the audio from: {audio_path}
            Language: {language if language != 'auto' else 'auto-detect'}
            
            Provide timestamped transcription segments.
            """,
            expected_output="Timestamped transcription with segments",
            agent=transcriber,
        )
        
        format_task = Task(
            name="format_captions",
            description=f"""
            Format the transcription into {output_format.upper()} caption format.
            
            For SRT format:
            1
            00:00:00,000 --> 00:00:02,500
            Caption text here
            
            For VTT format:
            WEBVTT
            
            00:00:00.000 --> 00:00:02.500
            Caption text here
            """,
            expected_output=f"Properly formatted {output_format.upper()} captions",
            agent=formatter,
            context=[transcribe_task],
        )
        
        summarize_task = Task(
            name="summarize_content",
            description="Summarize the video content based on the transcription.",
            expected_output="Brief summary of video content",
            agent=summarizer,
            context=[transcribe_task],
        )
        
        # Execute agents
        agents = PraisonAIAgents(
            agents=[transcriber, formatter, summarizer],
            tasks=[transcribe_task, format_task, summarize_task],
        )
        
        result = agents.start()
        
        # Save captions to file
        video_name = Path(video_path).stem
        captions_file = f"{video_name}.{output_format}"
        
        with open(captions_file, "w", encoding="utf-8") as f:
            f.write(result.get("format_captions", ""))
        
        # Cleanup temp audio file
        if audio_path and os.path.exists(audio_path):
            os.remove(audio_path)
        
        return {
            "ok": True,
            "captions_file": captions_file,
            "summary": result.get("summarize_content", ""),
            "artifacts": [
                {"path": captions_file, "type": "text", "size_bytes": os.path.getsize(captions_file)}
            ],
            "warnings": [],
        }
        
    except Exception as e:
        return {
            "ok": False,
            "error": {"code": "PROCESSING_ERROR", "message": str(e)},
            "captions_file": None,
            "summary": None,
        }


def extract_audio(video_path: str) -> str:
    """Extract audio from video file using ffmpeg."""
    try:
        # Create temp file for audio
        temp_audio = tempfile.NamedTemporaryFile(suffix=".wav", delete=False)
        temp_audio.close()
        
        # Extract audio using ffmpeg
        cmd = [
            "ffmpeg", "-i", video_path,
            "-vn", "-acodec", "pcm_s16le",
            "-ar", "16000", "-ac", "1",
            "-y", temp_audio.name
        ]
        
        subprocess.run(cmd, check=True, capture_output=True)
        return temp_audio.name
        
    except FileNotFoundError:
        # ffmpeg not installed - return video path for direct processing
        return video_path
    except subprocess.CalledProcessError as e:
        raise RuntimeError(f"Audio extraction failed: {e.stderr.decode()}")
4

Create test_recipe.py

Write tests for the recipe:
# test_recipe.py
import pytest
import os
import tempfile
from recipe import run

def test_missing_video_path():
    """Test error handling for missing video_path."""
    result = run({})
    assert result["ok"] is False
    assert result["error"]["code"] == "MISSING_INPUT"

def test_file_not_found():
    """Test error handling for non-existent file."""
    result = run({"video_path": "/nonexistent/video.mp4"})
    assert result["ok"] is False
    assert result["error"]["code"] == "FILE_NOT_FOUND"

def test_output_format_default():
    """Test that default output format is SRT."""
    # This test would need a mock video file
    pass

def test_valid_output_formats():
    """Test that both SRT and VTT formats are accepted."""
    # Validation only - actual processing requires video file
    valid_formats = ["srt", "vtt"]
    for fmt in valid_formats:
        input_data = {
            "video_path": "test.mp4",
            "output_format": fmt
        }
        # Would need mock file for full test
        assert fmt in valid_formats

@pytest.mark.integration
def test_end_to_end():
    """Full integration test with real video file."""
    # Skip if no test video available
    test_video = os.environ.get("TEST_VIDEO_PATH")
    if not test_video or not os.path.exists(test_video):
        pytest.skip("No test video available")
    
    result = run({
        "video_path": test_video,
        "language": "en",
        "output_format": "srt"
    })
    
    assert result["ok"] is True
    assert result["captions_file"] is not None
    assert os.path.exists(result["captions_file"])
    
    # Cleanup
    if os.path.exists(result["captions_file"]):
        os.remove(result["captions_file"])

@pytest.mark.integration
def test_vtt_format():
    """Test VTT output format."""
    test_video = os.environ.get("TEST_VIDEO_PATH")
    if not test_video:
        pytest.skip("No test video available")
    
    result = run({
        "video_path": test_video,
        "output_format": "vtt"
    })
    
    if result["ok"]:
        assert result["captions_file"].endswith(".vtt")
5

Create README.md

Document the recipe:
# Video Caption Generator

Generate captions from video files with automatic language detection.

## Quick Start

```bash
praison recipes run video-caption-generator --input '{"video_path": "my-video.mp4"}'

Inputs

FieldTypeRequiredDefaultDescription
video_pathstringYes-Path to video file
languagestringNoautoLanguage code (en, es, fr, etc.)
output_formatstringNosrtOutput format: srt or vtt

Outputs

FieldTypeDescription
captions_filestringPath to generated caption file
summarystringBrief content summary
okbooleanSuccess indicator

Requirements

  • OPENAI_API_KEY environment variable
  • ffmpeg (optional, for audio extraction)
  • praisonaiagents package

Examples

Basic Usage

praison recipes run video-caption-generator \
  --input '{"video_path": "presentation.mp4"}'

Specify Language and Format

praison recipes run video-caption-generator \
  --input '{"video_path": "video.mp4", "language": "es", "output_format": "vtt"}'

Troubleshooting

IssueSolution
”ffmpeg not found”Install ffmpeg: brew install ffmpeg or apt install ffmpeg
”API key missing”Set export OPENAI_API_KEY=your_key
Poor transcriptionTry specifying the language explicitly
</Step>

<Step title="Verify Recipe Structure">
```bash
# Check directory structure
ls -la ~/.praison/templates/video-caption-generator/

# Expected output:
# TEMPLATE.yaml
# recipe.py
# test_recipe.py
# README.md

# Verify recipe is discovered
praison recipes list | grep video-caption

Run Locally

Using CLI

# Basic run
praison recipes run video-caption-generator \
  --input '{"video_path": "my-video.mp4"}'

# With all options
praison recipes run video-caption-generator \
  --input '{"video_path": "video.mp4", "language": "en", "output_format": "vtt"}'

# Dry run (see what would happen)
praison recipes run video-caption-generator \
  --input '{"video_path": "video.mp4"}' \
  --dry-run

Using Python SDK

from praisonai import recipe

result = recipe.run(
    "video-caption-generator",
    input={
        "video_path": "my-video.mp4",
        "language": "en",
        "output_format": "srt"
    }
)

if result.ok:
    print(f"Captions saved to: {result.output['captions_file']}")
    print(f"Summary: {result.output['summary']}")
else:
    print(f"Error: {result.error}")

Deploy & Integrate: 6 Integration Models

When to use: Python applications, Jupyter notebooks, direct integration
from praisonai import recipe

# Synchronous execution
result = recipe.run(
    "video-caption-generator",
    input={"video_path": "video.mp4", "output_format": "srt"}
)

# Access results
if result.ok:
    captions_path = result.output["captions_file"]
    summary = result.output["summary"]
Deployment note: Runs in-process, lowest latency, requires Python environment.
Safety: Ensure video files are from trusted sources. Recipe writes to local filesystem.

Troubleshooting

Symptom: Error message about ffmpeg not being installed.Solution:
# macOS
brew install ffmpeg

# Ubuntu/Debian
sudo apt install ffmpeg

# Windows
choco install ffmpeg
The recipe will still work without ffmpeg but may have reduced quality.
Symptom: Authentication error from OpenAI.Solution:
export OPENAI_API_KEY=your_key_here

# Verify it's set
echo $OPENAI_API_KEY
Symptom: Captions contain errors or miss words.Solutions:
  • Specify the language explicitly instead of auto-detect
  • Ensure audio quality is good (reduce background noise)
  • Try a different model by setting OPENAI_MODEL=gpt-4o
Symptom: Recipe times out on long videos.Solution:
  • Split video into smaller segments
  • Use async/event-driven integration model
  • Increase timeout in config

Next Steps