Skip to main content
Prerequisites
  • Python 3.10 or higher
  • PraisonAI Agents package installed
  • crawl4ai package installed and set up
Crawl4AI provides powerful async web crawling with JavaScript rendering, content extraction, and LLM-based data extraction. PraisonAI includes built-in Crawl4AI tools for easy integration.

Installation

pip install praisonaiagents crawl4ai
crawl4ai-setup

Setup

export OPENAI_API_KEY=your_openai_api_key

Built-in Crawl4AI Tool

PraisonAI provides built-in crawl4ai functions that you can use directly:
import asyncio
from praisonaiagents.tools import crawl4ai

async def main():
    result = await crawl4ai("https://example.com")
    print(result["markdown"])

asyncio.run(main())

Available Functions

FunctionDescription
crawl4aiAsync crawl a URL and get markdown
crawl4ai_manyCrawl multiple URLs concurrently
crawl4ai_extractExtract data using CSS selectors
crawl4ai_llm_extractExtract data using LLM
crawl4ai_syncSynchronous version of crawl4ai
crawl4ai_extract_syncSynchronous CSS extraction

Basic Usage

Simple Crawl

import asyncio
from praisonaiagents.tools import crawl4ai

async def main():
    result = await crawl4ai("https://example.com")
    
    if result["success"]:
        print(f"URL: {result['url']}")
        print(f"Markdown: {result['markdown'][:500]}...")
        print(f"Links: {len(result['links'].get('internal', []))}")
    else:
        print(f"Error: {result['error']}")

asyncio.run(main())

Crawl with Options

import asyncio
from praisonaiagents.tools import crawl4ai

async def main():
    result = await crawl4ai(
        url="https://example.com",
        css_selector="main.content",  # Focus on specific content
        js_code="window.scrollTo(0, document.body.scrollHeight);",  # Execute JS
        wait_for="css:.loaded",  # Wait for element
        screenshot=True  # Capture screenshot
    )
    
    if result["success"]:
        print(result["markdown"])
        if result.get("screenshot"):
            print("Screenshot captured!")

asyncio.run(main())

Crawl Multiple URLs

import asyncio
from praisonaiagents.tools import crawl4ai_many

async def main():
    urls = [
        "https://example.com/page1",
        "https://example.com/page2",
        "https://example.com/page3"
    ]
    
    results = await crawl4ai_many(urls)
    
    for result in results:
        if result["success"]:
            print(f"✓ {result['url']}: {len(result['markdown'])} chars")
        else:
            print(f"✗ {result['url']}: {result['error']}")

asyncio.run(main())

Extract with CSS Selectors

import asyncio
from praisonaiagents.tools import crawl4ai_extract

async def main():
    schema = {
        "name": "Products",
        "baseSelector": "div.product",
        "fields": [
            {"name": "title", "selector": "h2", "type": "text"},
            {"name": "price", "selector": ".price", "type": "text"},
            {"name": "link", "selector": "a", "type": "attribute", "attribute": "href"}
        ]
    }
    
    result = await crawl4ai_extract(
        url="https://example.com/products",
        schema=schema
    )
    
    if result["success"]:
        print(f"Extracted {result['count']} items")
        for item in result["data"]:
            print(f"  - {item['title']}: {item['price']}")

asyncio.run(main())

Extract with LLM

import asyncio
from praisonaiagents.tools import crawl4ai_llm_extract

async def main():
    result = await crawl4ai_llm_extract(
        url="https://openai.com/api/pricing/",
        instruction="Extract all model names with their input and output token prices",
        provider="openai/gpt-4o-mini"
    )
    
    if result["success"]:
        print("Extracted data:", result["data"])

asyncio.run(main())

Using Crawl4AITools Class

For more control, use the Crawl4AITools class directly:
import asyncio
from praisonaiagents.tools import Crawl4AITools

async def main():
    tools = Crawl4AITools(headless=True, verbose=False)
    
    try:
        # Basic crawl
        result = await tools.crawl("https://example.com")
        print(result["markdown"][:500])
        
        # CSS extraction
        schema = {
            "name": "Articles",
            "baseSelector": "article",
            "fields": [
                {"name": "title", "selector": "h2", "type": "text"},
                {"name": "summary", "selector": "p", "type": "text"}
            ]
        }
        result = await tools.extract_css("https://example.com/blog", schema)
        print(result["data"])
        
    finally:
        await tools.close()

asyncio.run(main())

Synchronous Usage

For non-async code, use the sync versions:
from praisonaiagents.tools import crawl4ai_sync, crawl4ai_extract_sync

# Simple crawl
result = crawl4ai_sync("https://example.com")
print(result["markdown"][:500])

# CSS extraction
schema = {
    "name": "Items",
    "baseSelector": ".item",
    "fields": [{"name": "title", "selector": "h3", "type": "text"}]
}
result = crawl4ai_extract_sync("https://example.com/items", schema)
print(result["data"])

Schema Reference

CSS Extraction Schema

schema = {
    "name": "Schema Name",
    "baseSelector": "div.item",  # CSS selector for each item
    "fields": [
        {
            "name": "field_name",
            "selector": "h2",  # CSS selector within item
            "type": "text"  # text, attribute, html, nested, list, nested_list
        },
        {
            "name": "link",
            "selector": "a",
            "type": "attribute",
            "attribute": "href"
        },
        {
            "name": "details",
            "selector": ".details",
            "type": "nested",
            "fields": [
                {"name": "brand", "selector": ".brand", "type": "text"}
            ]
        }
    ]
}

Field Types

TypeDescription
textExtract text content
attributeExtract HTML attribute (specify attribute key)
htmlExtract raw HTML
nestedSingle nested object
listList of simple items
nested_listList of complex objects

JavaScript Execution

Execute JavaScript before crawling:
result = await crawl4ai(
    url="https://example.com",
    js_code="""
        // Scroll to load lazy content
        window.scrollTo(0, document.body.scrollHeight);
        
        // Click a button
        document.querySelector('.load-more')?.click();
    """,
    wait_for="css:.loaded-content"  # Wait for content to appear
)

Wait Conditions

# Wait for CSS selector
wait_for="css:.content-loaded"

# Wait for JavaScript condition
wait_for="js:() => document.querySelectorAll('.item').length > 10"

Video Tutorial

Key Points

  • Async by default: Use await for all crawl functions
  • JavaScript rendering: Full browser support for dynamic content
  • CSS extraction: Fast, no-LLM structured data extraction
  • LLM extraction: AI-powered extraction for complex content
  • Multi-URL: Efficient concurrent crawling