Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.praison.ai/llms.txt

Use this file to discover all available pages before exploring further.

AI News Crawler

Crawl AI news from multiple sources including HackerNews, Reddit, arXiv, and GitHub trending repositories.

CLI Quickstart

# Install
pip install praisonai praisonai-tools

# Run the crawler
praisonai recipe run ai-news-crawler \
  --input '{"sources": ["hackernews", "reddit", "arxiv"], "max_articles": 20}' \
  --json

# With output directory
praisonai recipe run ai-news-crawler \
  --input-file config.json \
  --out-dir ./output
Output:
{
  "ok": true,
  "run_id": "run_abc123",
  "recipe": "ai-news-crawler",
  "output": {
    "articles": [
      {"title": "...", "url": "...", "source": "hackernews", "score": 100}
    ],
    "total": 20
  }
}

Use in Your App (SDK)

from praisonai.recipe import run, run_stream

# Basic usage
result = run(
    "ai-news-crawler",
    input={
        "sources": ["hackernews", "reddit", "arxiv"],
        "max_articles": 20,
        "time_window_hours": 24
    }
)

print(f"Crawled {len(result.output['articles'])} articles")

# Direct tool usage
import sys
sys.path.insert(0, 'agent_recipes/templates/ai-news-crawler')
from tools import crawl_hackernews, crawl_reddit_ai, crawl_arxiv

# Crawl HackerNews
hn_articles = crawl_hackernews(max_articles=10, time_window_hours=24)

# Crawl Reddit
reddit_articles = crawl_reddit_ai(subreddits=["MachineLearning", "artificial"])

# Crawl arXiv
arxiv_articles = crawl_arxiv(categories=["cs.AI", "cs.LG"], max_results=10)

Use as HTTP Server

Start Server

praisonai serve recipe --port 8080

Invoke via curl

curl -X POST http://localhost:8080/v1/recipes/run \
  -H "Content-Type: application/json" \
  -d '{
    "recipe": "ai-news-crawler",
    "input": {
      "sources": ["hackernews", "reddit"],
      "max_articles": 10
    }
  }'

Input Schema

{
  "type": "object",
  "properties": {
    "sources": {
      "type": "array",
      "items": {"type": "string"},
      "description": "News sources: hackernews, reddit, arxiv, github, web"
    },
    "max_articles": {
      "type": "integer",
      "default": 50
    },
    "time_window_hours": {
      "type": "integer",
      "default": 24
    },
    "keywords": {
      "type": "array",
      "description": "Filter keywords"
    }
  }
}

Output Schema

{
  "articles": [
    {
      "title": "string",
      "url": "string",
      "source": "string",
      "score": "number",
      "timestamp": "string",
      "content": "string"
    }
  ],
  "total": "number",
  "sources_crawled": ["string"]
}

Configuration

OptionTypeDefaultDescription
sourcesarray[“hackernews”]News sources to crawl
max_articlesint50Maximum articles per source
time_window_hoursint24Time window for filtering
keywordsarray[“AI”, “ML”]Filter keywords

Dependencies

pip install requests feedparser beautifulsoup4

Environment Variables

VariableRequiredDescription
TAVILY_API_KEYOptionalFor web search source