AI News Crawler

Crawl AI news from multiple sources including HackerNews, Reddit, arXiv, and GitHub trending repositories.

CLI Quickstart

# Install
pip install praisonai praisonai-tools

# Run the crawler
praisonai recipe run ai-news-crawler \
  --input '{"sources": ["hackernews", "reddit", "arxiv"], "max_articles": 20}' \
  --json

# With output directory
praisonai recipe run ai-news-crawler \
  --input-file config.json \
  --out-dir ./output

Output:

{
  "ok": true,
  "run_id": "run_abc123",
  "recipe": "ai-news-crawler",
  "output": {
    "articles": [
      {"title": "...", "url": "...", "source": "hackernews", "score": 100}
    ],
    "total": 20
  }
}

Use in Your App (SDK)

from praisonai.recipe import run, run_stream

# Basic usage
result = run(
    "ai-news-crawler",
    input={
        "sources": ["hackernews", "reddit", "arxiv"],
        "max_articles": 20,
        "time_window_hours": 24
    }
)

print(f"Crawled {len(result.output['articles'])} articles")

# Direct tool usage
import sys
sys.path.insert(0, 'agent_recipes/templates/ai-news-crawler')
from tools import crawl_hackernews, crawl_reddit_ai, crawl_arxiv

# Crawl HackerNews
hn_articles = crawl_hackernews(max_articles=10, time_window_hours=24)

# Crawl Reddit
reddit_articles = crawl_reddit_ai(subreddits=["MachineLearning", "artificial"])

# Crawl arXiv
arxiv_articles = crawl_arxiv(categories=["cs.AI", "cs.LG"], max_results=10)

Use as HTTP Server

Start Server

praisonai serve recipe --port 8080

Invoke via curl

curl -X POST http://localhost:8080/v1/recipes/run \
  -H "Content-Type: application/json" \
  -d '{
    "recipe": "ai-news-crawler",
    "input": {
      "sources": ["hackernews", "reddit"],
      "max_articles": 10
    }
  }'

Input Schema

{
  "type": "object",
  "properties": {
    "sources": {
      "type": "array",
      "items": {"type": "string"},
      "description": "News sources: hackernews, reddit, arxiv, github, web"
    },
    "max_articles": {
      "type": "integer",
      "default": 50
    },
    "time_window_hours": {
      "type": "integer",
      "default": 24
    },
    "keywords": {
      "type": "array",
      "description": "Filter keywords"
    }
  }
}

Output Schema

{
  "articles": [
    {
      "title": "string",
      "url": "string",
      "source": "string",
      "score": "number",
      "timestamp": "string",
      "content": "string"
    }
  ],
  "total": "number",
  "sources_crawled": ["string"]
}

Configuration

Option	Type	Default	Description
sources	array	[“hackernews”]	News sources to crawl
max_articles	int	50	Maximum articles per source
time_window_hours	int	24	Time window for filtering
keywords	array	[“AI”, “ML”]	Filter keywords

Dependencies

pip install requests feedparser beautifulsoup4

Environment Variables

Variable	Required	Description
TAVILY_API_KEY	Optional	For web search source

Agent Recipes

Examples

AI News Crawler

AI News Crawler

CLI Quickstart

Use in Your App (SDK)

Use as HTTP Server

Start Server

Invoke via curl

Input Schema

Output Schema

Configuration

Dependencies

Environment Variables

Agent Recipes

Examples

​AI News Crawler

​CLI Quickstart

​Use in Your App (SDK)

​Use as HTTP Server

​Start Server

​Invoke via curl

​Input Schema

​Output Schema

​Configuration

​Dependencies

​Environment Variables

​Related Tools

AI News Crawler

CLI Quickstart

Use in Your App (SDK)

Use as HTTP Server

Start Server

Invoke via curl

Input Schema

Output Schema

Configuration

Dependencies

Environment Variables

Related Tools