Skip to main content

AI News Crawler

Crawl AI news from multiple sources including HackerNews, Reddit, arXiv, and GitHub trending repositories.

CLI Quickstart

# Install
pip install praisonai praisonai-tools

# Run the crawler
praisonai recipe run ai-news-crawler \
  --input '{"sources": ["hackernews", "reddit", "arxiv"], "max_articles": 20}' \
  --json

# With output directory
praisonai recipe run ai-news-crawler \
  --input-file config.json \
  --out-dir ./output
Output:
{
  "ok": true,
  "run_id": "run_abc123",
  "recipe": "ai-news-crawler",
  "output": {
    "articles": [
      {"title": "...", "url": "...", "source": "hackernews", "score": 100}
    ],
    "total": 20
  }
}

Use in Your App (SDK)

from praisonai.recipes import run_recipe

# Basic usage
result = run_recipe(
    "ai-news-crawler",
    input={
        "sources": ["hackernews", "reddit", "arxiv"],
        "max_articles": 20,
        "time_window_hours": 24
    }
)

print(f"Crawled {len(result['output']['articles'])} articles")

# Direct tool usage
import sys
sys.path.insert(0, 'agent_recipes/templates/ai-news-crawler')
from tools import crawl_hackernews, crawl_reddit_ai, crawl_arxiv

# Crawl HackerNews
hn_articles = crawl_hackernews(max_articles=10, time_window_hours=24)

# Crawl Reddit
reddit_articles = crawl_reddit_ai(subreddits=["MachineLearning", "artificial"])

# Crawl arXiv
arxiv_articles = crawl_arxiv(categories=["cs.AI", "cs.LG"], max_results=10)

Use as HTTP Server

Start Server

praisonai recipe serve --port 8080

Invoke via curl

curl -X POST http://localhost:8080/v1/recipes/run \
  -H "Content-Type: application/json" \
  -d '{
    "recipe": "ai-news-crawler",
    "input": {
      "sources": ["hackernews", "reddit"],
      "max_articles": 10
    }
  }'

Input Schema

{
  "type": "object",
  "properties": {
    "sources": {
      "type": "array",
      "items": {"type": "string"},
      "description": "News sources: hackernews, reddit, arxiv, github, web"
    },
    "max_articles": {
      "type": "integer",
      "default": 50
    },
    "time_window_hours": {
      "type": "integer",
      "default": 24
    },
    "keywords": {
      "type": "array",
      "description": "Filter keywords"
    }
  }
}

Output Schema

{
  "articles": [
    {
      "title": "string",
      "url": "string",
      "source": "string",
      "score": "number",
      "timestamp": "string",
      "content": "string"
    }
  ],
  "total": "number",
  "sources_crawled": ["string"]
}

Configuration

OptionTypeDefaultDescription
sourcesarray[“hackernews”]News sources to crawl
max_articlesint50Maximum articles per source
time_window_hoursint24Time window for filtering
keywordsarray[“AI”, “ML”]Filter keywords

Dependencies

pip install requests feedparser beautifulsoup4

Environment Variables

VariableRequiredDescription
TAVILY_API_KEYOptionalFor web search source