Capability · Framework — rag

Firecrawl

Firecrawl handles the grubby parts of web ingestion: JavaScript rendering, sitemap following, rate limiting, and content extraction. It emits LLM-ready Markdown (or JSON with a schema), making it a drop-in source for RAG corpora, agent browsing tools, and continuous knowledge sync.

Framework facts

Category
rag
Language
TypeScript / Python
License
AGPL-3.0
Repository
https://github.com/mendableai/firecrawl

Install

pip install firecrawl-py
# or
npm install @mendable/firecrawl-js

Quickstart

from firecrawl import FirecrawlApp

app = FirecrawlApp(api_key='fc-...')
job = app.crawl_url('https://engineering.vips.edu', params={'limit': 50})
for page in job['data']:
    print(page['metadata']['url'], len(page['markdown']))

Alternatives

  • Jina Reader — single-URL markdownification
  • Trafilatura — local HTML extraction
  • Crawl4AI — async crawler for RAG
  • ScrapingBee — commercial rendering API

Frequently asked questions

Cloud or self-host?

The cloud API is the fastest way to start and scales headless browsers for you. Self-host when you need private network access, data residency, or unlimited crawls.

Does Firecrawl respect robots.txt?

Yes by default. You can override on self-hosted deployments, but you're responsible for legal and ethical compliance.

Sources

  1. Firecrawl — GitHub — accessed 2026-04-20
  2. Firecrawl — docs — accessed 2026-04-20