Capability · Framework — rag

Trafilatura

Trafilatura is a go-to library for turning raw HTML into clean text, Markdown, or XML. It combines boilerplate removal, language detection, and metadata extraction with a battle-tested extraction algorithm used across academic and industrial crawlers. It runs entirely offline — no API, no browser.

Framework facts

Category
rag
Language
Python
License
Apache-2.0
Repository
https://github.com/adbar/trafilatura

Install

pip install trafilatura

Quickstart

import trafilatura

html = trafilatura.fetch_url('https://engineering.vips.edu/about')
text = trafilatura.extract(html, output_format='markdown', with_metadata=True)
print(text[:500])

Alternatives

  • Readability.js / python-readability
  • Newspaper3k — article-focused
  • Goose3 — HTML to article text
  • Jina Reader — hosted Markdown API

Frequently asked questions

Does Trafilatura render JavaScript?

No. It parses server-rendered HTML only. Pair it with Playwright or a rendering service for JS-heavy sites.

Is it suitable for large-scale crawls?

Yes. It's written in C-extensions where it matters and benchmarks at thousands of pages per second on a single core.

Sources

  1. Trafilatura — GitHub — accessed 2026-04-20
  2. Trafilatura — docs — accessed 2026-04-20