Capability · Framework — rag
Trafilatura
Trafilatura is a go-to library for turning raw HTML into clean text, Markdown, or XML. It combines boilerplate removal, language detection, and metadata extraction with a battle-tested extraction algorithm used across academic and industrial crawlers. It runs entirely offline — no API, no browser.
Framework facts
- Category
- rag
- Language
- Python
- License
- Apache-2.0
- Repository
- https://github.com/adbar/trafilatura
Install
pip install trafilatura Quickstart
import trafilatura
html = trafilatura.fetch_url('https://engineering.vips.edu/about')
text = trafilatura.extract(html, output_format='markdown', with_metadata=True)
print(text[:500]) Alternatives
- Readability.js / python-readability
- Newspaper3k — article-focused
- Goose3 — HTML to article text
- Jina Reader — hosted Markdown API
Frequently asked questions
Does Trafilatura render JavaScript?
No. It parses server-rendered HTML only. Pair it with Playwright or a rendering service for JS-heavy sites.
Is it suitable for large-scale crawls?
Yes. It's written in C-extensions where it matters and benchmarks at thousands of pages per second on a single core.
Sources
- Trafilatura — GitHub — accessed 2026-04-20
- Trafilatura — docs — accessed 2026-04-20