Docling

App in the BluixApps catalog

What it is

Docling is IBM's document conversion library that transforms PDFs, DOCX, PPTX, HTML into structured Markdown or JSON. Layout-aware OCR, table detection, image extraction, formula recognition — built specifically for RAG preprocessing where document structure matters.

The MIT-licensed open-source release is the same engine IBM uses in its enterprise AI offerings — high-quality output that captures semantic structure, not just plain text.

What it's for

  • RAG preprocessing — convert your PDF library into clean Markdown for embedding
  • Document digitization — OCR scanned documents with layout preserved
  • Knowledge base ingestion — extract structured content from messy enterprise docs
  • Compliance archival — convert physical documents to searchable format
  • Content migration — DOCX → Markdown for static site generators

Who it's for

  • AI engineers building RAG pipelines over real-world PDF corpora
  • Knowledge management teams digitizing legacy document archives
  • Legal & compliance converting contract PDFs into searchable Markdown
  • Researchers extracting structured data from scientific papers
  • Tech writers migrating documentation from Word/PDF to Markdown

Why teams pick Docling over alternatives

  • Layout-aware — preserves table structure, headers, lists (vs simple text extraction)
  • OCR built-in — handles scanned PDFs with Tesseract integration
  • Formula recognition — STEM papers with equations stay intact
  • Apache 2.0 — IBM-backed but fully open
  • Python-first — clean API, easy to integrate
  • Output flexibility — Markdown, JSON, with optional structured metadata

Integrations

  • Python API — primary interface; pip install and go
  • HTTP API mode — Docling-Serve wrapper exposes REST endpoint
  • OCR engines — Tesseract, EasyOCR pluggable
  • PDF parsers — pdfium, PyMuPDF backends
  • LLM frameworks — LangChain document loader available
  • Output formats — Markdown, JSON, DocLayNet structured format
  • Embedded image handling — extract or inline as base64

Notable users & community

  • 20k+ GitHub stars
  • Backed by IBM Research with active engineering team
  • Featured in IBM's enterprise AI stack
  • Strong adoption in research / academic RAG pipelines
  • Growing community around document AI use cases

Tips & operations

  • Use HTTP mode for multi-language stacks — embedded Python only for Python apps; REST works for any client
  • Pre-warm models — first request downloads several hundred MB of model weights; bake into image
  • OCR vs text extraction — disable OCR for born-digital PDFs; saves 10× processing time
  • Batch processing — Docling can handle multiple docs per request; batch when possible
  • GPU acceleration — optional but significantly speeds OCR on scanned doc archives
  • Output cleanup — Docling Markdown can need light post-processing for LLM ingestion

What we ship in BluixApps

  • Docker compose: Docling-Serve HTTP wrapper
  • Pinned quay.io/ds4sd/docling-serve:latest (release-tagged)
  • HTTPS via Let's Encrypt; API key auth enabled
  • OCR enabled by default with Tesseract
  • Persistent model cache volume to avoid re-download on restart
  • API rate limiting configured for fair use
  • Backup not needed (stateless service)
Read this app's deep dive on bluix.app ↗

Get this app — pick a BluixApps plan

Same catalog. Scaling tenant isolation, white-label and support tier.

TierTenantsCatalogSupportWhite-labelMonthly
Stacks119 curated stacksStandard$19/moDetailDeploy
Starter10Full catalogStandard+$15–25/mo$49/moDetailDeploy
Pro25Full catalogPriority bugfix+$15–25/mo$149/moDetailDeploy
Growth100Full catalogPriority bugfix+$15–25/mo$349/moDetailDeploy
Scale500Full catalog7-day window+$15–25/mo$799/moDetailDeploy
EnterpriseUnlimitedFull catalogPriority 7-dayBundled$1,499/moDetailDeploy

Powered by WHMCompleteSolution