TL;DR
- Open-source OCR system created by Vik Paruchuri, first released in 2024 and developed alongside the related Marker PDF-to-Markdown project.
- Document-understanding pipeline: text detection, recognition, layout analysis, reading-order prediction, and table recognition from a single Python package.
- Strong multilingual coverage (90+ languages) with competitive accuracy on document-level benchmarks against PaddleOCR and Tesseract.
- Permissively licensed for non-commercial and research use; commercial use depends on revenue and is governed by the Surya licence terms.
What Surya Offers#
Surya is a document OCR system rather than just a text recogniser. The default pipeline returns not only the transcribed text and bounding boxes but also a layout decomposition (title, paragraph, list, figure, table), an inferred reading order, and — for tables — a structured cell grid. That makes it a natural input for downstream document-understanding workloads: RAG ingestion, structured-data extraction, accessibility tooling.
It is developed in tandem with Marker, a PDF-to-Markdown converter that uses Surya for the OCR layer. The two projects share infrastructure and are commonly used together for document ingestion pipelines.
Components#
- Detection model — text-line and text-block detection for arbitrary orientations.
- Recognition model — transformer-based encoder-decoder text recogniser.
- Layout model — segments pages into semantic regions (text, title, list, table, figure, caption, header, footer).
- Order model — predicts reading order across detected regions, important for multi-column or magazine layouts.
- Table recognition — locates and reconstructs table structure into row/column cells.
Multilingual Coverage#
Surya supports OCR for 90+ languages out of the box. The text recogniser is trained jointly across languages, so script switches within a document are handled without per-language model swapping. Layout and order models are largely script-agnostic and apply across the language set.
For mixed-script documents — academic papers with English captions over Devanagari body text, or multilingual government forms — Surya handles the joint vocabulary more cleanly than running PaddleOCR with two language models.
Deployment#
Surya runs on PyTorch and benefits from FP16 inference on L4 or L40S accelerators. For high-throughput document ingestion, batched inference across pages is the dominant optimisation; layout and order models are small enough that they rarely become the bottleneck.
from surya.ocr import run_ocr
from surya.model.detection.model import load_model as load_det_model, load_processor as load_det_processor
from surya.model.recognition.model import load_model as load_rec_model
from surya.model.recognition.processor import load_processor as load_rec_processor
from PIL import Image
image = Image.open("document.png")
predictions = run_ocr(
[image],
[["en", "hi"]],
load_det_model(), load_det_processor(),
load_rec_model(), load_rec_processor(),
)Licensing and Practical Notes#
Surya is freely available for research and personal use. Commercial use is permitted under the licence below a revenue threshold; above that, a commercial agreement with the author is required. Check the LICENCE file on GitHub before deploying in a commercial context — the terms have been refined over the project's lifetime.
- Best paired with Marker when the input is PDFs rather than raster images.
- Reading-order model is the standout feature for multi-column ingestion — the alternative is hand-tuned heuristics.
- Table recognition output is structured enough to be parsed directly into pandas DataFrames.
References
- Surya GitHub · GitHub
- Marker GitHub · GitHub