TrOCR

TL;DR

Introduced by Li et al. at Microsoft in 'TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models' (arXiv:2109.10282, September 2021).
End-to-end transformer encoder-decoder: a Vision Transformer encodes the input image and a text decoder (BERT/RoBERTa-initialised) generates the transcription token-by-token.
First competitive OCR system with no convolutional backbone and no CRNN-style sequence decoder — pure transformer throughout.
Pre-trained checkpoints for printed text, handwritten text, and scene text are available on Hugging Face under MIT licence.

Architecture#

TrOCR is conceptually the simplest OCR architecture deployable today. A Vision Transformer encoder splits the input image into patches, projects them into embeddings, and processes them with standard transformer self-attention. A text decoder — initialised from a pre-trained language model — then cross-attends to the encoded patches and autoregressively generates the transcription, one wordpiece token at a time.

The encoder is typically initialised from BEiT or DeiT pretraining; the decoder from RoBERTa or MiniLM. This warm-start lets TrOCR converge on OCR data with far less labelled supervision than a from-scratch model.

Variants#

Variant	Encoder	Decoder	Target text
TrOCR-base-printed	BEiT-base	RoBERTa-base	Printed English
TrOCR-large-printed	BEiT-large	RoBERTa-large	Printed English (highest accuracy)
TrOCR-base-handwritten	BEiT-base	RoBERTa-base	IAM-style handwriting
TrOCR-base-stage1	DeiT-base	MiniLM	Pre-training checkpoint

Strengths and Limits#

Strength — handwritten OCR. The autoregressive decoder handles cursive ligatures and informal letterforms better than CRNN systems.
Strength — multilingual fine-tuning. Swap in a multilingual decoder (e.g., XLM-R) and TrOCR can be retargeted to non-Latin scripts.
Limit — single-line input. Standard TrOCR expects pre-cropped text lines; you need a separate detector to find them.
Limit — autoregressive decoding latency. Throughput trails CRNN-style recognisers on simple printed text.
Limit — no built-in layout or reading order, unlike Surya or PaddleOCR's PP-Structure.

TrOCR is almost always paired with a separate detector — PaddleOCR's DBNet, a YOLO line-detector, or DocTR's detection module — that crops text lines for TrOCR to recognise.

When to Use TrOCR#

Pick TrOCR when handwritten text is in scope, when transcription accuracy matters more than throughput, or when an existing transformer infrastructure (Hugging Face Transformers, optimum, accelerate) makes deployment easy. Skip TrOCR for high-throughput printed-text pipelines — PaddleOCR's CRNN-based recogniser will be faster at comparable accuracy on clean print.

Deployment#

Production deployments typically convert TrOCR to ONNX or TensorRT and serve via Triton. Beam search at inference improves accuracy on noisy handwriting at the cost of throughput; greedy decoding is usually adequate for clean printed text.

python

from transformers import TrOCRProcessor, VisionEncoderDecoderModel
from PIL import Image

processor = TrOCRProcessor.from_pretrained("microsoft/trocr-base-handwritten")
model = VisionEncoderDecoderModel.from_pretrained("microsoft/trocr-base-handwritten")

image = Image.open("handwriting_line.png").convert("RGB")
pixel_values = processor(images=image, return_tensors="pt").pixel_values
generated_ids = model.generate(pixel_values)
text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

References

Architecture#

Variant

Encoder

Decoder

Target text

TrOCR-base-printed

BEiT-base

RoBERTa-base

Printed English

TrOCR-large-printed

BEiT-large

RoBERTa-large

Printed English (highest accuracy)

TrOCR-base-handwritten

BEiT-base

RoBERTa-base

IAM-style handwriting

TrOCR-base-stage1

DeiT-base

MiniLM

Pre-training checkpoint

Strengths and Limits#

Strength — handwritten OCR. The autoregressive decoder handles cursive ligatures and informal letterforms better than CRNN systems.

Strength — multilingual fine-tuning. Swap in a multilingual decoder (e.g., XLM-R) and TrOCR can be retargeted to non-Latin scripts.

Limit — single-line input. Standard TrOCR expects pre-cropped text lines; you need a separate detector to find them.

Limit — autoregressive decoding latency. Throughput trails CRNN-style recognisers on simple printed text.

Limit — no built-in layout or reading order, unlike Surya or PaddleOCR's PP-Structure.

TrOCR is almost always paired with a separate detector — PaddleOCR's DBNet, a YOLO line-detector, or DocTR's detection module — that crops text lines for TrOCR to recognise.

When to Use TrOCR#

Deployment#

python

from transformers import TrOCRProcessor, VisionEncoderDecoderModel
from PIL import Image

processor = TrOCRProcessor.from_pretrained("microsoft/trocr-base-handwritten")
model = VisionEncoderDecoderModel.from_pretrained("microsoft/trocr-base-handwritten")

image = Image.open("handwriting_line.png").convert("RGB")
pixel_values = processor(images=image, return_tensors="pt").pixel_values
generated_ids = model.generate(pixel_values)
text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

TrOCR

Architecture#

Variants#

Strengths and Limits#

When to Use TrOCR#

Deployment#

References

Browse all entries

Deploy on Yobitel

TrOCR

Architecture#

Variants#

Strengths and Limits#

When to Use TrOCR#

Deployment#

References

Browse all entries

Deploy on Yobitel