TL;DR
- Introduced by Lv et al. at Baidu in 'DETRs Beat YOLOs on Real-time Object Detection' (arXiv:2304.08069, April 2023).
- First DETR variant to match YOLO-class real-time inference latency on T4-class hardware while keeping the end-to-end set-prediction formulation.
- Uses an efficient hybrid encoder that decouples intra-scale interaction from cross-scale fusion, cutting encoder compute against vanilla Deformable DETR.
- Licensed Apache 2.0 — a permissive alternative to the AGPL-licensed YOLO line for closed-source commercial deployment.
Positioning#
Vanilla DETR proved transformers could detect objects end-to-end, but it was slow to train and slow to run. Deformable DETR fixed convergence. Subsequent variants (DAB-DETR, DINO) raised accuracy. None of them ran fast enough to displace YOLO at the latency-sensitive end of the market.
RT-DETR is the variant that closed that gap. On T4 GPUs in TensorRT FP16, RT-DETR-L and RT-DETR-X land in the same throughput band as YOLOv8-L and YOLOv8-X at competitive or better COCO mAP. The win is not just performance — RT-DETR's set-prediction head needs no NMS, simplifying the deployment pipeline.
Architectural Contributions#
- Efficient Hybrid Encoder — intra-scale self-attention is applied only to the highest-level (S5) feature map, and cross-scale interaction is delegated to a lightweight CCFM (Cross-scale Feature-fusion Module). Most of the encoder compute is concentrated where it matters.
- IoU-aware Query Selection — instead of fully learned content queries, RT-DETR selects high-quality initial queries from encoder features based on classification and IoU scores, accelerating convergence.
- Standard DETR decoder with auxiliary prediction heads.
- No NMS at inference — set prediction inherits from DETR.
Variants and Reported Performance#
Numbers from the original paper and PaddleDetection release notes at 640×640 input. Treat as broad guidance — actual deployed mAP depends on the exact checkpoint, precision, and TensorRT version.
| Variant | Backbone | COCO mAP50-95 |
|---|---|---|
| RT-DETR-R18 | ResNet-18 | ~46.5 |
| RT-DETR-R34 | ResNet-34 | ~48.9 |
| RT-DETR-R50 | ResNet-50 | ~53.1 |
| RT-DETR-R101 | ResNet-101 | ~54.3 |
| RT-DETR-L | HGNetv2-L | ~53.0 |
| RT-DETR-X | HGNetv2-X | ~54.8 |
Deployment#
RT-DETR is available in two production-ready forms: the original PaddlePaddle implementation in PaddleDetection, and an Ultralytics-integrated port that uses the standard `ultralytics` CLI and dataset format. The Ultralytics integration is the easier on-ramp for teams already running YOLOv8/v11 — model strings change to `rtdetr-l.pt` or `rtdetr-x.pt` and the rest of the pipeline stays put.
Export to TensorRT is the standard production deployment on NVIDIA hardware. Because there is no NMS, exported engines are simpler than the YOLO equivalents and the post-processing step is just a top-K filter on scored predictions.
If licensing is a concern for closed-source SaaS, RT-DETR's Apache 2.0 licence is the standout reason to choose it over YOLOv8/v11. Accuracy and throughput are competitive enough that licensing often tips the decision.
When to Choose RT-DETR over YOLO#
- Closed-source commercial product where AGPL-3.0 is a non-starter and an Ultralytics Enterprise licence is undesirable.
- Crowded scenes with overlapping objects where NMS tuning has caused production headaches.
- Pipelines that benefit from cleaner export — no NMS, no anchor decoding, just top-K.
- Teams that prefer the transformer detector lineage for downstream fine-tuning into open-vocabulary or grounded detection.
References
- DETRs Beat YOLOs on Real-time Object Detection (Lv et al., 2023) · arXiv
- PaddleDetection RT-DETR · GitHub
- Ultralytics RT-DETR Documentation · Ultralytics Docs