TL;DR
- Open-source inference engine from Microsoft, first released December 2018, that runs models in the ONNX (Open Neural Network Exchange) format.
- Cross-platform — Windows, Linux, macOS, Android, iOS, web — and cross-hardware — CPU, NVIDIA, AMD, Intel, Qualcomm, Apple Neural Engine.
- Built around 'execution providers' that route operators to backend kernels: CUDA, TensorRT, ROCm, DirectML, OpenVINO, Core ML, NNAPI, QNN.
- Particularly strong on classical ML and computer-vision deployment; LLM support via the ONNX Runtime GenAI library.
Overview#
ONNX Runtime is Microsoft's general-purpose inference engine for ONNX models. ONNX itself is an open exchange format with broad support from PyTorch, TensorFlow, scikit-learn and most major frameworks, which makes the runtime a natural choice when models must move between training and deployment frameworks or when one engine must cover many model types and hardware targets.
The runtime has been in production at scale inside Microsoft (Office, Bing, Azure) since 2019 and is broadly used in third-party deployments where portability is a higher priority than peak NVIDIA throughput.
Execution Providers#
- CPU — default, with AVX-512 and ARM NEON paths.
- CUDA — NVIDIA GPU, general path.
- TensorRT — NVIDIA GPU with TensorRT engine compilation for max performance.
- ROCm — AMD GPU.
- DirectML — Windows GPU abstraction, covers NVIDIA, AMD, Intel and Qualcomm GPUs.
- OpenVINO — Intel CPU and GPU.
- Core ML — Apple devices.
- NNAPI — Android.
- QNN — Qualcomm Hexagon NPU.
- WebAssembly + WebGPU — in-browser execution.
LLM Support via GenAI#
Traditional ONNX Runtime was a generic graph executor; LLM-specific features (KV cache, paged attention, continuous batching) lived elsewhere. ONNX Runtime GenAI fills the gap with a higher-level API that handles the LLM control loop, KV cache management and tokenisation, while delegating the core matmuls to the standard runtime. It is the recommended path for LLM inference inside ONNX Runtime as of 2026.
import onnxruntime_genai as og
model = og.Model("./phi-3-mini-onnx")
tokenizer = og.Tokenizer(model)
params = og.GeneratorParams(model)
params.set_search_options(max_length=512, temperature=0.7)
params.input_ids = tokenizer.encode("Explain ONNX Runtime in one sentence.")
generator = og.Generator(model, params)
while not generator.is_done():
generator.compute_logits()
generator.generate_next_token()
print(tokenizer.decode(generator.get_sequence(0)))When to Use#
Use ONNX Runtime when you need broad hardware portability, when serving heterogeneous models (vision + speech + tabular + LLM) and the deployment story should be uniform, or when the target is a non-server form factor (mobile, browser, Windows desktop, Apple devices).
For cloud-side LLM throughput on NVIDIA, vLLM and TensorRT-LLM typically lead. For everything else, ONNX Runtime is often the most pragmatic choice.
Yobitel Context#
ONNX Runtime is a common runtime in Yobitel edge-AI deployments where the same model artifact must run across NVIDIA Jetson, x86 servers and Windows-based control systems. The portability advantage outweighs the small throughput gap versus the NVIDIA-native stack in those contexts.
References
- ONNX Runtime · Microsoft / ONNX
- ONNX Runtime on GitHub · GitHub (Microsoft)
- ONNX Runtime GenAI · GitHub (Microsoft)