TL;DR
- Inference on devices located close to the data source — gateways, edge servers, phones, embedded systems.
- Motivated by latency (round-trip to cloud unacceptable), bandwidth (uploading raw sensor data too expensive), privacy (data must not leave the device) or connectivity (intermittent or offline operation).
- Toolchain ranges from heavyweight (NVIDIA Jetson with TensorRT) through middleware (ONNX Runtime, OpenVINO) to lightweight (llama.cpp, MLC-LLM, Apple CoreML).
- Quantisation and pruning are essential; model selection biases toward smaller architectures with strong quality-per-parameter ratios.
Overview#
Edge inference covers any deployment where model execution happens close to the data source rather than in a centralised data centre. The motivations are familiar: tight latency budgets that round-tripping to the cloud cannot meet, data-volume economics where uploading raw video or audio is impractical, regulatory or contractual requirements that data not leave the premises, and operational scenarios where connectivity is unreliable.
The category is broad. A factory-floor server inspecting product images, a Jetson-powered camera detecting intruders, an iPhone running on-device speech recognition, a vehicle-control system running a vision model — all are edge inference. Each has different memory, latency, power and thermal envelopes.
Hardware Spectrum#
- Edge data-centre — racks of L4, L40S or Jetson AGX Orin near factories, retail or branch offices.
- Gateway servers — small fanless boxes with discrete or integrated GPUs.
- Mobile devices — iPhone Neural Engine, Snapdragon Hexagon, Tensor Mobile.
- Embedded NPUs — Hailo, Coral Edge TPU, Ambarella for cameras and IoT.
- Microcontrollers — TinyML on Cortex-M class chips with TensorFlow Lite Micro.
Runtimes#
Runtime choice tracks the hardware tier. TensorRT for NVIDIA Jetson; ONNX Runtime as a portable middle ground across CPU, mobile GPU and integrated accelerators; OpenVINO for Intel; CoreML for Apple devices; llama.cpp and MLC-LLM for portable on-device LLMs; vendor-specific SDKs (Qualcomm AI Engine Direct, Hailo SDK) for specialised NPUs.
Model Selection#
Edge models are smaller by necessity. Vision: YOLOv8n, MobileNet, EfficientNet-Lite. Speech: Whisper Tiny or Small, faster-whisper-int4. LLMs: Phi-3.5 Mini, Llama 3.2 1B and 3B, Qwen 2.5 1.5B. The selection bias is toward architectures with strong quality-per-parameter ratios because each parameter byte affects memory, bandwidth and power.
Quality benchmarks for edge models do not always reflect real-world behaviour. Always evaluate on actual deployment data — sensor characteristics, lighting, accents and codecs differ from benchmark distributions.
Operations#
Edge fleets are operationally distinct from cloud fleets. OTA model updates, signed model artefacts, A/B rollouts gated by connectivity, and telemetry sampling all need first-class support. Tools like NVIDIA Fleet Command, Azure IoT Edge, AWS Greengrass and ROS-based fleet management address parts of this stack.
Yobitel Context#
Edge-AI deployments are a core piece of Yobitel's stack alongside data-centre GPU clouds, with telco-friendly form factors for FTTH cabinets, branch sites and customer-premise inference appliances. Most workloads use ONNX Runtime or TensorRT, with llama.cpp where on-device LLM is in scope.
References
- NVIDIA Jetson Documentation · NVIDIA Developer
- ONNX Runtime · Microsoft / ONNX
- Apple Core ML · Apple Developer