Edge Inference

TL;DR

Inference on devices located close to the data source — gateways, edge servers, phones, embedded systems.
Motivated by latency (round-trip to cloud unacceptable), bandwidth (uploading raw sensor data too expensive), privacy (data must not leave the device) or connectivity (intermittent or offline operation).
Toolchain ranges from heavyweight (NVIDIA Jetson with TensorRT) through middleware (ONNX Runtime, OpenVINO) to lightweight (llama.cpp, MLC-LLM, Apple CoreML).
Quantisation and pruning are essential; model selection biases toward smaller architectures with strong quality-per-parameter ratios.

Overview

Edge inference covers any deployment where model execution happens close to the data source rather than in a centralised data centre. The motivations are familiar: tight latency budgets that round-tripping to the cloud cannot meet, data-volume economics where uploading raw video or audio is impractical, regulatory or contractual requirements that data not leave the premises, and operational scenarios where connectivity is unreliable.

The category is broad. A factory-floor server inspecting product images, a Jetson-powered camera detecting intruders, an iPhone running on-device speech recognition, a vehicle-control system running a vision model — all are edge inference. Each has different memory, latency, power and thermal envelopes.

Hardware Spectrum

Edge data-centre — racks of L4, L40S or Jetson AGX Orin near factories, retail or branch offices.
Gateway servers — small fanless boxes with discrete or integrated GPUs.
Mobile devices — iPhone Neural Engine, Snapdragon Hexagon, Tensor Mobile.
Embedded NPUs — Hailo, Coral Edge TPU, Ambarella for cameras and IoT.
Microcontrollers — TinyML on Cortex-M class chips with TensorFlow Lite Micro.

Runtimes

Runtime choice tracks the hardware tier. TensorRT for NVIDIA Jetson; ONNX Runtime as a portable middle ground across CPU, mobile GPU and integrated accelerators; OpenVINO for Intel; CoreML for Apple devices; llama.cpp and MLC-LLM for portable on-device LLMs; vendor-specific SDKs (Qualcomm AI Engine Direct, Hailo SDK) for specialised NPUs.

Model Selection

Edge models are smaller by necessity. Vision: YOLOv8n, MobileNet, EfficientNet-Lite. Speech: Whisper Tiny or Small, faster-whisper-int4. LLMs: Phi-3.5 Mini, Llama 3.2 1B and 3B, Qwen 2.5 1.5B. The selection bias is toward architectures with strong quality-per-parameter ratios because each parameter byte affects memory, bandwidth and power.

Tip: Quality benchmarks for edge models do not always reflect real-world behaviour. Always evaluate on actual deployment data — sensor characteristics, lighting, accents and codecs differ from benchmark distributions.

Operations

Edge fleets are operationally distinct from cloud fleets. OTA model updates, signed model artefacts, A/B rollouts gated by connectivity, and telemetry sampling all need first-class support. Tools like NVIDIA Fleet Command, Azure IoT Edge, AWS Greengrass and ROS-based fleet management address parts of this stack.

Yobitel Context

Edge-AI deployments are a core piece of Yobitel's stack alongside data-centre GPU clouds, with telco-friendly form factors for FTTH cabinets, branch sites and customer-premise inference appliances. Most workloads use ONNX Runtime or TensorRT, with llama.cpp where on-device LLM is in scope.

References

NVIDIA Jetson Documentation · NVIDIA Developer
ONNX Runtime · Microsoft / ONNX
Apple Core ML · Apple Developer

Overview

Hardware Spectrum

Edge data-centre — racks of L4, L40S or Jetson AGX Orin near factories, retail or branch offices.

Gateway servers — small fanless boxes with discrete or integrated GPUs.

Mobile devices — iPhone Neural Engine, Snapdragon Hexagon, Tensor Mobile.

Embedded NPUs — Hailo, Coral Edge TPU, Ambarella for cameras and IoT.

Microcontrollers — TinyML on Cortex-M class chips with TensorFlow Lite Micro.

Runtimes

Model Selection

Tip: Quality benchmarks for edge models do not always reflect real-world behaviour. Always evaluate on actual deployment data — sensor characteristics, lighting, accents and codecs differ from benchmark distributions.

Operations

Edge Inference

Overview

Hardware Spectrum

Runtimes

Model Selection

Operations

Yobitel Context

References

Browse all entries

Deploy on Yobibyte

Edge Inference

Overview

Hardware Spectrum

Runtimes

Model Selection

Operations

Yobitel Context

References

Browse all entries

Deploy on Yobibyte