TL;DR
- Apple's machine learning framework for running models on Apple silicon — Neural Engine, GPU and CPU.
- Introduced at WWDC 2017; today the default path for on-device inference on iPhone, iPad, Mac, Apple Watch and Vision Pro.
- Models are packaged as `.mlmodel` or `.mlpackage` bundles. `coremltools` converts from PyTorch, TensorFlow and ONNX.
- Powers Apple Intelligence on-device LLMs, on-device Vision and Speech APIs, and many third-party apps that prefer on-device privacy.
Overview#
Core ML is Apple's first-party framework for running models on Apple silicon. It chooses among the Neural Engine, GPU and CPU based on the model and the device, and is integrated tightly with iOS and macOS — Vision, Natural Language, Speech and Sound Analysis all use it under the hood.
The framework has been pushed hard by Apple Intelligence, the on-device LLM stack introduced in iOS 18, which uses Core ML and a fine-tuned ~3B parameter foundation model to run summarisation, rewriting and notification triage entirely on the device. Larger queries route to a private-cloud-compute backend.
Conversion Pipeline#
- Train or download a model in PyTorch, TensorFlow or ONNX.
- Use `coremltools` to convert to the Core ML format, optionally with quantisation (palettisation, INT8, INT4) and graph optimisation.
- Ship the `.mlpackage` in the application bundle or as an on-demand resource.
- Load and run via the `MLModel` or higher-level Vision / Natural Language APIs.
Quantisation#
Core ML supports several compression modes: weight palettisation (a small codebook of weight values, common for LLMs), linear quantisation (INT8, INT4) and pruning. The Apple Neural Engine has fixed-function support for several quantised formats and chooses kernels accordingly. Mixed-precision is the norm — embeddings and output projections often stay in FP16 while bulk matmuls run in 4-bit palettised mode.
Hardware Routing#
Each model carries a compute units preference (`all`, `cpuAndGPU`, `cpuOnly`, `cpuAndNeuralEngine`). Apple's runtime decides at load time which engine each operator runs on; some operators always fall back to GPU or CPU because the Neural Engine does not support them. Developers can profile execution with Core ML Performance Reports in Xcode to identify operators that drop off the Neural Engine.
A 'cpuAndNeuralEngine' preference is usually the right default for LLMs and vision models — the Neural Engine handles the bulk efficiently while the CPU handles non-supported operators.
LLM Path#
Apple has invested heavily in LLM support: Core ML's stateful prediction APIs maintain KV cache across calls, and the framework supports paged attention and stateful generation for autoregressive models. Apple's own foundation models and projects like MLX (the research framework) push the on-device LLM frontier.
When to Use#
Use Core ML for any iOS, iPadOS, macOS, watchOS, tvOS or visionOS deployment where on-device inference is desired. The combination of fixed-function Neural Engine, low power and tight OS integration makes it the right path on Apple platforms; using ONNX Runtime or PyTorch Mobile on the same hardware sacrifices the Neural Engine advantage in most cases.
References
- Apple Core ML Documentation · Apple Developer
- coremltools · GitHub (Apple)
- Apple Intelligence Foundation Language Models · Apple Machine Learning Research