TL;DR
- Multimodal models extend foundation-model ideas to any-to-any combinations of text, image, audio, video and structured data.
- Three architectural strategies dominate: separate-then-fuse (CLIP-style), encoder-connector (LLaVA-style), and natively multimodal (Gemini, GPT-4o).
- Frontier multimodal models in 2026 — Gemini 2.x, GPT-4o/5, Claude 4 (vision+text), Qwen2-VL, Llama 3.2 — cover the major modalities with varying audio/video depth.
- Audio uses Whisper-style encoders or audio tokenisers; video uses sparse-frame ViT or learned temporal tokenisation.
What Counts as Multimodal#
Strictly, any model that ingests or produces more than one modality. In practice the term is used for models that combine natural-language understanding with vision, audio, video, structured data or robotics observations. The defining property is shared representation across modalities, not separate models stitched together at the API layer.
Architecture Strategies#
Three patterns recur:
- Separate encoders, contrastive alignment (CLIP, ALIGN, ImageBind) — each modality has its own encoder; embeddings are aligned via contrastive loss on paired data. Great for retrieval, weaker for generation.
- Encoder-connector-LLM (LLaVA, BLIP-2, Qwen-VL) — pretrained modality encoder maps to LLM token space via a small connector. Easy to assemble from existing strong components.
- Natively multimodal (Gemini, GPT-4o, Voicebox) — a single Transformer trained on tokenised inputs from every modality from the start. Hardest to train but tightest integration.
Tokenising Non-Text Modalities#
For text, tokens are subword units from a BPE tokeniser. Other modalities need their own tokenisation:
Discrete tokens fit a Transformer's existing vocabulary mechanism (the audio codec produces tokens drawn from a vocabulary, just like BPE produces text tokens). Continuous embeddings require a projection step but preserve more information.
- Images — ViT patches (typically 14×14 pixels), or VQ-VAE / RQ-VAE discrete codes for generation.
- Audio — log-mel spectrograms (Whisper-style continuous) or neural audio codecs (SoundStream, EnCodec, DAC) for discrete tokens.
- Video — sparse-frame ViT patches, or learned spatio-temporal codecs.
- 3D — voxel grids, point cloud patches, or implicit neural representations.
Frontier Coverage in 2026#
| Model | Text | Image in | Image out | Audio in | Audio out | Video in |
|---|---|---|---|---|---|---|
| Gemini 2.5 Ultra | Yes | Yes | Native | Yes | Yes | Yes (long) |
| GPT-5 / 4o | Yes | Yes | Native | Yes | Yes (real-time) | Yes |
| Claude 4.7 Opus | Yes | Yes | No | No | No | No |
| Llama 4 / 3.2 Vision | Yes | Yes | No | No | No | No |
| Qwen2-VL 72B | Yes | Yes | No | No | No | Sparse frames |
| DeepSeek-V3 | Yes | Limited (V2.5) | No | No | No | No |
Real-Time Multimodality#
GPT-4o's launch demo (May 2024) demonstrated sub-second speech-in, speech-out conversation with image input — a step change in conversational latency over earlier pipelines that chained Whisper, GPT and TTS sequentially.
The 2026 frontier in this direction is conversational voice agents (OpenAI Voice Mode, Gemini Live, Claude Voice), real-time video understanding (Gemini Live with camera input), and embodied AI (robotics models like RT-2 and OpenVLA).
Real-time multimodal inference is dominated by audio token generation latency, not text. Codec choices (Mimi, Moshi, EnCodec) and inference engine tuning (SGLang, vLLM with audio extensions) are first-order concerns.
Open Challenges#
- Long video understanding — even Gemini 1.5/2.x's 1M+ token context starts to strain on movie-length input.
- Audio-visual synchronisation — combining lip-sync, speech, gesture and scene context in one stream.
- Embodied multimodality — closing the loop with proprioception and motor control for robotics.
- Evaluation — benchmarks lag the capability frontier and rarely capture cross-modal reasoning well.