Multimodal Models

TL;DR

Multimodal models extend foundation-model ideas to any-to-any combinations of text, image, audio, video and structured data.
Three architectural strategies dominate: separate-then-fuse (CLIP-style), encoder-connector (LLaVA-style), and natively multimodal (Gemini, GPT-4o).
Frontier multimodal models in 2026 — Gemini 2.x, GPT-4o/5, Claude 4 (vision+text), Qwen2-VL, Llama 3.2 — cover the major modalities with varying audio/video depth.
Audio uses Whisper-style encoders or audio tokenisers; video uses sparse-frame ViT or learned temporal tokenisation.

What Counts as Multimodal#

Strictly, any model that ingests or produces more than one modality. In practice the term is used for models that combine natural-language understanding with vision, audio, video, structured data or robotics observations. The defining property is shared representation across modalities, not separate models stitched together at the API layer.

Architecture Strategies#

Three patterns recur:

Separate encoders, contrastive alignment (CLIP, ALIGN, ImageBind) — each modality has its own encoder; embeddings are aligned via contrastive loss on paired data. Great for retrieval, weaker for generation.
Encoder-connector-LLM (LLaVA, BLIP-2, Qwen-VL) — pretrained modality encoder maps to LLM token space via a small connector. Easy to assemble from existing strong components.
Natively multimodal (Gemini, GPT-4o, Voicebox) — a single Transformer trained on tokenised inputs from every modality from the start. Hardest to train but tightest integration.

Tokenising Non-Text Modalities#

For text, tokens are subword units from a BPE tokeniser. Other modalities need their own tokenisation:

Discrete tokens fit a Transformer's existing vocabulary mechanism (the audio codec produces tokens drawn from a vocabulary, just like BPE produces text tokens). Continuous embeddings require a projection step but preserve more information.

Images — ViT patches (typically 14×14 pixels), or VQ-VAE / RQ-VAE discrete codes for generation.
Audio — log-mel spectrograms (Whisper-style continuous) or neural audio codecs (SoundStream, EnCodec, DAC) for discrete tokens.
Video — sparse-frame ViT patches, or learned spatio-temporal codecs.
3D — voxel grids, point cloud patches, or implicit neural representations.

Frontier Coverage in 2026#

Model	Text	Image in	Image out	Audio in	Audio out	Video in
Gemini 2.5 Ultra	Yes	Yes	Native	Yes	Yes	Yes (long)
GPT-5 / 4o	Yes	Yes	Native	Yes	Yes (real-time)	Yes
Claude 4.7 Opus	Yes	Yes	No	No	No	No
Llama 4 / 3.2 Vision	Yes	Yes	No	No	No	No
Qwen2-VL 72B	Yes	Yes	No	No	No	Sparse frames
DeepSeek-V3	Yes	Limited (V2.5)	No	No	No	No

Real-Time Multimodality#

GPT-4o's launch demo (May 2024) demonstrated sub-second speech-in, speech-out conversation with image input — a step change in conversational latency over earlier pipelines that chained Whisper, GPT and TTS sequentially.

The 2026 frontier in this direction is conversational voice agents (OpenAI Voice Mode, Gemini Live, Claude Voice), real-time video understanding (Gemini Live with camera input), and embodied AI (robotics models like RT-2 and OpenVLA).

Real-time multimodal inference is dominated by audio token generation latency, not text. Codec choices (Mimi, Moshi, EnCodec) and inference engine tuning (SGLang, vLLM with audio extensions) are first-order concerns.

Open Challenges#

Long video understanding — even Gemini 1.5/2.x's 1M+ token context starts to strain on movie-length input.
Audio-visual synchronisation — combining lip-sync, speech, gesture and scene context in one stream.
Embodied multimodality — closing the loop with proprioception and motor control for robotics.
Evaluation — benchmarks lag the capability frontier and rarely capture cross-modal reasoning well.

References

What Counts as Multimodal#

Architecture Strategies#

Three patterns recur:

Separate encoders, contrastive alignment (CLIP, ALIGN, ImageBind) — each modality has its own encoder; embeddings are aligned via contrastive loss on paired data. Great for retrieval, weaker for generation.

Encoder-connector-LLM (LLaVA, BLIP-2, Qwen-VL) — pretrained modality encoder maps to LLM token space via a small connector. Easy to assemble from existing strong components.

Natively multimodal (Gemini, GPT-4o, Voicebox) — a single Transformer trained on tokenised inputs from every modality from the start. Hardest to train but tightest integration.

Tokenising Non-Text Modalities#

For text, tokens are subword units from a BPE tokeniser. Other modalities need their own tokenisation:

Images — ViT patches (typically 14×14 pixels), or VQ-VAE / RQ-VAE discrete codes for generation.

Audio — log-mel spectrograms (Whisper-style continuous) or neural audio codecs (SoundStream, EnCodec, DAC) for discrete tokens.

Video — sparse-frame ViT patches, or learned spatio-temporal codecs.

3D — voxel grids, point cloud patches, or implicit neural representations.

Model

Text

Image in

Image out

Audio in

Audio out

Video in

Gemini 2.5 Ultra

Yes

Native

Yes

Yes (long)

GPT-5 / 4o

Yes

Native

Yes

Yes (real-time)

Yes

Claude 4.7 Opus

Yes

Llama 4 / 3.2 Vision

Yes

Qwen2-VL 72B

Yes

Sparse frames

DeepSeek-V3

Yes

Limited (V2.5)

Real-Time Multimodality#

Open Challenges#

Long video understanding — even Gemini 1.5/2.x's 1M+ token context starts to strain on movie-length input.

Audio-visual synchronisation — combining lip-sync, speech, gesture and scene context in one stream.

Embodied multimodality — closing the loop with proprioception and motor control for robotics.

Evaluation — benchmarks lag the capability frontier and rarely capture cross-modal reasoning well.

Multimodal Models

What Counts as Multimodal#

Architecture Strategies#

Tokenising Non-Text Modalities#

Frontier Coverage in 2026#

Real-Time Multimodality#

Open Challenges#

References

Browse all entries

Deploy on Yobitel

Multimodal Models

What Counts as Multimodal#

Architecture Strategies#

Tokenising Non-Text Modalities#

Frontier Coverage in 2026#

Real-Time Multimodality#

Open Challenges#

References

Browse all entries

Deploy on Yobitel