Llama 3.2 11B Vision Instruct

Meta · Multimodal · Released Sep 2024

An 11-billion-parameter instruction-tuned vision model from Meta that processes both text and images with a 131k token context window.

Strengths: Handles multimodal tasks combining text and image understanding in a single forward pass, with sufficient scale to manage complex visual reasoning.
Best for: Applications requiring both image and text interpretation in a lightweight form factor, such as document analysis or visual question-answering on consumer hardware or edge deployments.
Limitations: Smaller than the 70B instruction-tuned variants, so may struggle with complex reasoning tasks that benefit from additional model capacity; newer agentic models like Muse Spark 1.1 are purpose-built for tool use and multimodal reasoning at scale.

Input / 1M

$0.345

Output / 1M

$0.345

Cached input / 1M

Context window

Price history

Effective	Input	Output	Cached in	Note	Source
11 Jun 2026	$0.345	$0.345	—	Imported from OpenRouter	openrouter.ai

Muse Spark 1.1

in $1.25 · out $4.25

Llama Guard 4 12B

in $0.18 · out $0.18

Llama 4 Maverick

in $0.15 · out $0.6

Llama 4 Scout

in $0.08 · out $0.3

Data updated Jul 17, 2026 Report a problem