Optimize multimodal performance with Opik

Multimodal agents often juggle text instructions, images, audio, video, and structured outputs. Opik’s optimizers can work with any model that LiteLLM supports for non-text modalities (GPT-4o, Gemini, Claude 3.5 Sonnet vision, etc.). Make sure that both the optimizer’s model and your ChatPrompt.model accept the modality you plan to optimize. Otherwise, the run will fail or silently ignore the media.

Optimizer multimodal support

Use optimizers that can forward OpenAI-style content parts (string or an array of structured parts like { type: "text" | "image_url" }) to a multimodal LLM. Current support:

Optimizer	Multimodal (text+image)	Notes
HRPO (Hierarchical Reflective Prompt Optimizer)	✓	Ensure both optimizer `model` and `ChatPrompt.model` are multimodal-capable.
MetaPrompt Optimizer	✓	Uses content parts; requires a multimodal-capable model for evaluation.
Evolutionary Optimizer	✓	Uses content parts; requires a multimodal-capable model for evaluation.
Few-shot Bayesian Optimizer	✓	Uses content parts; requires a multimodal-capable model for evaluation.
Parameter Optimizer	✓	Tunes parameters only, but supports multimodal prompts for evaluation.
GEPA Optimizer	✓	Uses content parts; requires a multimodal-capable model for evaluation.

See also: Evaluate multimodal for model-family guidance and content block format.

Dataset design

Store image, audio, or video references as signed URLs in your dataset items (for example metadata["image_url"], metadata["audio_url"], or metadata["video_url"]).
Include textual descriptions alongside assets so metrics can run without downloading large files when possible.
Tag rows with modality info (metadata["modality"] = "image+text") to filter during analysis.

Prompt structure

1 from opik_optimizer import ChatPrompt
2 
3 prompt = ChatPrompt(
4     messages=[
5         {"role": "system", "content": "Analyze the provided image and answer the question."},
6         {
7             "role": "user",
8             "content": [
9                 {"type": "text", "text": "Question: {question}"},
10                 {"type": "image_url", "image_url": {"url": "{image_url}"}},
11             ],
12         },
13     ],
14     model="openai/gpt-4o-mini"
15 )

Describe the expected output schema (JSON, markdown table, etc.) to reduce ambiguity.

Metrics

Reuse existing text metrics when possible by comparing textual descriptions.
For vision-specific scoring, call external models from your metric function, but cache results to control cost.
Record reasons that mention the modality: “Image not described” or “Chart incorrectly transcribed”.
When possible, augment automated metrics with lightweight human review or deterministic checks—LLM-as-a-judge signals can be noisy for multimodal tasks.

Running optimizations

Start with MetaPrompt for wording improvements. For cold-start exploration, pair Evolutionary → Few-Shot Bayesian to uncover new structures and example choices.
Use HRPO to catch recurring multimodal failures (e.g., missing chart descriptions) and highlight which dataset rows are problematic.
Monitor token usage because multimodal prompts send larger payloads; pick models like gpt-4o-mini when budgets are tight.
All prompt optimizers can forward multimodal content parts, but evaluation will fail if the chosen models do not support the modality. Use multimodal-capable models for both optimizer generation and prompt evaluation.

Validation

Spot-check generated outputs with the associated media in the dashboard.
Confirm that dataset asset URLs remain valid for the duration of the optimization.
When sharing results, include thumbnails or sample outputs so reviewers understand the changes.