Multimodal Agent Optimization Tutorial

This tutorial outlines how to optimize a multimodal agent (vision + text) and links to the full walkthrough for a self-driving car scenario. The SDK already includes a working example script and dataset you can run locally.

Full guide: Automatic prompt optimization for multimodal vision agents (self-driving car example).

Codebase entry point: sdks/opik_optimizer/scripts/multimodal_example.py using the driving hazard dataset in sdks/opik_optimizer/src/opik_optimizer/datasets/driving_hazard.py.

What is the multimodal optimizer example?

The SDK includes a complete example that optimizes a vision agent on a driving hazard dataset. It demonstrates how to pass image content parts through ChatPrompt, score outputs, and compare trials in the Optimization Studio.

Why use optimizers here?

Multimodal prompts are sensitive to phrasing and output structure. Running HRPO or MetaPrompt helps you converge on safer, more consistent outputs without rewriting prompts manually.

How the SDK example works

multimodal_example.py loads the driving hazard dataset (images + hazard labels).
A multimodal ChatPrompt inserts an image URL content part next to the textual instruction.
The metric (Levenshtein ratio) scores predicted hazard text against the expected label.
HRPO optimizes the prompt using the training split, with a small validation split for ranking.
Results display in the Opik UI (Optimization runs and trial details).

Screenshot placeholder: multimodal trial comparisons and failure analysis.

Next steps

Explore the full SDK script and adapt the dataset to your own vision tasks.
Use pass@k evaluation (n parameter) to reduce stochastic failures.
Read the full external guide for the complete workflow and visuals.