Not all GPT judges think alike. In our vision-language evals, each model showed a distinct “personality”:
• GPT-4o-mini — the calm, calibrated evaluator. It applies steady standards, checks every box, and keeps judgment consistent across criteria. Great when you need reliable, systematic coverage.
• GPT-4o — the sharp quality controller. It’s especially good at spotting what’s off while still keeping a balanced view of the whole. Reach for this when error-hunting matters most.
• GPT-5 — the ultra-cautious gatekeeper. It’s very tough on anything that smells like a hallucination, but its standards can swing from case to case. Use when maximum caution beats breadth.
Across the board, these judges lean more toward catching mistakes than celebrating correct details—so model choice should match your goal: consistency, error-hunting, or maximum caution. (2/2)