AI Judges In Design: Statistical Perspectives On Achieving Human Expert Equivalence With VLMs

CDFAM Computational Design Symposium

0:00

-17:16

AI Judges In Design: Statistical Perspectives On Achieving Human Expert Equivalence With VLMs

Kristen Edwards - MIT Decode Lab

Sep 01, 2025

Recorded at CDFAM Computational Design Symposium, Amsterdam, 2025

https://cdfam.com/amsterdam-2025/

Organization: MIT

Presenter: Kristen Edwards

Presentation Abstract

The subjective evaluation of early stage engineering designs, such as conceptual sketches, traditionally relies on human experts. However, expert evaluations are time-consuming, expensive, and sometimes inconsistent. Recent advances in vision-language models (VLMs) offer the potential to automate design assessments, but it is crucial to ensure that these AI “judges” perform on par with human experts. However, no existing framework assesses expert equivalence. This research introduces a rigorous statistical framework to determine whether an AI judge’s ratings match those of human experts. We propose statistical metrics that broadly cover these assessment areas: interrater reliability, agreement, error metrics, correlation and relative rank assessment, distribution-similarity analysis, and equivalence tests. We apply this framework in a case study evaluating four VLM-based judges on key design metrics (uniqueness, creativity, usefulness, and drawing quality). These AI judges employ various in-context learning (ICL) techniques, including uni- vs. multimodal prompts and inference-time reasoning. The same statistical framework is used to assess three trained novices for expert-equivalence. Results show that the top-performing AI judge, using text- and image-based ICL with reasoning, achieves expert-level agreement for uniqueness and drawing quality and outperforms or matches trained novices across all metrics. This has implications for scaling design evaluation in education and practice, and provides a general statistical framework for validating AI judges in other domains requiring subjective content evaluation.

Bits to Atoms

AI Judges In Design: Statistical Perspectives On Achieving Human Expert Equivalence With VLMs

Ready for more?