Can AI Judge the Quality of AI Generated Design
Interview with Kristen Edwards – MIT Decode Lab
As generative AI systems become more capable of producing vast volumes of design concepts, a new question emerges: can those same systems reliably judge the quality of what they generate?
Kristen Edwards, a researcher at MIT’s Decode Lab, returns to CDFAM Amsterdam to explore this question through her latest research on vision-language models (VLMs) as design evaluators, or AI-judges in engineering contexts.
The full interview on CDFAM.COM offers a preview of her upcoming talk, covering recent developments in multimodal reasoning, the statistical tools used to compare AI and expert evaluations, and how small, structured datasets can still support meaningful assessments. Edwards also reflects on the real-world implications for early-stage design workflows and what the computational design community can contribute to this rapidly evolving field.
Following are some excerpts.
You presented at CDFAM Berlin last year on Concept to Manufacturing: Evaluating Vision-Language Models for Engineering Design. What have you been working on since then, and how have the tools or models available for this type of design evaluation evolved?
My research is focused on utilizing multimodal machine learning models for two main tasks: design exploration and design evaluation.
Since my presentation at CDFAM last year, I have been working on generating and validating 3D meshes directly from sketches during the conceptual design stage, in order to enhance design exploration. Here is a link to this work.
Most recently, however, my primary focus has been on how AI might serve as an evaluator in engineering design, especially important now that generative AI has made it easier than ever to produce large volumes of design concepts.
In the past year, there have been huge algorithmic strides in multimodal models, like large pre-trained vision-language models (VLMs), and big changes in the way AI is being utilized in engineering and manufacturing workflows. To name a few of the advancements and trends from the past year, we’ve seen:
Agentic AI systems: systems that can act autonomously, making decisions and taking actions to achieve goals based on their reasoning, rather than direct human intervention,
More powerful open source models: LLama 3, DeepSeek-R1, Qwen 2.5, and Mistral to name a few.
Powerful reasoning models, including open-source ones, that show improved performance on many tasks using chain-of-thought answering.
Improved multimodal models: For example, gpt-4o and gpt-image-1 have brought significant performance improvements on tasks that require understanding both image and text inputs.
Continued buzz around generative AI
But now, there’s not just generation, there is also evaluation. We’re now seeing increased attention on how to systematically evaluate these outputs to guide downstream decisions.
LLM-as-a-Judge: Over the past year, there’s been a rise in research and real-world experimentation with large models, primarily just language models, as evaluators. These include both benchmarking tools and in-the-loop evaluators for human or AI-generated content. I’m interested in how multimodal models can serve as evaluators in the context of engineering design evaluation, and how to evaluate their performance.
That’s what I’ll be discussing at this year’s CDFAM: a statistical perspective on measuring whether AI-judges’ evaluations align with experts’ evaluations.
What kinds of tools or capabilities would you like to see become available to support this kind of AI-driven design evaluation? And in the meantime, what can engineers do to get the most out of the tools we have today?
Design evaluation varies widely depending on the stage of the design process and the nature of the metrics involved. In early-stage conceptual design, representations may be rough sketches, and evaluations tend to focus on subjective criteria such as creativity or novelty. Later in the process, during detailed design, evaluations may rely on rich 3D CAD models and focus on objective metrics like manufacturability or structural performance via simulation tools like CFD or FEA.
To support AI-driven design evaluation across this spectrum, I’d like to see tools that:
Intake a variety of design representations, from sketches and natural language descriptions to full CAD assemblies
Provide reliable evaluations of both subjective and objective metrics, ideally validated against expert or real-world outcomes
Enable lightweight customization or fine-tuning, so engineers can align evaluations with their specific domain or design context without massive datasets
Explain their reasoning, especially for subjective or qualitative metrics, to foster trust and insight
Integrate with existing metrics and benchmarks to easily assess their results against a “ground-truth” evaluator, like a human expert.
There are already impressive AI tools in this space — for instance, SimScale (co-founded by David Heiny) allows engineers to run simulations in the cloud with AI-assisted preprocessing and setup. As more tools emerge, they’ll help streamline performance-driven evaluation. But for subjective or creative metrics, the AI tooling is still developing.
In the meantime, engineers can get the most out of existing AI tools by:
Validating AI evaluators before relying on them. Ensure LLMs or VLMs are accurate by comparing their outputs to expert judgments or known benchmarks
Using in-context learning to align pre-trained models with “ground truth” examples, which is often more feasible than training a model from scratch
Treating AI as a collaborator, not a final rater. AI models can surface ideas, rank options, or provide second opinions, but human oversight remains critical, especially in high-stakes or novel design contexts
AI-based evaluators have the potential to reduce the evaluative load on experts and allow for design evaluation at large scales. But they should be employed as part of a human-AI team where measures are being taken to ensure that any AI-evaluations match that of the ground-truth evaluator.