Evaluating Vision-Language Models for Engineering Design
Interview with Kristen M. Edwards MIT – DECODE LAB
Kristen M. Edwards is a researcher at MIT DeCoDE Lab exploring Generative AI for product design, engineering and manufacturing using Vision-Language Models (VLMs) and Large Language Models (LLMs).
The following interview ahead of her presentation at CDFAM Berlin gives some background on her research and presentation focus.
RESEARCH OVERVIEW AND DIFFERENTIATION FROM LLMS
Could you provide an overview of your research at MIT focusing on Vision-Language Models (VLMs), how do VLMs diverge from traditional Large Language Models (LLMs) in terms of approach and functionality?
My research in the MIT DeCoDE Lab focuses on the intersection of artificial intelligence (AI) and engineering design. I’m particularly interested in how AI agents can assist engineers throughout the design process, from design conception to manufacturing.
Most recently, I’ve been working with vision-language models (VLMs) which are multimodal models that can take in both text and images. Many of them build from successes and strides made in the field of large language models (LLMs), but LLMs only deal with language input, while VLMs are trained on large image and text datasets.
Functionally, this means VLMs are well positioned for tasks in engineering which require some sort of visual input – tasks like understanding engineering drawings, choosing the best concept out of a series of sketches, and generating 3D models from 2D inputs.
An example of my recent work includes Sketch2Prototype, which is a framework that uses multimodal generative AI to generate text, diverse 2D images, and 3D models from a hand-drawn sketch.
EXPECTATIONS VS. REALITY
When starting your research into VLMs, what were your expectations for performance in engineering tasks?
When I began my research with vision-language models (VLMs), people had just begun publishing research which evaluated large language models (LLMs) for tasks like exam completion and CAD script generation. I had a sense of how LLMs were performing at various tasks, and this informed some of my expectations for VLMs.
I expected VLMs to have certain strengths – a sense of global knowledge from its vast training data, versatility to perform well at many different tasks, and ability to improve its responses after iterative feedback. I also went in expecting some weaknesses – tendency to hallucinate (or say something coherent but not factual), a need for iteration, and poor analytical skills. Since I was working most with GPT-4 Turbo, which is a closed-source model, I did not have access to what was happening under the hood. However, I did have some guesses that the training data included common computer vision datasets and a plethora of text. These datasets are often paired images and text, so I suspected GPT-4 Turbo would be able to recognize and classify images well.
Were there any aspects that exceeded your expectations, and which areas did the models underperform?
I was impressed with VLMs performance with optical character recognition (OCR) in low detail sketches, especially those with handwritten annotations. I found that VLMs could use OCR and general image understanding to not only interpret a sketch of a design, but also extrapolate from the annotations how a design might function. I was also impressed with some of the domain specific knowledge, like what aspects to look for in a visual of a CFD (computational fluid dynamics) simulation, or what a Pugh Chart is and how one might fill it out during concept selection.
REGISTER TO ATTEND CDFAM BERLIN
OVERCOMING SPATIAL REASONING AND CAD FUNCTIONALITY CHALLENGES
As you mention, VLMs demonstrate limitations in spatial reasoning and CAD functionality. How might these challenges be addressed?
I see a few ways to address the limitations in spatial reasoning, and I think they vary by whether you are working with an open-source or closed-source model. If you are working with an open-source model, then you have the opportunity to change the model architecture. Here, researchers can change the training set and pretraining objectives to address specific gaps in a model’s performance, or even incorporate new computer vision techniques to try to improve spatial reasoning.
On the other hand, for both open and closed-source models, there is potential to incorporate code interpretation to improve spatial reasoning. I see this approach as “I have accepted that my current VLM is not great for this task, BUT I know another tool that is and that interfaces with my VLM, so I will use that.” For example, GPT-4 Turbo has useful code interpretation skills and I have personally built Custom GPTs for which I write and provide Python scripts that are able to perform tasks like determining the cross sectional area of a part and its area moment of inertia. This also means that you can utilize Python packages like Shapely that are already well-suited for geometry-intensive tasks.
UTILITY IN CONCEPTUAL DESIGN
In our conversations, you mentioned four main failure modes in VLMs’ ability to process and interpret engineering designs. Could you elaborate on these modes and their implications?
The four main failure modes that my collaborators and I have found when moving from 2D images to 3D models are:
Unwanted holes
Incorrectly filled holes (like making a cup a solid cylinder instead of an empty one with a base)
Floating material
Uneven texture
The image below demonstrates some of these issues in the 3D rendering columns. Moving forward, future work to make 3D models that are actually manufacturable will be necessary for making these tools more useful to engineers.
3D models generated from varying input images (an original sketch, an image made via ControlNet, and an image made via Sketch2Prototype. The Sketch2Prototype image generates more diverse and manufacturable designs. B: The intermediate text modality allows for user control. Users can easily edit the text description which is used to create a 2D image. We show this by appending text to the original prompt to generate different designs.
KEY TAKEAWAYS AND CDFAM BERLIN PRESENTATION
Looking ahead to the CDFAM Symposium in Berlin, what’s the key insight you aim to share with your audience about your research?
From my research, I have identified key areas in engineering design where vision-language models shine, and key areas where they fail. I aim for my audience to take away that VLMs currently shine in sketch understanding including OCR with handwriting, creative generation of detailed design images from sketch and text inputs, and high level knowledge of most engineering disciplines. These are areas where we can immediately think of applications and ways that VLMs can assist engineers.
On the other hand, there is still a lot of progress to be made in spatial understanding, in 2D to 3D models for actual manufacturable parts, and in specific domain knowledge.
We might begin tackling these limitations by incorporating code interpretation, creating Custom GPTs, or turning to open-source VLMs that we can train with different datasets and objectives. Nonetheless, these areas represent exciting opportunities for future work.
Finally, What do you hope to achieve or discover through your involvement in the CDFAM Symposium?
I am incredibly excited to hear what other researchers, industry leaders, and AI-users have discovered when they have applied these models to their specific fields. In my experience, cross-field conversations provide some of the most valuable insights – highlighting application areas that I had not considered before, as well as successes and learnings that are transferable to my area of expertise.
I want to understand which areas are getting people across industry, research, and government the most excited, and which areas may be untapped.
Lastly, I will count my involvement as a success if I discuss the future of security and sustainability as it pertains to AI in engineering and manufacturing with those leading the charge forward.