Benchmarking Frontier LLM Models for AEC Image Understanding

Aug 19, 2025

What’s measured: accuracy at component identification from images across three AEC domains:

  • MEP

  • Structural engineering

  • Building science / envelope


TLDR;

  • MEP: GPT-5 leads at 62.5%, followed by GPT-4.1 (55.0%).

  • Structural: Grok 4 tops the set at 38.89%, ahead of Gemini 2.5 Pro (33.33%) and GPT-5 (27.78%).

  • Building science: GPT-5 and GPT-4.1 are tied at 60.0% with Gemini 2.5 Pro close behind (59.48%).

Results:



What these numbers reflect

Component identification means: “given a photo, plan detail, or section, name the primary component(s) shown and classify them into a controlled vocabulary.” Think air handler, VAV box, condensing unit (MEP), W-section column, PT slab tendon, shear wall (structural), or AVB membrane, mineral wool, masonry veneer tie (building science).

The three domains stress different capabilities:

  1. MEP favors recognition of equipment with distinctive silhouettes, labels, gauges, and manufacturer patterns. Language models that “read” incidental text and infer function do better.
    Why GPT-5 leads: strong multimodal text+pattern reading, better grounding across similar SKUs.

  2. Structural images are harder. Rebar chairs, embeds, and connections present subtle geometry with few labels. Small visual cues matter, and lighting or concrete dust hides edges.
    Why Grok 4 wins here: it appears to pick up low-level geometric cues more reliably on frames, connections, and reinforcement.

  3. Building science sits between the two. Envelope layers are often repeated and labeled on details, so models that follow a taxonomy and parse annotations score well.
    Why GPT-5/4.1 top: good at mapping notes and hatch patterns to a known assembly stack.

Error patterns we typically see

  • Taxonomy drift: “fan coil” vs “unit heater,” “RTU” vs “AHU.”

  • Over-generalization: calling all reinforcement “rebar” without bar size or role.

  • Context miss: mistaking a parapet AVB for below-grade waterproofing when cropping hides the roof.

  • Confusion among similar hardware: condensate pumps vs small circulators; beam clamps vs hanger clips.

  • Multi-component scenes: credit given to only the dominant item can suppress scores on models that list all visible parts.

Interpreting the spread by domain

  • Structural is uniformly lower. The visual distinctions are smaller, the class set is more granular, and imagery often lacks labels. Expect to pair automated ID with a short reviewer checklist.

  • Building science is the most “compressible.” Clear layer stacks and annotations let language priors shine. You can get close to 60% without fine-tuning by enforcing taxonomy and giving a one-page legend.

  • MEP rewards text+vision fusion. When the model can read a data plate or stenciled duct tag, identification jumps. Encourages photos that keep manufacturer plates visible.


Grok 4 has been touted to be fine tuned for the real-world. While this might be true, it fell apart against GPT-5 for structural.

Now we know that 38% isn’t enough for use in the real-world which is why at Guild AI we hit accuracy rates that are much higher using your firm’s own institutional knowledge. Get more information here.