Case studyAI in productionConstruction checklist
Case study · Construction AI

On-site checklists.

A 2,000-item construction checklist, 30+ PDFs, on-premise only. We didn't promise full automation — we built a validation gate over an LLM and cut the load by ~6×.

May 5, 2026 ~10 min read max

§ 01The problem: 2,000 items, 30+ PDFs, 90+ person‑days a month

There are processes in construction that look simple from the outside: take the project documentation, walk through a checklist, fill in the right fields. In practice, it can be a huge amount of manual work.

A construction company came to us with a per‑site checklist of 2,000+ items. To fill it in, their specialists rely on a full set of project documentation: dozens of PDF files, text sections, tables, plans, drawings, and appendices.

Before automation, the process looked like this:

Parameter Before
Specialists involved 3
Time per checklist 10–14 working days
Cadence ~3 times per month
Monthly load 90–126 person‑days

Every month, several specialists were spending most of their working time hunting for information across PDFs, reconciling wording, transferring data, and double‑checking it by hand.

The client's request sounded simple:

We want to fully automate this checklist.

At first glance the task is a perfect fit for AI: there are documents, there are questions, the model just needs to find the answers. After the first analysis pass, though, it became obvious that full automation here would be too risky. That's where the project actually started.

§ 02Why we didn't promise full automation

We could've said yes. "Sure, we'll build an AI that fills the whole checklist on its own." But in a real production process, that would almost certainly have led to problems.

In construction documentation, the cost of a wrong answer is high. If the system confidently returns a bad value, the specialist will end up rechecking everything by hand, and trust in the tool evaporates within a release or two.

So we didn't start with code. We started with a risk assessment.

The main conclusion was this:

Fully automating this from day one would be too risky. But we can cut the manual load drastically if we sort answers by reliability and keep the human exactly where the human is needed.

Instead of replacing the specialist 100%, we proposed an AI service that:

The client agreed with the approach, and we got to work.

§ 03Where it gets hard

3.1. The documents aren't standardised

The checklist is filled out from 30+ PDF documents. The average document runs to about 120 pages. And every project, the documents are produced by different contractors. That means:

So we couldn't write a rule‑based system that says "look for the answer in document X, section Y, page Z". The system had to handle poorly standardised documentation and survive constant changes in the shape of the input.

3.2. Not every answer lives in the text

Some questions can be answered directly from the body text — a specific characteristic, parameter, or value that's explicitly written down. Those are the easy ones to automate reliably.

But not every question is like that. Some items read more like:

Height from floor to ceiling, including the upper partition.

Sometimes that information is in a text section. Sometimes it only exists on a plan, an elevation, or a drawing.

Construction plans are often huge PDF pages packed with small details, callouts, footnotes, tables, and visual elements. Feeding entire pages straight into a vision‑language model would be slow, expensive, and not always accurate. Trying to brute‑force‑parse every drawing didn't make sense either.

For those cases, we had to design a separate flow: locate relevant regions, do smart crops, pick the right fragments, and only then push them further down the pipeline.

3.3. On‑premise only, on a shared GPU node

One more important constraint: the system had to run entirely inside the client's infrastructure. No external APIs. No project documentation leaving for a cloud LLM. No sensitive data crossing the perimeter.

The client already had a shared 2×H100 GPU node on their own infrastructure. For this project, we were allocated 65 GB of VRAM out of that node — it wasn't a dedicated machine reserved for us. Our pipeline had to fit inside that 65 GB budget while sharing the node with other workloads.

Practically speaking, the task wasn't just "build an AI service" — it was "fit an AI service into a fixed set of constraints":

The final service runs inside the client's infrastructure and does its work inside that GPU allocation.

§ 04The core engineering idea: a validation gate, not model self‑confidence

We didn't try to make the LLM answer every question at any cost. We built the pipeline around a different idea:

The system should not just produce answers. It should know where an answer is reliably backed by the documentation, where a quick human check is needed, and where it should honestly say "I don't know".

This is the important distinction.

We don't use the model's own confidence as the final reliability signal. An LLM can sound confident when it's wrong. So the high / medium / low tiers aren't decided by the model — they're decided by a separate validation gate.

The validation gate looks at:

Only after that does the system assign a tier.

The result isn't just an answer generator. It's a controlled AI pipeline with an explicit boundary of trust.

§ 05How the service works in practice

For the user, the solution looks like a web service. A specialist uploads:

  1. the checklist for a specific site;
  2. the project documentation set;
  3. any extra reference materials, if needed.

The pipeline then runs over the documents. It analyses the corpus, retrieves relevant fragments, extracts candidate answers, runs each one through the validation gate, and returns an annotated checklist.

A few hours later, the specialist receives a result where every item has:

End‑to‑end processing takes around 4 hours and runs entirely on‑premise.

§ 06Three confidence tiers

6.1. high

high is the class of questions where the answer is, with very high probability, written directly in the body text of the documentation.

Honestly: this is not the hardest part of the checklist. Most of these are well‑formalised, fairly simple questions — parameters, values, characteristics, the kind of thing that's usually stated explicitly in the docs.

But these were exactly the items creating most of the manual grind. Even when the answer is simple, the specialist still has to:

That's the repeatable part we automated.

An answer only lands in high if it passes the strict validation gate: a direct match was found in the documentation, and the answer has been confirmed to match the question.

Per the validation results, high covers around 30% of checklist items at very high accuracy — about 99.9%.

For the business, that means a meaningful chunk of the manual grind disappears with almost no loss of reliability.

6.2. medium

medium is the class where the system found relevant material and can propose an answer, but the validation gate doesn't consider the support strong enough for high. That can happen for several reasons:

For medium, answer accuracy sits around 70%.

But the value of this tier isn't only the answer itself. If the relevant information exists in the documentation at all, the system surfaces it as a useful source in roughly ~95% of cases — usually fewer than ten links or fragments the specialist can scan quickly.

So the specialist no longer starts the search from scratch. Instead of skimming dozens of PDFs by hand, they get a short list of locations where the answer is most likely to be.

6.3. low

low is the class where the validation gate didn't find solid enough support. Here the system doesn't try to guess and doesn't push the model to make up "the most likely" answer. It just says, honestly:

I don't know the answer to this question.

That mattered to us. In construction documentation, it's better not to give an answer than to deliver a polished, confident, unsupported one.

low isn't a system failure. It's part of the system's reliability. The system has to know not only where it can help, but also where to leave the decision to the specialist.

§ 07Results in numbers

Before:

Parameter Before automation
Specialists 3
Time per checklist 10–14 working days
Cadence 3× per month
Monthly load 90–126 person‑days

After:

Parameter After automation
Specialists 1
Time per checklist 5–7 working days
Cadence 3× per month
Monthly load 15–21 person‑days

The monthly load dropped from roughly 90–126 to 15–21 person‑days. That's about 6× more efficient in raw labour terms.

And we didn't try to replace the expert outright. The specialist stays in the decision loop — they just stop spending their day chasing PDFs and start spending it on the disputed items and final approval.

§ 08What changed for the specialist

Before, the specialist would open the documentation set and dig through dozens of PDFs by hand looking for answers.

After, they open a pre‑annotated checklist:

The specialist's job changed from:

Find every answer from scratch.

to:

Check the disputed ones and approve.

That shift is where most of the gain came from.

§ 09An unexpected bonus

One nice surprise: the high tier ended up exposing errors in old checklists. So the system didn't just speed up the new process — it also flagged inconsistencies in already‑completed documents.

For the client, that was extra confirmation that the pipeline isn't just generating plausible text. It's actually working with the documentation, and it can find useful facts.

§ 10Why the client was happy

The client started out wanting full automation. We could've said yes, disappeared into development for several months, and come back with a system whose quality wasn't stable enough for a real business process.

Instead, we took a more grown‑up route:

The client didn't get an experimental AI demo. They got a working tool that helps them do a hard job faster, more safely, and more predictably.

§ 11The main takeaway

A successful AI project isn't always about replacing the human 100%. In our case, the best result came not from full automation but from controlled partial automation.

We split the task into three zones:

  1. Simple, well‑supported questions — the system handles them almost entirely.
  2. Disputed questions — the system helps with a quick verification through surfaced sources.
  3. Questions without solid support — the system honestly leaves them to a human.

That's what made the solution production‑ready.

Not the largest possible number of polished AI answers — but a reliable pipeline that knows its own limits, shows its sources, and amplifies the specialist where it actually creates business value.