Construction checklists – case study

§ 01The problem: 2,000 items, 30+ PDFs, 90+ person‑days a month

There are processes in construction that look simple from the outside: take the project documentation, walk through a checklist, fill in the right fields. In practice, it can be a huge amount of manual work.

A construction company came to us with a per‑site checklist of 2,000+ items. To fill it in, their specialists rely on a full set of project documentation: dozens of PDF files, text sections, tables, plans, drawings, and appendices.

Before automation, the process looked like this:

Parameter	Before
Specialists involved	3
Time per checklist	10–14 working days
Cadence	~3 times per month
Monthly load	90–126 person‑days

Every month, several specialists were spending most of their working time hunting for information across PDFs, reconciling wording, transferring data, and double‑checking it by hand.

The client's request sounded simple:

We want to fully automate this checklist.

At first glance the task is a perfect fit for AI: there are documents, there are questions, the model just needs to find the answers. After the first analysis pass, though, it became obvious that full automation here would be too risky. That's where the project actually started.

§ 02Why we didn't promise full automation

We could've said yes. "Sure, we'll build an AI that fills the whole checklist on its own." But in a real production process, that would almost certainly have led to problems.

In construction documentation, the cost of a wrong answer is high. If the system confidently returns a bad value, the specialist will end up rechecking everything by hand, and trust in the tool evaporates within a release or two.

So we didn't start with code. We started with a risk assessment.

The main conclusion was this:

Fully automating this from day one would be too risky. But we can cut the manual load drastically if we sort answers by reliability and keep the human exactly where the human is needed.

Instead of replacing the specialist 100%, we proposed an AI service that:

finds answers in the project documentation;
fills part of the checklist automatically;
shows its sources;
splits results by reliability tier;
and honestly says "I don't know" when the evidence isn't strong enough.

The client agreed with the approach, and we got to work.

§ 03Where it gets hard

3.1. The documents aren't standardised

The checklist is filled out from 30+ PDF documents. The average document runs to about 120 pages. And every project, the documents are produced by different contractors. That means:

file names change every time;
the structure of the documents differs;
the same parameter can live in different sections;
information ordering is not guaranteed;
some data sits in the body text, some in tables, some on plans;
a year from now the client may have new contractors with completely different formatting conventions.

So we couldn't write a rule‑based system that says "look for the answer in document X, section Y, page Z". The system had to handle poorly standardised documentation and survive constant changes in the shape of the input.

3.2. Not every answer lives in the text

Some questions can be answered directly from the body text — a specific characteristic, parameter, or value that's explicitly written down. Those are the easy ones to automate reliably.

But not every question is like that. Some items read more like:

Height from floor to ceiling, including the upper partition.

Sometimes that information is in a text section. Sometimes it only exists on a plan, an elevation, or a drawing.

Construction plans are often huge PDF pages packed with small details, callouts, footnotes, tables, and visual elements. Feeding entire pages straight into a vision‑language model would be slow, expensive, and not always accurate. Trying to brute‑force‑parse every drawing didn't make sense either.

For those cases, we had to design a separate flow: locate relevant regions, do smart crops, pick the right fragments, and only then push them further down the pipeline.

3.3. On‑premise only, on a shared GPU node

One more important constraint: the system had to run entirely inside the client's infrastructure. No external APIs. No project documentation leaving for a cloud LLM. No sensitive data crossing the perimeter.

The client already had a shared 2×H100 GPU node on their own infrastructure. For this project, we were allocated 65 GB of VRAM out of that node — it wasn't a dedicated machine reserved for us. Our pipeline had to fit inside that 65 GB budget while sharing the node with other workloads.

Practically speaking, the task wasn't just "build an AI service" — it was "fit an AI service into a fixed set of constraints":

run on‑premise;
use only the models available inside their perimeter;
stay within the allocated GPU budget;
handle large document sets reliably;
finish processing in a usable amount of time.

The final service runs inside the client's infrastructure and does its work inside that GPU allocation.

§ 04The core engineering idea: a validation gate, not model self‑confidence

We didn't try to make the LLM answer every question at any cost. We built the pipeline around a different idea:

The system should not just produce answers. It should know where an answer is reliably backed by the documentation, where a quick human check is needed, and where it should honestly say "I don't know".

This is the important distinction.

We don't use the model's own confidence as the final reliability signal. An LLM can sound confident when it's wrong. So the high / medium / low tiers aren't decided by the model — they're decided by a separate validation gate.

The validation gate looks at:

the sources that were retrieved;
the strength of the textual support;
whether the answer actually matches the question;
the quality of the extracted fragment;
how directly the answer can be backed by the documentation;
signs that the answer might refer to a different object, section, or zone than the one being asked about.

Only after that does the system assign a tier.

The result isn't just an answer generator. It's a controlled AI pipeline with an explicit boundary of trust.

§ 05How the service works in practice

For the user, the solution looks like a web service. A specialist uploads:

the checklist for a specific site;
the project documentation set;
any extra reference materials, if needed.

The pipeline then runs over the documents. It analyses the corpus, retrieves relevant fragments, extracts candidate answers, runs each one through the validation gate, and returns an annotated checklist.

A few hours later, the specialist receives a result where every item has:

a proposed answer;
a reliability tier — high, medium, or low;
links to the source locations in the documentation;
relevant fragments to verify against;
a clear next step: accept, double‑check, or "not enough evidence".

End‑to‑end processing takes around 4 hours and runs entirely on‑premise.

§ 06Three confidence tiers

6.1. `high`

high is the class of questions where the answer is, with very high probability, written directly in the body text of the documentation.

Honestly: this is not the hardest part of the checklist. Most of these are well‑formalised, fairly simple questions — parameters, values, characteristics, the kind of thing that's usually stated explicitly in the docs.

But these were exactly the items creating most of the manual grind. Even when the answer is simple, the specialist still has to:

find the right PDF;
open the right section;
verify the wording;
locate the value;
copy it into the checklist;
and confirm it actually refers to the right object.

That's the repeatable part we automated.

An answer only lands in high if it passes the strict validation gate: a direct match was found in the documentation, and the answer has been confirmed to match the question.

Per the validation results, high covers around 30% of checklist items at very high accuracy — about 99.9%.

For the business, that means a meaningful chunk of the manual grind disappears with almost no loss of reliability.

6.2. `medium`

medium is the class where the system found relevant material and can propose an answer, but the validation gate doesn't consider the support strong enough for high. That can happen for several reasons:

the wording in the docs differs from the wording in the checklist;
the information is spread across several places;
the answer requires interpretation;
there's a risk the retrieved fragment refers to the wrong section or zone;
part of the data lives on a plan or drawing;
the sources are useful but don't give a direct enough confirmation.

For medium, answer accuracy sits around 70%.

But the value of this tier isn't only the answer itself. If the relevant information exists in the documentation at all, the system surfaces it as a useful source in roughly ~95% of cases — usually fewer than ten links or fragments the specialist can scan quickly.

So the specialist no longer starts the search from scratch. Instead of skimming dozens of PDFs by hand, they get a short list of locations where the answer is most likely to be.

6.3. `low`

low is the class where the validation gate didn't find solid enough support. Here the system doesn't try to guess and doesn't push the model to make up "the most likely" answer. It just says, honestly:

I don't know the answer to this question.

That mattered to us. In construction documentation, it's better not to give an answer than to deliver a polished, confident, unsupported one.

low isn't a system failure. It's part of the system's reliability. The system has to know not only where it can help, but also where to leave the decision to the specialist.

§ 07Results in numbers

Before:

Parameter	Before automation
Specialists	3
Time per checklist	10–14 working days
Cadence	3× per month
Monthly load	90–126 person‑days

After:

Parameter	After automation
Specialists	1
Time per checklist	5–7 working days
Cadence	3× per month
Monthly load	15–21 person‑days

The monthly load dropped from roughly 90–126 to 15–21 person‑days. That's about 6× more efficient in raw labour terms.

And we didn't try to replace the expert outright. The specialist stays in the decision loop — they just stop spending their day chasing PDFs and start spending it on the disputed items and final approval.

§ 08What changed for the specialist

Before, the specialist would open the documentation set and dig through dozens of PDFs by hand looking for answers.

After, they open a pre‑annotated checklist:

high can be accepted without the usual manual recheck;
medium can be verified quickly against the surfaced sources;
low is immediately flagged as something to handle manually.

The specialist's job changed from:

Find every answer from scratch.

to:

Check the disputed ones and approve.

That shift is where most of the gain came from.

§ 09An unexpected bonus

One nice surprise: the high tier ended up exposing errors in old checklists. So the system didn't just speed up the new process — it also flagged inconsistencies in already‑completed documents.

For the client, that was extra confirmation that the pipeline isn't just generating plausible text. It's actually working with the documentation, and it can find useful facts.

§ 10Why the client was happy

The client started out wanting full automation. We could've said yes, disappeared into development for several months, and come back with a system whose quality wasn't stable enough for a real business process.

Instead, we took a more grown‑up route:

we were honest about the risks of full automation;
we proposed partial automation with an explicit trust boundary;
we didn't lean on the model's self‑confidence;
we built a separate validation gate;
we kept the specialist in the loop where it actually matters;
we returned transparent sources alongside every answer;
we fit into the client's on‑premise infrastructure;
and we cut the labour cost by roughly 6×.

The client didn't get an experimental AI demo. They got a working tool that helps them do a hard job faster, more safely, and more predictably.

§ 11The main takeaway

A successful AI project isn't always about replacing the human 100%. In our case, the best result came not from full automation but from controlled partial automation.

We split the task into three zones:

Simple, well‑supported questions — the system handles them almost entirely.
Disputed questions — the system helps with a quick verification through surfaced sources.
Questions without solid support — the system honestly leaves them to a human.

That's what made the solution production‑ready.

Not the largest possible number of polished AI answers — but a reliable pipeline that knows its own limits, shows its sources, and amplifies the specialist where it actually creates business value.

On-site checklists.