Skip to main content
spfr_blog_22.05.26.jpg
Insights

The Model is Not the Deliverable: The Other 85%

Marijana Vukovic · · 7 min read

Demos have gotten very good at hiding what isn't ready. A model that answers questions, flags anomalies, and classifies images with boardroom-grade confidence can look like a finished product right up until someone asks what happens when it touches a real clinical workflow, a real fleet, a real supply chain decision. A

Demos have gotten very good at hiding what isn't ready. A model that answers questions, flags anomalies, and classifies images with boardroom-grade confidence can look like a finished product right up until someone asks what happens when it touches a real clinical workflow, a real fleet, a real supply chain decision. And the answer turns out to be "we haven't gotten there yet." That gap between a working model and a deployable system is where most regulated AI initiatives stall, and it almost never has anything to do with the model itself.

That gap has widened considerably in the past two years, and somewhat counterintuitively, more capable models are part of the reason. As foundation models do more and agentic systems operate with greater autonomy, the surface area that needs to be validated, governed, and monitored has grown faster than most organizations anticipated. The model is no longer the hard part, which means the hard part is everything else, and teams that keep treating model selection as the central question are optimizing for the wrong variable.

Buying "an AI model" is the wrong mental model

When executives talk about procuring AI, the conversation centers on which foundation model, which vendor, what benchmark, because that's the part that maps cleanly onto a procurement decision. But in regulated industries, nobody is actually buying a prediction function. They're buying a change in how consequential work gets done, and that's a meaningfully different thing to scope, build, and deliver. 

A recent example: an AI system that predicts individual patient recovery trajectories post-procedure and automatically surfaces deviations from the expected curve. The point was to flip a reactive workflow into a proactive one. Instead of clinicians waiting for the next scheduled checkup or a patient calling in with a problem, the system flags patients drifting off-trajectory while the care team can still intervene.

The model was roughly 15% of the engineering effort. The system around it, the integrations, the evidence layer, the operating model, the handoff, was everything else. Organizations that internalize this early stop asking what the model can do and start asking what the system around it can survive: bad data, a staff change, a regulatory update, an edge case the pilot never encountered. 

The gap a demo doesn't show

The first layer of any AI deployment (the model, the workflow integration, the data interfaces, the user experience) is the part most teams recognize and know how to build. It's also the part that can make a system look finished when it isn't. 

What a working interface doesn't surface are the questions sitting underneath it: how was this validated, what assumptions is it making about the data it receives, who is accountable when it fails in a way the pilot never anticipated. In conventional software, you can often ship and let real usage answer those questions. In regulated industries, that's not an option and the questions have to be resolved before the system touches a real patient, a real vehicle, a real supply chain decision. Teams that treat a functioning interface as the finish line tend to find out why that's costly later than they should.

If the system can't be inspected, it can't be approved

A working workflow isn't the same as a defensible one. When a medical director, safety engineer, or compliance officer asks how the system was tested, what its known limits are, and what happens when it's wrong, nobody having a clear answer is usually where executive confidence starts to erode. It happens earlier than most teams expect, and the credibility lost in that moment is hard to rebuild.

What closes that gap is an evidence layer: an evaluation suite tested against real-world scenarios, a validation package capturing intended use and known failure modes, documentation of what data the system has and hasn't seen, and a change history that's actually auditable. In regulated settings, this isn't supplementary documentation but part of the product. A system that performs well but can't be reviewed can't be approved, and it won't be.

Production is where the real test begins

There is a common assumption that once a system is in production, the hard work is done. With AI, the opposite is closer to the truth. Production is where the system starts accumulating conditions no pilot ever replicates like data drift, model behavior shift, regulatory updates, feedback loops between the system and the humans using it, and the gradual erosion of the assumptions the original build was designed around.

A system that performs reliably on day one can become quietly untrustworthy by month three if nobody designed the operating layer: what gets monitored, who is alerted when behavior changes, what constitutes a failure serious enough to trigger rollback, who owns the system when the implementation team is gone. The specifics vary (escalation rules and audit logs in clinical environments, telemetry quality checks in mobility, exception handling in logistics) but the requirement doesn't. If nobody owns the operating layer, nobody actually owns the AI system.

Human oversight is a design problem, not a checkbox

Regulated AI plans almost universally include human oversight and treat it as settled once the phrase is in the documentation. A checkbox gets ticked, and the assumption is that having a person somewhere near the system constitutes meaningful control. It doesn't.

Oversight only works when the workflow makes explicit what the human is reviewing, what information they need, when to intervene, and how their decision gets recorded. Without that design, you get alert fatigue, rubber-stamping, and unclear accountability, which in high-stakes environments can be more dangerous than no automation. The human needs to be a participant in a workflow built to support them, not a label applied to make the system look controlled.

Where most vendor engagements fail

Technical delivery and genuine capability transfer are not the same thing, and the gap between them is where a surprising number of AI engagements end. The code exists, the contract is fulfilled, and then the internal team discovers they don't understand the evaluation suite, can't locate the evidence, and have no clear owner across business, product, engineering, and compliance.

That's not a deployment. It's dependency with a launch date.

A real handoff leaves the organization with runbooks, an ownership map across every function that touches the system, documentation someone who wasn't in the original build can act on, and a roadmap specific enough to execute. The goal is to leave a capability they can operate, explain, and evolve independently. 

The question worth asking before you sign

The most common question in AI procurement — which model are you using? — rarely determines whether an initiative reaches production. The better question is: what exactly will we own when this engagement is over? 

The answer should name the workflow, integration points, evidence package, evaluation methodology, monitoring approach, human oversight model, operating runbook, and first post-launch milestone. If it's vague, the risk is both technical and commercial. Six months proving that AI is possible, without producing anything the business can safely use, is a failure mode that's easy to avoid and surprisingly common. 

At SpiceFactory, we scope regulated AI as a shipped capability from the start: the system, the evidence layer, the operating model, and the transfer of ownership designed together, not assembled after the fact. Because a deployment the client can't explain, govern, and improve on their own isn't really a deployment at all.

If you're moving an AI pilot toward production, start with that conversation.

Your Turn

Shipping production AI into a regulated industry?

Tell us the regulatory or safety constraint slowing you down. 30 minutes with a senior engineer, a deployable architecture sketch, and an honest call on whether a Bootcamp is the right next step.