By Aron Racho
May 18, 2026

Designing Deterministic AI Chatbots

Many organizations are currently exploring how Generative AI chatbots could improve customer service, patient outreach, employee communication, and other operational workflows. On the surface, these systems appear relatively straightforward to build. Modern LLMs can already hold conversations, answer questions, and generate human-like responses with surprisingly little setup. But once chatbots are introduced into real production workflows, the problem becomes much more complicated. As customers ourselves, most of us have interacted with chatbots that felt frustratingly limited or unreliable. Some fail because they cannot handle normal conversational variation. Others appear conversational, but produce inconsistent behavior once conversations move outside ideal scenarios. At Olio Apps, we recently worked on a healthcare patient outreach chatbot that needed to balance two competing requirements:
  • maintain a natural conversational experience
  • consistently follow structured business workflows
That tension is one of the central challenges in production chatbot systems built on LLMs. In this post, we'll look at why chatbot systems that must follow business rules are difficult to build, where LLMs perform well, where they struggle, and how architectural design becomes critical for creating reliable production systems.

Business Logic Didn't Start with LLMs

Before AI chatbots existed, customer service organizations already relied heavily on structured communication systems. Support representatives often followed scripts like:
  • If the customer says X, respond with Y
  • If the customer becomes frustrated, escalate to a manager
  • If the issue involves billing, follow a separate workflow
These are examples of classical decision systems. In real-world organizations, these workflows can become extremely complex. There may be dozens of possible conversation paths depending on:
  • customer sentiment
  • escalation conditions
  • compliance requirements
  • service category
  • workflow state
Traditional support systems are often built around these classical decision systems because they provide operational consistency.

What Classical Decision Systems Do Best

Classical decision systems are designed to behave predictably. They are deterministic systems: systems that follow defined rules and produce consistent outcomes based on known inputs. Given the same information, the system should behave the same way every time. This predictability is critical in production environments because it makes systems easier to test, monitor and audit, and ultimately easier to trust operationally. Classical decision systems are especially important in industries like healthcare, finance, and enterprise SaaS, where workflows often involve compliance requirements, approvals, escalation paths, and operational safeguards. However, classical decision systems also have limitations. They tend to struggle with natural conversation and ambiguity, because human conversations rarely follow perfectly structured patterns.

What LLMs Are Good At and Where They Fall Short

Large language models are extremely good at extracting meaning from messy, unstructured language. This ability to turn conversational language into structured meaning is one of the biggest strengths of modern AI systems. For example, a user might say: "I've been charged twice and nobody is responding." An LLM can infer:
  • this is likely a billing issue
  • the user is frustrated
  • the interaction may require escalation
LLMs are also very good at:
  • handling conversational variation
  • sounding natural and human like
  • adapting responses in a dynamic way
  • interpreting context from incomplete language
However, LLMs are not naturally reliable at following large sets of business rules consistently. As complexity grows, predictability suffers. When overloaded with instructions, workflows, edge cases, and exceptions, they may begin producing inconsistent outputs or inventing behavior that was never explicitly defined. This unfortunate behavior is commonly referred to as hallucination. For sensitive workflows, that becomes a serious operational problem. In a healthcare outreach workflow, for example, we do not want the system inventing escalation paths or improvising business rules.

Deterministic vs Non-Deterministic Systems

This creates a fundamental tension in production chatbot systems. Classical decision systems are deterministic:
  • predictable
  • structured
  • controlled
LLMs are non-deterministic:
  • flexible
  • conversational
  • adaptive
  • variable
Both approaches solve different problems well. The challenge is that production chatbot systems often need both simultaneously. They need to feel conversational and flexible for users, while also remaining operationally predictable for the business. That combination is where many first-generation chatbot systems fail.

The Initial Solution That Didn't Work

The most common initial approach is straightforward: put everything into one large prompt. The prompt attempts to define:
  • tone
  • workflow behavior
  • escalation logic
  • business rules
  • edge cases
  • response formatting
  • conversational guidance
We initially approached the problem this way as well. This was reasonably successful for our early prototypes. But production variability exposed the weaknesses quickly. Some conversations followed workflows correctly but drifted in tone. Others sounded polished while violating important business rules. Small prompt changes intended to fix one scenario often introduced problems elsewhere. The larger the prompt became, the more difficult the system became to control. The issue was not necessarily that the prompt was poorly written. The issue was that too many responsibilities were concentrated into a single layer.

The Architectural Solution That Worked

What ultimately worked better was designing the system around the strengths of two different approaches:
  • LLMs for language understanding
  • Classical decision systems for operational control
Instead of asking one prompt to do everything, we separated the system into multiple stages with narrower responsibilities.

Stage 1: Interpretation and Fact Extraction

The first stage uses an LLM to interpret the conversation and extract structured information. For example: "I've been charged twice and nobody is getting back to me." The system extracts:
  • intent: billing issue
  • sentiment: frustrated
  • priority: high
  • escalation candidate: yes
At this stage, the LLM is doing what it does best: it is turning unstructured language into structured data.

Stage 2: Classical Decision Logic

The second stage removes the LLM from business decision-making. A classical decision systems layer processes the extracted facts and determines the next workflow step. For example:
  • billing issue + escalation candidate → trigger billing escalation workflow
This layer remains predictable, testable, and governed by explicit operational rules.

Stage 3: Conversational Response Generation

Only after the workflow decision has already been made does the system return to the LLM. At this point, the model is responsible only for phrasing the response naturally for the user. The business logic itself has already been determined elsewhere. This separation profoundly improved reliability because each layer was responsible for a smaller and more focused task. We also found that smaller, focused prompts consistently performed better than large "do everything" prompts in production environments. We would like to note that this general engineering discipline of single-responsibility holds up pretty well in the age of AI.

Summary and What's Next

One of the biggest lessons from this work was to treat the LLM as one component within a well-architected system - not the system itself. LLMs are extremely powerful for interpreting language and generating natural conversation. Classical decision systems remain extremely valuable for enforcing operational control and business rules. The most effective production chatbot systems combine both approaches. At Olio Apps, we've found that separating interpretation, decision-making, and response generation creates systems that are more reliable, easier to test, and easier to evolve over time. In this architecture, classical decision systems provide the structured layer that ensures operational consistency. In upcoming posts, we'll dive deeper into some of the architectural patterns behind these systems, including:
  • prompt decomposition strategies
  • regression testing for conversational workflows
  • orchestration patterns for production LLM systems
  • balancing conversational flexibility with deterministic business logic
If you are exploring production AI systems or conversational workflows with similar operational requirements, we'd be happy to talk through architectural approaches and lessons learned from our work in this space.
Aron Racho
CTO

Aron has lived in the Pacific Northwest since the turn of the century. He joined Scott at Olio Apps in late 2015 and helped scale the team to its present size. He is a family man and also has many hobbies, which include music composition, game programming, and reading. Aron’s areas of expertise are project management, dev team management, application architecture, and full stack engineering in Java, Golang, and React.