Frugal AI and agent tokenomics · Prepared for a global pharmaceutical leader

Optimized performance meets economic efficiency: the formula for sustainable scale.

An agentic system's economics are decided before the agent runs, in how the data is modeled. A semantic model built to cost the least to query lets everything above it scale.

Context for this engagement. The data-layer readiness assessment is complete and a Snowflake Cortex build is live in an initial market. The foundation is in motion. What follows is the approach to optimize it, across its technical, architectural, and economic dimensions.

01 Architectural

The decision is in the data model

The token bill is mostly written at the data layer. A well-structured semantic model produces an output at the lowest token cost and the best quality. A thin or unmodeled layer makes the agent work harder, and pay more, on every call. This is a data engineering decision first.

It helps to hold two ideas apart. How the data is modeled, the shape of the tables, is one layer. How an agent works a problem is another. The two are easy to run together, and doing so is what makes the architecture conversation harder than it needs to be. This piece stays on the layer that controls cost.

A comparison between an older approach and a newer one shows which behaves better. It does not show which is more economical. Cost is a separate axis, and it is the one this methodology is built around.

02 Economic

The missing component: token consumption

A comparison of two approaches usually weighs output quality and latency. The component most often left out is token consumption, and it is the one that decides which approach stays affordable as usage grows. Measuring it is what turns a question of which behaves better into a question of which scales.

The reason it matters is in the shape of the bill.

5 to 30x

more tokens per agentic task than a single model callGartner

~85%

of enterprise AI budgets is inference, not trainingFinOps Foundation, 2026

~1,000x

drop in unit token price over three yearsStanford HAI AI Index

20x+

projected rise in token consumption by 2030Goldman Sachs Research

The unit price falls, the volume rises, the bill grows. The metric that matters is not total token spend but cost per output, the price of each answer or generated result, and that number is set almost entirely by how the data is modeled before the agent runs.

Market figures cited from public 2026 sources for context. They are not Hakkoda or IBM measurements.

03 Technological and architectural

The methodology: four layers, in order

Two principles sit underneath the work. First, do not use AI for everything: the parts of the system that should be deterministic stay in code, where they are cheaper, faster, and safer. Second, model the data so it costs the least to query. From there, cost per output is set in four layers, and the order matters. Each layer removes work the agent would otherwise pay to do at runtime.

Model the data first

The semantic model is the foundation. It serves whatever agent approach runs on top, at the lowest token cost and the best output quality of any choice available. A bare model forces the agent to interpret it from scratch on every call, and that interpretation is where the tokens go. Get this layer right and every layer above it gets cheaper.

Build a skills library

Skills are prebuilt capabilities attached to specific kinds of work. With a skill in place, the agent stops reasoning through how to read the model for that task and runs a known-good procedure instead. Tokens per call drop, reliability rises. Modular skill architectures have cut token cost by 60 to 90 percent with no loss of output quality. Skills also compound: each one built for one engagement is a reusable accelerator for the next.

Build a prompt repository and a model context protocol layer

Above the skills library sit locked, tested prompts, built against the specific model and data, not generic templates. Model context protocols (MCPs) define how the agent connects to tools, passes context, and hands off between steps, so it does not improvise its connections on every call. Together they cut spend, cut latency, and leave less room for an expensive wrong turn.

Route to the right model

With the data modeled and the skills, prompts, and protocols in place, the last layer sends each task to the right model: the most economic one that clears the bar, capable enough to be reliably correct and no costlier than that requires. Snowflake orchestrates across cost tiers out of the box, on one principle, the shallowest sufficient capability per query. Simple retrieval does not pay frontier prices; complex synthesis earns them. Running a single top-tier model for everything overspends an estimated 40 to 85 percent.

04 Technological and architectural

How the system runs once it is stood up

Stood up, the methodology becomes a running system. It is not a fixed pipeline: the router sends each request down the lightest path that fits, retrieval fans out in parallel when the question needs it, and only the hard cases run the full loop.

The router is the first economic control

Every request enters a lightweight router, a shallow reason-then-act step that picks the lightest mode that resolves it reliably. Most routine traffic resolves here, without an expensive reasoning loop, and a large share of the token bill is decided at this one step.

Direct

One grounded generate-and-execute step for simple retrieval, such as yesterday's volumes by region.

Validated

Generate, execute, inspect, review. For moderately complex questions that need a check.

Reasoned

Plan, run multiple queries, reconcile definitions, build a supported conclusion. Reserved for the small share where added reasoning changes the output.

Retrieve in parallel, then reason on what survives

When a question needs real context, the system retrieves in parallel: many tool calls against the semantic model and its skills, deduplicated by source and ranked by evidence quality, then reduced to the strongest set. Public enterprise-agent research calls this schedule explore then exploit, broad first and focused after. The parallel calls also cross-check one another, so a single unreliable result is caught rather than carried forward. Reasoning then runs on that filtered context only, offloading intermediate state to a store instead of its own history, which keeps a long task from inflating the context window and the bill.

The roles that hold it together

Roles are bounded, which keeps the system testable and stops it from spawning work without end. In a data system they map directly to the data path.

Manager

Reads the business question, breaks it into data tasks, and assembles the result. Delegates, and holds no query tools of its own.

Specialist

Owns one part of the path: schema discovery, semantic resolution, SQL generation, or result validation, each with only the tools that part needs. Cannot delegate, which prevents loops.

Worker

Runs one atomic operation against the warehouse: execute a query, call a function, return rows.

Access follows the same discipline. Read and search sit with retrieval, write and execution with tightly governed workers, and a pure reasoning step may hold no tools at all. Least privilege keeps the blast radius of any one component small.

05 Operational

Where the rubber meets the road

An architecture that works in a notebook holds in production only with the operating discipline around it. Day to day, the work is governance, guardrails, and observability, and none of it is model judgment.

Guardrails are deterministic, not agents

The controls that protect cost and correctness are enforced in code, which is cheaper, faster, and safer than asking an agent to enforce them.

Tools are strict contracts. Every tool call and every hand-off back to a manager passes schema validation, a structured output the next step can trust. A failure triggers a retry with backoff rather than letting a partial result compound silently into the steps that follow.
Recursion is bounded. Each branch and worker runs under a hard budget: maximum tokens, maximum tool calls, a wall-clock timeout, and a maximum delegation depth. The loop is guaranteed a termination point.
Humans gate the high-impact actions. Before a side effect that is hard to undo, a database write, a payment, an outbound message, the workflow stops at a checkpoint and escalates to a person, with the trace and the proposed action attached.

Everything is observable

The system emits a step-by-step execution trace, and operators watch agent success rates, tool-call failure rates, time to first token, and tokens consumed by agent type. Without traces, the system is a box that works until it does not. This is also where data-layer weakness shows up first, as latency in the data agent and cost that rises without an obvious cause.

The architecture and operating patterns in these two sections describe current public framing for enterprise agent systems. They are the reference design this methodology applies, not a specific delivered build or a Hakkoda or IBM measurement.

06 Architectural

How do you know your data is AI-ready?

Readiness is a property of the data model. The question to ask is whether the data is modeled for how agents will actually query it. Two markers answer it quickly.

Data quality runs as events, not on a schedule. Event-based checks through Snowflake Data Metric Functions catch an issue the moment it appears, where scheduled checks drift out of step with the data and let bad inputs reach the agent before anyone notices.

The semantic model is clean, because its quality is not cosmetic. Gaps in the semantic layer do not stay there. They surface downstream as latency in the data agent and as cost on the observability dashboard. Resolving them in the model is cheaper than paying for them on every call.

Read this way, the choice between an older approach and a newer one is made on the axis that matters: which is modeled so the data costs the least to query, at the quality the work requires.

07 Economic

The discipline underneath: frugal AI

None of this is a new category. It is an existing efficiency discipline applied to the data and model layer.

The contextual frame

Frugal AI, achieving high impact with minimal resources, was named in the Allen Institute's 2019 Green AI work and advanced through French national AI policy and Cambridge Judge Business School's Frugal AI Hub, which evaluates AI systems on return on investment and total cost of ownership rather than raw capability. The overkill reflex it warns against is not only running a frontier model on a simple question. It is running any agent on data it should not have to interpret at runtime. Modeling that interpretation away is the frugal move.

Read against this discipline, the methodology is one continuous idea: spend tokens where they change the decision, and engineer the data layer so the rest is close to free.

08 Proof

What we have built

The pattern is not theoretical. Hakkoda builds the data foundation: governed semantic models on Snowflake, grounded text-to-SQL, and the data engineering that makes agentic systems reliable. IBM builds the orchestration and routing layer on watsonx, with the governance to keep production systems accountable.

Hakkoda-delivered

Mitsui USAHakkoda

Clinical trial pre-screening on Snowflake and Azure, natural-language and large language model reasoning over a defined clinical domain. 2x faster pre-screening, 4x faster build than industry standard.

Edwards LifesciencesHakkoda

Snowflake and dbt stack ingesting clinical device telemetry and Fast Healthcare Interoperability Resources (FHIR) data on a reusable model. 97 percent data-efficiency gains, pipelines cut from hours to five minutes.

Keck Medicine of USCHakkoda

Governed machine-learning operations and model governance on a clean data foundation. 70 percent reduction in the manual review behind an informed patient-transfer decision.

Under ArmourHakkoda

Cortex Analyst and Snowflake Intelligence over product data, grounded text-to-SQL on a defined domain. Source reports thousands of hours saved annually, directional.

MedtronicHakkoda

Supplier intelligence on Snowflake Cortex Analyst with three connected semantic models, grounded reasoning over governed definitions. Structural proof, no hard metric cited.

IBM-delivered

Global life sciences leaderIBM Consulting

Conversational generative AI assistant over governed product and support data, client blinded. 90 percent reduction in cost per query, 4x conversion lift, 60 to 70 percent of product inquiries resolved automatically.

UFC Insights EngineIBM Build Engineering

watsonx Orchestrate fans out to parallel specialist retrieval feeding a grounded insights agent. IBM-estimated gains of about 3x insight volume and roughly 40 percent less query-generation time. Illustrative only, actual results will vary.

IBM Consulting AdvantageIBM

watsonx Orchestrate with cost-aware routing across Granite, Llama, Haiku, and frontier models. The shallowest-sufficient-model pattern in production. Capability proof.

Fortune 500 pharmaceuticalIBM Consulting

Intelligent supply-chain platform with Watson as the intelligence engine and a generative AI plugin that turns unstructured risk questions into backend queries, client blinded. Projected 55 percent reduction in decision-cycle time and $8 to 10 million in avoided disruption. Projected, not realized.

Bottom line

The data modeling layer is the first decision and the most consequential one. A semantic model built to minimize token cost, paired with a skills library, a tested prompt repository, and a clean protocol layer, is what makes any agent architecture economical. The choices above it are tools on that foundation, not substitutes for it. So the question to carry into a build is simple: is the data modeled so the output costs the least tokens, and are the skills and protocols in place to remove interpretation overhead at runtime? When the answer is yes, the rest of the architecture gets cheaper to settle, and the choice between an older approach and a newer one is made on performance and cost at once.

Draft for internal review

First pass for review. Confirm all client-specific framing and every proof point before this goes to the client. Client metrics labeled directional, estimated, projected, or structural are not confirmed realized results, and IBM-estimated figures carry illustrative-only qualification. Frugal AI sources: Cambridge Judge Business School Frugal AI Hub (frugalai.org); Allen Institute, Green AI, 2019; French national AI policy. Frugal AI is an existing efficiency discipline, not a new category.