Most dev teams aren't sure which code is driving their AI API spend. That's not a criticism, it's just the reality of how token-based billing works. You ship a feature, it calls a model, the model charges by the token, and somewhere downstream a bill arrives that's larger than expected. By then, the code has moved on and so has your team.
Traditional infrastructure monitoring wasn't built for this. It tracks CPU, memory, I/O. Useful stuff, but it can't tell you which function is sending a bloated context window to GPT-4 on every request when a smaller model would do fine. The result is what the industry has taken to calling "bill shock," which is a polite way of saying you had no idea this was happening.
The fix isn't complicated in principle. Treat AI cost like any other application concern: trace each call, attribute the spend to the code path that triggered it, and figure out what to actually change. That's the core idea behind Application Cost Engineering, which is just a fancy name for connecting engineering decisions to dollars.
A common thread among teams that get this under control: they use OpenTelemetry to capture per-request behavior and layer cost attribution on top. OTel is vendor-neutral, well-supported, and lets you build spans that align with real application events. That's your foundation.
So, what tools actually help? Here's an honest look at the landscape.
Before getting into the list, a few things worth having in any tool you're evaluating:
Code-level attribution is the big one. "Your AI spend went up 40% this month" is not actionable. "This function in checkout-service is resending the full conversation history on every call" is. You want the latter.
Real-time, per-call costing matters because token pricing varies by model and provider. Monthly rollups hide the spikes. You want to see cost at request time.
OTel-native integration keeps your instrumentation overhead manageable and connects cost spans to traces you already have.
And finally: recommendations that go somewhere. Visibility is necessary, but a dashboard that tells you there's a problem and then stops is just expensive homework.
Tools in this category are built around the idea that cloud waste stems from application code, not just infrastructure. They trace AI API calls and map spend back to the specific function, file, and commit that triggered it. Cost is computed per call across OpenAI, Anthropic, and other providers, so you're not waiting for a monthly summary to find out something went sideways.
The piece that's genuinely different here is what happens after detection. A good application cost engineering platform doesn't just surface a problem. It proposes concrete changes: trim redundant context, add prompt caching, pick a smaller model where quality holds, batch calls that don't need to be individual. These come as ready-to-review PRs, which means the gap between "we found a problem" and "we fixed a problem" is a lot smaller.
Frugal is built specifically for this. It connects cloud billing data, observability data, and source code to trace AI API spend down to the function and commit responsible, then generates Frugal Fixes: ready-to-merge PRs with quantified savings attached.
Worth considering if you're getting unpredictable AI invoices, need to connect spend to owners in the codebase, or want to close the loop between finding waste and shipping the fix.
These tools unify cloud and AI spend so finance and engineering are looking at the same numbers. CloudZero, for example, is strong at mapping cost to business units (cost per customer, cost per feature) using tag-based allocation even when tagging is inconsistent. Well-suited for chargebacks and showbacks at an organizational level.
The distinction that matters here: CloudZero maps cost to business unit. Frugal maps cost to source code. Both are useful, but they answer different questions. "Feature A costs $50k a month" is a financial fact. "This function in feature A is resending the full conversation history on every GPT-4 call" is an engineering fix. You need the second one to actually reduce the bill.
Best for teams where the primary need is consolidated reporting across cloud, SaaS, and AI usage. You'll probably still need something code-level to close the loop on actual fixes.
FinOps platforms are built for budgeting, forecasting, and bill reconciliation. They've expanded to cover AI APIs alongside infrastructure, and they're genuinely good at aligning finance and engineering on what's being spent and where it's trending.
The limitation is that their strength is governance and dashboarding, not code-path attribution. They'll tell you a service is expensive. They're less equipped to tell you why, at the code level, or to help you fix it. The typical output is a Jira ticket assigned to an engineering team. The typical result is that ticket sitting in the backlog while the bill keeps climbing.
Best for finance-led teams that need trustworthy rollups and budget guardrails. Pair with something code-native when developers need to actually optimize prompts, context windows, or call patterns.
If your organization is already standardized on Datadog or a similar broad observability stack, you can instrument AI API calls manually: emit spans with token counts, latency, approximate cost. It works, and it leverages dashboards and alerting infrastructure you've already built.
Datadog can tell you that your AI spend correlated with a traffic spike. What it can't do natively is tell you that the spike was caused by a specific function resending full conversation context on every request, or generate a fix for it. Correlation isn't causation, and visibility isn't remediation.
The honest downside is maintenance. You're responsible for defining the attributes, keeping token counting accurate as provider pricing changes, and building whatever analysis layer sits on top. There are no built-in AI cost heuristics. Every optimization insight is something your team has to derive and act on.
Best for teams deeply invested in one observability platform and willing to do the build-and-maintain work. Not unreasonable if you have the bandwidth. Just go in clear-eyed about what it takes.
LangSmith, LangFuse, and Helicone are what you reach for when you're debugging prompt behavior, evaluating chain quality, or tracking how a RAG pipeline is performing. They give you detailed visibility into prompts, tool calls, tokens, and completion behavior. For understanding what your LLM application is doing, they're genuinely good.
The distinction is that they're observing, not optimizing. LangSmith surfaces traces. Helicone logs requests and responses with token counts. LangFuse gives you evaluation and experiment tracking. None of them are designed to map that telemetry to a code path, attribute the cost to a specific function, or generate a fix. They're observers, not optimizers, and that's fine, because that's what they're built for.
If your objective is reducing dollars per request at the function level, you'll need an additional layer on top of whatever LLM observability tool you're running.
Infracost and tools like it scan your Terraform or infrastructure definitions to flag cost implications before deployment. They're useful for catching expensive provisioning decisions early (choosing the wrong instance type, forgetting a reserved capacity discount) and they integrate neatly into CI/CD pipelines.
The catch: Infracost operates at the provisioning layer. It sees what you're buying: the instance type, the region, the service tier. It has no visibility into runtime behavior. It can't tell you that the code running on that correctly-sized instance is calling GPT-4o when claude-haiku would do the job at a fraction of the cost. The provisioning decision and the consumption decision are different problems.
Best for platform teams managing infrastructure configuration at scale. Add a code-level profiler when the goal shifts from "right-size the infra" to "right-size the application behavior inside it."
|
Tool Type |
Code-level attribution |
Real-time per-call costing |
OTel-native |
Automated fix PRs |
|
Application cost engineering (Frugal) |
Yes, to function, file, commit |
Yes, token-aware per call |
Yes |
Yes |
|
Cloud cost intelligence (CloudZero) |
Partial, to team or feature |
Partial |
Varies |
No |
|
FinOps platforms |
Limited, service or tag level |
Partial |
Varies |
No |
|
Observability + custom (Datadog) |
Possible with significant build |
Possible if maintained |
Yes |
No |
|
LLM observability (LangSmith, Helicone) |
Limited, prompt/session focus |
Token metrics, not $ by function |
Often |
No |
|
Infra-as-code scanners (Infracost) |
Provisioning only, not runtime |
N/A for runtime API calls |
Often |
No |
If you're getting bill shock and need to pinpoint expensive lines of code now, start with an application cost engineering platform. That's the problem it's built for, and it's the only category that closes the loop from detection to fix.
If you need enterprise-wide chargebacks and unified reporting across cloud, SaaS, and AI, a FinOps or cloud intelligence platform like CloudZero is the right anchor. Complement it with code-level tooling when you're ready to go deeper than "Feature A is expensive."
If your focus is prompt quality, chain debugging, or RAG evaluation, LangSmith, LangFuse, or Helicone are the right fit. Add a cost engineering layer when reducing dollars per request becomes the goal alongside quality.
If you're deeply invested in Datadog or a similar observability stack and want to roll your own AI cost instrumentation, that's a valid path. Just budget for the ongoing maintenance it requires.
Instrument once with OTel and capture AI-specific attributes: prompt size, tokens in and out, model, temperature, cache hit or miss. Use attributes that match how your teams actually think about features and user journeys.
Link spans to functions, files, commits, and services so problems land with the right owners. This is what cuts down the time between "we found something" and "we fixed it."
Look for the common patterns: repeated context resends, unnecessary use of high-end models, unbatched calls, missing cache layers. These show up constantly and can usually be resolved without any quality loss.
Automate the fix loop where you can. Prioritize by cost impact, ship PRs that right-size model selection, trim context, add caching, or batch requests. The goal is making this part of what teams normally ship, not a separate initiative.
Then move beyond monthly totals to the unit economics that actually matter: cost per conversation, per agent action, per RAG query, per document processed.
AI-enabled applications don't have the same cost profile as traditional software. The bill is set by runtime code behavior, and waiting for the invoice to discover a problem is an expensive habit. Generic infrastructure dashboards, whether that's a FinOps platform showing you cost-by-service or an observability tool showing you correlated metrics, don't reveal which functions, prompts, and chains are responsible.
Code-level AI cost profiling is how you close that gap: real-time per-call costing mapped to functions and commits, instrumented with open standards, paired with fixes that actually get merged. That's the recipe.
If you want a path from "we think this feature is expensive" to "we've measured it, fixed it, and verified the savings," start with a platform designed for Application Cost Engineering.