Cost Engineering
Anthropic
The Frugal Approach to Anthropic Claude API Costs
Craig Conboy

With usage-billed AI services like Anthropic Claude, cost optimization starts with your code. It's not just that applications generate the API calls; applications spend. Every model selection, every prompt, every generated token contributes to your bill.

Taking an application- and code-centric approach to cost reduction means understanding what your code does before and after each API call. Here's a practical walk-through of what to look for and how to optimize Claude API costs at the code level.

Cost Trap Efficiency Pattern
Using expensive models for simple tasks Route tasks to cheapest capable model (Haiku 3/Haiku 4.5/Sonnet 4.5)
Not monitoring thinking tokens (extended thinking mode) Monitor thinking token usage when using extended thinking
Verbose natural language prompts Compress prompts to minimize input tokens
Sending full conversation history Truncate or summarize older messages
No preprocessing of input Filter irrelevant content before sending to API
Not using prompt caching Structure prompts to leverage caching (90% discount)
No deduplication or application caching Cache responses for identical/similar requests
No output token limits Set explicit max_tokens for every request
Requesting unstructured text responses Use tool use for structured outputs
Processing batch workloads synchronously Use Message Batches API for 50% discount
Including excessive examples in prompts Test minimum effective example count

Attribute the Costs

Start with your bill. Break down costs by endpoint, feature, and request type to understand where spend concentrates. Your bill reveals whether you're burning budget on input tokens (large prompts, excessive conversation history), thinking tokens (extended thinking mode on Sonnet or Opus), or output tokens (unbounded generation, verbose responses). It also shows if you're using expensive models (Opus, Sonnet) for tasks that cheaper models (Haiku) could handle.

This attribution guides which of the 11 efficiency patterns below deliver the highest impact. If 70% of costs come from output tokens, focus on patterns #8-10. If you're using Opus everywhere, pattern #1 is your priority. If extended thinking costs dominate, pattern #2 matters most.

1. Route tasks to cheapest capable model

The fastest way to reduce costs: use Claude 4.5 Sonnet as your default, not Claude 4.1 Opus. Claude 4.5 Sonnet is Anthropic's current recommended model for most use cases at $3/$15, maintaining the same price point as Claude 3.5 Sonnet.

Claude 4.1 Opus is a specialty tool for extreme reasoning tasks and costs 5x more ($15 input / $75 output vs $3 / $15). Both models support extended thinking mode, which generates thinking tokens billed at output rates ($15 per 1M for Sonnet, $40 per 1M for Opus).

Tier your tasks by actual requirements:

def select_claude_model(task_type):
    """Route tasks to the cheapest model that can handle them."""

    # Simple: classification, extraction, formatting, simple Q&A
    if task_type in ['classify', 'extract', 'format', 'simple_qa']:
        return "claude-3-haiku-20240307"  # $0.25 per 1M input tokens
        # Or claude-3-5-haiku for slightly better quality at $0.80

    # Medium: summarization, translation, simple content generation
    elif task_type in ['summarize', 'translate', 'simple_write']:
        return "claude-haiku-4-5"  # $1 per 1M input, $5 output

    # Most tasks: coding, analysis, reasoning, creative writing, vision tasks
    # This is your default - Claude 4.5 Sonnet handles 95%+ of use cases
    elif task_type in ['code', 'analyze', 'reason', 'write', 'vision', 'complex_qa']:
        return "claude-sonnet-4-5"  # $3 per 1M input, $15 output

    # Rare edge cases only: extremely complex multi-step reasoning
    # Note: Test Sonnet 4.5 first - it handles nearly everything
    # Opus costs 5x more and thinking tokens cost more ($40 vs $15 per 1M)
    elif task_type in ['extreme_reasoning', 'multi_step_research']:
        return "claude-opus-4-1"  # $15 per 1M input, $75 output
        # Extended thinking: $40 per 1M thinking tokens vs $15 for Sonnet

    else:
        # Default to Sonnet 4.5 for unknown tasks (not Opus!)
        return "claude-sonnet-4-5"

Benchmarks show Claude 4.5 Sonnet performs well on coding and reasoning tasks while costing 5x less than Claude 4.1 Opus.

Implement tiered retry logic for structured tasks:

For tasks with deterministic validation (classification, extraction, formatting), start cheap and escalate only when validation fails:

import json
import anthropic

client = anthropic.Anthropic()

def classify_sentiment_with_fallback(text):
    """Sentiment classification: start with Haiku, escalate if needed."""

    models = [
        "claude-3-haiku-20240307",  # $0.25 per 1M input
        "claude-haiku-4-5",          # $1.00 per 1M input
        "claude-sonnet-4-5"          # $3.00 per 1M input
    ]

    prompt = f"Classify sentiment as positive/negative/neutral. Return only valid JSON.\nText: {text}"

    for model in models:
        response = client.messages.create(
            model=model,
            max_tokens=50,
            messages=[{"role": "user", "content": prompt}]
        )

        # Deterministic validation: check if response is valid JSON with required field
        try:
            result = json.loads(response.content[0].text)
            if 'sentiment' in result and result['sentiment'] in ['positive', 'negative', 'neutral']:
                return result, model  # Success - don't try more expensive model
        except (json.JSONDecodeError, KeyError):
            continue  # Failed validation - try next tier

    raise Exception("All models failed to produce valid output")

# Result: Haiku handles 85% of requests at $0.25/1M
# Haiku 4.5 handles 10% at $1/1M
# Sonnet handles 5% edge cases at $3/1M
# Average cost: ~$0.40 per 1M vs. $3/1M if always using Sonnet
# Savings: 87%

This pattern works for any task with objective quality criteria: code that must compile, SQL that must parse, structured data with required fields, regex patterns, etc.

2. Monitor thinking token usage

When using extended thinking mode on Claude 4.5 Sonnet or Claude 4.1 Opus, the model generates invisible reasoning tokens (thinking tokens) before producing its response. These tokens are billed at output token rates: $15 per 1M for Sonnet, $40 per 1M for Opus.

Extended thinking improves performance on complex reasoning tasks but multiplies costs. A request generating 500 output tokens might generate 5,000 thinking tokens—10x the visible output. Monitor usage via the API response headers (anthropic-thinking-tokens) and track thinking token costs separately. Reserve extended thinking for tasks that require it, and use standard mode for routine operations.

3. Compress prompts to minimize input tokens

Every input token costs money. The goal: send exactly what's needed, nothing more.

Compress your prompts. Every unnecessary word is wasted spend:

# Verbose: 27 tokens
user_message = "Please carefully analyze the sentiment expressed in the following text and provide a classification of whether it is positive, negative, or neutral:"

# Compressed: 6 tokens
user_message = "Sentiment classification (positive/negative/neutral):"

# Even more efficient: 3 tokens + structured output
user_message = "Sentiment:"

At 1M API calls, that 24-token difference is 24M tokens = $72 in Sonnet 4.5 input costs (or $6 in Haiku 3 costs).

4. Truncate older conversation messages

Conversational applications accumulate context over time. Sending the entire conversation history on every request multiplies input costs with each turn. Most tasks don't need full history—recent context is usually sufficient. Implement strategic truncation to keep only relevant messages within a token budget.

def prepare_messages(conversation_history, new_message, max_tokens=8000):
    """Keep only relevant recent context."""

    messages = []
    token_count = 0

    # Always include system message if present
    system_msg = conversation_history[0] if conversation_history[0]['role'] == 'system' else None
    if system_msg:
        messages.append(system_msg)
        token_count += count_tokens(system_msg['content'])

    # Add recent messages in reverse order until budget exhausted
    for msg in reversed(conversation_history[-20:]):  # Last 20 turns
        if msg['role'] == 'system':
            continue

        msg_tokens = count_tokens(msg['content'])

        if token_count + msg_tokens <= max_tokens:
            messages.insert(1 if system_msg else 0, msg)
            token_count += msg_tokens
        else:
            break

    # Add new message
    messages.append(new_message)

    return messages

A 30-turn conversation without management: 50K tokens. With truncation: 8K tokens. That's 84% input cost reduction.

5. Filter irrelevant content before sending to API

Raw inputs often contain bloat: HTML boilerplate, navigation elements, repeated footers, excessive whitespace. This noise inflates token counts without adding value. Preprocess inputs to extract signal and discard irrelevant content before sending to the API.

def clean_document(text):
    """Extract signal, discard noise."""
    from bs4 import BeautifulSoup
    import re

    # If HTML, extract text
    if '<' in text and '>' in text:
        soup = BeautifulSoup(text, 'html.parser')

        # Remove non-content elements
        for tag in soup(['script', 'style', 'nav', 'footer', 'aside', 'header']):
            tag.decompose()

        text = soup.get_text(separator=' ')

    # Normalize whitespace
    text = re.sub(r'\s+', ' ', text).strip()

    # Remove repeated boilerplate
    text = re.sub(r'(Copyright \d{4}.*?)(\n|$)', '', text)

    return text

Preprocessing can reduce "web page as text" from 20K tokens to 3K tokens—an 85% reduction.

6. Structure prompts for prompt caching

Prompt caching can reduce input costs by 90% for repeated content.

How it works:

  • Structure your prompt with a static prefix (system instructions, examples, context)
  • Minimum cacheable size: 1024 tokens
  • Cache read cost: ~10% of normal input cost (90% discount)
  • Cache TTL: 5 minutes

Basic implementation:

For high-volume applications with large static system prompts, cache the prompt once and pay 90% less on subsequent calls within the 5-minute cache window.

import anthropic

client = anthropic.Anthropic()

# System instructions (5000 tokens) - this gets cached
system_prompt = """
You are a specialized customer service assistant with the following guidelines:
[... 5000 tokens of detailed instructions, examples, policies ...]
"""

def call_with_caching(user_question):
    message = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=500,
        system=[
            {
                "type": "text",
                "text": system_prompt,
                "cache_control": {"type": "ephemeral"}  # Mark for caching
            }
        ],
        messages=[
            {"role": "user", "content": user_question}
        ]
    )
    return message

# First call: pays full price for 5000 token system prompt ($0.015)
# Subsequent calls (within 5 min): pay 10% price for cached prompt ($0.0015)
# Savings: 90% on input costs for the system prompt portion

Advanced: Multi-level caching with context:

For Q&A applications where users ask multiple questions about the same document, cache both system instructions and document content in separate layers to maximize cache hit rates.

def call_with_layered_cache(user_question, document_context):
    """Cache both system prompt and document context."""

    message = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=1000,
        system=[
            {
                "type": "text",
                "text": system_instructions,  # 3000 tokens
                "cache_control": {"type": "ephemeral"}
            },
            {
                "type": "text",
                "text": f"Document context:\n{document_context}",  # 10000 tokens
                "cache_control": {"type": "ephemeral"}
            }
        ],
        messages=[
            {"role": "user", "content": user_question}  # 50 tokens
        ]
    )
    return message

# First call: full price (13,050 tokens input = $0.039)
# Subsequent calls on same document: 90% discount on 13,000 cached tokens
# Effective input cost: ~$0.0048 instead of $0.039
# Cost reduction: 87% on input (from $0.039 to $0.0048 per call)

For applications with large context (documentation, long articles, code repositories), prompt caching is the difference between affordable and unsustainable.

When to use caching:

Cache static, reusable content ≥1024 tokens: system instructions, few-shot examples, document context, code repositories, or style guides. Skip caching for highly variable user content, frequently changing context, or prompts under 1024 tokens.

7. Cache application responses for identical requests

Cache full API responses by request signature (messages + model + max_tokens) to avoid redundant API calls—30-50% cache hit rates on FAQs or repeated classifications translate directly to 30-50% cost reduction.

import hashlib
import json

cache = {}  # Use Redis/Memcached in production

def cached_claude_call(messages, model, max_tokens):
    """Cache by request signature."""

    cache_key = hashlib.sha256(
        json.dumps({
            'messages': messages,
            'model': model,
            'max_tokens': max_tokens
        }, sort_keys=True).encode()
    ).hexdigest()

    if cache_key in cache:
        return cache[cache_key]

    response = client.messages.create(
        model=model,
        max_tokens=max_tokens,
        messages=messages
    )

    # Cache for 1 hour
    cache[cache_key] = response

    return response

For applications with repeated queries (FAQs, common classifications, identical document analyses), 30-50% cache hit rates mean 30-50% cost reduction.

8. Set explicit max_tokens for every request

Output tokens are expensive (5x the cost of input tokens). Setting explicit limits is mandatory:

message = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=200,  # ALWAYS set this
    messages=[{"role": "user", "content": prompt}]
)

# vs

message = client.messages.create(
    model="claude-sonnet-4-5",
    # No max_tokens - dangerous, unbounded cost
    messages=[{"role": "user", "content": prompt}]
)

With output tokens at 5x input cost, an unconstrained 2000-token response vs a constrained 200-token response is a 10x cost difference.

9. Use tool use for structured outputs

Tool use forces structured, machine-readable output instead of verbose natural language, reducing output tokens by 50-80%.

tools = [
    {
        "name": "get_customer_info",
        "description": "Retrieve customer information",
        "input_schema": {
            "type": "object",
            "properties": {
                "customer_id": {
                    "type": "string",
                    "description": "Customer ID"
                },
                "fields": {
                    "type": "array",
                    "items": {"type": "string"},
                    "description": "Fields to retrieve"
                }
            },
            "required": ["customer_id"]
        }
    }
]

message = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1000,
    tools=tools,
    messages=[
        {"role": "user", "content": "Get email and phone for customer C12345"}
    ]
)

# Response is structured tool use, not verbose natural language
# Minimal output tokens, no fluff

Tool use produces concise, parseable output—typically 50-80% fewer tokens than free-form text responses.

10. Use Message Batches API for 50% discount

For work that can wait up to 24 hours, use the Message Batches API for a 50% discount:

# Create batch of requests
requests = [
    {
        "custom_id": f"request-{i}",
        "params": {
            "model": "claude-3-5-sonnet-20241022",
            "max_tokens": 1000,
            "messages": [
                {"role": "user", "content": task}
            ]
        }
    }
    for i, task in enumerate(bulk_tasks)
]

# Submit batch
batch = client.messages.batches.create(requests=requests)

# Poll for completion
while batch.processing_status != "ended":
    time.sleep(60)
    batch = client.messages.batches.retrieve(batch.id)

# Retrieve results
results = client.messages.batches.results(batch.id)

# Same output quality, 50% cost, 24-hour window

Perfect for: content generation, data analysis, bulk classification, document processing—anything asynchronous.

11. Test minimum effective example count

Test example counts systematically (0, 1, 3, 5, 10, 20) to find the minimum needed for target accuracy—typically 3 examples achieve 92% accuracy while 10 examples cost 3x more for just 2% improvement.

def test_example_count(task, example_pool, test_set):
    """Find minimum examples needed for target accuracy."""

    results = {}

    for n_examples in [0, 1, 3, 5, 10, 20]:
        examples = example_pool[:n_examples]

        # Format as conversation history
        example_messages = []
        for ex in examples:
            example_messages.extend([
                {"role": "user", "content": ex['input']},
                {"role": "assistant", "content": ex['output']}
            ])

        accuracy = evaluate_on_test_set(task, example_messages, test_set)
        input_tokens_per_call = sum(count_tokens(msg['content']) for msg in example_messages)

        results[n_examples] = {
            'accuracy': accuracy,
            'tokens': input_tokens_per_call
        }

        # Stop at diminishing returns
        if n_examples > 3 and accuracy - results[n_examples - 5]['accuracy'] < 0.01:
            break

    return results

# Typical finding: 3 examples = 92% accuracy, 10 examples = 94% accuracy
# Those extra 7 examples cost 3x more for 2% gain

Closing Thoughts

Careful model selection is table stakes—use Claude 4.5 Sonnet ($3/$15) as your default, reserve Claude 4.1 Opus ($15/$75) for rare edge cases, and route simple tasks to Haiku models. This baseline discipline prevents the worst cost traps.

Getting to the next level of savings requires two steps. First, let observed costs guide what needs optimization. Attribute your bill to specific application behaviors—which endpoints, features, and request types drive spend. This reveals which of the 11 efficiency patterns above matter most for your application. Second, apply the optimizations that address your cost concentrations. These patterns reduce tokens without sacrificing functionality: prompt caching cuts input costs by 90%, conversation truncation eliminates redundant context, output limits prevent unbounded generation, and tool use replaces verbose natural language with structured responses.

You don't lose capability. You gain precision. Cost-effective AI is an engineering discipline—treat each API call as a budget allocation and optimize based on measured impact.

Looking for help with cost optimizations like these? Sign up for Early Access to Frugal. Frugal attributes Anthropic Claude API costs to your code, finds inefficient usage, provides Frugal Fixes that reduce your bill, and helps keep you out of cost traps for new and changing code.

Back to Top