With usage-billed AI services like OpenAI, cost optimization starts with your code. It's not just that applications generate the API calls; applications spend. Every model selection, every prompt, every generated token contributes to your bill.
Taking an application- and code-centric approach to cost reduction means understanding what your code does before and after each API call. Here's a practical walk-through of what to look for and how to optimize OpenAI API costs at the code level.
| Cost Trap | Efficiency Pattern |
|---|---|
| Using expensive models for simple tasks | Route tasks to cheapest capable model (GPT-5-nano/GPT-5-mini/GPT-5) |
| Not monitoring reasoning token usage | Monitor GPT-5 reasoning tokens via usage API (can 5x costs) |
| Verbose natural language prompts | Compress prompts to minimize input tokens |
| Sending full conversation history | Truncate or summarize older messages |
| No preprocessing of input | Filter irrelevant content before sending to API |
| Not structuring prompts for automatic caching | Structure prompts to leverage automatic caching (90% discount on GPT-5) |
| No deduplication or application caching | Cache responses for identical/similar requests |
| No output token limits | Set explicit max_tokens for every request |
| Requesting unstructured text responses | Use JSON mode or function calling for structured outputs |
| Processing batch workloads synchronously | Use Batch API for 50% discount on async work |
| Including excessive few-shot examples | Test minimum effective example count |
Attribute the Costs
Start with your bill. Break down costs by endpoint, feature, and request type to understand where spend concentrates. Your bill reveals whether you're burning budget on input tokens (large prompts, excessive conversation history), reasoning tokens (GPT-5's invisible thinking), or output tokens (unbounded generation, verbose responses). It also shows if you're using expensive models (GPT-5, o4-mini, o3) for tasks that cheaper models (GPT-5-mini, GPT-5-nano) could handle.
This attribution guides which of the 11 efficiency patterns below deliver the highest impact. If 70% of costs come from reasoning tokens on simple tasks, patterns #1 and #2 are your priority. If output tokens dominate, focus on patterns #8-10. If you're using GPT-5 everywhere, pattern #1 matters most.
1. Route tasks to cheapest capable model
The biggest cost reduction comes from not over-provisioning intelligence. GPT-5 is the current flagship (released August 2025), but most tasks don't need it.
OpenAI now offers three model series:
- GPT-5 series: General-purpose models with advanced reasoning (gpt-5, gpt-5-mini, gpt-5-nano)
- o-series: Specialized reasoning models for complex problems (o3, o4-mini)
- Previous generation: Still available (GPT-4o, GPT-4o mini, GPT-4, GPT-3.5 Turbo)
Tier your tasks by actual requirements:
def select_model(task_type):
"""Route tasks to the cheapest model that can handle them."""
# Simple: classification, extraction, formatting, simple Q&A
if task_type in ['sentiment', 'classify', 'extract_entities', 'simple_qa']:
return "gpt-5-nano" # $0.05 per 1M input, $0.40 output
# Cheapest option, 25x less than GPT-5
# Medium: summarization, translation, structured extraction
elif task_type in ['summarize', 'translate', 'extract_structured', 'simple_write']:
return "gpt-5-mini" # $0.25 per 1M input, $2 output
# 5x cheaper than GPT-5, handles most routine tasks
# Complex: creative writing, coding, analysis, instruction-following
# This is your default for complex work - GPT-5 is the current flagship
elif task_type in ['analyze', 'code_generation', 'creative_write', 'complex_qa']:
return "gpt-5" # $1.25 per 1M input, $10 output
# Specialized reasoning: multi-step math, scientific reasoning, complex logic
# Note: Only use if proven GPT-5 isn't sufficient
elif task_type in ['advanced_math', 'scientific_reasoning', 'multi_step_logic']:
return "o4-mini" # $1.10 per 1M input, $4.40 output
# Cheaper than o3 ($10/$40) but still powerful
else:
# Default to gpt-5-mini for unknown tasks
return "gpt-5-mini"
Benchmarks show sentiment analysis with o3 vs GPT-5-nano: same accuracy, 200x price difference ($10 vs $0.05 per 1M input tokens). At scale, that's the difference between a $100 bill and a $20,000 bill.
Important note on previous generation models: GPT-4o ($2.50/$10) and GPT-4o mini ($0.15/$0.60) are still available and may be appropriate for applications not yet ready to upgrade to GPT-5. However, GPT-5-nano ($0.05/$0.40) is now cheaper than GPT-4o mini for simple tasks.
Test before you commit. Don't assume you need the most expensive model. Run A/B tests:
def evaluate_model_performance(task, test_cases):
"""Test multiple models on same task to find cheapest acceptable option."""
# Test from cheapest to most expensive
models = ['gpt-5-nano', 'gpt-5-mini', 'gpt-5', 'o4-mini']
results = {}
for model in models:
accuracy = run_test_suite(model, task, test_cases)
cost_per_1k = get_model_cost(model)
results[model] = {
'accuracy': accuracy,
'cost': cost_per_1k,
'value': accuracy / cost_per_1k # Accuracy per dollar
}
return results
If GPT-5-nano gets you 92% accuracy and GPT-5 gets you 94% accuracy, is that extra 2% worth 25x the cost ($0.05 vs $1.25 per 1M input tokens)? Sometimes yes, usually no.
2. Monitor reasoning token usage
GPT-5 generates invisible "reasoning tokens" before producing visible output. These are charged at the output token rate ($10 per 1M) but don't appear in your response—only in your bill.
The problem: Reasoning tokens can be 5x your visible output tokens. A 1000-token response might have 5000 reasoning tokens behind it. Your expected $0.01 cost becomes $0.05.
Monitor reasoning token usage:
import openai
client = openai.OpenAI()
# Make API call
response = client.chat.completions.create(
model="gpt-5",
messages=[{"role": "user", "content": "Solve this complex problem..."}],
max_tokens=500
)
# Check actual token usage including reasoning tokens
usage = response.usage
print(f"Input tokens: {usage.prompt_tokens}")
print(f"Output tokens: {usage.completion_tokens}")
print(f"Reasoning tokens: {usage.completion_tokens_details.reasoning_tokens}") # NEW
print(f"Total cost: ${calculate_cost(usage)}")
# Cost calculation for GPT-5
input_cost = usage.prompt_tokens * 0.00000125 # $1.25 per 1M
output_cost = usage.completion_tokens * 0.00001 # $10 per 1M
reasoning_cost = usage.completion_tokens_details.reasoning_tokens * 0.00001 # $10 per 1M
total_cost = input_cost + output_cost + reasoning_cost
Strategies to control reasoning costs:
- Use GPT-5-mini or GPT-5-nano for simple tasks - They generate fewer reasoning tokens
- Monitor reasoning token ratio - If reasoning tokens consistently exceed 3x output, consider if the task needs that level of reasoning
- Use o4-mini for actual reasoning tasks - It's designed for reasoning and may be more cost-effective than GPT-5 for complex logic
- Set budgets and alerts - Track reasoning token costs separately in your monitoring
def call_with_reasoning_monitoring(prompt, max_reasoning_multiplier=5):
"""Call GPT-5 with reasoning token monitoring."""
response = client.chat.completions.create(
model="gpt-5",
messages=[{"role": "user", "content": prompt}],
max_tokens=500
)
usage = response.usage
reasoning_tokens = usage.completion_tokens_details.reasoning_tokens or 0
output_tokens = usage.completion_tokens
# Alert if reasoning tokens are excessive
if reasoning_tokens > output_tokens * max_reasoning_multiplier:
print(f"⚠️ High reasoning token usage: {reasoning_tokens} reasoning vs {output_tokens} output")
print(f"Consider using gpt-5-mini or o4-mini for this task")
return response
3. Compress prompts to minimize input tokens
Every token you send costs money. The goal: send exactly what's needed, nothing more. Every word in your prompt is tokens you're paying for:
# Verbose: 23 tokens
prompt = "Please analyze the sentiment of the following text and tell me whether it is positive, negative, or neutral:"
# Compressed: 7 tokens
prompt = "Sentiment (positive/negative/neutral):"
# Even better with structured output: 3 tokens + JSON mode
prompt = "Sentiment:"
At 1M API calls, that 20-token difference is 20M tokens = $25 in GPT-5 input costs (or $1 in GPT-5-nano costs).
4. Truncate older conversation messages
Conversational applications accumulate context over time. Sending the entire conversation history on every request multiplies input costs with each turn. Most tasks don't need full history—recent context is usually sufficient. Implement strategic truncation to keep only relevant messages within a token budget:
def prepare_conversation(messages, max_history_tokens=2000):
"""Keep only recent, relevant conversation context."""
# Always include system message
system_msg = next(m for m in messages if m['role'] == 'system')
result = [system_msg]
# Add recent messages until we hit token budget
recent = messages[-10:] # Last 10 turns
total_tokens = count_tokens([system_msg])
for msg in reversed(recent):
msg_tokens = count_tokens([msg])
if total_tokens + msg_tokens <= max_history_tokens:
result.insert(1, msg) # Insert after system message
total_tokens += msg_tokens
else:
break
return result
A 20-turn conversation without truncation might send 30K tokens on the 20th turn. With truncation: 3K tokens. That's 90% cost reduction on input.
5. Filter irrelevant content before sending to API
Raw inputs often contain bloat: HTML boilerplate, navigation elements, repeated footers, excessive whitespace. This noise inflates token counts without adding value. Preprocess inputs to extract signal and discard irrelevant content before sending to the API:
def preprocess_document(text):
"""Extract relevant content, remove boilerplate."""
from bs4 import BeautifulSoup
# Strip HTML if present
soup = BeautifulSoup(text, 'html.parser')
# Remove script, style, nav elements
for element in soup(['script', 'style', 'nav', 'footer', 'aside']):
element.decompose()
# Extract main content
main_content = soup.find('main') or soup.find('article') or soup.body
text = main_content.get_text(separator=' ', strip=True)
# Remove excessive whitespace
text = ' '.join(text.split())
return text
This can reduce input from 15K tokens (full HTML page) to 2K tokens (article content)—an 85% reduction.
6. Structure prompts for automatic caching
Since GPT-5's release, automatic prompt caching provides a 90% discount on cached input tokens. Structure prompts with static prefixes (system instructions, examples, context) to maximize cache hits:
# Structure prompts with static prefix for automatic caching
system_instructions = """
You are a specialized assistant for... [5000 tokens of instructions]
"""
def create_cached_completion(user_query):
"""Static system message gets cached automatically - 90% discount."""
return openai.ChatCompletion.create(
model="gpt-5",
messages=[
{"role": "system", "content": system_instructions}, # Cached: $0.125 per 1M (vs $1.25)
{"role": "user", "content": user_query} # Dynamic: $1.25 per 1M
],
max_tokens=200
)
# First call: full price ($1.25 per 1M tokens)
# Subsequent calls (within a few minutes): 90% discount ($0.125 per 1M tokens)
# For 5000-token system message: $0.00625 → $0.000625 per call (90% savings)
For a 50-message conversation with a 5000-token system prompt, costs drop from $5 to under $1 with caching. This is automatic—no configuration required.
7. Cache responses for identical/similar requests
Don't re-process identical requests. Cache full API responses by request signature (messages + model + parameters) to avoid redundant API calls—30-50% cache hit rates translate directly to 30-50% cost reduction:
import hashlib
import redis
cache = redis.Redis()
def cached_completion(messages, model, **kwargs):
"""Cache responses by request hash."""
# Create cache key from request parameters
cache_key = hashlib.md5(
json.dumps({
'messages': messages,
'model': model,
**kwargs
}, sort_keys=True).encode()
).hexdigest()
# Check cache
cached = cache.get(cache_key)
if cached:
return json.loads(cached)
# Make API call
response = openai.ChatCompletion.create(
model=model,
messages=messages,
**kwargs
)
# Cache for 1 hour
cache.setex(cache_key, 3600, json.dumps(response))
return response
For applications with repeated queries (FAQs, common classifications), cache hit rates of 30-50% eliminate that spend entirely.
8. Set explicit max_tokens for every request
Output tokens are expensive (8x the cost of input tokens for GPT-5), and reasoning tokens multiply the cost further. Setting explicit limits is mandatory:
# Bad: no token limit, model generates until it decides to stop
response = openai.ChatCompletion.create(
model="gpt-5",
messages=[{"role": "user", "content": "Classify this text: ..."}]
)
# Model might generate 500 tokens of explanation when you needed 1 word
# Plus 2500 reasoning tokens you can't see = 3000 total tokens charged
# Good: strict token limit for the task
response = openai.ChatCompletion.create(
model="gpt-5-nano",
messages=[{"role": "user", "content": "Classify this text: ..."}],
max_tokens=10 # Just enough for "positive" or "negative"
)
# Limited output = limited reasoning tokens too
Output tokens are 8x more expensive than input for GPT-5. An unconstrained response generating 2000 tokens instead of 200 tokens costs 10x more. Plus reasoning tokens multiply that further.
9. Use JSON mode or function calling for structured outputs
JSON mode and function calling produce shorter, more reliable responses, eliminating verbose natural language wrappers and reducing reasoning token overhead:
# Bad: free-form text that requires parsing
response = openai.ChatCompletion.create(
model="gpt-5",
messages=[{
"role": "user",
"content": "Extract the name, email, and phone from this text. Format as JSON."
}]
)
# Model generates: "Here's the extracted information:\n{\n \"name\": \"John\",...
# ~100 output tokens + ~200 reasoning tokens = 300 total
# Good: JSON mode, no explanation
response = openai.ChatCompletion.create(
model="gpt-5-nano",
messages=[{
"role": "user",
"content": "Extract name, email, phone from this text."
}],
response_format={"type": "json_object"}
)
# Model generates: {"name":"John","email":"j@ex.com","phone":"555-1234"}
# ~20 output tokens + minimal reasoning = 25 total
# 90% cost reduction + cheaper model (25x less)
The difference: 80% fewer output tokens, minimal reasoning overhead, more reliable parsing, lower costs.
Use function calling for actions:
tools = [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string"},
"unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
},
"required": ["location"]
}
}
}]
response = openai.ChatCompletion.create(
model="gpt-5-nano",
messages=[{"role": "user", "content": "What's the weather in SF?"}],
tools=tools,
tool_choice="auto"
)
Function calling produces minimal output tokens—just the function call arguments, no natural language wrapper. And it minimizes reasoning tokens since the task is well-defined.
10. Use Batch API for 50% discount on async work
For asynchronous workloads that can wait up to 24 hours, use the Batch API for a 50% discount on all tokens:
# Create batch file with requests for bulk processing
batch_requests = [
{
"custom_id": f"request-{i}",
"method": "POST",
"url": "/v1/chat/completions",
"body": {
"model": "gpt-5",
"messages": [{"role": "user", "content": task}],
"max_tokens": 500
}
}
for i, task in enumerate(bulk_tasks)
]
# Upload and submit batch (50% cost reduction on all tokens)
batch_file = client.files.create(file=open("batch_requests.jsonl", "rb"), purpose="batch")
batch = client.batches.create(input_file_id=batch_file.id, endpoint="/v1/chat/completions", completion_window="24h")
Suitable for data processing, content generation, bulk classification—anything that doesn't need real-time results.
11. Test minimum effective example count
More few-shot examples don't always mean better performance. Test example counts systematically (0, 1, 3, 5, 10) to find the minimum needed for target accuracy—typically 3 examples achieve 94% accuracy while 10 examples cost 3x more for just 1% improvement:
# Test different example counts
def find_optimal_examples(task, example_pool, test_set):
"""Find minimum examples needed for acceptable accuracy."""
results = {}
for n in [0, 1, 3, 5, 10]:
examples = example_pool[:n]
accuracy = evaluate_with_examples(task, examples, test_set)
cost_per_call = n * avg_tokens_per_example * input_token_cost
results[n] = {'accuracy': accuracy, 'cost': cost_per_call}
# Stop if we hit diminishing returns
if n > 0 and accuracy - results[n-2]['accuracy'] < 0.02:
print(f"Diminishing returns after {n-2} examples")
break
return results
# Often: 3 examples = 94% accuracy, 10 examples = 95% accuracy
# That 1% costs 3x more. Not worth it.
Closing Thoughts
Careful model selection is table stakes—use GPT-5-mini ($0.25/$2) or GPT-5-nano ($0.05/$0.40) as your default for most tasks, reserve GPT-5 ($1.25/$10) for complex work, and route specialized reasoning to o4-mini. Monitor reasoning tokens on GPT-5 to catch the hidden cost multiplier. This baseline discipline prevents the worst cost traps.
Getting to the next level of savings requires two steps. First, let observed costs guide what needs optimization. Attribute your bill to specific application behaviors—which endpoints, features, and request types drive spend. This reveals which of the 11 efficiency patterns above matter most for your application. Second, apply the optimizations that address your cost concentrations. These patterns reduce tokens without sacrificing functionality: automatic prompt caching cuts input costs by 90%, conversation truncation eliminates redundant context, output limits prevent unbounded generation, JSON mode and function calling replace verbose responses with structured outputs, and the Batch API cuts async workload costs in half.
You don't lose capability. You gain precision. Cost-effective AI is an engineering discipline—treat each API call as a budget allocation and optimize based on measured impact.
Looking for help with cost optimizations like these? Sign up for Early Access to Frugal. Frugal attributes OpenAI API costs to your code, finds inefficient usage, provides Frugal Fixes that reduce your bill, and helps keep you out of cost traps for new and changing code.