Back to writing

LLM Cost Optimization: Cut Your API Bill by 80%

7 min read

The $15K Wake-Up Call

Month 3 of our AI feature: OpenAI bill hits $15,000. CEO asks questions. We optimize. Month 4: $3,000 for the same workload.

Here's exactly what we did.

The Cost Breakdown

Where LLM costs come from:

The math:

GPT-4: $30/1M input tokens, $60/1M output tokens
Claude Sonnet: $3/1M input, $15/1M output  
GPT-3.5: $0.50/1M input, $1.50/1M output

Small changes compound at scale.

1. Prompt Compression: The Quick Win

Before (350 tokens):

You are a highly experienced customer service representative with deep knowledge of our product... [250 tokens of unnecessary preamble]

Customer question: {question}
Please provide a helpful response.

After (80 tokens):

Answer customer questions accurately using the provided context.

Context: {relevant_context}
Question: {question}
Answer:

Savings: 77% fewer input tokens

Key tactics:

2. Model Routing: Right Tool, Right Job

Not every task needs GPT-4.

Our routing logic:

def route_to_model(task_type, complexity):
    if task_type == "simple_classification":
        return "gpt-3.5-turbo"  # $0.50/M input
    
    elif task_type == "extraction" and complexity == "low":
        return "gpt-3.5-turbo"
    
    elif task_type == "reasoning" and complexity == "high":
        return "gpt-4"  # $30/M input
    
    elif task_type == "long_context":
        return "claude-sonnet"  # Better $/performance for long context
    
    else:
        return "gpt-3.5-turbo"  # Default to cheap

Result: 60% of our tasks ran on GPT-3.5 instead of GPT-4.

Savings: ~$7K/month

3. Response Truncation: Stop When Done

Models keep generating until max_tokens or stop sequence.

Bad:

response = openai.chat.completions.create(
    model="gpt-4",
    max_tokens=1000,  # Often generates 800+ unnecessary tokens
    messages=[...]
)

Good:

response = openai.chat.completions.create(
    model="gpt-4",
    max_tokens=200,  # Tight constraint
    stop=["###", "---"],  # Stop early if model signals done
    messages=[...]
)

Savings: 60-70% fewer output tokens

4. Batch Processing: Group API Calls

Before: 1 API call per item

for item in items:
    result = llm.process(item)

Cost: N API calls

After: Batch items per request

def batch_process(items, batch_size=10):
    for i in range(0, len(items), batch_size):
        batch = items[i:i+batch_size]
        prompt = f"Process these items:\n{json.dumps(batch)}"
        result = llm.process(prompt)

Savings:

Pro tip: Balance batch size with latency needs.

5. Caching: Don't Recompute

Anthropic's Prompt Caching:

response = anthropic.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": long_system_prompt,  # Cached
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[{"role": "user", "content": user_query}]
)

Cost:

Our savings: $2K/month on system message repetition alone.

6. Smart Sampling: Don't Call LLM When You Don't Need To

Before: Every user action triggered LLM

def handle_user_input(input):
    return llm.generate_response(input)

After: Filter first

def handle_user_input(input):
    # Quick heuristics
    if is_faq(input):
        return cached_faq_response(input)
    
    if is_simple_query(input):
        return rule_based_response(input)
    
    # Only complex queries hit LLM
    return llm.generate_response(input)

Result: 35% of requests handled without LLM calls.

Savings: ~$4K/month

7. Output Constraints: Force Brevity

Bad:

"Explain this concept to the user."

Average output: 300 tokens

Good:

"Explain this concept in exactly 2 sentences."

Average output: 50 tokens

Savings: 83% fewer output tokens

Tactics:

8. Model Cascading: Try Cheap First

Pattern:

def generate_with_cascade(prompt):
    # Try cheap model first
    response = gpt_35(prompt)
    
    # Check if response is good enough
    if quality_check(response) > 0.8:
        return response
    
    # Fall back to expensive model only if needed
    return gpt_4(prompt)

Result: 70% of requests satisfied by cheap model.

Savings: Massive (only pay for GPT-4 when necessary)

9. Fine-Tuning: Long-Term Optimization

When it makes sense:

Cost comparison (1M requests):

Break-even: ~50K requests

Our fine-tuning wins:

10. Monitoring: Track Cost Per Feature

Essential metrics:

{
  "feature_name": "content_generation",
  "daily_requests": 1200,
  "avg_input_tokens": 250,
  "avg_output_tokens": 180,
  "model": "gpt-4",
  "daily_cost": "$45",
  "monthly_projection": "$1350"
}

Dashboard we built:

Result: Identified that 80% of costs came from 20% of features. Optimized those first.

11. Open-Source Models: The Nuclear Option

When to consider:

Cost comparison (1M requests):

Break-even: ~15K requests/day

Trade-offs:

When it's worth it: Mature products with stable, high-volume use cases.

12. The Latency-Cost Trade-off

Fast & expensive:

response = gpt_4(prompt, max_tokens=1000)

Slow & cheap:

# Use cheaper model
response = gpt_35(prompt, max_tokens=200)

Smart middle ground:

# Stream response, stop early if user is satisfied
for chunk in gpt_4_stream(prompt):
    yield chunk
    if user_stopped_reading():
        break  # Don't generate remaining tokens

Savings: 20-30% on output tokens

Real Numbers: Our Optimization Journey

Month 1 (baseline):

Month 2 (after optimizations):

What we did:

  1. Prompt compression → 30% savings
  2. Model routing → 45% savings
  3. Caching → 15% savings
  4. Batching → 10% savings

Total: 83% cost reduction

The Optimization Framework

Step 1: Measure

Step 2: Low-Hanging Fruit

Step 3: Architecture

Step 4: Long-Term

Gotchas to Avoid

Don't:

Do:

Tools & Code

Cost tracking:

import anthropic
from functools import wraps

def track_cost(func):
    @wraps(func)
    def wrapper(*args, **kwargs):
        response = func(*args, **kwargs)
        
        input_cost = response.usage.input_tokens * MODEL_INPUT_COST
        output_cost = response.usage.output_tokens * MODEL_OUTPUT_COST
        
        log_cost(
            feature=func.__name__,
            input_tokens=response.usage.input_tokens,
            output_tokens=response.usage.output_tokens,
            cost=input_cost + output_cost
        )
        
        return response
    return wrapper

@track_cost
def generate_content(prompt):
    return anthropic.messages.create(...)

Model router:

class ModelRouter:
    def __init__(self):
        self.costs = {
            "gpt-4": 30,
            "gpt-3.5": 0.5,
            "claude-sonnet": 3
        }
    
    def route(self, task_type, budget):
        if budget == "low":
            return "gpt-3.5"
        elif task_type == "simple":
            return "gpt-3.5"
        elif task_type == "long_context":
            return "claude-sonnet"
        else:
            return "gpt-4"

Start Here

  1. Track costs: Implement per-feature cost monitoring
  2. Compress prompts: Remove unnecessary tokens
  3. Route smartly: Use GPT-3.5 where possible
  4. Set output limits: Add max_tokens constraints
  5. Cache aggressively: Use prompt caching

You'll see 40-60% savings in the first week.


How much are you spending? Let me know your optimization wins on Twitter or email.

Enjoying this article?

Get deep technical guides like this delivered weekly.

Get AI growth insights weekly

Join engineers and product leaders building with AI. No spam, unsubscribe anytime.

Keep reading