Home/Blog/Optimizing Prompts to Reduce Token Usage and Costs
AI & Machine Learning

Optimizing Prompts to Reduce Token Usage and Costs

Learn practical techniques to write more efficient prompts, reduce API token consumption by 50-80%, and lower your LLM costs without sacrificing output quality.

By Inventive HQ Team
Optimizing Prompts to Reduce Token Usage and Costs

Every token counts when you're processing thousands of API requests. Yet many developers write prompts without considering token efficiency, leaving 50-80% potential savings on the table. This guide shows you practical techniques to reduce token usage dramatically while maintaining or even improving output quality.

The Cost of Inefficient Prompts

Before diving into optimization, let's understand the impact. Consider this common chatbot prompt:

Inefficient prompt (847 tokens):

You are a helpful, friendly, and knowledgeable customer service assistant
for TechCorp, a leading technology company that specializes in innovative
software solutions. Your role is to assist customers with their questions
and concerns in a professional and courteous manner. You should always
strive to provide accurate, helpful, and comprehensive responses to all
customer inquiries. Please remember to be patient, understanding, and
empathetic when dealing with customer issues. If you don't know the answer
to a question, please let the customer know that you will find out and
get back to them. Always maintain a positive and helpful attitude.

The customer has the following question about their account:

{customer_question}

Please provide a helpful and informative response to address the
customer's concerns. Make sure your response is clear, concise, and
addresses all aspects of their question. If additional information
is needed, please ask clarifying questions.

Optimized prompt (156 tokens):

You are TechCorp's support agent. Be helpful, accurate, and concise.

Customer question: {customer_question}

Respond directly to their question. Ask for clarification only if essential.

At 100,000 monthly requests with GPT-4o:

  • Inefficient: ~$450/month
  • Optimized: ~$85/month
  • Savings: $365/month (81%)

Principle 1: Eliminate Redundancy

Remove Filler Words

LLMs understand instructions without pleasantries:

VerboseConciseSavings
"I would like you to please summarize""Summarize"7 tokens
"Can you help me by analyzing""Analyze"6 tokens
"It would be great if you could"(just state the task)8 tokens
"Please make sure to always""Always"5 tokens

Consolidate Repetitive Instructions

Before (redundant):

Be concise in your response.
Keep your answer brief.
Don't write lengthy explanations.
Provide short, to-the-point responses.

After:

Be concise (2-3 sentences max).

Remove Obvious Context

Don't tell the model what it already knows:

Remove:

  • "As an AI language model..."
  • "Based on the text I was given..."
  • "Using my knowledge and capabilities..."

The model knows what it is. Focus instructions on what it should do.

Principle 2: Optimize System Prompts

System prompts accompany every request in a conversation. A 500-token system prompt costs you 500 tokens per message.

Keep System Prompts Minimal

Verbose system prompt (312 tokens):

You are an expert software developer with extensive experience in Python,
JavaScript, TypeScript, and various other programming languages. You have
deep knowledge of software design patterns, best practices, clean code
principles, and modern development methodologies including Agile and DevOps.
You should provide helpful, accurate, and well-thought-out responses to
programming questions. When writing code, always include comments explaining
what the code does. Follow industry best practices and conventions. If you're
unsure about something, acknowledge your uncertainty. Be friendly and
professional in your interactions.

Optimized system prompt (47 tokens):

Expert developer. Python/JS/TS. Write clean, commented code.
Acknowledge uncertainty. Be concise.

Use Role Shorthand

Instead of lengthy role descriptions:

Role: Senior security analyst
Expertise: Penetration testing, vulnerability assessment
Tone: Technical, precise

Principle 3: Structure Output Efficiently

Request Structured Formats

Natural language responses use more tokens than structured data:

Natural language output (89 tokens):

After analyzing the customer review, I've determined that the overall
sentiment is positive. The customer expresses satisfaction with the
product quality and delivery speed. The confidence level for this
sentiment analysis is approximately 92 percent. The main positive
aspects mentioned are quality and shipping, while no negative aspects
were identified.

JSON output (34 tokens):

{"sentiment": "positive", "confidence": 0.92, "positive": ["quality", "shipping"], "negative": []}

Use Abbreviated Keys

For high-volume processing, shorter keys save tokens:

// Verbose: 45 tokens
{"customer_name": "John", "order_status": "shipped", "estimated_delivery": "2026-01-20"}

// Abbreviated: 29 tokens
{"n": "John", "s": "shipped", "d": "2026-01-20"}

Limit Output Length

Explicitly constrain response length:

Summarize in exactly 3 bullet points.
Answer in one sentence.
List top 5 only.
Max 100 words.

Principle 4: Reduce Few-Shot Examples

Zero-Shot First

Many tasks work well without examples:

5-shot prompt (400+ tokens):

Classify sentiment. Examples:
"Great product!" -> positive
"Terrible experience" -> negative
"It's okay" -> neutral
"Love it!" -> positive
"Waste of money" -> negative

Classify: "The delivery was fast but packaging was damaged"

Zero-shot prompt (32 tokens):

Classify sentiment as positive, negative, or neutral.
Text: "The delivery was fast but packaging was damaged"

Minimal Examples When Needed

If zero-shot doesn't work, use 1-2 examples instead of 5+:

Classify sentiment (positive/negative/neutral).
Example: "Great but expensive" -> neutral
Text: "{input}"

Consider Fine-Tuning

For production workloads with many examples, fine-tuning can be more cost-effective:

ApproachSetup CostPer-Request Cost
5-shot prompts$0High (extra 300+ tokens)
Fine-tuned model$50-500Low (zero-shot works)

Break-even typically occurs around 50,000-200,000 requests.

Principle 5: Optimize Input Context

Summarize Long Documents

Instead of passing entire documents:

Naive approach:

Here's a 10,000-word document: {full_document}
Answer: What are the key financial metrics?

Optimized approach:

Document summary: {500_word_summary}
Full metrics section: {relevant_section}
Answer: What are the key financial metrics?

Use Retrieval Wisely

In RAG systems, retrieve only what's needed:

# Inefficient: retrieve 10 large chunks
chunks = retriever.get_top_k(query, k=10, chunk_size=1000)

# Efficient: retrieve fewer, smaller, more relevant chunks
chunks = retriever.get_top_k(query, k=3, chunk_size=300)

Truncate Strategically

For very long inputs, identify what matters:

def optimize_context(document: str, max_tokens: int = 2000) -> str:
    # Keep beginning (usually has key info)
    # Keep end (often has conclusions)
    # Summarize middle
    beginning = document[:1000]
    end = document[-500:]
    middle_summary = summarize(document[1000:-500])
    return f"{beginning}\n[Summary: {middle_summary}]\n{end}"

Principle 6: Batch Requests

Inefficient (3 separate calls):

sentiment1 = analyze_sentiment("Review 1...")
sentiment2 = analyze_sentiment("Review 2...")
sentiment3 = analyze_sentiment("Review 3...")

Efficient (1 batched call):

results = analyze_sentiment_batch([
    "Review 1...",
    "Review 2...",
    "Review 3..."
])
# Returns: [{"text": "Review 1...", "sentiment": "positive"}, ...]

Batch Processing Prompt

Analyze sentiment for each review. Return JSON array.

Reviews:
1. "{review1}"
2. "{review2}"
3. "{review3}"

Format: [{"id": 1, "sentiment": "positive/negative/neutral"}, ...]

This saves system prompt repetition and reduces per-item overhead.

Principle 7: Cache Aggressively

Semantic Caching

Cache similar queries, not just exact matches:

from functools import lru_cache
import hashlib

def get_cache_key(prompt: str) -> str:
    # Normalize and hash
    normalized = prompt.lower().strip()
    return hashlib.md5(normalized.encode()).hexdigest()

@lru_cache(maxsize=10000)
def cached_completion(cache_key: str, prompt: str) -> str:
    return llm.complete(prompt)

Cache Common Patterns

Identify frequently asked questions and pre-generate responses:

COMMON_RESPONSES = {
    "reset_password": "To reset your password, go to Settings > Security...",
    "refund_policy": "Our refund policy allows returns within 30 days...",
    "shipping_time": "Standard shipping takes 3-5 business days..."
}

def handle_query(query: str) -> str:
    intent = classify_intent(query)  # Cheap classification
    if intent in COMMON_RESPONSES:
        return COMMON_RESPONSES[intent]  # No LLM call
    return llm.generate(query)  # Full LLM call only when needed

Practical Optimization Workflow

Step 1: Measure Current Usage

def log_token_usage(prompt: str, response: str, model: str):
    input_tokens = count_tokens(prompt, model)
    output_tokens = count_tokens(response, model)

    metrics.histogram("llm.input_tokens", input_tokens)
    metrics.histogram("llm.output_tokens", output_tokens)

    print(f"Input: {input_tokens}, Output: {output_tokens}")

Step 2: Identify High-Volume Prompts

SELECT
    prompt_template,
    COUNT(*) as request_count,
    AVG(input_tokens) as avg_input,
    AVG(output_tokens) as avg_output,
    SUM(cost) as total_cost
FROM llm_requests
GROUP BY prompt_template
ORDER BY total_cost DESC
LIMIT 10;

Step 3: Optimize Top Offenders

Focus on the prompts with highest total cost (volume × cost per request).

Step 4: A/B Test Changes

def get_prompt(variant: str, context: dict) -> str:
    if variant == "control":
        return verbose_prompt.format(**context)
    else:
        return optimized_prompt.format(**context)

# Track metrics by variant
variant = random.choice(["control", "optimized"])
prompt = get_prompt(variant, context)
response = llm.complete(prompt)

metrics.track("llm_cost", cost, tags={"variant": variant})
metrics.track("response_quality", quality_score, tags={"variant": variant})

Step 5: Monitor Quality

Ensure optimizations don't degrade output:

def evaluate_response(response: str, expected: str) -> float:
    # Measure semantic similarity, factual accuracy, etc.
    return quality_score

# Alert if quality drops
if quality_score < threshold:
    alert("Response quality degraded after prompt optimization")

Token-Saving Checklist

Before deploying any prompt, verify:

  • Removed filler words and pleasantries
  • Eliminated redundant instructions
  • System prompt under 200 tokens
  • Using structured output format
  • Minimal or no few-shot examples
  • Context limited to essential information
  • Output length explicitly constrained
  • Caching implemented for common queries
  • Batching used where possible

Real-World Results

Here's what companies typically see after systematic optimization:

Company TypeBeforeAfterSavings
SaaS Chatbot$8,000/mo$2,400/mo70%
Document Processing$15,000/mo$4,500/mo70%
Code Assistant$5,000/mo$1,750/mo65%
Customer Support$12,000/mo$3,000/mo75%

Conclusion

Token optimization isn't about writing the shortest possible prompts—it's about eliminating waste while preserving the information the model needs. The techniques in this guide can cut your LLM costs by 50-80% without sacrificing output quality.

Start by measuring your current usage, identify the highest-cost prompts, and systematically apply these principles. Use our LLM Token Counter to compare prompt variations before deploying to production.

Remember: the best prompt is one that achieves your goal with the minimum necessary tokens.

Frequently Asked Questions

Find answers to common questions

Well-optimized prompts typically reduce token usage by 40-70% compared to naive implementations. Combined with output format optimization, total savings can reach 50-80%. For a company spending $10,000/month on LLM APIs, this translates to $5,000-$8,000 in monthly savings.

Let's turn this knowledge into action

Get a free 30-minute consultation with our experts. We'll help you apply these insights to your specific situation.