Where AI infrastructure waste actually lives
Cost optimisation for AI workloads is different from traditional cloud optimisation. Rightsizing EC2 instances and deleting idle resources still matter, but they are a small fraction of the savings available. The larger opportunities are in model API usage patterns, retrieval efficiency, caching strategy and GPU utilisation — areas that traditional FinOps tooling does not surface.
This guide covers the five waste categories most common in enterprise AI infrastructure and the specific actions that address each one.
Category 1: Idle and over-provisioned GPU capacity
GPU instances are the most expensive resource in a self-hosted AI stack. An A10G instance costs $1.50–3.50/hour. An A100 costs $3–8/hour. A cluster of these sitting at 20% utilisation is burning money at a rate that shocks teams when they first audit it.
Common causes:
- Model serving instances sized for peak load and left running at off-peak baseline
- Development and experimentation environments left running overnight and on weekends
- Multiple model versions running simultaneously when traffic has already migrated to the new version
Fixes: implement GPU utilisation dashboards with 7-day trend views; set up auto-shutdown for dev instances on a nightly schedule; use spot/preemptible instances for batch inference jobs (50–70% cost reduction with appropriate retry handling).
Category 2: Redundant and unoptimised vector database queries
Vector databases charge per query. In a RAG system, it is easy to accumulate five to ten queries per user interaction when a single well-designed query would suffice. Common antipatterns:
- Running separate queries for each document type rather than using metadata filters on a single query
- Re-querying the vector store for context that was already retrieved earlier in the conversation
- Setting top-k too high (retrieving 20 chunks when 5 are needed) and paying for result processing that gets truncated before the LLM sees it
Every unnecessary vector query is two costs: the query fee and the LLM tokens used to process content that did not need to be there.
Audit your retrieval code for these patterns. Add a query counter metric per user session. Implement a conversation-scoped retrieval cache so that context retrieved for question one is available for question two without re-querying.
Category 3: Uncached model API calls
Model API calls are expensive and many of them are identical or near-identical. High-frequency patterns that should be cached:
- FAQ and support queries: A large percentage of user questions in support use cases are semantically identical even if phrased differently. Semantic caching (matching queries by embedding similarity rather than exact string match) can cache 30–50% of calls in these scenarios.
- Repeated document analysis: If the same document is analysed by multiple users or workflows, cache the analysis output and serve it without re-calling the model.
- Classification and routing steps: A model used to classify a query into a category before routing it to a specialised workflow is making an expensive API call for a simple task. Replace it with a fine-tuned smaller model or a lightweight classifier.
Implement Redis-based semantic caching with a configurable similarity threshold. Start with a threshold of 0.95 (high similarity required) and loosen it as you validate cache quality. Track cache hit rate as a metric and set a target — 20–40% hit rate is achievable for most support and FAQ use cases.
Category 4: Oversized prompts and context windows
Token cost scales linearly with prompt size. Teams often add context to prompts without removing old context, leading to system prompts and retrieved chunks that are far larger than needed.
- Audit average prompt token count per workflow monthly — it tends to grow as prompts are edited and never trimmed
- For retrieved context, use re-ranking to select the top 3–5 most relevant chunks rather than passing all retrieved results to the LLM
- Use smaller context windows for classification and routing steps; reserve large context for generation tasks that actually need it
- For long documents, summarise sections before including them in prompts rather than passing raw text
A 30% reduction in average prompt token count — achievable with a focused optimisation pass — translates directly to a 30% reduction in model API spend, with no change in model quality if the removed tokens were not contributing to output quality.
Category 5: Data egress and inter-service transfer
Data egress charges are the quiet cost that accumulates in the background. For AI workloads, the most common egress sources:
- Object storage to compute across availability zones (use same-AZ placement where possible)
- Cloud A to Cloud B transfers when data and model infrastructure are on different providers
- Internet egress for vector database queries when using a managed service in a different cloud than the application
- Log export to a third-party observability platform without compression or sampling
Audit your network topology against your cloud billing console. Look for egress line items and trace them to specific service pairs. In many cases, co-locating services in the same region or using VPC endpoints eliminates significant egress costs entirely.
Building an ongoing cost optimisation practice
One-time cost audits help, but sustainable cost management requires a continuous practice:
- Weekly cost review meeting (15 minutes) looking at spend by service and trend
- Monthly efficiency metric review: token cost per transaction, cache hit rate, GPU utilisation %
- Cost targets per workflow embedded in engineering team OKRs
- Architectural review checklist that includes a cost modelling step before any new workload goes to production
The teams that keep AI infrastructure costs under control are not the ones who run the biggest optimisation projects — they are the ones who treat cost as a continuous engineering metric alongside latency and reliability.