Tool
RAG cost estimator.
Back-of-envelope monthly cost and latency for a production RAG system. Adjust model, traffic, tokens, cache hit rate, and retrieval top-k.
Requests per day10,000
Avg input tokens (query + retrieved context)3,500 tok
Avg output tokens (answer)450 tok
Cache hit rate30%
Retrieval top-k40 chunks
Monthly estimate
$3.7k
300,000 requests · $0.01 per request
Input tokens$2.2k
Output tokens$1.4k
Embedding (re-embed on miss)$34
Est. p95 latency1200 ms
Rough estimate using public list prices at time of writing. Real costs depend on routing, caching layers, rerankers, batching, and negotiated rates.
A few things this doesn't capture.
- Rerankers. Cross-encoder rerankers add their own token cost (small) and latency (modest). We omit this — typical impact is 5-15% of total cost.
- Model routing. Production systems route cheap vs. frontier per query. Our estimate assumes a single model — real cost is usually 30-60% lower with routing.
- Enterprise discounts. Bedrock, Azure OpenAI, and Vertex negotiate below public list prices for committed spend. Treat the numbers as ceiling estimates.
- Observability. Tracing, eval runs, and log storage add meaningful cost at scale — typically 5-10% of inference spend.
Want a tighter estimate specific to your system? We'll run one in a 30-minute discovery call.
Ready to accelerate your tech growth?
Schedule your free consultation today and let's discuss how we can help your business scale efficiently.
