Tool

RAG cost estimator.

Back-of-envelope monthly cost and latency for a production RAG system. Adjust model, traffic, tokens, cache hit rate, and retrieval top-k.

Model

Requests per day10,000

Avg input tokens (query + retrieved context)3,500 tok

Avg output tokens (answer)450 tok

Cache hit rate30%

Retrieval top-k40 chunks

Monthly estimate

$3.7k

300,000 requests · $0.01 per request

Input tokens$2.2k

Output tokens$1.4k

Embedding (re-embed on miss)$34

Est. p95 latency1200 ms

Rough estimate using public list prices at time of writing. Real costs depend on routing, caching layers, rerankers, batching, and negotiated rates.

A few things this doesn't capture.

Rerankers. Cross-encoder rerankers add their own token cost (small) and latency (modest). We omit this — typical impact is 5-15% of total cost.
Model routing. Production systems route cheap vs. frontier per query. Our estimate assumes a single model — real cost is usually 30-60% lower with routing.
Enterprise discounts. Bedrock, Azure OpenAI, and Vertex negotiate below public list prices for committed spend. Treat the numbers as ceiling estimates.
Observability. Tracing, eval runs, and log storage add meaningful cost at scale — typically 5-10% of inference spend.

Want a tighter estimate specific to your system? We'll run one in a 30-minute discovery call.

Schedule your free consultation today and let's discuss how we can help your business scale efficiently.