Show HN: CacheLens – Local-first cost tracking proxy for LLM APIs

github.com · stephenlthorn · 15 hours ago · view on HN · tool
quality 3/10 · low quality
0 net
AI Summary

CacheLens is a local HTTP proxy for monitoring and optimizing LLM API costs across Anthropic, OpenAI, and Google, offering real-time tracking, budget enforcement, cache analysis, and multi-provider visibility without sending prompts to external servers.

Entities
CacheLens Anthropic OpenAI Google Claude API FastAPI SQLite Prometheus
I built CacheLens because I was burning through $200+/month on Claude API calls and had no idea where it was going.

It's a local HTTP proxy that sits between your app and the AI provider (Anthropic, OpenAI, Google). Every request flows through it, and it records token usage, cost, cache hit rates, latency — everything. Then there's a dashboard to visualize it all.

What makes it different from just checking your provider dashboard:

It's real-time (WebSocket live feed of every call as it happens) It works across all three major providers in one view It runs 100% locally — your prompts never leave your machine It has budget caps that actually block requests before you overspend It identifies optimization opportunities (cache misses, model downgrades, repeated prompts) Tech stack: Python, FastAPI, SQLite, vanilla JS. No React, no build step, no external dependencies beyond pip. The whole thing is ~3K lines of Python.

Interesting technical decisions:

The proxy captures streaming responses without buffering — it tees the byte stream so the client sees zero added latency Cost calculation uses a built-in pricing table with override support (providers change rates constantly) There's a Prometheus /metrics endpoint so you can plug it into existing monitoring Cacheability analysis uses diff-based detection across multiple API calls to identify what's actually static vs dynamic in your prompts Limitations I'm honest about:

The cacheability scorer is heuristic-based — solid for multi-call traces (~85% accurate), rougher for single prompts (~65%) Token counting uses cl100k_base for everything, which drifts ~10% for non-OpenAI models Three features (smart routing, scheduled reports, multi-user auth) are on the roadmap but not shipped yet Would love feedback, especially from anyone managing LLM costs at scale.