Wrap your LLM SDK. KostAI records every call, scores it for waste
across nine categories, and — in shadow mode — runs a cheaper or
local path in parallel so you can see, per call, what each
optimized route would have saved. Nothing leaves the machine.
One-line wrappers for Anthropic, OpenAI, Google, Ollama, LM
Studio, and any OpenAI-compatible endpoint. Append-only JSONL
event store you can cat. Optional SQLite backend.
Optional Elasticsearch sink with an ECS-shaped document.
Score waste, per call
Nine categories — oversized context, redundant history, long
outputs where a short one would have done, over-model for task
class, missing cache hits, and more. Every call gets an
efficiency score and an avoidable-cost estimate in USD.
Shadow mode A/B
For a route you flag, run the frontier call and a
cheaper/local one in parallel. The user always sees the
frontier result. KostAI records both, grades them with a
quality evaluator, and shows you exactly which optimized path
would have saved money without quality regression.
Route
Pure function. Given a call, classify the task, check the
model, emit one of four decisions — local sufficient,
cheaper API sufficient, frontier required,
cache hit — with a USD-denominated savings estimate
per decision.
MCP server
Twenty tools over both stdio (Claude Desktop / Claude Code)
and an HTTP+SSE bridge. Local↔frontier handoff, cheap-API
routing, local preprocessing, durable task queue with
exponential backoff.
Dashboard
Eight tabs on :3674 — Overview, Shadow Mode,
Router, Local LLMs, Bridge, Queue, Calls, Trends. Live-syncs
over SSE. /health, /ready,
/metrics (Prometheus) on the side.
Prefer zero-code adoption? Run npx ai-cost proxy --mode observe
and set OPENAI_BASE_URL=http://localhost:4311/v1.
The nine waste categories
Oversized context
Redundant history
Over-long output
Over-model for task
Missed cache hit
Retry burn
Tool-call fan-out
System-prompt bloat
Unnecessary streaming
Every event carries llm.efficiency_score (0–100) and
llm.avoidable_context_cost_usd. Roll up by route,
model, app, or workflow. Ship to Elasticsearch and grade your
whole fleet.
Two-machine bridge
Run a MacBook as your frontier node and a Mac Mini as a local-LLM
workhorse. HTTP + SSE transport with bearer-token auth, a
durable 24-hour task queue, exponential backoff, and a live
dashboard that syncs across both machines.
✓ Escalate — local node asks a frontier-role peer to run a prompt.
✓ Delegate — frontier asks a local peer; records the would-have-cost savings.
✓ Cheap-API route — defer to a Haiku-class peer before escalating.
✓ Tailscale auto-detect — prefers MagicDNS URLs when available, falls back to LAN.
Built for Elastic-shaped observability
KostAI ships an Elasticsearch sink out of the box. Events
flow through _bulk as ECS 8.11 documents with a
dedicated llm.* namespace — cost, tokens,
avoidable cost, efficiency, route, model, provider — plus
the standard event.category,
event.action, and labels.* you
already index.
A starter Kibana dashboard (kibana/dashboards/ai-cost-overview.ndjson)
lands a five-panel overview in one import: total spend,
avoidable spend, average efficiency, spend by model over
time, top routes by spend.
PII redaction is on by default: email, US phone, SSN, IPv4/v6,
credit card, plus GitHub/Slack/GitLab tokens. Fail-soft buffered
sink — a network blip never loses events, never blocks a call.
Not a SaaS. Not a proxy you have to trust.
Everything runs on your machine. The JSONL store is a file you
can read. The dashboard listens on localhost. The Elasticsearch
sink is opt-in. KostAI is a library and a CLI — not a service
that charges per event.