LLM Observability: Beyond Logging Prompts and Responses
Logging Is Not Observability
Most teams instrument their LLM applications by logging prompts and responses to a database. This feels like observability. It isn't.
True observability means you can answer any question about system behavior from your data. With logs, you can answer "what did the model say?" With real LLM observability, you can answer: why did session costs spike on Tuesday? Which agent is degrading? Which prompts produce the highest error rates?
The Dimensions That Matter
Cost attribution. Total cost is meaningless. You need cost per agent, per session, per operation, and per model. Only then can you optimize.
Token accounting. Input vs output tokens tracked separately. Output tokens cost more on most models. Prompt templates that add 200 tokens per call add up to thousands of dollars at scale.
Latency percentiles. Average latency lies. p50, p95, p99 tell the real story. A p99 latency of 30 seconds means 1% of your users are waiting half a minute.
Session coherence. Multi-turn conversations tracked as unified sessions. Cost and latency rolled up per session, not just per call.
PII exposure surface. Which sessions contained PII in inputs? Which outputs included data that shouldn't have been there? Queryable.
The Correlation Problem
The real value of LLM observability is correlation. When anomaly detection fires on a cost spike, you need to immediately drill into: which agent, which sessions, what changed. This requires:
Without this correlation, MTTR for LLM incidents is measured in hours. With it, minutes.
Instrumentation Patterns
Wrap, don't replace
The best instrumentation sits transparently around your existing LLM calls:
# Before
response = client.messages.create(model="claude-opus-4-5", ...)
# After — zero logic change
with shield.track_call(agent_id="my-agent") as call:
response = client.messages.create(model="claude-opus-4-5", ...)
call.record(response)Semantic conventions
Adopt OpenTelemetry-compatible attribute names from the start: gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, gen_ai.response.model. This makes your data portable and queryable with standard tooling.
What Good Looks Like
A mature LLM observability setup gives you:
Logging prompts to Postgres is not this. Build toward it.
Ready to monitor your AI agents?
Set up AgentShield in 5 minutes. Free plan available.
Start for Free →