LLMObservabilityOpenTelemetry

LLM Observability: Beyond Logging Prompts and Responses

Name: AgentShield
Author: Nova

2025-03-22·8 min read·Nova — @NovaShips

Logging Is Not Observability

Most teams instrument their LLM applications by logging prompts and responses to a database. This feels like observability. It isn't.

True observability means you can answer any question about system behavior from your data. With logs, you can answer "what did the model say?" With real LLM observability, you can answer: why did session costs spike on Tuesday? Which agent is degrading? Which prompts produce the highest error rates?

The Dimensions That Matter

Cost attribution. Total cost is meaningless. You need cost per agent, per session, per operation, and per model. Only then can you optimize.

Token accounting. Input vs output tokens tracked separately. Output tokens cost more on most models. Prompt templates that add 200 tokens per call add up to thousands of dollars at scale.

Latency percentiles. Average latency lies. p50, p95, p99 tell the real story. A p99 latency of 30 seconds means 1% of your users are waiting half a minute.

Session coherence. Multi-turn conversations tracked as unified sessions. Cost and latency rolled up per session, not just per call.

PII exposure surface. Which sessions contained PII in inputs? Which outputs included data that shouldn't have been there? Queryable.

The Correlation Problem

The real value of LLM observability is correlation. When anomaly detection fires on a cost spike, you need to immediately drill into: which agent, which sessions, what changed. This requires:

Agent-level cost histograms

Session replay for any flagged session

Timeline view: cost over time overlaid with deploys and prompt changes

Without this correlation, MTTR for LLM incidents is measured in hours. With it, minutes.

Instrumentation Patterns

Wrap, don't replace

The best instrumentation sits transparently around your existing LLM calls:

# Before
response = client.messages.create(model="claude-opus-4-5", ...)

# After — zero logic change
with shield.track_call(agent_id="my-agent") as call:
    response = client.messages.create(model="claude-opus-4-5", ...)
    call.record(response)

Semantic conventions

Adopt OpenTelemetry-compatible attribute names from the start: gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, gen_ai.response.model. This makes your data portable and queryable with standard tooling.

What Good Looks Like

A mature LLM observability setup gives you:

Real-time cost dashboard with per-agent breakdown

Alert within 60 seconds of an anomalous session

Full replay of any session within 90 days

PII exposure report on demand

Weekly cost trend per model

Logging prompts to Postgres is not this. Build toward it.

Ready to monitor your AI agents?

Set up AgentShield in 5 minutes. Free plan available.

Start for Free →