GenAI Observability: What to Measure When Your Product Uses LLMs

After two quarters embedded with a Fortune 10 company’s Applied Machine Learning teams instrumenting GenAI workloads, I’ve seen the core mistake: most organizations instrument LLM applications like they’re REST APIs. They’re not.
Traditional observability (latency, error rate, throughput) tells you that something broke. GenAI observability tells you why your AI is failing to deliver value.
What Makes GenAI Different
When your API returns a 500 error, that’s unambiguous. When your LLM returns a response, you have no idea if it’s:
- Correct
- Hallucinated
- Off-topic
- Biased
- Too expensive
- Too slow for the user’s patience
You need different instrumentation.
The Four Layers of GenAI Observability
Layer 1: Infrastructure Metrics (Table Stakes)
These are your traditional observability signals, adapted for LLM workloads:
Latency Metrics:
- TTFT (Time to First Token): How long before the user sees something? This determines perceived performance.
- Tokens per Second: Throughput rate during generation. Affects user patience.
- Total Request Duration: End-to-end latency including prompt processing.
Cost Metrics:
- Cost per Request: Input tokens × price + output tokens × price. Track cached and uncached input separately, since prompt caching charges cached tokens at a fraction of the rate
- Cost per User Session: Aggregated across multi-turn conversations
- Cost per Feature: Which parts of your product are burning money?
Throughput:
- Requests per Second: Standard, but important for capacity planning
- Concurrent Requests: How many LLM calls are in-flight?
- Queue Depth: Are you throttling before you hit provider rate limits?
Layer 2: LLM-Specific Signals
This is where GenAI observability diverges from traditional monitoring:
Token Metrics:
- Input Token Count: How much context are you sending? Larger = slower + more expensive
- Output Token Count: How verbose is your model? Can affect UX and cost
- Cache Hit Rate (if using prompt caching): Are you paying for redundant processing?
Model Behavior:
- Temperature: Are you using consistent sampling parameters?
- Model Version: Track which model version generated each response (for A/B testing and rollback)
- Retry Count: How often are you retrying failed requests?
- Fallback Triggers: When did you fall back from your primary model to a cheaper or backup one, and why?
Rate Limiting:
- Rate Limit Hits: How often are you throttled by your provider?
- Quota Exhaustion: Are you hitting daily/monthly spending caps?
Layer 3: Quality Signals
Infrastructure can be perfect while your AI delivers garbage. You need quality metrics:
Response Quality (Automated):
- Toxicity Score: Are you generating harmful content?
- Relevance Score: Does the response match the prompt intent?
- Hallucination Detection: Is the model making things up? (This is hard; more below)
- PII Leakage: Are you exposing sensitive data without realizing it?
Response Quality (Human-Labeled):
- Thumbs Up/Down Ratios: The simplest signal
- User Edits: Did the user have to fix the output?
- Retry Rate: Did the user regenerate the response?
- Abandonment: Did they give up and close the feature?
Prompt Engineering Effectiveness:
- Prompt Version: Track which prompt template was used
- Few-Shot Example Count: How many examples are you including?
- RAG Context Size: How much retrieved context are you injecting?
Layer 4: Business Impact
The reason you’re building with LLMs is business value. Measure it:
User Engagement:
- Feature Adoption: Are users using your AI features at all?
- Session Length: Does AI make users stick around longer?
- Churn Impact: Do AI users churn less?
Conversion:
- AI-Assisted Conversions: Did the LLM help close a sale?
- Content Generation Volume: For content- or code-generation products, output volume maps directly to revenue
Cost-Benefit:
- Revenue per Dollar Spent on LLMs: Your AI P&L
- Cost per Value Delivered: What’s the unit economics?
What You Can Instrument Today vs. What’s Hard
Easy Wins (Implement These First)
- TTFT and Tokens/s: Every LLM provider returns timing data. Log it.
- Cost Tracking: Token counts × pricing. Track by user, by feature, by model.
- Model Version & Parameters: Log which model you called and with what settings.
- User Feedback: Add thumbs up/down buttons. You’d be shocked how few products do this.
Medium Difficulty
- Prompt Versioning: Treat prompts like code. Version them, deploy them, track which version served each request.
- RAG Observability: If you’re doing retrieval, log what you retrieved, how relevant it was, and whether it made it into the final response.
- Trace Context: Use OpenTelemetry to connect your LLM call to the upstream request. When a user complains, you can trace back through your entire stack.
Hard Problems
Hallucination Detection: There’s no silver bullet. You need:
- Fact-checking against known sources (expensive)
- Consistency checks across multiple generations (slow)
- Human labeling of a sample (doesn’t scale)
Semantic Quality: “Is this response helpful?” is subjective. You’ll need:
- LLM-as-judge (use a second model to grade the first; yes, really)
- Human eval loops (label a sample, train a classifier)
- User behavior proxies (did they edit it? regenerate? abandon?)
OpenLLMetry: The Standard That’s Emerging
OpenLLMetry is to LLM observability what OpenTelemetry is to traditional observability: a vendor-neutral instrumentation standard.
It’s built on top of OpenTelemetry and adds semantic conventions for:
- LLM provider and model name/version
- Token counts (prompt, completion, and cached)
- Cost
- Prompt and completion payloads (with PII redaction)
Worth knowing where this is heading: OpenLLMetry pioneered these conventions, and the OpenTelemetry GenAI working group is now standardizing the same ground upstream as gen_ai.* semantic conventions. Instrumenting on OTEL today keeps you aligned with where the ecosystem is converging, not locked to one vendor’s schema.
If you’re starting from scratch, use OpenLLMetry. It gives you:
- Vendor portability (swap observability backends without reinstrumenting)
- Ecosystem compatibility (works with Datadog, Honeycomb, Grafana, etc.)
- Future-proofing (as the space matures, tooling will standardize on OTEL)
Real-World Implementation: What I Built at Fortune 10 Scale
I can’t share specifics, but the architecture was:
- OpenLLMetry SDK wrapping LLM provider calls
- OTEL Collector for aggregation, sampling, and enrichment
- Trace Backend (vendor withheld) for storage and visualization
- Custom Dashboards showing:
- TTFT P50/P95/P99 by model and feature
- Cost burn rate and projections
- Token usage patterns (prompt size vs. output size)
- Model version distribution
- Alerting on:
- TTFT degradation (user experience)
- Cost spikes (budget protection)
- Error rate increases (availability)
- Rate limit hits (capacity planning)
Result: Engineering teams could diagnose production AI issues in minutes instead of days, and we caught a $50K/month cost leak from a prompt that was including full document context on every call.
Advice for Teams Starting Out
Start simple:
- Log TTFT and cost for every LLM call
- Add thumbs up/down feedback
- Set a budget alert
Iterate toward quality:
- Version your prompts
- A/B test prompt variations
- Sample and label responses for quality
Invest in infra when it pays:
- If you’re spending $10K/month on LLMs, you can afford manual tracking
- If you’re spending $100K/month, you need automated observability
- If you’re spending $1M/month, you need a dedicated AI observability platform
The Tooling Landscape
Open Source:
- OpenLLMetry: OTEL-native instrumentation SDK
- Langfuse: Tracing, evals, and prompt management (self-hostable)
- Phoenix (Arize): LLM tracing and evaluation
Commercial:
- LangSmith: Tracing and eval platform from the LangChain team (works beyond LangChain)
- Braintrust: Eval-focused, strong for LLM-as-judge and regression testing
- Honeycomb: General-purpose tracing with LLM support
- Datadog: APM expanding into LLM observability
- Arize AI: Purpose-built for ML/LLM monitoring
- Helicone: Cost tracking and prompt management
Avoid building from scratch unless you’re Netflix-scale. The tooling is maturing fast.
Bottom Line
GenAI observability is how you keep AI products from bleeding money or delivering garbage. Start with infrastructure metrics, add quality signals, then tie it to business impact.
And for the love of all that’s holy, instrument your prompts. If you don’t know which prompt template generated a bad response, you can’t fix it.
Building an LLM-powered product and not sure what to instrument? I’ve done this at Fortune 10 scale and scrappy startup scale. Let’s talk.