GenAI Observability: What to Measure When Your Product Uses LLMs | Endurance Consulting | Fractional CTO & Platform Engineering Leadership

After two quarters embedded with a Fortune 10 company’s Applied Machine Learning teams instrumenting GenAI workloads, I’ve seen the core mistake: most organizations instrument LLM applications like they’re REST APIs. They’re not.

Traditional observability (latency, error rate, throughput) tells you that something broke. GenAI observability tells you why your AI is failing to deliver value.

What Makes GenAI Different

When your API returns a 500 error, that’s unambiguous. When your LLM returns a response, you have no idea if it’s:

Correct
Hallucinated
Off-topic
Biased
Too expensive
Too slow for the user’s patience

You need different instrumentation.

The Four Layers of GenAI Observability

Layer 1: Infrastructure Metrics (Table Stakes)

These are your traditional observability signals, adapted for LLM workloads:

Latency Metrics:

TTFT (Time to First Token): How long before the user sees something? This determines perceived performance.
Tokens per Second: Throughput rate during generation. Affects user patience.
Total Request Duration: End-to-end latency including prompt processing.

Cost Metrics:

Cost per Request: Input tokens × price + output tokens × price. Track cached and uncached input separately, since prompt caching charges cached tokens at a fraction of the rate
Cost per User Session: Aggregated across multi-turn conversations
Cost per Feature: Which parts of your product are burning money?

Throughput:

Requests per Second: Standard, but important for capacity planning
Concurrent Requests: How many LLM calls are in-flight?
Queue Depth: Are you throttling before you hit provider rate limits?

Layer 2: LLM-Specific Signals

This is where GenAI observability diverges from traditional monitoring:

Token Metrics:

Input Token Count: How much context are you sending? Larger = slower + more expensive
Output Token Count: How verbose is your model? Can affect UX and cost
Cache Hit Rate (if using prompt caching): Are you paying for redundant processing?

Model Behavior:

Temperature: Are you using consistent sampling parameters?
Model Version: Track which model version generated each response (for A/B testing and rollback)
Retry Count: How often are you retrying failed requests?
Fallback Triggers: When did you fall back from your primary model to a cheaper or backup one, and why?

Rate Limiting:

Rate Limit Hits: How often are you throttled by your provider?
Quota Exhaustion: Are you hitting daily/monthly spending caps?

Layer 3: Quality Signals

Infrastructure can be perfect while your AI delivers garbage. You need quality metrics:

Response Quality (Automated):

Toxicity Score: Are you generating harmful content?
Relevance Score: Does the response match the prompt intent?
Hallucination Detection: Is the model making things up? (This is hard; more below)
PII Leakage: Are you exposing sensitive data without realizing it?

Response Quality (Human-Labeled):

Thumbs Up/Down Ratios: The simplest signal
User Edits: Did the user have to fix the output?
Retry Rate: Did the user regenerate the response?
Abandonment: Did they give up and close the feature?

Prompt Engineering Effectiveness:

Prompt Version: Track which prompt template was used
Few-Shot Example Count: How many examples are you including?
RAG Context Size: How much retrieved context are you injecting?

Layer 4: Business Impact

The reason you’re building with LLMs is business value. Measure it:

User Engagement:

Feature Adoption: Are users using your AI features at all?
Session Length: Does AI make users stick around longer?
Churn Impact: Do AI users churn less?

Conversion:

AI-Assisted Conversions: Did the LLM help close a sale?
Content Generation Volume: For content- or code-generation products, output volume maps directly to revenue

Cost-Benefit:

Revenue per Dollar Spent on LLMs: Your AI P&L
Cost per Value Delivered: What’s the unit economics?

What You Can Instrument Today vs. What’s Hard

Easy Wins (Implement These First)

TTFT and Tokens/s: Every LLM provider returns timing data. Log it.
Cost Tracking: Token counts × pricing. Track by user, by feature, by model.
Model Version & Parameters: Log which model you called and with what settings.
User Feedback: Add thumbs up/down buttons. You’d be shocked how few products do this.

Medium Difficulty

Prompt Versioning: Treat prompts like code. Version them, deploy them, track which version served each request.
RAG Observability: If you’re doing retrieval, log what you retrieved, how relevant it was, and whether it made it into the final response.
Trace Context: Use OpenTelemetry to connect your LLM call to the upstream request. When a user complains, you can trace back through your entire stack.

Hard Problems

Hallucination Detection: There’s no silver bullet. You need:
- Fact-checking against known sources (expensive)
- Consistency checks across multiple generations (slow)
- Human labeling of a sample (doesn’t scale)
Semantic Quality: “Is this response helpful?” is subjective. You’ll need:
- LLM-as-judge (use a second model to grade the first; yes, really)
- Human eval loops (label a sample, train a classifier)
- User behavior proxies (did they edit it? regenerate? abandon?)

OpenLLMetry: The Standard That’s Emerging

OpenLLMetry is to LLM observability what OpenTelemetry is to traditional observability: a vendor-neutral instrumentation standard.

It’s built on top of OpenTelemetry and adds semantic conventions for:

LLM provider and model name/version
Token counts (prompt, completion, and cached)
Cost
Prompt and completion payloads (with PII redaction)

Worth knowing where this is heading: OpenLLMetry pioneered these conventions, and the OpenTelemetry GenAI working group is now standardizing the same ground upstream as gen_ai.* semantic conventions. Instrumenting on OTEL today keeps you aligned with where the ecosystem is converging, not locked to one vendor’s schema.

If you’re starting from scratch, use OpenLLMetry. It gives you:

Vendor portability (swap observability backends without reinstrumenting)
Ecosystem compatibility (works with Datadog, Honeycomb, Grafana, etc.)
Future-proofing (as the space matures, tooling will standardize on OTEL)

Real-World Implementation: What I Built at Fortune 10 Scale

I can’t share specifics, but the architecture was:

OpenLLMetry SDK wrapping LLM provider calls
OTEL Collector for aggregation, sampling, and enrichment
Trace Backend (vendor withheld) for storage and visualization
Custom Dashboards showing:
- TTFT P50/P95/P99 by model and feature
- Cost burn rate and projections
- Token usage patterns (prompt size vs. output size)
- Model version distribution
Alerting on:
- TTFT degradation (user experience)
- Cost spikes (budget protection)
- Error rate increases (availability)
- Rate limit hits (capacity planning)

Result: Engineering teams could diagnose production AI issues in minutes instead of days, and we caught a $50K/month cost leak from a prompt that was including full document context on every call.

Advice for Teams Starting Out

Start simple:

Log TTFT and cost for every LLM call
Add thumbs up/down feedback
Set a budget alert

Iterate toward quality:

Version your prompts
A/B test prompt variations
Sample and label responses for quality

Invest in infra when it pays:

If you’re spending $10K/month on LLMs, you can afford manual tracking
If you’re spending $100K/month, you need automated observability
If you’re spending $1M/month, you need a dedicated AI observability platform

The Tooling Landscape

Open Source:

OpenLLMetry: OTEL-native instrumentation SDK
Langfuse: Tracing, evals, and prompt management (self-hostable)
Phoenix (Arize): LLM tracing and evaluation

Commercial:

LangSmith: Tracing and eval platform from the LangChain team (works beyond LangChain)
Braintrust: Eval-focused, strong for LLM-as-judge and regression testing
Honeycomb: General-purpose tracing with LLM support
Datadog: APM expanding into LLM observability
Arize AI: Purpose-built for ML/LLM monitoring
Helicone: Cost tracking and prompt management

Avoid building from scratch unless you’re Netflix-scale. The tooling is maturing fast.

Bottom Line

GenAI observability is how you keep AI products from bleeding money or delivering garbage. Start with infrastructure metrics, add quality signals, then tie it to business impact.

And for the love of all that’s holy, instrument your prompts. If you don’t know which prompt template generated a bad response, you can’t fix it.

Building an LLM-powered product and not sure what to instrument? I’ve done this at Fortune 10 scale and scrappy startup scale. Let’s talk.