<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>AI | Endurance Consulting | Fractional CTO &amp; Platform Engineering Leadership</title><link>https://enduranceconsulting.com/tags/ai/</link><atom:link href="https://enduranceconsulting.com/tags/ai/index.xml" rel="self" type="application/rss+xml"/><description>AI</description><generator>Hugo Blox Builder (https://hugoblox.com)</generator><language>en-us</language><lastBuildDate>Tue, 28 Apr 2026 00:00:00 +0000</lastBuildDate><image><url>https://enduranceconsulting.com/media/logo_hu_98515048530bb274.png</url><title>AI</title><link>https://enduranceconsulting.com/tags/ai/</link></image><item><title>GenAI Observability: What to Measure When Your Product Uses LLMs</title><link>https://enduranceconsulting.com/blog/genai-observability-getting-started/</link><pubDate>Tue, 28 Apr 2026 00:00:00 +0000</pubDate><guid>https://enduranceconsulting.com/blog/genai-observability-getting-started/</guid><description>&lt;p&gt;After two quarters embedded with a Fortune 10 company&amp;rsquo;s Applied Machine Learning teams instrumenting GenAI workloads, I&amp;rsquo;ve seen the core mistake: &lt;strong&gt;most organizations instrument LLM applications like they&amp;rsquo;re REST APIs&lt;/strong&gt;. They&amp;rsquo;re not.&lt;/p&gt;
&lt;p&gt;Traditional observability (latency, error rate, throughput) tells you &lt;em&gt;that&lt;/em&gt; something broke. GenAI observability tells you &lt;em&gt;why your AI is failing to deliver value&lt;/em&gt;.&lt;/p&gt;
&lt;h2 id="what-makes-genai-different"&gt;What Makes GenAI Different&lt;/h2&gt;
&lt;p&gt;When your API returns a 500 error, that&amp;rsquo;s unambiguous. When your LLM returns a response, you have no idea if it&amp;rsquo;s:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Correct&lt;/li&gt;
&lt;li&gt;Hallucinated&lt;/li&gt;
&lt;li&gt;Off-topic&lt;/li&gt;
&lt;li&gt;Biased&lt;/li&gt;
&lt;li&gt;Too expensive&lt;/li&gt;
&lt;li&gt;Too slow for the user&amp;rsquo;s patience&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;You need different instrumentation.&lt;/p&gt;
&lt;h2 id="the-four-layers-of-genai-observability"&gt;The Four Layers of GenAI Observability&lt;/h2&gt;
&lt;h3 id="layer-1-infrastructure-metrics-table-stakes"&gt;Layer 1: Infrastructure Metrics (Table Stakes)&lt;/h3&gt;
&lt;p&gt;These are your traditional observability signals, adapted for LLM workloads:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Latency Metrics:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;TTFT (Time to First Token)&lt;/strong&gt;: How long before the user sees &lt;em&gt;something&lt;/em&gt;? This determines perceived performance.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Tokens per Second&lt;/strong&gt;: Throughput rate during generation. Affects user patience.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Total Request Duration&lt;/strong&gt;: End-to-end latency including prompt processing.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Cost Metrics:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Cost per Request&lt;/strong&gt;: Input tokens × price + output tokens × price. Track cached and uncached input separately, since prompt caching charges cached tokens at a fraction of the rate&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cost per User Session&lt;/strong&gt;: Aggregated across multi-turn conversations&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cost per Feature&lt;/strong&gt;: Which parts of your product are burning money?&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Throughput:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Requests per Second&lt;/strong&gt;: Standard, but important for capacity planning&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Concurrent Requests&lt;/strong&gt;: How many LLM calls are in-flight?&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Queue Depth&lt;/strong&gt;: Are you throttling before you hit provider rate limits?&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="layer-2-llm-specific-signals"&gt;Layer 2: LLM-Specific Signals&lt;/h3&gt;
&lt;p&gt;This is where GenAI observability diverges from traditional monitoring:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Token Metrics:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Input Token Count&lt;/strong&gt;: How much context are you sending? Larger = slower + more expensive&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Output Token Count&lt;/strong&gt;: How verbose is your model? Can affect UX and cost&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cache Hit Rate&lt;/strong&gt; (if using prompt caching): Are you paying for redundant processing?&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Model Behavior:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Temperature&lt;/strong&gt;: Are you using consistent sampling parameters?&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Model Version&lt;/strong&gt;: Track which model version generated each response (for A/B testing and rollback)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Retry Count&lt;/strong&gt;: How often are you retrying failed requests?&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Fallback Triggers&lt;/strong&gt;: When did you fall back from your primary model to a cheaper or backup one, and why?&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Rate Limiting:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Rate Limit Hits&lt;/strong&gt;: How often are you throttled by your provider?&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Quota Exhaustion&lt;/strong&gt;: Are you hitting daily/monthly spending caps?&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="layer-3-quality-signals"&gt;Layer 3: Quality Signals&lt;/h3&gt;
&lt;p&gt;Infrastructure can be perfect while your AI delivers garbage. You need quality metrics:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Response Quality (Automated):&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Toxicity Score&lt;/strong&gt;: Are you generating harmful content?&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Relevance Score&lt;/strong&gt;: Does the response match the prompt intent?&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Hallucination Detection&lt;/strong&gt;: Is the model making things up? (This is hard; more below)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;PII Leakage&lt;/strong&gt;: Are you exposing sensitive data without realizing it?&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Response Quality (Human-Labeled):&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Thumbs Up/Down Ratios&lt;/strong&gt;: The simplest signal&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;User Edits&lt;/strong&gt;: Did the user have to fix the output?&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Retry Rate&lt;/strong&gt;: Did the user regenerate the response?&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Abandonment&lt;/strong&gt;: Did they give up and close the feature?&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Prompt Engineering Effectiveness:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Prompt Version&lt;/strong&gt;: Track which prompt template was used&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Few-Shot Example Count&lt;/strong&gt;: How many examples are you including?&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;RAG Context Size&lt;/strong&gt;: How much retrieved context are you injecting?&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="layer-4-business-impact"&gt;Layer 4: Business Impact&lt;/h3&gt;
&lt;p&gt;The reason you&amp;rsquo;re building with LLMs is business value. Measure it:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;User Engagement:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Feature Adoption&lt;/strong&gt;: Are users using your AI features at all?&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Session Length&lt;/strong&gt;: Does AI make users stick around longer?&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Churn Impact&lt;/strong&gt;: Do AI users churn less?&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Conversion:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;AI-Assisted Conversions&lt;/strong&gt;: Did the LLM help close a sale?&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Content Generation Volume&lt;/strong&gt;: For content- or code-generation products, output volume maps directly to revenue&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Cost-Benefit:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Revenue per Dollar Spent on LLMs&lt;/strong&gt;: Your AI P&amp;amp;L&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cost per Value Delivered&lt;/strong&gt;: What&amp;rsquo;s the unit economics?&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="what-you-can-instrument-today-vs-whats-hard"&gt;What You Can Instrument Today vs. What&amp;rsquo;s Hard&lt;/h2&gt;
&lt;h3 id="easy-wins-implement-these-first"&gt;Easy Wins (Implement These First)&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;TTFT and Tokens/s&lt;/strong&gt;: Every LLM provider returns timing data. Log it.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cost Tracking&lt;/strong&gt;: Token counts × pricing. Track by user, by feature, by model.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Model Version &amp;amp; Parameters&lt;/strong&gt;: Log which model you called and with what settings.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;User Feedback&lt;/strong&gt;: Add thumbs up/down buttons. You&amp;rsquo;d be shocked how few products do this.&lt;/li&gt;
&lt;/ol&gt;
&lt;h3 id="medium-difficulty"&gt;Medium Difficulty&lt;/h3&gt;
&lt;ol start="5"&gt;
&lt;li&gt;&lt;strong&gt;Prompt Versioning&lt;/strong&gt;: Treat prompts like code. Version them, deploy them, track which version served each request.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;RAG Observability&lt;/strong&gt;: If you&amp;rsquo;re doing retrieval, log what you retrieved, how relevant it was, and whether it made it into the final response.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Trace Context&lt;/strong&gt;: Use OpenTelemetry to connect your LLM call to the upstream request. When a user complains, you can trace back through your entire stack.&lt;/li&gt;
&lt;/ol&gt;
&lt;h3 id="hard-problems"&gt;Hard Problems&lt;/h3&gt;
&lt;ol start="8"&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Hallucination Detection&lt;/strong&gt;: There&amp;rsquo;s no silver bullet. You need:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Fact-checking against known sources (expensive)&lt;/li&gt;
&lt;li&gt;Consistency checks across multiple generations (slow)&lt;/li&gt;
&lt;li&gt;Human labeling of a sample (doesn&amp;rsquo;t scale)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Semantic Quality&lt;/strong&gt;: &amp;ldquo;Is this response helpful?&amp;rdquo; is subjective. You&amp;rsquo;ll need:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;LLM-as-judge (use a second model to grade the first; yes, really)&lt;/li&gt;
&lt;li&gt;Human eval loops (label a sample, train a classifier)&lt;/li&gt;
&lt;li&gt;User behavior proxies (did they edit it? regenerate? abandon?)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id="openllmetry-the-standard-thats-emerging"&gt;OpenLLMetry: The Standard That&amp;rsquo;s Emerging&lt;/h2&gt;
&lt;p&gt;&lt;a href="https://github.com/traceloop/openllmetry" target="_blank" rel="noopener"&gt;OpenLLMetry&lt;/a&gt; is to LLM observability what OpenTelemetry is to traditional observability: a vendor-neutral instrumentation standard.&lt;/p&gt;
&lt;p&gt;It&amp;rsquo;s built on top of OpenTelemetry and adds semantic conventions for:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;LLM provider and model name/version&lt;/li&gt;
&lt;li&gt;Token counts (prompt, completion, and cached)&lt;/li&gt;
&lt;li&gt;Cost&lt;/li&gt;
&lt;li&gt;Prompt and completion payloads (with PII redaction)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Worth knowing where this is heading: OpenLLMetry pioneered these conventions, and the OpenTelemetry GenAI working group is now standardizing the same ground upstream as &lt;code&gt;gen_ai.*&lt;/code&gt; semantic conventions. Instrumenting on OTEL today keeps you aligned with where the ecosystem is converging, not locked to one vendor&amp;rsquo;s schema.&lt;/p&gt;
&lt;p&gt;If you&amp;rsquo;re starting from scratch, &lt;strong&gt;use OpenLLMetry&lt;/strong&gt;. It gives you:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Vendor portability (swap observability backends without reinstrumenting)&lt;/li&gt;
&lt;li&gt;Ecosystem compatibility (works with Datadog, Honeycomb, Grafana, etc.)&lt;/li&gt;
&lt;li&gt;Future-proofing (as the space matures, tooling will standardize on OTEL)&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="real-world-implementation-what-i-built-at-fortune-10-scale"&gt;Real-World Implementation: What I Built at Fortune 10 Scale&lt;/h2&gt;
&lt;p&gt;I can&amp;rsquo;t share specifics, but the architecture was:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;OpenLLMetry SDK&lt;/strong&gt; wrapping LLM provider calls&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;OTEL Collector&lt;/strong&gt; for aggregation, sampling, and enrichment&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Trace Backend&lt;/strong&gt; (vendor withheld) for storage and visualization&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Custom Dashboards&lt;/strong&gt; showing:
&lt;ul&gt;
&lt;li&gt;TTFT P50/P95/P99 by model and feature&lt;/li&gt;
&lt;li&gt;Cost burn rate and projections&lt;/li&gt;
&lt;li&gt;Token usage patterns (prompt size vs. output size)&lt;/li&gt;
&lt;li&gt;Model version distribution&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Alerting&lt;/strong&gt; on:
&lt;ul&gt;
&lt;li&gt;TTFT degradation (user experience)&lt;/li&gt;
&lt;li&gt;Cost spikes (budget protection)&lt;/li&gt;
&lt;li&gt;Error rate increases (availability)&lt;/li&gt;
&lt;li&gt;Rate limit hits (capacity planning)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Result: &lt;strong&gt;Engineering teams could diagnose production AI issues in minutes instead of days&lt;/strong&gt;, and we caught a $50K/month cost leak from a prompt that was including full document context on every call.&lt;/p&gt;
&lt;h2 id="advice-for-teams-starting-out"&gt;Advice for Teams Starting Out&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Start simple:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Log TTFT and cost for every LLM call&lt;/li&gt;
&lt;li&gt;Add thumbs up/down feedback&lt;/li&gt;
&lt;li&gt;Set a budget alert&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Iterate toward quality:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Version your prompts&lt;/li&gt;
&lt;li&gt;A/B test prompt variations&lt;/li&gt;
&lt;li&gt;Sample and label responses for quality&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Invest in infra when it pays:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;If you&amp;rsquo;re spending $10K/month on LLMs, you can afford manual tracking&lt;/li&gt;
&lt;li&gt;If you&amp;rsquo;re spending $100K/month, you need automated observability&lt;/li&gt;
&lt;li&gt;If you&amp;rsquo;re spending $1M/month, you need a dedicated AI observability platform&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="the-tooling-landscape"&gt;The Tooling Landscape&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Open Source:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;OpenLLMetry&lt;/strong&gt;: OTEL-native instrumentation SDK&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Langfuse&lt;/strong&gt;: Tracing, evals, and prompt management (self-hostable)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Phoenix&lt;/strong&gt; (Arize): LLM tracing and evaluation&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Commercial:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;LangSmith&lt;/strong&gt;: Tracing and eval platform from the LangChain team (works beyond LangChain)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Braintrust&lt;/strong&gt;: Eval-focused, strong for LLM-as-judge and regression testing&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Honeycomb&lt;/strong&gt;: General-purpose tracing with LLM support&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Datadog&lt;/strong&gt;: APM expanding into LLM observability&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Arize AI&lt;/strong&gt;: Purpose-built for ML/LLM monitoring&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Helicone&lt;/strong&gt;: Cost tracking and prompt management&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Avoid building from scratch&lt;/strong&gt; unless you&amp;rsquo;re Netflix-scale. The tooling is maturing fast.&lt;/p&gt;
&lt;h2 id="bottom-line"&gt;Bottom Line&lt;/h2&gt;
&lt;p&gt;GenAI observability is how you keep AI products from bleeding money or delivering garbage. Start with infrastructure metrics, add quality signals, then tie it to business impact.&lt;/p&gt;
&lt;p&gt;And for the love of all that&amp;rsquo;s holy, &lt;strong&gt;instrument your prompts&lt;/strong&gt;. If you don&amp;rsquo;t know which prompt template generated a bad response, you can&amp;rsquo;t fix it.&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;&lt;strong&gt;Building an LLM-powered product and not sure what to instrument?&lt;/strong&gt; I&amp;rsquo;ve done this at Fortune 10 scale and scrappy startup scale. &lt;a href="https://enduranceconsulting.com/#contact"&gt;Let&amp;rsquo;s talk&lt;/a&gt;.&lt;/p&gt;</description></item></channel></rss>