Every new computing paradigm takes about 18 months before the industry realizes it needs new tools to see what’s happening inside. Microservices gave us Datadog. Containers gave us Kubernetes observability. AI agents are doing the same thing right now.
The numbers are clear. LangChain’s LangSmith platform, launched in July 2023, now traces agent executions for Klarna, Uber, LinkedIn, Nvidia, and Coinbase. Arize AI’s Phoenix project has become the open-source observability standard for AI, with integrations spanning LlamaIndex, Haystack, and LangChain. Promptfoo, a CLI eval tool that started as a side project, hit 21,800 GitHub stars before OpenAI acquired it. And in March 2025, Braintrust — founded by Ankur Goyal, formerly of Figma — released Brainstore, a purpose-built database for AI traces that they claim is 10x faster than existing solutions for agent workloads.
This is not a feature. This is a category forming.
Why Traditional APM Breaks on Agents
Traditional APM tools like Datadog, New Relic, and Grafana were built for deterministic request/response patterns. A web request comes in, a server processes it, a response goes out. You measure p95 latency, error rates, and throughput. The trace is a straight line.
Agent workloads are different. A single user request spawns multiple LLM calls, each with token-by-token streaming. Those calls trigger tool invocations — a web search, a code execution, a database query — each with its own latency profile. The agent might loop, retry, backtrack, or hand off to another agent. A single conversation with a coding agent like Cursor or GitHub Copilot can generate megabytes of trace data across dozens of sub-calls, each with multi-kilobyte payloads of prompts and responses.
Agent traces are deeply nested with heavy payloads. General-purpose databases can store trace data, but weren’t designed for the way teams query it.
— LangChain’s team, introducing SmithDB
The numbers bear this out. SmithDB delivers 15x faster full-text search, 12x faster random-access trace queries, and 9x faster thread queries compared to standard Postgres. Those gaps reflect a fundamental architectural mismatch — not incremental tweaking.
The Three Pillars of Agent Observability
The companies building in this space have converged on a remarkably consistent architecture. Every major platform — LangSmith, Braintrust, Arize Phoenix — organizes around three capabilities.
- Tracing. Following an agent’s decision chain from user input to final output, capturing every LLM call, tool invocation, and state transition along the way. This is the foundation. Without it, debugging an agent that hallucinates on the third of five reasoning steps is guesswork.
- Evaluation. Scoring outputs automatically — using LLM-as-judge, code assertions, or human review — and continuously improving those scores. Braintrust’s evaluation layer lets teams run hundreds or thousands of experiments in parallel. Replit’s CTO said Braintrust “helped us identify several patterns that we wouldn’t have found.”
- Monitoring. Real-time visibility into agent behavior in production — cost per run, latency distributions, tool call frequencies, failure modes. LangSmith’s dashboards track p50 and p99 latency, token usage, and error rates, with webhook and PagerDuty alerts when thresholds are crossed.
None of these companies are competing on architectural philosophy. They’re all building the same stack, differentiated by execution, integrations, and go-to-market.
The Database Problem
The hardest technical challenge in agent observability isn’t the AI — it’s the storage layer. Agent traces have different characteristics than anything observability platforms have handled before.
A typical web trace: a few hundred bytes, a handful of spans, milliseconds of processing time. An agent trace: megabytes across dozens of spans, each containing multi-kilobyte prompt and response payloads, with deeply nested parent-child relationships that span minutes or hours. The query patterns are different too — you search across traces for semantic patterns, not just individual lookups. “Find all traces where the agent called the code interpreter after generating a syntax error.”
Both LangChain and Braintrust built custom databases. SmithDB is “three stateless components on object storage and Postgres” that fits inside customer VPCs. Brainstore handles full-text search up to 23.9x faster than a popular open-source data warehouse.
This mirrors early APM. Datadog built its own time-series database because existing storage couldn’t handle metric cardinality at scale. The same pattern is repeating: when existing infrastructure can’t handle the data shape, the new category builds its own foundation.
Who Wins and Who Loses
The incumbents are not standing still. Datadog added LLM observability to its APM suite in 2024. Honeycomb followed. But their products are optimized for the old data model. Adding LLM spans to a trace designed for microservice calls works, but you’re fighting the original architecture.
The native players face a different challenge: convincing teams to adopt yet another observability tool. Sarah Sachs, AI Lead at Notion:
There are some problems we wouldn’t know were problems without Braintrust.
That’s the pitch — not better dashboards, but problems you don’t know exist.
The likely outcome is a split. Incumbents handle the good enough use case for simple RAG pipelines or single-LLM-call agents. Native platforms win complex, multi-step agent workloads where dedicated storage and query infrastructure actually matters.
What This Means
Agent observability becoming its own category signals something real about AI adoption: it’s graduated from experimentation to production. Nvidia, Uber, and Klarna are paying for LangSmith. Coursera reports 45x more feedback with AI grading through Braintrust. We’re past the threshold.
Datadog won not because it was the best monitoring tool, but because it was built for the microservices world that was emerging. The incumbents then — Nagios, Zabbix, SolarWinds — were built for monoliths. The same dynamic is playing out now between Datadog/Honeycomb and LangSmith/Braintrust.
If your team is building production agents, you need agent-native observability. Universal databases handle the first wave of debugging. But when you’re searching across a million traces to understand why your coding agent keeps introducing race conditions — and you need the answer in seconds, not minutes — you’ll want the tool built for the job.
The next 12 months will determine which of these native companies becomes the Datadog of agents. What’s already clear: the category exists, it’s growing fast, and it’s not going back into the APM monolith.
Further Reading
- LangSmith Observability Platform — LangChain’s production-grade tracing, monitoring, and evaluation platform for AI agents. The industry leader with customers like Klarna, Uber, and Nvidia.
- Braintrust — The AI Observability Platform — Braintrust’s full-stack platform combining observability, evaluation, and automation. Used by Notion, Coursera, and Dropbox to ship quality AI at scale.
- Promptfoo Docs — The open-source CLI and library for evaluating and red-teaming LLM applications. Acquired by OpenAI in 2026.
- How Coursera Builds Next-Generation Learning Tools — Case study on how Coursera achieved 45x more feedback with AI grading using Braintrust’s evaluation infrastructure.
- The Three Pillars of AI Observability — Braintrust’s breakdown of tracing, evaluation, and monitoring as the foundational capabilities for production AI systems.
No comments yet