RAMPART vs LangSmith vs Braintrust vs Arize: Agent Safety
RAMPART vs LangSmith vs Braintrust vs Arize: Agent Safety (May 2026)
Microsoft open-sourced RAMPART on May 20, 2026 as the first widely-distributed agent CI safety testing framework. It’s not a competitor to LangSmith, Braintrust, or Arize Phoenix — it’s complementary. Here’s exactly where each fits in a 2026 production agent stack.
Last verified: May 25, 2026.
TL;DR table
| Microsoft RAMPART | LangSmith | Braintrust | Arize Phoenix | |
|---|---|---|---|---|
| Primary focus | Adversarial safety regression | Quality evals + LangChain observability | Quality evals + custom evaluators | Tracing + LLM observability |
| License | Open source (May 20, 2026) | Commercial (LangChain) | Commercial (Braintrust) | Open source + cloud |
| CI-native | Yes (designed for CI) | Yes (via SDK) | Yes (via SDK) | Yes (via SDK) |
| Adversarial scenario library | Yes (built-in starter set) | No (write your own) | No (write your own) | No (write your own) |
| Prompt injection coverage | First-class | Custom evaluator | Custom evaluator | Custom evaluator |
| Indirect injection coverage | First-class | Custom evaluator | Custom evaluator | Custom evaluator |
| Quality evals (accuracy, latency) | Limited | Strong | Strong | Strong |
| Tracing / observability | Limited | Strong | Strong | Strongest |
| Framework integration | Model-agnostic | LangChain / LangGraph first | Framework-agnostic | OpenTelemetry-native |
| Best for | Safety regression in CI | LangChain shops | Polished SaaS evals | Self-hosted observability |
What each tool is actually optimized for
Microsoft RAMPART (May 20, 2026)
Agent test framework for encoding adversarial and benign scenarios as repeatable CI tests. The defining feature: it ships with a starter library of adversarial scenarios — prompt injection, indirect injection, tool-call abuse, data exfiltration — that you can extend with your own threat model. Tests fail the build when the agent regresses on a previously-passing safety case.
Pick it for: Production agents that take real-world actions (write to systems, send emails, transact, browse the web). Mandatory for regulated workloads.
LangSmith
LangChain’s hosted evals and observability platform. Tight integration with LangChain and LangGraph. Strong quality eval primitives (LLM-as-judge, accuracy scoring, regression detection), comprehensive tracing.
Pick it for: Teams already on LangChain/LangGraph who want the platform with the deepest framework integration.
Braintrust
Framework-agnostic evals platform with a polished SaaS experience. Strength is the evaluator authoring experience — fast iteration on custom graders, clean diff views, regression detection across model versions.
Pick it for: Teams that want best-in-class quality evals without framework lock-in.
Arize Phoenix
Open-source LLM tracing + observability. OpenTelemetry-native. Self-hostable. Best in the category for tracing depth and integration with existing observability stacks.
Pick it for: Teams that want open-source observability they can self-host and own.
The combined production stack
The pattern that dominates production agent stacks in May 2026:
| Stage | Tool |
|---|---|
| Design intent (assumption audit) | Microsoft Clarity (companion to RAMPART) |
| Safety regression (CI) | RAMPART |
| Quality evals (CI + offline) | LangSmith / Braintrust |
| Production tracing + observability | Arize Phoenix / LangSmith / Braintrust |
| Incident replay | Tracing tool (same as observability) |
Most teams pick one of LangSmith/Braintrust/Arize for the quality + observability story and add RAMPART for safety regression. The two layers don’t overlap meaningfully — running both is the norm, not over-investment.
Adversarial scenarios: where RAMPART pulls ahead
The clearest differentiation is the adversarial coverage:
| Scenario | RAMPART | LangSmith | Braintrust | Arize |
|---|---|---|---|---|
| Direct prompt injection | Built-in scenario | Custom evaluator | Custom evaluator | Custom evaluator |
| Indirect prompt injection (web page) | Built-in scenario | Custom evaluator | Custom evaluator | Custom evaluator |
| Indirect prompt injection (document) | Built-in scenario | Custom evaluator | Custom evaluator | Custom evaluator |
| Tool-call abuse | Built-in scenario | Custom evaluator | Custom evaluator | Custom evaluator |
| Data exfiltration | Built-in scenario | Custom evaluator | Custom evaluator | Custom evaluator |
| Jailbreak persistence (multi-turn) | Built-in scenario | Manual setup | Manual setup | Manual setup |
| Authorization escape | Built-in scenario | Custom evaluator | Custom evaluator | Custom evaluator |
You can build this coverage on top of any of the evals platforms with enough work. RAMPART’s value is the starter library + the framework designed for it — you don’t reinvent prompt-injection test cases for the hundredth time.
Quality evals: where LangSmith / Braintrust / Arize win
Conversely, RAMPART is thin on quality evals:
| Capability | RAMPART | LangSmith | Braintrust | Arize |
|---|---|---|---|---|
| LLM-as-judge accuracy scoring | Limited | Strong | Strong | Strong |
| Retrieval relevance evals (RAG) | Limited | Strong | Strong | Strong |
| Latency + cost regression | Limited | Strong | Strong | Strong |
| Hallucination detection | Limited | Strong | Strong | Strong |
| Production trace replay | Limited | Strong | Strong | Strongest |
| Dataset versioning | Limited | Strong | Strong | Strong |
| A/B testing across model versions | Limited | Strong | Strong | Medium |
If your top pain is “is the agent producing accurate, fast, on-task responses,” you need an evals platform — RAMPART isn’t the right tool.
Decision framework
You should run RAMPART if:
- Your agent takes real-world actions (writes to systems, sends emails, transacts, browses the web).
- You ship in regulated industries (finance, healthcare, government, legal).
- You’ve had a prompt-injection incident (or you don’t want to find out you should have).
- You want safety regression coverage every commit, not just at launch.
You should run LangSmith if:
- You’re already on LangChain or LangGraph and want the platform with native framework integration.
- Your top eval needs are quality, retrieval, and trace replay.
You should run Braintrust if:
- You want the cleanest SaaS evaluator-authoring experience and aren’t tied to a specific agent framework.
- Your team iterates rapidly on custom graders and wants a polished diff/regression experience.
You should run Arize Phoenix if:
- You want open-source observability you can self-host and own.
- Your existing observability stack is OpenTelemetry-native and you want LLM tracing to fit in cleanly.
Most production teams run RAMPART + one of LangSmith/Braintrust/Arize. The combination covers safety regression + quality + observability without duplication.
Caveats
- RAMPART is new (May 20, 2026). Starter scenarios are good; the long-tail authoring experience and ecosystem will mature through Q3 2026.
- Multimodal coverage is text-first. Vision-based attacks, audio injection, document-embedded payloads get expanded coverage post-launch.
- LangSmith / Braintrust have hosted-only paths. If self-hosted is a hard requirement, Arize Phoenix is the obvious open-source choice.
- M365-bound agents have a Microsoft-shop alternative — M365 Copilot Evaluations Tool — that may be the path of least resistance inside Foundry.
Verdict
- Best for adversarial safety regression in CI: Microsoft RAMPART.
- Best for LangChain / LangGraph quality evals: LangSmith.
- Best for framework-agnostic polished evals SaaS: Braintrust.
- Best for self-hosted open-source observability: Arize Phoenix.
- Best combined stack: RAMPART for safety + one of the three for quality + observability.
The story for May 2026: safety regression and quality evals are now two distinct disciplines with distinct tooling. The teams that run both will ship more reliable agents than the teams that run only one.