AI agents · OpenClaw · self-hosting · automation

Quick Answer

RAMPART vs LangSmith vs Braintrust vs Arize: Agent Safety

Published:

RAMPART vs LangSmith vs Braintrust vs Arize: Agent Safety (May 2026)

Microsoft open-sourced RAMPART on May 20, 2026 as the first widely-distributed agent CI safety testing framework. It’s not a competitor to LangSmith, Braintrust, or Arize Phoenix — it’s complementary. Here’s exactly where each fits in a 2026 production agent stack.

Last verified: May 25, 2026.

TL;DR table

Microsoft RAMPARTLangSmithBraintrustArize Phoenix
Primary focusAdversarial safety regressionQuality evals + LangChain observabilityQuality evals + custom evaluatorsTracing + LLM observability
LicenseOpen source (May 20, 2026)Commercial (LangChain)Commercial (Braintrust)Open source + cloud
CI-nativeYes (designed for CI)Yes (via SDK)Yes (via SDK)Yes (via SDK)
Adversarial scenario libraryYes (built-in starter set)No (write your own)No (write your own)No (write your own)
Prompt injection coverageFirst-classCustom evaluatorCustom evaluatorCustom evaluator
Indirect injection coverageFirst-classCustom evaluatorCustom evaluatorCustom evaluator
Quality evals (accuracy, latency)LimitedStrongStrongStrong
Tracing / observabilityLimitedStrongStrongStrongest
Framework integrationModel-agnosticLangChain / LangGraph firstFramework-agnosticOpenTelemetry-native
Best forSafety regression in CILangChain shopsPolished SaaS evalsSelf-hosted observability

What each tool is actually optimized for

Microsoft RAMPART (May 20, 2026)

Agent test framework for encoding adversarial and benign scenarios as repeatable CI tests. The defining feature: it ships with a starter library of adversarial scenarios — prompt injection, indirect injection, tool-call abuse, data exfiltration — that you can extend with your own threat model. Tests fail the build when the agent regresses on a previously-passing safety case.

Pick it for: Production agents that take real-world actions (write to systems, send emails, transact, browse the web). Mandatory for regulated workloads.

LangSmith

LangChain’s hosted evals and observability platform. Tight integration with LangChain and LangGraph. Strong quality eval primitives (LLM-as-judge, accuracy scoring, regression detection), comprehensive tracing.

Pick it for: Teams already on LangChain/LangGraph who want the platform with the deepest framework integration.

Braintrust

Framework-agnostic evals platform with a polished SaaS experience. Strength is the evaluator authoring experience — fast iteration on custom graders, clean diff views, regression detection across model versions.

Pick it for: Teams that want best-in-class quality evals without framework lock-in.

Arize Phoenix

Open-source LLM tracing + observability. OpenTelemetry-native. Self-hostable. Best in the category for tracing depth and integration with existing observability stacks.

Pick it for: Teams that want open-source observability they can self-host and own.

The combined production stack

The pattern that dominates production agent stacks in May 2026:

StageTool
Design intent (assumption audit)Microsoft Clarity (companion to RAMPART)
Safety regression (CI)RAMPART
Quality evals (CI + offline)LangSmith / Braintrust
Production tracing + observabilityArize Phoenix / LangSmith / Braintrust
Incident replayTracing tool (same as observability)

Most teams pick one of LangSmith/Braintrust/Arize for the quality + observability story and add RAMPART for safety regression. The two layers don’t overlap meaningfully — running both is the norm, not over-investment.

Adversarial scenarios: where RAMPART pulls ahead

The clearest differentiation is the adversarial coverage:

ScenarioRAMPARTLangSmithBraintrustArize
Direct prompt injectionBuilt-in scenarioCustom evaluatorCustom evaluatorCustom evaluator
Indirect prompt injection (web page)Built-in scenarioCustom evaluatorCustom evaluatorCustom evaluator
Indirect prompt injection (document)Built-in scenarioCustom evaluatorCustom evaluatorCustom evaluator
Tool-call abuseBuilt-in scenarioCustom evaluatorCustom evaluatorCustom evaluator
Data exfiltrationBuilt-in scenarioCustom evaluatorCustom evaluatorCustom evaluator
Jailbreak persistence (multi-turn)Built-in scenarioManual setupManual setupManual setup
Authorization escapeBuilt-in scenarioCustom evaluatorCustom evaluatorCustom evaluator

You can build this coverage on top of any of the evals platforms with enough work. RAMPART’s value is the starter library + the framework designed for it — you don’t reinvent prompt-injection test cases for the hundredth time.

Quality evals: where LangSmith / Braintrust / Arize win

Conversely, RAMPART is thin on quality evals:

CapabilityRAMPARTLangSmithBraintrustArize
LLM-as-judge accuracy scoringLimitedStrongStrongStrong
Retrieval relevance evals (RAG)LimitedStrongStrongStrong
Latency + cost regressionLimitedStrongStrongStrong
Hallucination detectionLimitedStrongStrongStrong
Production trace replayLimitedStrongStrongStrongest
Dataset versioningLimitedStrongStrongStrong
A/B testing across model versionsLimitedStrongStrongMedium

If your top pain is “is the agent producing accurate, fast, on-task responses,” you need an evals platform — RAMPART isn’t the right tool.

Decision framework

You should run RAMPART if:

  • Your agent takes real-world actions (writes to systems, sends emails, transacts, browses the web).
  • You ship in regulated industries (finance, healthcare, government, legal).
  • You’ve had a prompt-injection incident (or you don’t want to find out you should have).
  • You want safety regression coverage every commit, not just at launch.

You should run LangSmith if:

  • You’re already on LangChain or LangGraph and want the platform with native framework integration.
  • Your top eval needs are quality, retrieval, and trace replay.

You should run Braintrust if:

  • You want the cleanest SaaS evaluator-authoring experience and aren’t tied to a specific agent framework.
  • Your team iterates rapidly on custom graders and wants a polished diff/regression experience.

You should run Arize Phoenix if:

  • You want open-source observability you can self-host and own.
  • Your existing observability stack is OpenTelemetry-native and you want LLM tracing to fit in cleanly.

Most production teams run RAMPART + one of LangSmith/Braintrust/Arize. The combination covers safety regression + quality + observability without duplication.

Caveats

  • RAMPART is new (May 20, 2026). Starter scenarios are good; the long-tail authoring experience and ecosystem will mature through Q3 2026.
  • Multimodal coverage is text-first. Vision-based attacks, audio injection, document-embedded payloads get expanded coverage post-launch.
  • LangSmith / Braintrust have hosted-only paths. If self-hosted is a hard requirement, Arize Phoenix is the obvious open-source choice.
  • M365-bound agents have a Microsoft-shop alternative — M365 Copilot Evaluations Tool — that may be the path of least resistance inside Foundry.

Verdict

  • Best for adversarial safety regression in CI: Microsoft RAMPART.
  • Best for LangChain / LangGraph quality evals: LangSmith.
  • Best for framework-agnostic polished evals SaaS: Braintrust.
  • Best for self-hosted open-source observability: Arize Phoenix.
  • Best combined stack: RAMPART for safety + one of the three for quality + observability.

The story for May 2026: safety regression and quality evals are now two distinct disciplines with distinct tooling. The teams that run both will ship more reliable agents than the teams that run only one.