How do you run an evaluation?

Install the m365-copilot-eval CLI (Node.js 24.12+), authenticate against your Microsoft 365 tenant with a Copilot license, point it at your deployed declarative agent, supply prompts (CLI list, JSON file, or interactive), and run an evaluation. The tool outputs colorized console reports, JSON, CSV, and auto-opening HTML. You need an Azure OpenAI endpoint for the LLM-judge evaluators. Windows is the supported OS in preview.

What can it actually measure?

Two default metrics — Relevance and Coherence (LLM-judged on a 1-5 scale). Optional metrics: Groundedness (does the answer use the provided context?), Similarity (compare to a gold answer), Citations (are citations present and correct?), ExactMatch and PartialMatch (string-based). It supports both single-turn and multi-turn conversations, so you can test context retention and end-to-end task completion.

How is it different from LangSmith or Arize Phoenix?

LangSmith and Arize Phoenix are framework-agnostic agent eval platforms. The M365 Copilot Agent Evaluations CLI is Microsoft-specific — it knows the Agents Toolkit format, plugs into M365 tenants, and uses Azure OpenAI as judge. Use it if you ship declarative agents for M365 Copilot. Use LangSmith/Phoenix/Braintrust if your agents live outside the Microsoft stack.

Quick Answer

What is M365 Copilot Agent Evaluations Tool? (May 2026)

Q: What is the Microsoft 365 Copilot Agent Evaluations tool?

It's a CLI tool, in public preview as of May 2026, for measuring the quality of agents built for Microsoft 365 Copilot. It sends prompts to a deployed agent, captures responses, and scores them using Azure OpenAI as the LLM judge across seven evaluators (Relevance, Coherence, Groundedness, Similarity, Citations, ExactMatch, PartialMatch). It produces structured reports for developer workflows and CI/CD pipelines.

Published: May 18, 2026

What is Microsoft 365 Copilot Agent Evaluations Tool? (May 2026)

Microsoft 365 Copilot Agent Evaluations is the new CLI tool, in public preview, that gives M365 Copilot agent developers a real evaluation loop — Azure OpenAI as judge, seven metrics, CI/CD-ready reports. Here’s what it does in May 2026.

Last verified: May 18, 2026

The one-paragraph answer

The Microsoft 365 Copilot Agent Evaluations tool is a command-line interface (CLI) for measuring and improving the quality of declarative agents built for Microsoft 365 Copilot. Announced in public preview around Build 2026 prep (May 2026), the tool sends prompts to a deployed agent, captures the responses, and scores them using Azure OpenAI as the LLM judge across seven evaluator types — Relevance, Coherence, Groundedness, Similarity, Citations, ExactMatch, and PartialMatch. The tool produces structured reports in console (colorized), JSON, CSV, and auto-opening HTML so developers can integrate it into CI/CD pipelines for declarative-agent quality gates. It is part of the Microsoft 365 Agents Toolkit and is the first Microsoft-supplied eval layer for the M365 Copilot agent ecosystem.

What it does

Developer
  │
  ▼
m365-copilot-eval CLI
  │
  ├─► Send prompts to deployed agent (M365 tenant)
  │
  ├─► Capture agent response (single- or multi-turn)
  │
  ├─► Score with Azure OpenAI LLM judge (Relevance, Coherence, ...)
  │
  └─► Emit report (console / JSON / CSV / HTML)
            │
            └─► CI/CD pipeline (quality gate)

The full loop fits into a CI workflow. Push agent change → run eval → block PR if Relevance/Coherence drops below threshold.

The seven evaluators

Evaluator	Type	Scale	What it measures
Relevance	LLM-judge	1-5	Is the response on-topic? (default)
Coherence	LLM-judge	1-5	Is the response well-structured? (default)
Groundedness	LLM-judge	1-5	Does the answer use provided context faithfully?
Similarity	LLM-judge	1-5	Compare against a gold answer
Citations	LLM-judge	1-5	Are citations present and correct?
ExactMatch	String	0/1	Exact string match
PartialMatch	String	0/1	Partial / fuzzy match

By default, only Relevance and Coherence run. Turn on the rest per scenario.

Single-turn and multi-turn

Single-turn: classic prompt → response → score.
Multi-turn: simulate a conversation. Important for testing context retention, follow-up handling, and end-to-end task completion. This is the real reason to use it for M365 Copilot agents, which usually live in extended chat sessions.

How to use it

Prerequisites:

Microsoft 365 Copilot license
Declarative agent deployed to your tenant
Node.js 24.12.0 or later
Admin consent to run the tool
Azure OpenAI endpoint (for LLM-judge evaluators)
Windows OS (preview-supported)

Basic flow:

# Install (preview)
npm install -g @microsoft/m365-copilot-eval

# Authenticate against your tenant
m365-copilot-eval auth login

# Run eval against a deployed agent
m365-copilot-eval run \
  --agent <agent-id> \
  --prompts ./prompts.json \
  --evaluators relevance,coherence,groundedness \
  --azure-openai-endpoint https://<your-resource>.openai.azure.com \
  --output html

You can also use the interactive selector to pick an agent without knowing its ID.

Where it fits in the M365 Copilot agent ecosystem

Microsoft 365 Agents Toolkit — the dev kit. Eval CLI plugs into it.
Copilot Studio — low-code builder. A2A communication, governance, intelligent workflows.
Microsoft Agent 365 — control plane for managing agents across the org.
Agent Evaluations CLI — quality measurement layer.

You build the agent in Agents Toolkit or Copilot Studio, manage it via Agent 365, and measure it with the Evaluations CLI.

Why it matters

Until now, M365 Copilot agent developers had no first-party way to measure quality at scale. You either:

Eye-balled outputs (doesn’t scale)
Built a custom harness (reinvent the wheel)
Used third-party tools like LangSmith or Braintrust (works but not M365-native)

The Evaluations CLI closes that gap. It also signals Microsoft’s recognition that eval is the production bottleneck for agents in 2026 — not the model, not the framework, but knowing whether your agent is getting better or worse on each change.

How it compares

	M365 Copilot Agent Evaluations	LangSmith	Arize Phoenix	Braintrust
Vendor	Microsoft	LangChain	Arize AI	Braintrust
M365 Copilot integration	✅ Native	Manual	Manual	Manual
LLM judge	Azure OpenAI	Any	Any	Any
Multi-turn	✅	✅	✅	✅
CI/CD reports	✅	✅	✅	✅
Pricing	Included with M365 Copilot license + Azure OpenAI usage	SaaS subscription	Open source + cloud	SaaS subscription
Best for	M365 Copilot agents	LangGraph / LangChain agents	Any LLM app	Production LLM ops

Strengths

First-party — built by Microsoft for Microsoft 365 Copilot agents.
Free with M365 Copilot license — only pay Azure OpenAI judge costs.
CI/CD ready out of the box.
Multi-turn support matches how M365 Copilot agents actually behave.

Weaknesses

Windows-only in preview — other OS support promised.
Azure OpenAI required for the LLM judges (no BYO judge yet).
Microsoft-specific — useless for agents outside M365 Copilot.
No grading dataset versioning yet — bring your own data discipline.

What’s next

Cross-OS support beyond Windows.
Custom evaluator plug-ins (bring your own scoring rubric).
Integration with Microsoft Agent 365’s monitoring stack.
Dataset/version management.

TL;DR

The Microsoft 365 Copilot Agent Evaluations CLI is Microsoft’s first-party eval harness for declarative M365 Copilot agents — seven metrics, Azure OpenAI judge, CI/CD reports. If you ship agents into a Microsoft tenant, you want this in your pipeline before Build 2026 GA.

Sources: Microsoft 365 Dev Blog “Announcing the public preview of the M365 Copilot Agent Evaluations tool” (devblogs.microsoft.com), microsoft/m365-copilot-eval GitHub repo, Microsoft Learn evaluations-cli docs — May 2026.