AI agents · OpenClaw · self-hosting · automation

Quick Answer

What is M365 Copilot Agent Evaluations Tool? (May 2026)

Published:

What is Microsoft 365 Copilot Agent Evaluations Tool? (May 2026)

Microsoft 365 Copilot Agent Evaluations is the new CLI tool, in public preview, that gives M365 Copilot agent developers a real evaluation loop — Azure OpenAI as judge, seven metrics, CI/CD-ready reports. Here’s what it does in May 2026.

Last verified: May 18, 2026

The one-paragraph answer

The Microsoft 365 Copilot Agent Evaluations tool is a command-line interface (CLI) for measuring and improving the quality of declarative agents built for Microsoft 365 Copilot. Announced in public preview around Build 2026 prep (May 2026), the tool sends prompts to a deployed agent, captures the responses, and scores them using Azure OpenAI as the LLM judge across seven evaluator types — Relevance, Coherence, Groundedness, Similarity, Citations, ExactMatch, and PartialMatch. The tool produces structured reports in console (colorized), JSON, CSV, and auto-opening HTML so developers can integrate it into CI/CD pipelines for declarative-agent quality gates. It is part of the Microsoft 365 Agents Toolkit and is the first Microsoft-supplied eval layer for the M365 Copilot agent ecosystem.

What it does

Developer


m365-copilot-eval CLI

  ├─► Send prompts to deployed agent (M365 tenant)

  ├─► Capture agent response (single- or multi-turn)

  ├─► Score with Azure OpenAI LLM judge (Relevance, Coherence, ...)

  └─► Emit report (console / JSON / CSV / HTML)

            └─► CI/CD pipeline (quality gate)

The full loop fits into a CI workflow. Push agent change → run eval → block PR if Relevance/Coherence drops below threshold.

The seven evaluators

EvaluatorTypeScaleWhat it measures
RelevanceLLM-judge1-5Is the response on-topic? (default)
CoherenceLLM-judge1-5Is the response well-structured? (default)
GroundednessLLM-judge1-5Does the answer use provided context faithfully?
SimilarityLLM-judge1-5Compare against a gold answer
CitationsLLM-judge1-5Are citations present and correct?
ExactMatchString0/1Exact string match
PartialMatchString0/1Partial / fuzzy match

By default, only Relevance and Coherence run. Turn on the rest per scenario.

Single-turn and multi-turn

  • Single-turn: classic prompt → response → score.
  • Multi-turn: simulate a conversation. Important for testing context retention, follow-up handling, and end-to-end task completion. This is the real reason to use it for M365 Copilot agents, which usually live in extended chat sessions.

How to use it

Prerequisites:

  • Microsoft 365 Copilot license
  • Declarative agent deployed to your tenant
  • Node.js 24.12.0 or later
  • Admin consent to run the tool
  • Azure OpenAI endpoint (for LLM-judge evaluators)
  • Windows OS (preview-supported)

Basic flow:

# Install (preview)
npm install -g @microsoft/m365-copilot-eval

# Authenticate against your tenant
m365-copilot-eval auth login

# Run eval against a deployed agent
m365-copilot-eval run \
  --agent <agent-id> \
  --prompts ./prompts.json \
  --evaluators relevance,coherence,groundedness \
  --azure-openai-endpoint https://<your-resource>.openai.azure.com \
  --output html

You can also use the interactive selector to pick an agent without knowing its ID.

Where it fits in the M365 Copilot agent ecosystem

  • Microsoft 365 Agents Toolkit — the dev kit. Eval CLI plugs into it.
  • Copilot Studio — low-code builder. A2A communication, governance, intelligent workflows.
  • Microsoft Agent 365 — control plane for managing agents across the org.
  • Agent Evaluations CLI — quality measurement layer.

You build the agent in Agents Toolkit or Copilot Studio, manage it via Agent 365, and measure it with the Evaluations CLI.

Why it matters

Until now, M365 Copilot agent developers had no first-party way to measure quality at scale. You either:

  • Eye-balled outputs (doesn’t scale)
  • Built a custom harness (reinvent the wheel)
  • Used third-party tools like LangSmith or Braintrust (works but not M365-native)

The Evaluations CLI closes that gap. It also signals Microsoft’s recognition that eval is the production bottleneck for agents in 2026 — not the model, not the framework, but knowing whether your agent is getting better or worse on each change.

How it compares

M365 Copilot Agent EvaluationsLangSmithArize PhoenixBraintrust
VendorMicrosoftLangChainArize AIBraintrust
M365 Copilot integration✅ NativeManualManualManual
LLM judgeAzure OpenAIAnyAnyAny
Multi-turn
CI/CD reports
PricingIncluded with M365 Copilot license + Azure OpenAI usageSaaS subscriptionOpen source + cloudSaaS subscription
Best forM365 Copilot agentsLangGraph / LangChain agentsAny LLM appProduction LLM ops

Strengths

  • First-party — built by Microsoft for Microsoft 365 Copilot agents.
  • Free with M365 Copilot license — only pay Azure OpenAI judge costs.
  • CI/CD ready out of the box.
  • Multi-turn support matches how M365 Copilot agents actually behave.

Weaknesses

  • Windows-only in preview — other OS support promised.
  • Azure OpenAI required for the LLM judges (no BYO judge yet).
  • Microsoft-specific — useless for agents outside M365 Copilot.
  • No grading dataset versioning yet — bring your own data discipline.

What’s next

  • Cross-OS support beyond Windows.
  • Custom evaluator plug-ins (bring your own scoring rubric).
  • Integration with Microsoft Agent 365’s monitoring stack.
  • Dataset/version management.

TL;DR

The Microsoft 365 Copilot Agent Evaluations CLI is Microsoft’s first-party eval harness for declarative M365 Copilot agents — seven metrics, Azure OpenAI judge, CI/CD reports. If you ship agents into a Microsoft tenant, you want this in your pipeline before Build 2026 GA.


Sources: Microsoft 365 Dev Blog “Announcing the public preview of the M365 Copilot Agent Evaluations tool” (devblogs.microsoft.com), microsoft/m365-copilot-eval GitHub repo, Microsoft Learn evaluations-cli docs — May 2026.