Real-server evaluation harness for the classifier and analyzer. Goal: turn lexicon and rule tuning from gut-feel into measurable precision and recall against hand-labeled ground truth, per the calibration plan in docs/capability-classifier.md.
The fastest way to add a new target — three commands plus one round of hand-labeling:
# 1. Capture the server's tools/list:
mcp-witness-capture --server-cmd python --server-arg -m --server-arg some_server \
-o /tmp/captured.json
# 2. Scaffold a ground-truth skeleton:
mcp-witness-scaffold-gt /tmp/captured.json --name some_server --language python \
--source https://github.com/.../some_server \
-o ground_truth/some_server.yaml
# 3. Hand-edit ground_truth/some_server.yaml — populate `capabilities`,
# `parameter_roles`, and `known_vulns` for each tool. Names,
# descriptions, and input schemas are already filled in by step 2.
# 4. Run the eval:
mcp-witness-eval-calibration some_server
# Or against the whole corpus:
mcp-witness-eval-calibration --all
The judgment calls — what capability each tool has, what role each parameter plays, which static rules should fire — stay with the human auditor. The scaffolder removes the rote transcription work.
For targets you cannot run locally (transport mismatch, missing credentials, paid service), transcribe the tools/list shape into a JSON file manually and skip step 1.
mcp-witness-eval-calibration --all for cross-target precision/recall. The per-tool diffs list the actionable gaps.reports/ alongside the lexicon change so the diff is reviewable.For each tool, fill in:
capabilities — capability tags the tool genuinely has (auditor judgment, not implementation behavior). A tool whose description says “deletes records from a table” is db_write regardless of how its body is coded.parameter_roles — semantic role of each interesting parameter (path, url, command, query, host, content). Skip params that fall into the catch-all text / id buckets unless documenting one matters.known_vulns — static-rule IDs that should fire on this tool (e.g. [MCP-S-006] if it has a known path-traversal bug). Used by the upcoming analyzer-side eval.target_name: <slug> # must match the filename basename
source: <URL or filesystem path>
language: python | typescript | rust | other
mcp_spec_version: "2025-06-18"
notes: <free text>
tools:
- name: <tool_name>
description: <description as a user/auditor sees it>
input_schema:
type: object
properties:
<param>: { type: string, ... }
capabilities: [<tag>, ...] # auditor's labels
parameter_roles:
<param>: <role>
known_vulns: [<rule_id>, ...] # e.g. [MCP-S-006]
A tool may have zero capabilities — that’s a meaningful label (the auditor saw nothing dangerous). Empty parameter_roles is also fine if every parameter is uninteresting.
Before any classifier rule or analyzer rule promotes from experimental to stable:
labeled status (mix of Python and TypeScript)high-confidence classifier outputsreports/The included example_server is the analyzer’s own fixture — useful for verifying the workflow but not sufficient on its own.
| Path | Purpose |
|---|---|
targets.yaml |
Planned corpus, each with a status |
ground_truth/*.yaml |
Hand-labeled tool definitions, one file per target |
reports/*.json |
Eval output, committed for diff-tracking lexicon changes |
eval.py |
The evaluation script and CLI |
tests/ |
Smoke tests against the example_server target |
Target: example_server (9 tools)
Per-tag capability metrics:
Tag TP FP FN Prec Recl
exec 4 0 0 1.00 1.00
fs_read 3 0 1 1.00 0.75
Parameter role accuracy: 8/9 (88.89%)
Per-tool diffs (predicted vs ground truth):
vulnerable_desc_injection:
missing: ['fs_read']
[fs_read, net_egress] contributes to two tags).MCP-S-005 (overbroad surface) acts on high-confidence only, so this is the most important number for that rule.missing line suggests a lexicon improvement, each spurious line suggests a tightening.