mcp-witness

Calibration corpus

Real-server evaluation harness for the classifier and analyzer. Goal: turn lexicon and rule tuning from gut-feel into measurable precision and recall against hand-labeled ground truth, per the calibration plan in docs/capability-classifier.md.

Workflow

The fastest way to add a new target — three commands plus one round of hand-labeling:

# 1. Capture the server's tools/list:
mcp-witness-capture --server-cmd python --server-arg -m --server-arg some_server \
    -o /tmp/captured.json

# 2. Scaffold a ground-truth skeleton:
mcp-witness-scaffold-gt /tmp/captured.json --name some_server --language python \
    --source https://github.com/.../some_server \
    -o ground_truth/some_server.yaml

# 3. Hand-edit ground_truth/some_server.yaml — populate `capabilities`,
#    `parameter_roles`, and `known_vulns` for each tool. Names,
#    descriptions, and input schemas are already filled in by step 2.

# 4. Run the eval:
mcp-witness-eval-calibration some_server
# Or against the whole corpus:
mcp-witness-eval-calibration --all

The judgment calls — what capability each tool has, what role each parameter plays, which static rules should fire — stay with the human auditor. The scaffolder removes the rote transcription work.

For targets you cannot run locally (transport mismatch, missing credentials, paid service), transcribe the tools/list shape into a JSON file manually and skip step 1.

Aggregation, tuning, and “stable” promotion

  1. Aggregate. Once 3+ targets are labeled, run mcp-witness-eval-calibration --all for cross-target precision/recall. The per-tool diffs list the actionable gaps.
  2. Tune. Where the report shows gaps, edit classifier/lexicons.py (or the analyzer rules) and re-run. Commit the updated JSON report under reports/ alongside the lexicon change so the diff is reviewable.

Hand-labeling rubric

For each tool, fill in:

Ground-truth schema

target_name: <slug>                 # must match the filename basename
source: <URL or filesystem path>
language: python | typescript | rust | other
mcp_spec_version: "2025-06-18"
notes: <free text>

tools:
  - name: <tool_name>
    description: <description as a user/auditor sees it>
    input_schema:
      type: object
      properties:
        <param>: { type: string, ... }
    capabilities: [<tag>, ...]      # auditor's labels
    parameter_roles:
      <param>: <role>
    known_vulns: [<rule_id>, ...]   # e.g. [MCP-S-006]

A tool may have zero capabilities — that’s a meaningful label (the auditor saw nothing dangerous). Empty parameter_roles is also fine if every parameter is uninteresting.

Acceptance criteria for “v0.1 stable”

Before any classifier rule or analyzer rule promotes from experimental to stable:

The included example_server is the analyzer’s own fixture — useful for verifying the workflow but not sufficient on its own.

Layout

Path Purpose
targets.yaml Planned corpus, each with a status
ground_truth/*.yaml Hand-labeled tool definitions, one file per target
reports/*.json Eval output, committed for diff-tracking lexicon changes
eval.py The evaluation script and CLI
tests/ Smoke tests against the example_server target

Reading a report

Target: example_server  (9 tools)

Per-tag capability metrics:
  Tag                    TP   FP   FN    Prec    Recl
  exec                    4    0    0    1.00    1.00
  fs_read                 3    0    1    1.00    0.75

Parameter role accuracy: 8/9 (88.89%)

Per-tool diffs (predicted vs ground truth):
  vulnerable_desc_injection:
    missing:  ['fs_read']