Instrumenting and comparing Agent Harnesses with RawTree
by Rafa Moreno
Vercel recently added HarnessAgent to the AI SDK 7: a single API to integrate established agent harnesses in your code.
An agent harness is the runtime around the model: the sandbox it can use, the session it runs in, the skills and sub-agents it can call, and the permission, compaction, and configuration rules that shape the run. The AI SDK puts those moving parts behind one interface, so the same application code can run a task through Claude Code, Codex, or any other supported harness.
Claude Code and Codex can both attempt the same incident-triage prompt, but they do not work the same way. One might call fewer tools. One might burn fewer tokens. One might finish faster but miss the exact mitigation your rubric expects.
You can compare the result by evaluating the return but with the right telemetry will tell you how the agents got there.
So I ran a small benchmark:
- same incident-triage task
- same sandbox runtime; the sample below uses Vercel Sandbox
- same grading rubric
- two harnesses: Claude Code and Codex
- all spans exported to RawTree through native OTLP
The result was not a dashboard or a hand-built eval system. It was one RawTree table and one query.
For the benchmark I used RawTrees's TypeScript SDK, which includes @rawtree/otel to register the OpenTelemetry exporter in the Node process.
And to do it, I don't have to create a single table or define a schema. AI SDK emits spans, @rawtree/otel sends them to RawTree's native Open Telemetry endpoints where they are stored as one queryable row per span.
Run the same task through each harness
The setup is simple: keep the sandbox fixed, swap the harness, and turn on AI SDK telemetry:
import { HarnessAgent } from "@ai-sdk/harness/agent";
import { createClaudeCode } from "@ai-sdk/harness-claude-code";
import { createCodex } from "@ai-sdk/harness-codex";
import { createVercelSandbox } from "@ai-sdk/sandbox-vercel";
const sandbox = createVercelSandbox({ runtime: "node24" });
for (const [name, harness] of [
["claude-code", createClaudeCode()],
["codex", createCodex()],
] as const) {
const agent = new HarnessAgent({
id: name,
harness,
sandbox,
telemetry: { recordInputs: true, recordOutputs: true, functionId: name },
});
const session = await agent.createSession();
try {
const result = await agent.generate({ session, prompt: TASK });
// Grade result.text and record benchmark attributes.
} finally {
await session.destroy();
}
}Register RawTree once at startup:
import { registerOTel, aiSdkIntegration } from "@rawtree/otel";
const rawtree = registerOTel({
apiKey: process.env.RAWTREE_API_KEY!,
serviceName: "harness-bench",
environment: process.env.NODE_ENV ?? "development",
integrations: [aiSdkIntegration()],
});That is the pipeline. AI SDK emits OpenTelemetry spans, @rawtree/otel exports them to RawTree's native /otlp/v1/traces endpoint, and RawTree applies the otlp-traces transform on insert. The important bit for the benchmark is the flattening: you get one queryable row per span in traces, not a nested OTLP batch you have to unpack later.
What RawTree captures
Every harness emits the same span shapes through AI SDK instrumentation:
| Span | What it tells you |
|---|---|
invoke_agent <model> | A complete agent run |
step N | One reasoning step |
chat <model> | One model call |
execute_tool <name> | One tool call |
The benchmark also adds one custom span, eval.grade, after each run. That row carries the product metrics you actually want to compare:
| Attribute | Meaning |
|---|---|
harness | claude-code or codex |
model_calls | number of model calls in the run |
tool_calls | number of tool calls in the run |
input_tokens / output_tokens | token usage reported by the harness |
duration_ms | wall-clock run duration |
passed | whether the run met the rubric |
That custom span is the benchmarking move. The harness tells you what happened. Your grader tells you whether it was good.
It can be as small as this:
import { SpanStatusCode, trace } from "@opentelemetry/api";
const tracer = trace.getTracer("harness-bench");
async function grade(harness: string, run: () => Promise<AgentResult>) {
return tracer.startActiveSpan("eval.grade", async (span) => {
span.setAttribute("harness", harness);
try {
const result = await run();
const passed = rubric(result.text);
span.setAttribute("passed", passed);
span.setAttribute("model_calls", result.modelCalls);
span.setAttribute("tool_calls", result.toolCalls);
span.setAttribute("input_tokens", result.inputTokens);
span.setAttribute("output_tokens", result.outputTokens);
span.setAttribute("duration_ms", result.durationMs);
span.setStatus({ code: passed ? SpanStatusCode.OK : SpanStatusCode.ERROR });
return { ...result, passed };
} finally {
span.end();
}
});
}Query the benchmark result
Because RawTree transformed the OTLP payload into flat span rows, the comparison is direct SQL over columns. No resourceSpans, no arrayJoin, no JSON unpacking in every question:
SELECT
harness,
model_calls,
tool_calls,
input_tokens,
output_tokens,
round(duration_ms / 1000, 1) AS wall_seconds,
passed
FROM traces
WHERE name = 'eval.grade'
ORDER BY wall_seconds ASCFor the incident-triage run, the result was:
| harness | model calls | tool calls | input tokens | output tokens | wall seconds | passed |
|---|---|---|---|---|---|---|
| codex | 1 | 4 | 19,541 | 773 | 11.7 | ❌ |
| claude-code | 1 | 5 | 73,006 | 1,152 | 28.9 | ✅ |
Both harnesses identified deployment v2.14.1 and the inventory retry/backoff behavior. They diverged on the mitigation. Claude Code recommended an immediate rollback and passed the rubric. Codex recommended reducing or disabling retries, which is directionally useful but failed the stricter rollback/circuit-breaker rollback criterion.
That is exactly why this is a benchmark instead of a vibe check. Codex made one fewer tool call, used far fewer tokens, and finished more than 17 seconds faster. Claude Code spent more and took longer, but produced the answer the rubric accepted. Run the task ten times and these become distributions. Run ten tasks and you learn which harness fits which class of work.
Why use current harness defaults?
This is a harness-as-shipped benchmark, not a controlled same-model test. In production you usually do not buy "Claude Code with my hand-picked model" or "Codex with my hand-picked model" as identical wrappers. You choose a harness because of the total system: model choice, tool loop, permissions, compaction, sandbox behavior, and defaults.
So the benchmark used current defaults at the time of the run: Claude Code with anthropic/claude-sonnet-4.6, and Codex with gpt-5.3-codex.
If you want to isolate the model variable, run a second benchmark. The point is that RawTree gives you the same flattened table either way.
The workflow scales
The useful part is that the benchmark output is ordinary telemetry:
- register OpenTelemetry in the process
- run each harness with the same task and sandbox
- write one
eval.gradespan with pass/fail and run-level counters - query the
tracestable
After that, adding a harness is just another harness value. Adding a metric is just another span attribute. Since RawTree flattens OTLP on ingest, the next question is SQL, not an eval-dashboard project.
Run the benchmark on your own task, then compare the rows in RawTree. Sign up for the private beta.
The harness packages and
HarnessAgentAPI ship with AI SDK 7, currently on the canary release and still moving. Check the AI SDK docs for the current surface.The
@rawtree/otelSDK is experimental too, and its API may change before a stable release.