Instrumenting and comparing Agent Harnesses with RawTree

by Rafa Moreno

Vercel recently added HarnessAgent to the AI SDK 7: a single API to integrate established agent harnesses in your code.

An agent harness is the runtime around the model: the sandbox it can use, the session it runs in, the skills and sub-agents it can call, and the permission, compaction, and configuration rules that shape the run. The AI SDK puts those moving parts behind one interface, so the same application code can run a task through Claude Code, Codex, or any other supported harness.

Claude Code and Codex can both attempt the same incident-triage prompt, but they do not work the same way. One might call fewer tools. One might burn fewer tokens. One might finish faster but miss the exact mitigation your rubric expects.

You can compare the result by evaluating the return but with the right telemetry will tell you how the agents got there.

So I ran a small benchmark:

  • same incident-triage task
  • same sandbox runtime; the sample below uses Vercel Sandbox
  • same grading rubric
  • two harnesses: Claude Code and Codex
  • all spans exported to RawTree through native OTLP

The result was not a dashboard or a hand-built eval system. It was one RawTree table and one query.

For the benchmark I used RawTrees's TypeScript SDK, which includes @rawtree/otel to register the OpenTelemetry exporter in the Node process.

And to do it, I don't have to create a single table or define a schema. AI SDK emits spans, @rawtree/otel sends them to RawTree's native Open Telemetry endpoints where they are stored as one queryable row per span.

Run the same task through each harness

The setup is simple: keep the sandbox fixed, swap the harness, and turn on AI SDK telemetry:

import { HarnessAgent } from "@ai-sdk/harness/agent";
import { createClaudeCode } from "@ai-sdk/harness-claude-code";
import { createCodex } from "@ai-sdk/harness-codex";
import { createVercelSandbox } from "@ai-sdk/sandbox-vercel";

const sandbox = createVercelSandbox({ runtime: "node24" });

for (const [name, harness] of [
  ["claude-code", createClaudeCode()],
  ["codex", createCodex()],
] as const) {
  const agent = new HarnessAgent({
    id: name,
    harness,
    sandbox,
    telemetry: { recordInputs: true, recordOutputs: true, functionId: name },
  });

  const session = await agent.createSession();
  try {
    const result = await agent.generate({ session, prompt: TASK });
    // Grade result.text and record benchmark attributes.
  } finally {
    await session.destroy();
  }
}

Register RawTree once at startup:

import { registerOTel, aiSdkIntegration } from "@rawtree/otel";

const rawtree = registerOTel({
  apiKey: process.env.RAWTREE_API_KEY!,
  serviceName: "harness-bench",
  environment: process.env.NODE_ENV ?? "development",
  integrations: [aiSdkIntegration()],
});

That is the pipeline. AI SDK emits OpenTelemetry spans, @rawtree/otel exports them to RawTree's native /otlp/v1/traces endpoint, and RawTree applies the otlp-traces transform on insert. The important bit for the benchmark is the flattening: you get one queryable row per span in traces, not a nested OTLP batch you have to unpack later.

What RawTree captures

Every harness emits the same span shapes through AI SDK instrumentation:

SpanWhat it tells you
invoke_agent <model>A complete agent run
step NOne reasoning step
chat <model>One model call
execute_tool <name>One tool call

The benchmark also adds one custom span, eval.grade, after each run. That row carries the product metrics you actually want to compare:

AttributeMeaning
harnessclaude-code or codex
model_callsnumber of model calls in the run
tool_callsnumber of tool calls in the run
input_tokens / output_tokenstoken usage reported by the harness
duration_mswall-clock run duration
passedwhether the run met the rubric

That custom span is the benchmarking move. The harness tells you what happened. Your grader tells you whether it was good.

It can be as small as this:

import { SpanStatusCode, trace } from "@opentelemetry/api";

const tracer = trace.getTracer("harness-bench");

async function grade(harness: string, run: () => Promise<AgentResult>) {
  return tracer.startActiveSpan("eval.grade", async (span) => {
    span.setAttribute("harness", harness);
    try {
      const result = await run();
      const passed = rubric(result.text);

      span.setAttribute("passed", passed);
      span.setAttribute("model_calls", result.modelCalls);
      span.setAttribute("tool_calls", result.toolCalls);
      span.setAttribute("input_tokens", result.inputTokens);
      span.setAttribute("output_tokens", result.outputTokens);
      span.setAttribute("duration_ms", result.durationMs);
      span.setStatus({ code: passed ? SpanStatusCode.OK : SpanStatusCode.ERROR });

      return { ...result, passed };
    } finally {
      span.end();
    }
  });
}

Query the benchmark result

Because RawTree transformed the OTLP payload into flat span rows, the comparison is direct SQL over columns. No resourceSpans, no arrayJoin, no JSON unpacking in every question:

SELECT
  harness,
  model_calls,
  tool_calls,
  input_tokens,
  output_tokens,
  round(duration_ms / 1000, 1) AS wall_seconds,
  passed
FROM traces
WHERE name = 'eval.grade'
ORDER BY wall_seconds ASC

For the incident-triage run, the result was:

harnessmodel callstool callsinput tokensoutput tokenswall secondspassed
codex1419,54177311.7
claude-code1573,0061,15228.9

Both harnesses identified deployment v2.14.1 and the inventory retry/backoff behavior. They diverged on the mitigation. Claude Code recommended an immediate rollback and passed the rubric. Codex recommended reducing or disabling retries, which is directionally useful but failed the stricter rollback/circuit-breaker rollback criterion.

That is exactly why this is a benchmark instead of a vibe check. Codex made one fewer tool call, used far fewer tokens, and finished more than 17 seconds faster. Claude Code spent more and took longer, but produced the answer the rubric accepted. Run the task ten times and these become distributions. Run ten tasks and you learn which harness fits which class of work.

Why use current harness defaults?

This is a harness-as-shipped benchmark, not a controlled same-model test. In production you usually do not buy "Claude Code with my hand-picked model" or "Codex with my hand-picked model" as identical wrappers. You choose a harness because of the total system: model choice, tool loop, permissions, compaction, sandbox behavior, and defaults.

So the benchmark used current defaults at the time of the run: Claude Code with anthropic/claude-sonnet-4.6, and Codex with gpt-5.3-codex.

If you want to isolate the model variable, run a second benchmark. The point is that RawTree gives you the same flattened table either way.

The workflow scales

The useful part is that the benchmark output is ordinary telemetry:

  1. register OpenTelemetry in the process
  2. run each harness with the same task and sandbox
  3. write one eval.grade span with pass/fail and run-level counters
  4. query the traces table

After that, adding a harness is just another harness value. Adding a metric is just another span attribute. Since RawTree flattens OTLP on ingest, the next question is SQL, not an eval-dashboard project.

Run the benchmark on your own task, then compare the rows in RawTree. Sign up for the private beta.


The harness packages and HarnessAgent API ship with AI SDK 7, currently on the canary release and still moving. Check the AI SDK docs for the current surface.

The @rawtree/otel SDK is experimental too, and its API may change before a stable release.