GLYPH

A verifiable RL environment for a Rust tool-use agent

I built the full post-training stack for a Rust tool-use coding agent (Qwen3-4B) end to end — synthetic data, SFT, verifiable-reward RLVR on PRIME-RL, and a strict held-out eval. SFT learns the agent contract; the first RLVR run came out flat; I traced that to no gradient on the hard tail and fixed it with a dense partial-credit reward for a small, reproducible pass@8 lift — with seed replication, not a single lucky run. I then A/B'd a more "principled" Rust-compiler-aware reward against it, and it lost — a real result about reward design, not just another win.

code SFT model RLVR adapter

What Glyph is

A verifiers / PRIME-RL environment. Each task hands the model a real Rust crate and a tool-use job: patch until cargo_test passes, patch until cargo_run prints exact stdout, or run an already-correct crate. The model emits CALL tool {…}, tools execute against real cargo, and it must end with a clean FINAL. The reward is verifiable — cargo actually compiles and runs. rl/task_trace.py exposes load_environment() -> vf.Environment, the Environments-Hub-standard shape.

CALL read_file {"id":"c1","file_path":"src/lib.rs"}
RESULT <source>
CALL apply_patch {"id":"c2","file_path":"src/lib.rs","find":"…","replace":"…"}
RESULT ok
CALL cargo_test {"id":"c3","project_path":"."}
RESULT status: success
FINAL: fixed the precedence bug in merge()

Success is strict valid_trace: terminal cargo success + one clean FINAL after it + exact CALL syntax + no tool use after success. Cargo passing mid-trace is not success if the trace is unusable. Held-out eval: 150 unseen crates, disjoint from SFT/RL data (split by case_id, 0 leakage).

SFT built the agent

SFT (Qwen3-4B-Base → SFT_HALF_A_V8) learned the protocol and a useful repair prior. Greedy strict pass@1: 74/150.

RLVR with a sparse reward was flat

LoRA RLVR via PRIME-RL (on-policy distillation anchor + verifiable reward). Reward: +10 only for a clean verifier success, penalties otherwise. Strict pass@1, same harness: 74 → 72 / 150. Flat-to-down.

Why it was flat (the useful part)

I analyzed the failures instead of just the score:

Structure is 100% SFT-saturated. Across all 150, exact CALL syntax, call IDs, and cargo paths are 150/150. RLVR has zero headroom on format.
Failures are capability, not hygiene. 75 of 76 failures never reach a terminal cargo success — the model formats perfectly and writes the wrong Rust.
72/150 fail in both SFT and RLVR — a shared hard tail; RLVR only reshuffled ~10 borderline prompts.
The hard tail produces no gradient. RLVR learns from reward variance within each group of 8 rollouts. When all 8 fail, a binary reward scores them identically → zero advantage → the group is filtered → no gradient. (At step 0 of the sparse run, the orchestrator filtered the entire first batch.)
But ~half the hard tail is partially correct: of the 75 cargo-failures, 52% compile and 44% pass ≥1 test (median 50%). Under a binary reward these score identically to never-compiles.

The fix: a dense partial-credit reward

Give graded credit in the no-success region only: a small bonus for compiling, plus a bonus scaled by the test-pass fraction. Both are fixed by the task (unhackable) and capped well below the +10 success bonus. Now 8 failing rollouts get different scores → non-zero advantage → gradient on exactly the prompts that were dead weight. (rl/reward.py, off by default, --progress-compile-bonus 0.5 --progress-test-frac-bonus 2.0.)

Measuring honestly

Greedy pass@1 is too noisy to detect a small effect, so I used pass@8 (T=0.8) with 3-seed replication per model.

model	valid@8 per seed	mean
SFT_HALF_A_V8	95, 97, 100	97.3
+ dense RLVR (step 10)	102, 102, 99	101.0

Δ ≈ +3.7 valid@8, seed-level t-test p ≈ 0.06. Small and borderline, but reproducible: dense-RLVR never drops below 99, SFT never exceeds 100. Stability (8/8) is flat. A single run had shown +7 — replication revealed SFT alone swings 95–100, so the +7 was partly seed luck. The honest effect is ~+4, not +7. The sparse run was flat, so the lift is attributable to the reward change.

Where the method's limit is

The hard tail splits in two. The reachable part — problems the model solves sometimes at pass@8 — is where RL has trajectories to reinforce, and the dense reward captured a sliver of it. The rest (~49/150 fail even at pass@8) is a 4B-base Rust capability ceiling: the model essentially never produces a correct trajectory, so no reward shaping can move it. Raising that needs richer compiler feedback to the agent or a stronger base, not a fancier reward.

Testing that prediction: a compiler-aware reward (it lost)

The obvious next move is to make the dense reward Rust-specific instead of generic. rustc fails in a fixed phase order — parse → type/resolve → borrow/lifetime → compiles — and reaching a later phase requires passing every earlier one, so the furthest phase a failed rollout reaches is a principled, monotone distance-to-compiling that the model can't game by deleting code to shrink its error count (--progress-error-ladder-bonus, see rl/tests/test_reward_progress.py). Same base model, data, steps, and hyperparameters as the dense run — only the reward shape changed.

Per-step reward and zero-advantage filter rate, dense vs compiler-aware

It worked exactly as designed at the mechanism level: step 0 retained 32/96 rollouts after zero-advantage filtering (the sparse run had filtered all 96), so the ladder did restore gradient on the hard tail, same as the dense reward.

valid@8 per seed: SFT base vs dense reward vs compiler-aware reward

model	valid@8 per seed	mean
SFT_HALF_A_V8	95, 97, 100	97.3
+ dense reward (step 10)	102, 102, 99	101.0
+ compiler-aware reward (step 10)	95, 96, 94	95.0

But the outcome was worse, not better: −6.0 valid@8 vs the dense reward (p ≈ 0.012), and even slightly below the SFT baseline. cargo@8 (ignoring trace hygiene) tells the same story, so it's a real solve-rate drop, not a formatting artifact. The likely cause is Goodhart, not noise: "reached a later compiler phase" is a proxy for progress, not progress itself, and it's a step further from the true objective (tests passing) than the dense reward's own compile/test-fraction signal. Optimizing the proxy pulled the model toward churning on borrow-checker errors instead of toward working code.

This is one coefficient (2.5) at one checkpoint (step 10), not an ablation — so the honest claim is narrower than "compiler-aware rewards don't work": this specific shaping, at this strength, underperformed a reward that stays closer to the actual success criterion.

A verifier gap I found by inspection, not by a metric

Reading individual rollouts instead of just trusting scores surfaced something valid_trace can't see. On a config-merge crate the RLVR model's patch correctly fixed port precedence, but it also flipped tls from direct-first to profile-first — the opposite of the stated spec ("direct values must take precedence over profile values"):

  let tls = direct
      .and_then(|c| c.tls)
      .or_else(|| profile.and_then(|c| c.tls))   // before: spec-correct
      .unwrap_or(defaults.tls);

  let tls = profile
      .and_then(|c| c.tls)
      .or_else(|| direct.and_then(|c| c.tls))    // after the patch: violates the spec
      .unwrap_or(defaults.tls);

None of the crate's 3 tests set conflicting direct.tls and profile.tls values, so the regression is invisible to them: cargo_test still reports 3/3 passing, the trace ends with a clean FINAL, and strict valid_trace scores it a success. The model's own closing summary ("tls, which was already correct") is true of the original code, not of what it just shipped — it doesn't disclose the regression its own edit introduced.

This is the sharp edge of verifiable reward: the reward is only as good as the test suite behind it. cargo_test passing means the spec as tested held, not the spec as written. A binary or dense reward built on this verifier would have no way to penalize this trace — it looks identical to a fully correct one. I didn't catch it from a number; I caught it from reading a trace.

What I claim / don't

Claim: a dense, unhackable partial-credit reward turns a flat RLVR result into a small, reproducible pass@8 lift, by restoring gradient on the reachable part of the hard tail.
Claim: a more "principled" Rust-specific reward (compiler-phase ladder) is not automatically better — at the tested coefficient it underperformed the simpler dense reward by −6.0 valid@8, despite also restoring gradient. Reward proxies further from the true objective can hurt even when they're well-motivated and harder to game.
Don't claim: a large RLVR win, or significance at p<0.05 for the dense-vs-SFT result. Don't claim compiler-aware rewards are categorically worse — only that this one configuration was.
Don't claim: valid_trace success means the patch is fully spec-correct — only that it passes the test suite the task ships with. Per-task test coverage gaps exist (see above); the metric inherits them.

Reusable lessons

Verifier RL only works if the verifier matches the full behavior you want — for agents, the contract is the whole trace, not just "cargo passed."
"Tests pass" is not "spec correct" — it's only as strong as the test suite. Read individual traces; a per-task coverage gap (see above) is invisible to every metric in this post.
A binary verifiable reward silently discards the hard tail via zero-advantage filtering. Dense, unhackable partial credit recovers it.
Greedy pass@1 hides small effects in noise; pass@k + seed replication is the minimum honest bar (a single-run +7 here was seed luck).
Export the served policy: run_default/broadcasts/step_N as a PEFT adapter, not weights/step_N.

Repro commands: github.com/JayZenith/GLYPH (README)