GLYPH
A verifiable RL environment for a Rust tool-use agent
I built the full post-training stack for a Rust tool-use coding agent (Qwen3-4B) end to end — synthetic data, SFT, verifiable-reward RLVR on PRIME-RL, and a strict held-out eval. SFT learns the agent contract; the first RLVR run came out flat; I traced that to no gradient on the hard tail and fixed it with a dense partial-credit reward for a small, reproducible pass@8 lift — with seed replication, not a single lucky run. I then A/B'd a more "principled" Rust-compiler-aware reward against it, and it lost — a real result about reward design, not just another win.
What Glyph is
A verifiers / PRIME-RL environment. Each task hands the model a real Rust
crate and a tool-use job: patch until cargo_test passes, patch until
cargo_run prints exact stdout, or run an already-correct crate. The model
emits CALL tool {…}, tools execute against real cargo, and it must end with a
clean FINAL. The reward is verifiable — cargo actually compiles and runs.
rl/task_trace.py exposes load_environment() -> vf.Environment,
the Environments-Hub-standard shape.
CALL read_file {"id":"c1","file_path":"src/lib.rs"} RESULT <source> CALL apply_patch {"id":"c2","file_path":"src/lib.rs","find":"…","replace":"…"} RESULT ok CALL cargo_test {"id":"c3","project_path":"."} RESULT status: success FINAL: fixed the precedence bug in merge()
Success is strict valid_trace: terminal cargo success + one clean
FINAL after it + exact CALL syntax + no tool use after success.
Cargo passing mid-trace is not success if the trace is unusable. Held-out eval: 150 unseen
crates, disjoint from SFT/RL data (split by case_id, 0 leakage).
SFT built the agent
SFT (Qwen3-4B-Base → SFT_HALF_A_V8) learned the protocol and a
useful repair prior. Greedy strict pass@1: 74/150.
RLVR with a sparse reward was flat
LoRA RLVR via PRIME-RL (on-policy distillation anchor + verifiable reward). Reward:
+10 only for a clean verifier success, penalties otherwise. Strict pass@1,
same harness: 74 → 72 / 150. Flat-to-down.
Why it was flat (the useful part)
I analyzed the failures instead of just the score:
- Structure is 100% SFT-saturated. Across all 150, exact CALL syntax, call IDs, and cargo paths are 150/150. RLVR has zero headroom on format.
- Failures are capability, not hygiene. 75 of 76 failures never reach a terminal cargo success — the model formats perfectly and writes the wrong Rust.
- 72/150 fail in both SFT and RLVR — a shared hard tail; RLVR only reshuffled ~10 borderline prompts.
- The hard tail produces no gradient. RLVR learns from reward variance within each group of 8 rollouts. When all 8 fail, a binary reward scores them identically → zero advantage → the group is filtered → no gradient. (At step 0 of the sparse run, the orchestrator filtered the entire first batch.)
- But ~half the hard tail is partially correct: of the 75 cargo-failures, 52% compile and 44% pass ≥1 test (median 50%). Under a binary reward these score identically to never-compiles.
The fix: a dense partial-credit reward
Give graded credit in the no-success region only: a small bonus for compiling, plus a bonus
scaled by the test-pass fraction. Both are fixed by the task (unhackable) and capped well
below the +10 success bonus. Now 8 failing rollouts get different
scores → non-zero advantage → gradient on exactly the prompts that were dead weight.
(rl/reward.py, off by default,
--progress-compile-bonus 0.5 --progress-test-frac-bonus 2.0.)
Measuring honestly
Greedy pass@1 is too noisy to detect a small effect, so I used pass@8 (T=0.8) with 3-seed replication per model.
| model | valid@8 per seed | mean |
|---|---|---|
| SFT_HALF_A_V8 | 95, 97, 100 | 97.3 |
| + dense RLVR (step 10) | 102, 102, 99 | 101.0 |
Δ ≈ +3.7 valid@8, seed-level t-test p ≈ 0.06. Small and borderline, but reproducible: dense-RLVR never drops below 99, SFT never exceeds 100. Stability (8/8) is flat. A single run had shown +7 — replication revealed SFT alone swings 95–100, so the +7 was partly seed luck. The honest effect is ~+4, not +7. The sparse run was flat, so the lift is attributable to the reward change.
Where the method's limit is
The hard tail splits in two. The reachable part — problems the model solves sometimes at pass@8 — is where RL has trajectories to reinforce, and the dense reward captured a sliver of it. The rest (~49/150 fail even at pass@8) is a 4B-base Rust capability ceiling: the model essentially never produces a correct trajectory, so no reward shaping can move it. Raising that needs richer compiler feedback to the agent or a stronger base, not a fancier reward.
Testing that prediction: a compiler-aware reward (it lost)
The obvious next move is to make the dense reward Rust-specific instead of generic.
rustc fails in a fixed phase order — parse → type/resolve → borrow/lifetime →
compiles — and reaching a later phase requires passing every earlier one, so the furthest
phase a failed rollout reaches is a principled, monotone distance-to-compiling that the
model can't game by deleting code to shrink its error count
(--progress-error-ladder-bonus, see rl/tests/test_reward_progress.py).
Same base model, data, steps, and hyperparameters as the dense run — only the reward shape
changed.
It worked exactly as designed at the mechanism level: step 0 retained 32/96 rollouts after zero-advantage filtering (the sparse run had filtered all 96), so the ladder did restore gradient on the hard tail, same as the dense reward.
| model | valid@8 per seed | mean |
|---|---|---|
| SFT_HALF_A_V8 | 95, 97, 100 | 97.3 |
| + dense reward (step 10) | 102, 102, 99 | 101.0 |
| + compiler-aware reward (step 10) | 95, 96, 94 | 95.0 |
But the outcome was worse, not better: −6.0 valid@8 vs the dense reward (p ≈ 0.012), and even slightly below the SFT baseline. cargo@8 (ignoring trace hygiene) tells the same story, so it's a real solve-rate drop, not a formatting artifact. The likely cause is Goodhart, not noise: "reached a later compiler phase" is a proxy for progress, not progress itself, and it's a step further from the true objective (tests passing) than the dense reward's own compile/test-fraction signal. Optimizing the proxy pulled the model toward churning on borrow-checker errors instead of toward working code.
This is one coefficient (2.5) at one checkpoint (step 10), not an ablation — so the honest claim is narrower than "compiler-aware rewards don't work": this specific shaping, at this strength, underperformed a reward that stays closer to the actual success criterion.
A verifier gap I found by inspection, not by a metric
Reading individual rollouts instead of just trusting scores surfaced something
valid_trace can't see. On a config-merge crate the RLVR model's patch correctly
fixed port precedence, but it also flipped tls from direct-first to
profile-first — the opposite of the stated spec ("direct values must take precedence over
profile values"):
let tls = direct
.and_then(|c| c.tls)
.or_else(|| profile.and_then(|c| c.tls)) // before: spec-correct
.unwrap_or(defaults.tls);
let tls = profile
.and_then(|c| c.tls)
.or_else(|| direct.and_then(|c| c.tls)) // after the patch: violates the spec
.unwrap_or(defaults.tls);
None of the crate's 3 tests set conflicting direct.tls and profile.tls
values, so the regression is invisible to them: cargo_test still reports 3/3
passing, the trace ends with a clean FINAL, and strict valid_trace
scores it a success. The model's own closing summary ("tls, which was already correct") is
true of the original code, not of what it just shipped — it doesn't disclose the
regression its own edit introduced.
This is the sharp edge of verifiable reward: the reward is only as good as the test suite
behind it. cargo_test passing means the spec as tested held, not the
spec as written. A binary or dense reward built on this verifier would have no way to
penalize this trace — it looks identical to a fully correct one. I didn't catch it from a
number; I caught it from reading a trace.
What I claim / don't
- Claim: a dense, unhackable partial-credit reward turns a flat RLVR result into a small, reproducible pass@8 lift, by restoring gradient on the reachable part of the hard tail.
- Claim: a more "principled" Rust-specific reward (compiler-phase ladder) is not automatically better — at the tested coefficient it underperformed the simpler dense reward by −6.0 valid@8, despite also restoring gradient. Reward proxies further from the true objective can hurt even when they're well-motivated and harder to game.
- Don't claim: a large RLVR win, or significance at p<0.05 for the dense-vs-SFT result. Don't claim compiler-aware rewards are categorically worse — only that this one configuration was.
- Don't claim:
valid_tracesuccess means the patch is fully spec-correct — only that it passes the test suite the task ships with. Per-task test coverage gaps exist (see above); the metric inherits them.
Reusable lessons
- Verifier RL only works if the verifier matches the full behavior you want — for agents, the contract is the whole trace, not just "cargo passed."
- "Tests pass" is not "spec correct" — it's only as strong as the test suite. Read individual traces; a per-task coverage gap (see above) is invisible to every metric in this post.
- A binary verifiable reward silently discards the hard tail via zero-advantage filtering. Dense, unhackable partial credit recovers it.
- Greedy pass@1 hides small effects in noise; pass@k + seed replication is the minimum honest bar (a single-run +7 here was seed luck).
- Export the served policy:
run_default/broadcasts/step_Nas a PEFT adapter, notweights/step_N.