GLYPH

A verifiable RL environment for a Rust tool-use agent

I built the full post-training stack for a Rust tool-use coding agent (Qwen3-4B) end to end — synthetic data, SFT, verifiable-reward RLVR on PRIME-RL, and a strict held-out eval. SFT learns the agent contract; the first RLVR run came out flat; I traced that to no gradient on the hard tail and fixed it with a dense partial-credit reward for a small, reproducible pass@8 lift — with seed replication, not a single lucky run. I then A/B'd a more "principled" Rust-compiler-aware reward against it, and it lost — a real result about reward design, not just another win.

What Glyph is

A verifiers / PRIME-RL environment. Each task hands the model a real Rust crate and a tool-use job: patch until cargo_test passes, patch until cargo_run prints exact stdout, or run an already-correct crate. The model emits CALL tool {…}, tools execute against real cargo, and it must end with a clean FINAL. The reward is verifiable — cargo actually compiles and runs. rl/task_trace.py exposes load_environment() -> vf.Environment, the Environments-Hub-standard shape.

CALL read_file {"id":"c1","file_path":"src/lib.rs"}
RESULT <source>
CALL apply_patch {"id":"c2","file_path":"src/lib.rs","find":"…","replace":"…"}
RESULT ok
CALL cargo_test {"id":"c3","project_path":"."}
RESULT status: success
FINAL: fixed the precedence bug in merge()

Success is strict valid_trace: terminal cargo success + one clean FINAL after it + exact CALL syntax + no tool use after success. Cargo passing mid-trace is not success if the trace is unusable. Held-out eval: 150 unseen crates, disjoint from SFT/RL data (split by case_id, 0 leakage).

SFT built the agent

SFT (Qwen3-4B-BaseSFT_HALF_A_V8) learned the protocol and a useful repair prior. Greedy strict pass@1: 74/150.

RLVR with a sparse reward was flat

LoRA RLVR via PRIME-RL (on-policy distillation anchor + verifiable reward). Reward: +10 only for a clean verifier success, penalties otherwise. Strict pass@1, same harness: 74 → 72 / 150. Flat-to-down.

Why it was flat (the useful part)

I analyzed the failures instead of just the score:

The fix: a dense partial-credit reward

Give graded credit in the no-success region only: a small bonus for compiling, plus a bonus scaled by the test-pass fraction. Both are fixed by the task (unhackable) and capped well below the +10 success bonus. Now 8 failing rollouts get different scores → non-zero advantage → gradient on exactly the prompts that were dead weight. (rl/reward.py, off by default, --progress-compile-bonus 0.5 --progress-test-frac-bonus 2.0.)

Measuring honestly

Greedy pass@1 is too noisy to detect a small effect, so I used pass@8 (T=0.8) with 3-seed replication per model.

modelvalid@8 per seedmean
SFT_HALF_A_V895, 97, 10097.3
+ dense RLVR (step 10)102, 102, 99101.0

Δ ≈ +3.7 valid@8, seed-level t-test p ≈ 0.06. Small and borderline, but reproducible: dense-RLVR never drops below 99, SFT never exceeds 100. Stability (8/8) is flat. A single run had shown +7 — replication revealed SFT alone swings 95–100, so the +7 was partly seed luck. The honest effect is ~+4, not +7. The sparse run was flat, so the lift is attributable to the reward change.

Where the method's limit is

The hard tail splits in two. The reachable part — problems the model solves sometimes at pass@8 — is where RL has trajectories to reinforce, and the dense reward captured a sliver of it. The rest (~49/150 fail even at pass@8) is a 4B-base Rust capability ceiling: the model essentially never produces a correct trajectory, so no reward shaping can move it. Raising that needs richer compiler feedback to the agent or a stronger base, not a fancier reward.

Testing that prediction: a compiler-aware reward (it lost)

The obvious next move is to make the dense reward Rust-specific instead of generic. rustc fails in a fixed phase order — parse → type/resolve → borrow/lifetime → compiles — and reaching a later phase requires passing every earlier one, so the furthest phase a failed rollout reaches is a principled, monotone distance-to-compiling that the model can't game by deleting code to shrink its error count (--progress-error-ladder-bonus, see rl/tests/test_reward_progress.py). Same base model, data, steps, and hyperparameters as the dense run — only the reward shape changed.

Per-step reward and zero-advantage filter rate, dense vs compiler-aware

It worked exactly as designed at the mechanism level: step 0 retained 32/96 rollouts after zero-advantage filtering (the sparse run had filtered all 96), so the ladder did restore gradient on the hard tail, same as the dense reward.

valid@8 per seed: SFT base vs dense reward vs compiler-aware reward
modelvalid@8 per seedmean
SFT_HALF_A_V895, 97, 10097.3
+ dense reward (step 10)102, 102, 99101.0
+ compiler-aware reward (step 10)95, 96, 9495.0

But the outcome was worse, not better: −6.0 valid@8 vs the dense reward (p ≈ 0.012), and even slightly below the SFT baseline. cargo@8 (ignoring trace hygiene) tells the same story, so it's a real solve-rate drop, not a formatting artifact. The likely cause is Goodhart, not noise: "reached a later compiler phase" is a proxy for progress, not progress itself, and it's a step further from the true objective (tests passing) than the dense reward's own compile/test-fraction signal. Optimizing the proxy pulled the model toward churning on borrow-checker errors instead of toward working code.

This is one coefficient (2.5) at one checkpoint (step 10), not an ablation — so the honest claim is narrower than "compiler-aware rewards don't work": this specific shaping, at this strength, underperformed a reward that stays closer to the actual success criterion.

A verifier gap I found by inspection, not by a metric

Reading individual rollouts instead of just trusting scores surfaced something valid_trace can't see. On a config-merge crate the RLVR model's patch correctly fixed port precedence, but it also flipped tls from direct-first to profile-first — the opposite of the stated spec ("direct values must take precedence over profile values"):

  let tls = direct
      .and_then(|c| c.tls)
      .or_else(|| profile.and_then(|c| c.tls))   // before: spec-correct
      .unwrap_or(defaults.tls);

  let tls = profile
      .and_then(|c| c.tls)
      .or_else(|| direct.and_then(|c| c.tls))    // after the patch: violates the spec
      .unwrap_or(defaults.tls);

None of the crate's 3 tests set conflicting direct.tls and profile.tls values, so the regression is invisible to them: cargo_test still reports 3/3 passing, the trace ends with a clean FINAL, and strict valid_trace scores it a success. The model's own closing summary ("tls, which was already correct") is true of the original code, not of what it just shipped — it doesn't disclose the regression its own edit introduced.

This is the sharp edge of verifiable reward: the reward is only as good as the test suite behind it. cargo_test passing means the spec as tested held, not the spec as written. A binary or dense reward built on this verifier would have no way to penalize this trace — it looks identical to a fully correct one. I didn't catch it from a number; I caught it from reading a trace.

What I claim / don't

Reusable lessons

Repro commands: github.com/JayZenith/GLYPH (README)