I found a small performance issue in llama.cpp that sits directly on the critical path of generation: token sampling.
It was a redundant O(vocab) heap allocation on every call.
While studying llama_sampler_sample(), I noticed a TODO:
“do not allocate each time”
The code showed why: each generated token rebuilds a std::vector<llama_token_data> of size n_vocab shown here:
// llama-sampling.cpp (Before) llama_token llama_sampler_sample(struct llama_sampler * smpl, struct llama_context * ctx, int32_t idx) { // ... std::vector<llama_token_data> cur; cur.reserve(n_vocab); // Redundant O(vocab) allocation // ... }
That means that for a ~150k vocab model, we allocate and fill a ~150k-element vector once per token generated. If you generate hundreds or thousands of tokens, this becomes a steady stream of heap traffic in one of the hottest loops in the system.
Sampling is inherently O(vocab) since you must process every logit. repeatedly allocating/freeing that buffer is pure waste: it bogs down the OS with memory requests, causes fragmentation, and ruins cache behavior.
I found this by reading the sampling pipeline end-to-end. The model produces logits, and sampling wraps them into a token array and runs a configured sampler chain (temperature/top-k/top-p/penalties/etc.). The key detail: the samplers only need a contiguous array view during apply(), they don’t need a freshly allocated vector each step.
The sampler used by the CLI is typically a sampler chain: a pipeline of sampling stages applied in order on each token. GDB confirms the chain uses a vtable to dispatch the apply operation:
(gdb) print *smpl->iface
$4 = {
... apply = 0x7ffff77de758 <llama_sampler_chain_apply(llama_sampler*, llama_token_data_array*)>,
...
}
The chain simply iterates through the sampling stages:
static void llama_sampler_chain_apply(struct llama_sampler * smpl, llama_token_data_array * cur_p) {
auto * chain = (llama_sampler_chain *) smpl->ctx;
for (auto * smpl : chain->samplers) {
llama_sampler_apply(smpl, cur_p);
}
}
Because the chain has a stable lifetime for the entire generation session, I moved the buffer ownership to that stable object. The samplers get their contiguous view via the same reused pointer, avoiding the collocation “tax.”
I added a reusable buffer to the chain state:
// inside struct llama_sampler_chain
std::vector<llama_token_data> cur;
Then, I updated the sampling call to use the existing buffer from the chain instead of allocating a local one:
// llama-sampling.cpp (After)
llama_token llama_sampler_sample(struct llama_sampler * smpl, struct llama_context * ctx, int32_t idx) {
auto * chain = (llama_sampler_chain *) smpl->ctx;
// Reuse the persistent buffer from the chain
chain->cur.resize(n_vocab);
llama_token_data_array cur_p = { chain->cur.data(), chain->cur.size(), false };
// ... logic continues using the stable pointer
}
Now, the samplers get their contiguous array view via the same reused pointer, avoiding the allocation “tax”.
To isolate the impact, I ran a micro-benchmark across common vocabulary sizes. This measures the total time for 10,000 sampling iterations.
| Vocab Size | Old (alloc/call) | New (reuse buf) | Speedup |
|---|---|---|---|
| 32,000 | 426.97 ms | 193.59 ms | 2.21x |
| 65,536 | 826.57 ms | 386.59 ms | 2.141x |
| 128,000 | 1671.90 ms | 825.43 ms | 2.03x |
| 152,064 | 1978.95 ms | 1039.35 ms | 1.90x |
The results show a consistent ~2x speedup in the sampling logic. While the absolute time saved per token is in microseconds (e.g., ~94μs for Qwen 2.5), it eliminates a linear overhead that scales with vocabulary size.
For long generation runs (thousands of tokens), this compounds into meaningful wall-clock savings.
Optimization is often just about not doing unnecessary work.
Jay Zenith| jayzenith248@gmail.com | github.com/JayZenith