Inference / PD Disaggregation / Single-Node A/B

Prefill/Decode Disaggregation on 4× RTX 3090 (PCIe): A Simple A/B

I wanted to answer one practical question: on a single 4-GPU PCIe machine, does prefill/decode (PD) disaggregation beat a colocated engine? This is not a "PD is good/bad in general" post — just a grounded single-node result.

Code: github.com/JayZenith/pd-disagg

Setup

Configs

Colocated

Disaggregated (PD)

Workload + Measurement

Important: these latencies are end-to-end request times (queueing + compute + overhead).

Results

From the Locust stats CSV (one run per concurrency):

Concurrency Colocated RPS Coloc p50 (ms) Coloc p95 (ms) PD RPS PD p50 (ms) PD p95 (ms) PD Δ RPS
10.06814000140000.0881400014000+29.3%
40.35212000130000.3551200015000+0.7%
80.61513000130000.6621200012000+7.6%
161.23013000130001.2681200012000+3.1%

Read this like an engineer:

So the honest summary is:

On this single-node PCIe setup and fixed 128-token workload, PD (TP=2/2) is near-parity with colocated TP=4, sometimes slightly better in throughput, with similar tail latency.

Constraints / What Didn't Work

I tried a decode-heavier split (prefill TP=1, decode TP=3) and it failed:

So TP=2/2 is the realistic PD split for this machine.

What This Does (and Doesn't) Prove

This does not prove "PD is always good" or "PD is always bad." It shows something narrower:

What I'd Test Next (If I Extend This)

  1. Mixed prompt lengths (short + very long prompts) to see if PD helps tail latency under interference
  2. Long prompt / RAG-like inputs (prefill-heavy workloads)
  3. Multi-node PD (where separation/scaling is actually the point)

Repro (High Level)