Setup
- Hardware: 4× RTX 3090 (24GB), PCIe (single machine)
- Model: DeepSeek-V2-Lite-Chat (~15.7B, bf16)
- Serving: Ray Serve LLM + vLLM
- KV transfer (PD only): NixlConnector
- gpu_memory_utilization: 0.97 on both configs
Configs
Colocated
- 1 engine
- TP=4 (all 4 GPUs)
Disaggregated (PD)
- Prefill engine: TP=2
- Decode engine: TP=2
- Prefill produces KV → decode consumes KV (via NIXL)
Workload + Measurement
- Non-streaming only
- max_tokens = 128
- Short fixed prompt set
- Measured with Locust:
- RPS = requests/sec
- p50/p95 latency = end-to-end response time (ms)
Important: these latencies are end-to-end request times (queueing + compute + overhead).
Results
From the Locust stats CSV (one run per concurrency):
| Concurrency | Colocated RPS | Coloc p50 (ms) | Coloc p95 (ms) | PD RPS | PD p50 (ms) | PD p95 (ms) | PD Δ RPS |
|---|---|---|---|---|---|---|---|
| 1 | 0.068 | 14000 | 14000 | 0.088 | 14000 | 14000 | +29.3% |
| 4 | 0.352 | 12000 | 13000 | 0.355 | 12000 | 15000 | +0.7% |
| 8 | 0.615 | 13000 | 13000 | 0.662 | 12000 | 12000 | +7.6% |
| 16 | 1.230 | 13000 | 13000 | 1.268 | 12000 | 12000 | +3.1% |
Read this like an engineer:
- At c=1, there were only 4 total requests in 60s, so don't over-interpret the +29%.
- At c=8–16 (the only part I actually care about), PD is slightly ahead in RPS in this run, and p95 is in the same ballpark (~12–13s).
- At c=4, throughput is basically even, and PD p95 was worse in this run (15s vs 13s).
So the honest summary is:
On this single-node PCIe setup and fixed 128-token workload, PD (TP=2/2) is near-parity with colocated TP=4, sometimes slightly better in throughput, with similar tail latency.
Constraints / What Didn't Work
I tried a decode-heavier split (prefill TP=1, decode TP=3) and it failed:
- TP=3 is invalid for this model because attention heads aren't divisible by 3.
- TP=1 prefill OOM'd on a single 3090 for this model/config.
So TP=2/2 is the realistic PD split for this machine.
What This Does (and Doesn't) Prove
This does not prove "PD is always good" or "PD is always bad." It shows something narrower:
- On a single 4×3090 PCIe node, PD disaggregation is not automatically a win, but it can be competitive for fixed-length generation.
- The "big PD advantages" likely require a different regime:
- long prompts (prefill is expensive)
- mixed prompt lengths (prefill interference is real)
- multi-node / hetero hardware where you can scale prefill and decode separately
What I'd Test Next (If I Extend This)
- Mixed prompt lengths (short + very long prompts) to see if PD helps tail latency under interference
- Long prompt / RAG-like inputs (prefill-heavy workloads)
- Multi-node PD (where separation/scaling is actually the point)
Repro (High Level)
- Run colocated app (TP=4)
- Run PD app (prefill TP=2, decode TP=2)
- Run Locust at concurrencies 1/4/8/16 for 60s each
- Compare RPS + p50/p95 from Locust stats