Himesh's Blog: VLLM Performance Benchmarks 4x RTX 3090 (Power Limits, and NVLINK)

I built an AI workstation to experiment with LLMs. I ran it with 4x RTX3090 and now have upgraded it to 6x RTX3090s.

I ran experiments to test various permutations of power limits, number of GPUs and NVLINK.

I used vLLM, which from my experimentation, is the best performing software to serve LLM models at scale and to multiple users. I have also tried TabbyAPI (exllamaV2), and llama-server (llama.cpp). They each have their pros and cons, but that discussion is beyond the scope of this post.

I used vLLMs client side benchmarking script "benchmark_serving.py"

Experiment #1 (Power vs Throughput):

vllm command:

NCCL_P2P_DISABLE=1 CUDA_VISIBLE_DEVICES=0,4,3,5 vllm serve Qwen/QwQ-32B --quantization="fp8" --tensor-parallel 4 --max-model-len 4096 --gpu-memory-utilization 0.95 --max-num-seqs 1

client command:

python benchmarks/benchmark_serving.py --backend openai --base-url http://localhost:8000 --model Qwen/QwQ-32B --seed 12345 --dataset-name=random --num-prompts=10

Results:

Power Limit (W)	Output (t/s)	Throughput (t/s)	Output (t/joule)	Throughput (t/joule)
200	32	287	0.16	1.44
220	39	353	0.18	1.60
275	43	392	0.16	1.43
300	44	400	0.15	1.33

Experiment #2 (NVLINK and GPU topology):

I have 6 RTX 3090s. However, when using Tensor Parallel, the number of GPUs have to evenly divide the number of attention heads in the model, as well as the token vocabulary. Most models can be divided into 2, 4, or 8 GPUs. Unfortunately 6 GPUs does not usually work. As a result, I left 2 of my GPUs idle and tested with either 2 or 4 GPUs active.

NVLINK for RTX 3090 can only allow connect pairs of GPUs. If a GPU in 1 pair needs to communicate with a GPU in another pair, it has to go through PCIE. I ran all the cards at PCIE Gen4 x8.

For this experiment, I fixed the power limit of each GPU to 220W.

vllm command:

NCCL_P2P_DISABLE=[0 or 1] CUDA_VISIBLE_DEVICES=[3,5 OR 0,4,3,5] vllm serve Qwen/Qwen2.5-7B-Instruct-1M --tensor-parallel [2 OR 4] --gpu-memory-utilization 0.9 --max-model-len 32768

Client command:

python benchmarks/benchmark_serving.py --backend openai --base-url http://localhost:8000 --model Qwen/Qwen2.5-7B-Instruct-1M --seed 12345 --dataset-name=random --num-prompts=200

Num GPUs	Nvlink	Output (t/s)	Throughput (t/s)
2	yes	715	6790
2	no	483	4583
4	yes	535	5093
4	no	490	4669

Conclusions:

The sweet spot for efficiency with an RTX3090 is around a power limit of 220Watts

NVLInk improves inference performance (in tensor parallel) by 50% if using 2x3090s, and by 10% if using 4x3090s. This makes sense. If you have 4x3090s, half of inter-GPU communication will be through PCIE.

I was surprised that inference performance improved by a whopping 50% with NVLINK when using a pair of GPUs. Common wisdom was that inference is not affected by inter GPU bandwidth, but this experiment proves otherwise.

Himesh's Blog

Monday, March 10, 2025

VLLM Performance Benchmarks 4x RTX 3090 (Power Limits, and NVLINK)

Experiment #1 (Power vs Throughput):

vllm command:

client command:

python benchmarks/benchmark_serving.py --backend openai --base-url http://localhost:8000 --model Qwen/QwQ-32B --seed 12345 --dataset-name=random --num-prompts=10

Results:

Experiment #2 (NVLINK and GPU topology):

vllm command:

Client command:

Conclusions:

No comments:

Post a Comment

Monday, March 10, 2025

VLLM Performance Benchmarks 4x RTX 3090 (Power Limits, and NVLINK)

Experiment #1 (Power vs Throughput):

vllm command:

client command:

<span style="font-weight: normal;">td {border: 1px solid #cccccc;}br {mso-data-placement:same-cell;}</span>python benchmarks/benchmark_serving.py --backend openai --base-url http://localhost:8000 --model Qwen/QwQ-32B --seed 12345 --dataset-name=random --num-prompts=10

Results:

td {border: 1px solid #cccccc;}br {mso-data-placement:same-cell;}

Experiment #2 (NVLINK and GPU topology):

vllm command:

Client command:

Conclusions:

No comments:

Post a Comment

python benchmarks/benchmark_serving.py --backend openai --base-url http://localhost:8000 --model Qwen/QwQ-32B --seed 12345 --dataset-name=random --num-prompts=10