I built an AI workstation to experiment with LLMs. I ran it with 4x RTX3090 and now have upgraded it to 6x RTX3090s.
I ran experiments to test various permutations of power limits, number of GPUs and NVLINK.
I used vLLM, which from my experimentation, is the best performing software to serve LLM models at scale and to multiple users. I have also tried TabbyAPI (exllamaV2), and llama-server (llama.cpp). They each have their pros and cons, but that discussion is beyond the scope of this post.
I used vLLMs client side benchmarking script "benchmark_serving.py"
Experiment #1 (Power vs Throughput):
vllm command:
NCCL_P2P_DISABLE=1 CUDA_VISIBLE_DEVICES=0,4,3,5 vllm serve Qwen/QwQ-32B --quantization="fp8" --tensor-parallel 4 --max-model-len 4096 --gpu-memory-utilization 0.95 --max-num-seqs 1
client command:
python benchmarks/benchmark_serving.py --backend openai --base-url http://localhost:8000 --model Qwen/QwQ-32B --seed 12345 --dataset-name=random --num-prompts=10
Results:
Experiment #2 (NVLINK and GPU topology):
I have 6 RTX 3090s. However, when using Tensor Parallel, the number of GPUs have to evenly divide the number of attention heads in the model, as well as the token vocabulary. Most models can be divided into 2, 4, or 8 GPUs. Unfortunately 6 GPUs does not usually work. As a result, I left 2 of my GPUs idle and tested with either 2 or 4 GPUs active.
NVLINK for RTX 3090 can only allow connect pairs of GPUs. If a GPU in 1 pair needs to communicate with a GPU in another pair, it has to go through PCIE. I ran all the cards at PCIE Gen4 x8.
For this experiment, I fixed the power limit of each GPU to 220W.
vllm command:
NCCL_P2P_DISABLE=[0 or 1] CUDA_VISIBLE_DEVICES=[3,5 OR 0,4,3,5] vllm serve Qwen/Qwen2.5-7B-Instruct-1M --tensor-parallel [2 OR 4] --gpu-memory-utilization 0.9 --max-model-len 32768
Client command:
python benchmarks/benchmark_serving.py --backend openai --base-url http://localhost:8000 --model Qwen/Qwen2.5-7B-Instruct-1M --seed 12345 --dataset-name=random --num-prompts=200
Conclusions:
The sweet spot for efficiency with an RTX3090 is around a power limit of 220Watts
NVLInk improves inference performance (in tensor parallel) by 50% if using 2x3090s, and by 10% if using 4x3090s. This makes sense. If you have 4x3090s, half of inter-GPU communication will be through PCIE.
I was surprised that inference performance improved by a whopping 50% with NVLINK when using a pair of GPUs. Common wisdom was that inference is not affected by inter GPU bandwidth, but this experiment proves otherwise.
No comments:
Post a Comment