An infuriating issue when using vLLM to self-host a model on my 6x3090 machine was 100% CPU usage on multiple cores even when vLLM was idle. The number of cores that would be at 100% correlated to the number of GPUs utilized with tensor parallelism.
This issue was resolved with PR #16226 . The PR adds an environment variable "VLLM_SLEEP_WHEN_IDLE=1" which will essentially eliminate constant polling and reduce CPU usage to essentially 0 when idle. This has a minimum performance impact especially in self-hosting environments.
I am now very happy with me self-hosted setup which uses Open-WebUI as my front-end and vLLM as my backend serving gpt-oss-120b on 4x3090s. I can save 2 of my 3090s for diffusion tasks as a result.
Running gpt-oss-120b working also required a bit of finagling, however vLLM has a good Quick Start Guide for gpt-oss-120b
uv venv
source .venv/bin/activate
uv pip install --pre vllm==0.10.1+gptoss \
--extra-index-url https://wheels.vllm.ai/gpt-oss/ \
--extra-index-url https://download.pytorch.org/whl/nightly/cu128 \
--index-strategy unsafe-best-match
VLLM_SLEEP_WHEN_IDLE=1 VLLM_ATTENTION_BACKEND=TRITON_ATTN_VLLM_V1 CUDA_VISIBLE_DEVICES=3,5,0,4 vllm serve openai/gpt-oss-120b -tp 4 --max-num-batched-tokens 4096 --gpu-memory-utilization 0.85 --max-num-seqs 8 --async-scheduling --max-model-len 32k
Note the "VLLM_ATTENTION_BACKEND=TRITON" and --async-scheduling environment variable and input parameters respectively that are required and useful for Ampere based GPUs.
I also now use Hollama instead of open-webui as my front-end. It is lightweight and runs entirely in the browser.
No comments:
Post a Comment