Himesh's Blog

Monday, November 24, 2025

Intel Arc B580 vs RTX 3090 - Llama.cpp Benchmark

I recently purchases an Intel ARC B580 for $320 CAD (~$225 USD). I received also a free game (Battlefield 6) valued at $75 CAD (~$53 USD). This is an absolutely insane deal for a very reasonable GPU.

I began testing it. The first test is with llama.cpp and model Qwen3-8B-Q4_K.

The ARC B580 uses the SYCL backend on Windows, whereas the RTX 3090 uses CUDA (cuBLAS) in Ubuntu Linux

> .\llama-bench.exe -m ..\Qwen3-8B-Q4_K_M.gguf -ngl 99 --threads 8 -p 512,1024,2048 -n 128,256,512 -sm none -mg 0

Device: Intel ARC B580 | test | t/s | | pp512 | 794.17 ± 2.06 | | pp1024 | 769.64 ± 0.78 | | pp2048 | 743.25 ± 0.21 | | tg128 | 48.41 ± 0.15 | | tg256 | 48.32 ± 0.07 | | tg512 | 48.17 ± 0.08 | Device: RTX 3090 (Power Limit 190W) | test | t/s | | --------------: | -------------------: | | pp512 | 2517.93 ± 209.39 | | pp1024 | 2375.25 ± 53.74 | | pp2048 | 2180.69 ± 12.87 | | tg128 | 68.85 ± 0.27 | | tg256 | 67.07 ± 0.60 | | tg512 | 65.14 ± 0.68 | Device: RTX 3090 (Power Limit 300W) | test | t/s | | --------------: | -------------------: | | pp512 | 4248.31 ± 15.44 | | pp1024 | 3965.97 ± 31.79 | | pp2048 | 3702.01 ± 7.88 | | tg128 | 129.63 ± 0.22 | | tg256 | 126.23 ± 0.14 | | tg512 | 121.94 ± 0.29 |

I would argue that the ARC B580 is quite respectable. More testing to follow!

Thursday, September 18, 2025

InternVL3.5-38B Multimodal Model on a single GPU!

I have been testing VLLMs or "Vision Large Language Models" for the past couple years, since the days of LLaVA. They have improved quite a bit over the 2 years.

Most recently, InternLM-VL3.5 series of models was released in multiple sizes (241B-A28B, 38B, 30B-A30B) with mean scores of 72.6, 70.4, 67.6 on a suite of benchmarks tested.

I wanted to try and fit the 38B flavor on a single RTX 3090 with 24GB of VRAM. I was able to squeeze it in using the IQ4_XS GGUF quantization with Q8_0 for the mmproj (vison encoder in GGUF format).

The full command is as follows:

MTMD_BACKEND_DEVICE=2 CUDA_VISIBLE_DEVICES=2 ./build/bin/llama-server -m ~/models/InternVL_3_5-38B-GGUF/OpenGVLab_InternVL3_5-38B-IQ4_XS.gguf --mmproj ~/models/InternVL_3_5-38B-GGUF/mmproj-OpenGVLab_InternVL3_5-38B-Q8_0.gguf --n-gpu-layers 100 --port 8000 --host 0.0.0.0 --no-mmap --flash-attn on --ctx-size 2048

I get about 38 tokens/s with generation. I would call that a success! The next goal is to fit the 241B version on 6x3090s.

Monday, August 11, 2025

vLLM Tensor Parallel and 100% CPU usage when idle

An infuriating issue when using vLLM to self-host a model on my 6x3090 machine was 100% CPU usage on multiple cores even when vLLM was idle. The number of cores that would be at 100% correlated to the number of GPUs utilized with tensor parallelism.

This issue was resolved with PR #16226 . The PR adds an environment variable "VLLM_SLEEP_WHEN_IDLE=1" which will essentially eliminate constant polling and reduce CPU usage to essentially 0 when idle. This has a minimum performance impact especially in self-hosting environments.

I am now very happy with me self-hosted setup which uses Open-WebUI as my front-end and vLLM as my backend serving gpt-oss-120b on 4x3090s. I can save 2 of my 3090s for diffusion tasks as a result.

Running gpt-oss-120b working also required a bit of finagling, however vLLM has a good Quick Start Guide for gpt-oss-120b

uv venv
source .venv/bin/activate

uv pip install --pre vllm==0.10.1+gptoss \
--extra-index-url https://wheels.vllm.ai/gpt-oss/ \
--extra-index-url https://download.pytorch.org/whl/nightly/cu128 \
--index-strategy unsafe-best-match

VLLM_SLEEP_WHEN_IDLE=1 VLLM_ATTENTION_BACKEND=TRITON_ATTN_VLLM_V1 CUDA_VISIBLE_DEVICES=3,5,0,4 vllm serve openai/gpt-oss-120b -tp 4 --max-num-batched-tokens 4096 --gpu-memory-utilization 0.85 --max-num-seqs 8 --async-scheduling --max-model-len 32k

Note the "VLLM_ATTENTION_BACKEND=TRITON" and --async-scheduling environment variable and input parameters respectively that are required and useful for Ampere based GPUs.

I also now use Hollama instead of open-webui as my front-end. It is lightweight and runs entirely in the browser.

Monday, March 10, 2025

VLLM Performance Benchmarks 4x RTX 3090 (Power Limits, and NVLINK)

I built an AI workstation to experiment with LLMs. I ran it with 4x RTX3090 and now have upgraded it to 6x RTX3090s.

I ran experiments to test various permutations of power limits, number of GPUs and NVLINK.

I used vLLM, which from my experimentation, is the best performing software to serve LLM models at scale and to multiple users. I have also tried TabbyAPI (exllamaV2), and llama-server (llama.cpp). They each have their pros and cons, but that discussion is beyond the scope of this post.

I used vLLMs client side benchmarking script "benchmark_serving.py"

Experiment #1 (Power vs Throughput):

vllm command:

NCCL_P2P_DISABLE=1 CUDA_VISIBLE_DEVICES=0,4,3,5 vllm serve Qwen/QwQ-32B --quantization="fp8" --tensor-parallel 4 --max-model-len 4096 --gpu-memory-utilization 0.95 --max-num-seqs 1

client command:

python benchmarks/benchmark_serving.py --backend openai --base-url http://localhost:8000 --model Qwen/QwQ-32B --seed 12345 --dataset-name=random --num-prompts=10

Results:

Power Limit (W)	Output (t/s)	Throughput (t/s)	Output (t/joule)	Throughput (t/joule)
200	32	287	0.16	1.44
220	39	353	0.18	1.60
275	43	392	0.16	1.43
300	44	400	0.15	1.33

Experiment #2 (NVLINK and GPU topology):

I have 6 RTX 3090s. However, when using Tensor Parallel, the number of GPUs have to evenly divide the number of attention heads in the model, as well as the token vocabulary. Most models can be divided into 2, 4, or 8 GPUs. Unfortunately 6 GPUs does not usually work. As a result, I left 2 of my GPUs idle and tested with either 2 or 4 GPUs active.

NVLINK for RTX 3090 can only allow connect pairs of GPUs. If a GPU in 1 pair needs to communicate with a GPU in another pair, it has to go through PCIE. I ran all the cards at PCIE Gen4 x8.

For this experiment, I fixed the power limit of each GPU to 220W.

vllm command:

NCCL_P2P_DISABLE=[0 or 1] CUDA_VISIBLE_DEVICES=[3,5 OR 0,4,3,5] vllm serve Qwen/Qwen2.5-7B-Instruct-1M --tensor-parallel [2 OR 4] --gpu-memory-utilization 0.9 --max-model-len 32768

Client command:

python benchmarks/benchmark_serving.py --backend openai --base-url http://localhost:8000 --model Qwen/Qwen2.5-7B-Instruct-1M --seed 12345 --dataset-name=random --num-prompts=200

Num GPUs	Nvlink	Output (t/s)	Throughput (t/s)
2	yes	715	6790
2	no	483	4583
4	yes	535	5093
4	no	490	4669

Conclusions:

The sweet spot for efficiency with an RTX3090 is around a power limit of 220Watts

NVLInk improves inference performance (in tensor parallel) by 50% if using 2x3090s, and by 10% if using 4x3090s. This makes sense. If you have 4x3090s, half of inter-GPU communication will be through PCIE.

I was surprised that inference performance improved by a whopping 50% with NVLINK when using a pair of GPUs. Common wisdom was that inference is not affected by inter GPU bandwidth, but this experiment proves otherwise.

Friday, August 10, 2018

Resizing partitions within an image file

I wanted to backup a 32GB SD Card to a 16GB SD Card. The 32 GB SD Card only contained 10GB of data so it should be possible.

I started by creating an image file of the 32GB SD Card

$ dd if=/dev/sdb of=SDCard.img

To complicate things, the original SD Card had 2 partitions, a FAT32 boot partition and and ext4 root partition. The key to making this work would be to shrink the root partition. I used this guide to accomplish the task.

Create a new loopback device

$ sudo losetup -f

Attach the image file to the new loopback device

$ sudo losetup /dev/loop9 SDCard.img

Run GParted on the attached loopback device

$ sudo gparted /dev/loop0

Use GParted GUI to resize the partitions to suit your need

Then disconnect the loopback device

$ sudo losetup -d /dev/loop0

Use Fdisk to findout the last used sector on the image

$ fdisk -l SDCard.img

Disk SDCardBackup.img: 29.6 GiB, 31724666880 bytes, 61962240 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0x32e07f87

Device Boot Start End Sectors Size Id Type
SDCardBackup.img1 8192 93236 85045 41.5M c W95 FAT32 (LBA)
SDCardBackup.img2 94208 20574207 20480000 9.8G 83 Linux

Calculate the last byte and use truncate to chop the file down to size!

$ truncate --size=$[(20574207+1)*512] SDCardBackup.img

Hope this helps others in the same predicament!

Wednesday, August 1, 2018

Fast boot with Raspberry Pi

I am hoping to have a raspberry pi power a wildlife camera. This camera will have to rely on battery and solar power. As a result, it would be beneficial if the camera was off when no wildlife is present. To aid in this regard, I hope to use a motion sensor that can trigger the raspberry pi to turn on and take a picture. For this to work, the time from motion detection to picture snap is heavily influenced by the boot time of the raspberry pi. Here is a video of what I've been able to accomplish:

I am starting with the stock Raspbian Stretch Lite distribution on a Pi 3B. Boot times out of the box are on the order of 1 minute. Boot time is influenced by the following:

1. Hardware
2. Bootloader
3. Kernel
4. Userspace

The Raspberry Pi hardware and bootloader are essentially out of my control. There was an effort to open source the boot loader, however the proprietary binary blob is the only reasonable option at this point. The Hardware and bootloader take approximately a minimum of 1.5-2 seconds to run. This is explained in an excellent post on the Raspberry Pi Forums. The author tested boot times with various minimal boot loaders. The fastest any code could be run on the ARM processor was around 1.5 seconds.

I was able to get the kernel and userspace boot times down to about 0.6 second and 0.8 seconds respectively. As a result my total boot time is on the order of 3.5 to 4 seconds (from power on to picture taken).

To be able to control the Raspberry Pi without SSH, I used serial (UART) communications. See my previous post to learn how.

I reduced the kernel and userspace boot times by doing the following (in order highest yield to lowest yield):

1. Editing the /boot/config.txt with the following changes:

# Disable the rainbow splash screen
disable_splash=1

# Disable bluetooth
dtoverlay=pi3-disable-bt

#Disable Wifi
dtoverlay=pi3-disable-wifi

# Overclock the SD Card from 50 to 100MHz
# This can only be done with at least a UHS Class 1 card
dtoverlay=sdtweak,overclock_50=100

# Set the bootloader delay to 0 seconds. The default is 1s if not specified.
boot_delay=0

# Overclock the raspberry pi. This voids its warranty. Make sure you have a good power supply.
force_turbo=1

2. Make the kernel output less verbose by adding the "quiet" flag to the kernel command line in file /boot/cmdline.txt

dwc_otg.lpm_enable=0 console=serial0,115200 console=tty1 root=PARTUUID=32e07f87-02 rootfstype=ext4 elevator=deadline fsck.repair=yes quiet rootwait

3. Use systemd-analyze blame, systemd-analyze critical-chain to disable services I didn't need

sudo systemctl disable dhcpcd.service
sudo systemctl disable networking.service
sudo systemctl disable ssh.service
sudo systemctl disable ntp.service
sudo systemctl disable dphys-swapfile.service
sudo systemctl disable keyboard-setup.service
sudo systemctl disable apt-daily.service
sudo systemctl disable wifi-country.service
sudo systemctl disable hciuart.service
sudo systemctl disable raspi-config.service
sudo systemctl disable avahi-daemon.service
sudo systemctl disable triggerhappy.service

See the references below to learn about a primer on systemd and the new linux init system to learn about how to interpret and write the above services.

4. Add a service that runs the code you would like to run as fast as possible. For example if you wanted to add a service called "1ylapse", create the following file: /etc/systemd/system/1ylapse.service

[Unit]
Description=Starts 1 Year Lapse Service

[Service]
ExecStart=/home/pi/foo.sh
StandardOutput=syslog
StandardError=syslog
SyslogIdentifier=piservice
User=pi
Group=pi
WorkingDirectory=/root/1ylapse/

[Install]
WantedBy=basic.target

5. Analyze the kernel for unnecessary work being done at boot.

To do this you need to compile your kernel with "CONFIG_PRINTK_TIME" and "CONFIG_KALLSYMS". This should be enabled on the default raspberry pi kernel. This allows you to add "initcall_debug" to the kernel command line. The kernel will now output start and end time information for every init call. You can use "bootgraph.pl" which is included with the linux kernel to analyze the output of dmesg.

On the raspberry pi:

$ dmesg > boot.log

On the cross-compile host:

$ linux/scripts/bootgraph.pl boot.log > boot.sv

This will output an graph of what is taking the most time when initializing the kernel. I noticed that a routine used by the USB driver was taking around 0.3s. I don't need USB for my project so I disabled USB support when re-compiling the kernel (see below). This saved around 0.3s.

6. Re-compile the Linux kernel

Remove stuff that is wasting time during initialization. I used the guide from the Raspberry Pi Foundation to learn how to re-compile the kernel.

7. Use LZO compression for kernel

When compiling the Linux kernel, select "LZO" compression instead of "GZip". This saved around 0.3s.

8. Don't re-mount the /boot partition

Edit the /etc/fstab file and comment out the line that re-mounts the /boot partition. This saved around 0.2s.

The final systemd-analyze shows:

Startup finished in 669ms (kernel) + 1.225s (userspace) = 1.894s

It should be noted that my camera service starts before systemd is finished initializing. You can find out when your service starts by using systemd-analyze crritical-chain. You can see below that my service starts at 836ms after the kernel is finished initializing, rather than the total of 1.225s.

$ systemd-analyze critical-chain 1ylapse.service
1ylapse.service @836ms
└─basic.target @832ms
└─sockets.target @832ms
    └─dbus.socket @831ms
      └─sysinit.target @826ms
        └─systemd-update-utmp.service @784ms +41ms
          └─systemd-tmpfiles-setup.service @748ms +33ms
            └─systemd-journal-flush.service @658ms +87ms
              └─systemd-remount-fs.service @585ms +64ms
                └─systemd-fsck-root.service @444ms +137ms
                  └─systemd-journald.socket @433ms
                    └─-.slice @376ms

9. Remove plymouth to disable systemd init messages

sudo apt-get purge --remove plymouth

I haven't seen anyone boot a raspberry pi faster than this using full Raspbian. Bare metal is obviously faster however. However having full Raspbian available at this boot up speed is a good compromise.

Things that failed to improve boot time included making the root partition read only.

Hopefully this helps others in my predicament.

References:

Presentation by Jan Altenberg on booting linux in less than 1 second. Powerpoint here. Youtube of presentation here.
Excellent powerpoint on boot time optimization using a beagle bone as a prototype here.
Excellent powerpoint on speeding up raspberry pi boot time here.
Excellent primer on systemd-anzlyze.
Good stackoverflow question on using sytemd-analyze.

Tuesday, July 3, 2018

Serial Communications with Raspberry Pi

Running a headless raspberry pi can be challenging. Until now I've been using SSH to control my raspberry pi. This works well if your raspberry pi has wifi (namely the 3 and Zero W). However, i'm hoping to use the plain Raspbery Pi Zero for my current project which has no wifi built-in. Thus I needed a means to control and debug my Pi without wifi. This is where serial communication is beneficial.

I purchased a USB<->Serial adapter/cable from Buyapi.ca. It is based on the PL2303HX chipset. It uses 3.3v to drive the RX and TX lines which is compatible with the Raspberry Pi.

The connections are:

Red - GPIO2 (5V)
Black - GPIO6 (Ground)
White (RX into USB) - GPIO8 (TXD from Raspberry Pi)
Green (TX out of USB) - GPIO10 (RXD to Raspberry Pi)

Please note that the above connection will cause the raspberry pi to draw power from the USB<->Serial cable. This is usually enough for a Pi Zero, however will cause the Pi 3 to brownout. To handle this, supply the Pi 3 with an external power supply and disconnect the red (5V) wire. Make sure to keep ground (black) connected, however, to prevent ground loops.

My Raspberry Pi did not have the default Raspbian Linux console (the console that prints on a screen if you have one) broadcasting on the serial interface. To enable it you can run:

sudo raspi-config

Look for "Interfacing options", then option P6, Serial, and select "Yes". To use the serial console for other purposes you can set it to "No". For more information see the documentation on the raspberry pi website.

To use the console, fire up a terminal in Ubuntu and type:

sudo screen /dev/ttyUSB0 115200

The device "/dev/ttyUSB0" maybe different depending on your host kernel. Just look for something in "/dev" that looks similar. You can double check if it is correct by removing the PL2303HX device and see if the device you suspect disappears from the list in "/dev".

Monday, November 24, 2025

Intel Arc B580 vs RTX 3090 - Llama.cpp Benchmark

Thursday, September 18, 2025

InternVL3.5-38B Multimodal Model on a single GPU!

Monday, August 11, 2025

vLLM Tensor Parallel and 100% CPU usage when idle

Monday, March 10, 2025

VLLM Performance Benchmarks 4x RTX 3090 (Power Limits, and NVLINK)

Experiment #1 (Power vs Throughput):

vllm command:

client command:

<span style="font-weight: normal;">td {border: 1px solid #cccccc;}br {mso-data-placement:same-cell;}</span>python benchmarks/benchmark_serving.py --backend openai --base-url http://localhost:8000 --model Qwen/QwQ-32B --seed 12345 --dataset-name=random --num-prompts=10

Results:

td {border: 1px solid #cccccc;}br {mso-data-placement:same-cell;}

Experiment #2 (NVLINK and GPU topology):

vllm command:

Client command:

Conclusions:

Friday, August 10, 2018

Resizing partitions within an image file

Wednesday, August 1, 2018

Fast boot with Raspberry Pi

1. Editing the /boot/config.txt with the following changes:

2. Make the kernel output less verbose by adding the "quiet" flag to the kernel command line in file /boot/cmdline.txt

3. Use systemd-analyze blame, systemd-analyze critical-chain to disable services I didn't need

4. Add a service that runs the code you would like to run as fast as possible. For example if you wanted to add a service called "1ylapse", create the following file: /etc/systemd/system/1ylapse.service

5. Analyze the kernel for unnecessary work being done at boot.

6. Re-compile the Linux kernel

7. Use LZO compression for kernel

8. Don't re-mount the /boot partition

9. Remove plymouth to disable systemd init messages

Tuesday, July 3, 2018

Serial Communications with Raspberry Pi

python benchmarks/benchmark_serving.py --backend openai --base-url http://localhost:8000 --model Qwen/QwQ-32B --seed 12345 --dataset-name=random --num-prompts=10