Thursday, September 18, 2025

InternVL3.5-38B Multimodal Model on a single GPU!

I have been testing VLLMs or "Vision Large Language Models" for the past couple years, since the days of LLaVA. They have improved quite a bit over the 2 years.

Most recently, InternLM-VL3.5 series of models was released in multiple sizes (241B-A28B, 38B, 30B-A30B) with mean scores of 72.6, 70.4, 67.6 on a suite of benchmarks tested.

I wanted to try and fit the 38B flavor on a single RTX 3090 with 24GB of VRAM. I was able to squeeze it in using the IQ4_XS GGUF quantization with Q8_0 for the mmproj (vison encoder in GGUF format).

The full command is as follows:

MTMD_BACKEND_DEVICE=2 CUDA_VISIBLE_DEVICES=2 ./build/bin/llama-server -m ~/models/InternVL_3_5-38B-GGUF/OpenGVLab_InternVL3_5-38B-IQ4_XS.gguf --mmproj ~/models/InternVL_3_5-38B-GGUF/mmproj-OpenGVLab_InternVL3_5-38B-Q8_0.gguf --n-gpu-layers 100 --port 8000 --host 0.0.0.0 --no-mmap --flash-attn on --ctx-size 2048

I get about 38 tokens/s with generation. I would call that a success! The next goal is to fit the 241B version on 6x3090s.

No comments:

Post a Comment