Description
Expected Behavior
I am trying to load the multimodal model bartowski/google_gemma-3-4b-it-qat-GGUF
using the Llama.from_pretrained
method. The script is configured to use the Llama3VisionAlphaChatHandler
with the appropriate mmproj
file.
I expect the library to successfully load both the multimodal projector and the main language model onto the GPU (using n_gpu_layers=-1
) and become ready for inference without crashing.
Current Behavior
The library successfully loads and initializes the mmproj-google_gemma-3-4b-it-qat-f16.gguf
file, detects the CUDA device, and loads the CLIP model to the CUDA backend. However, immediately after loading the CLIP model and before the main language model is fully loaded, the program terminates with a Windows C++ exception.
The script fails with the error: An error occurred during model operation: [WinError -529697949] Windows Error 0xe06d7363
.
Environment and Context
- Hardware: NVIDIA GeForce RTX 3060 (Compute Capability 8.6)
- Operating System: Windows 11 24H2
- SDK Versions:
- Python: 3.10.5
- CUDA Toolkit: 12.4
llama-cpp-python
: 0.3.9 (built using a pre-built CUDA 12.4 wheel)torch
: 2.5.1+cu124
Steps to Reproduce
-
Set up a Python virtual environment on Windows with CUDA 12.4 installed.
-
Install
llama-cpp-python
from source with CUDA support using PowerShell:$env:FORCE_CMAKE=1; $env:CMAKE_ARGS='-DGGML_CUDA=on -DCMAKE_CUDA_ARCHITECTURES=native' uv pip install --upgrade --force-reinstall --no-cache-dir --no-binary :all: llama-cpp-python
-
Install other dependencies like
torch
,transformers
,huggingface-hub
. -
Run the following Python script which attempts to load the
bartowski/google_gemma-3-4b-it-qat-GGUF
model with full GPU offload.from pathlib import Path from llama_cpp import Llama from llama_cpp.llama_chat_format import Llama3VisionAlphaChatHandler def load_gemma_model(mmproj_path: Path): print("Attempting to load Gemma 3 model...") chat_handler = Llama3VisionAlphaChatHandler(clip_model_path=str(mmproj_path)) llm = Llama.from_pretrained( repo_id="bartowski/google_gemma-3-4b-it-qat-GGUF", filename="google_gemma-3-4b-it-qat-IQ2_M.gguf", # Also fails with Q4_K_M and other quants chat_handler=chat_handler, n_ctx=2048, n_gpu_layers=-1, # Fails with full offload verbose=True ) return llm if __name__ == "__main__": model_dir = Path("./models") model_dir.mkdir(parents=True, exist_ok=True) mmproj_filename = "mmproj-google_gemma-3-4b-it-qat-f16.gguf" mmproj_path = model_dir / mmproj_filename # (I assume mmproj file is downloaded and present at mmproj_path) try: if mmproj_path.exists(): model = load_gemma_model(mmproj_path) print("Model loaded successfully!") else: print(f"Error: mmproj file not found at {mmproj_path}") except Exception as e: print(f"\nAn error occurred during model operation: {e}")
Failure Logs
The following log is produced when running the script. The crash occurs after the clip_model_load
completes and before the main Llama
model object is returned.
clip_model_load: loaded meta data with 16 key-value pairs and 439 tensors from models\mmproj-google_gemma-3-4b-it-qat-f16.gguf
clip_model_load: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
clip_model_load: - kv 0: general.architecture str = clip
clip_model_load: - kv 1: clip.projector_type str = gemma3
clip_model_load: - kv 2: clip.has_text_encoder bool = false
clip_model_load: - kv 3: clip.has_vision_encoder bool = true
clip_model_load: - kv 4: clip.has_llava_projector bool = false
clip_model_load: - kv 5: clip.vision.image_size u32 = 896
clip_model_load: - kv 6: clip.vision.patch_size u32 = 14
clip_model_load: - kv 7: clip.vision.embedding_length u32 = 1152
clip_model_load: - kv 8: clip.vision.feed_forward_length u32 = 4304
clip_model_load: - kv 9: clip.vision.projection_dim u32 = 2560
clip_model_load: - kv 10: clip.vision.block_count u32 = 27
clip_model_load: - kv 11: clip.vision.attention.head_count u32 = 16
clip_model_load: - kv 12: clip.vision.attention.layer_norm_epsilon f32 = 0.000001
clip_model_load: - kv 13: clip.vision.image_mean arr[f32,3] = [0.500000, 0.500000, 0.500000]
clip_model_load: - kv 14: clip.vision.image_std arr[f32,3] = [0.500000, 0.500000, 0.500000]
clip_model_load: - kv 15: clip.use_gelu bool = true
clip_model_load: - type f32: 276 tensors
clip_model_load: - type f16: 163 tensors
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: yes
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
clip_model_load: CLIP using CUDA backend
clip_model_load: params backend buffer size = 811.79 MB (439 tensors)
key clip.vision.image_grid_pinpoints not found in file
key clip.vision.mm_patch_merge_type not found in file
key clip.vision.image_crop_resolution not found in file
An error occurred during model operation: [WinError -529697949] Windows Error 0xe06d7363