Gemma 3:4B Multimodal CLIP Error [WinError -529697949] Windows Error 0xe06d7363

Expected Behavior

I am trying to load the multimodal model bartowski/google_gemma-3-4b-it-qat-GGUF using the Llama.from_pretrained method. The script is configured to use the Llama3VisionAlphaChatHandler with the appropriate mmproj file.

I expect the library to successfully load both the multimodal projector and the main language model onto the GPU (using n_gpu_layers=-1) and become ready for inference without crashing.

Current Behavior

The library successfully loads and initializes the mmproj-google_gemma-3-4b-it-qat-f16.gguf file, detects the CUDA device, and loads the CLIP model to the CUDA backend. However, immediately after loading the CLIP model and before the main language model is fully loaded, the program terminates with a Windows C++ exception.

The script fails with the error: An error occurred during model operation: [WinError -529697949] Windows Error 0xe06d7363.

Environment and Context

Hardware: NVIDIA GeForce RTX 3060 (Compute Capability 8.6)
Operating System: Windows 11 24H2
SDK Versions:
- Python: 3.10.5
- CUDA Toolkit: 12.4
- llama-cpp-python: 0.3.9 (built using a pre-built CUDA 12.4 wheel)
- torch: 2.5.1+cu124

Steps to Reproduce

Set up a Python virtual environment on Windows with CUDA 12.4 installed.

Install llama-cpp-python from source with CUDA support using PowerShell:

$env:FORCE_CMAKE=1; $env:CMAKE_ARGS='-DGGML_CUDA=on -DCMAKE_CUDA_ARCHITECTURES=native'
uv pip install --upgrade --force-reinstall --no-cache-dir --no-binary :all: llama-cpp-python

Install other dependencies like torch, transformers, huggingface-hub.

Run the following Python script which attempts to load the bartowski/google_gemma-3-4b-it-qat-GGUF model with full GPU offload.

from pathlib import Path
from llama_cpp import Llama
from llama_cpp.llama_chat_format import Llama3VisionAlphaChatHandler

def load_gemma_model(mmproj_path: Path):
    print("Attempting to load Gemma 3 model...")
    chat_handler = Llama3VisionAlphaChatHandler(clip_model_path=str(mmproj_path))
    llm = Llama.from_pretrained(
        repo_id="bartowski/google_gemma-3-4b-it-qat-GGUF",
        filename="google_gemma-3-4b-it-qat-IQ2_M.gguf", # Also fails with Q4_K_M and other quants
        chat_handler=chat_handler,
        n_ctx=2048,
        n_gpu_layers=-1, # Fails with full offload
        verbose=True
    )
    return llm

if __name__ == "__main__":
    model_dir = Path("./models")
    model_dir.mkdir(parents=True, exist_ok=True)
    mmproj_filename = "mmproj-google_gemma-3-4b-it-qat-f16.gguf"
    mmproj_path = model_dir / mmproj_filename

    # (I assume mmproj file is downloaded and present at mmproj_path)
    try:
        if mmproj_path.exists():
            model = load_gemma_model(mmproj_path)
            print("Model loaded successfully!")
        else:
            print(f"Error: mmproj file not found at {mmproj_path}")
    except Exception as e:
        print(f"\nAn error occurred during model operation: {e}")

Failure Logs

The following log is produced when running the script. The crash occurs after the clip_model_load completes and before the main Llama model object is returned.

clip_model_load: loaded meta data with 16 key-value pairs and 439 tensors from models\mmproj-google_gemma-3-4b-it-qat-f16.gguf
clip_model_load: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
clip_model_load: - kv   0:                               general.architecture str              = clip
clip_model_load: - kv   1:                                clip.projector_type str              = gemma3
clip_model_load: - kv   2:                                clip.has_text_encoder bool           = false
clip_model_load: - kv   3:                                clip.has_vision_encoder bool           = true
clip_model_load: - kv   4:                               clip.has_llava_projector bool           = false
clip_model_load: - kv   5:                               clip.vision.image_size u32              = 896
clip_model_load: - kv   6:                               clip.vision.patch_size u32              = 14
clip_model_load: - kv   7:                         clip.vision.embedding_length u32              = 1152
clip_model_load: - kv   8:                      clip.vision.feed_forward_length u32              = 4304
clip_model_load: - kv   9:                             clip.vision.projection_dim u32              = 2560
clip_model_load: - kv  10:                                clip.vision.block_count u32              = 27
clip_model_load: - kv  11:                         clip.vision.attention.head_count u32              = 16
clip_model_load: - kv  12:                   clip.vision.attention.layer_norm_epsilon f32              = 0.000001
clip_model_load: - kv  13:                               clip.vision.image_mean arr[f32,3]       = [0.500000, 0.500000, 0.500000]
clip_model_load: - kv  14:                                 clip.vision.image_std arr[f32,3]       = [0.500000, 0.500000, 0.500000]
clip_model_load: - kv  15:                                        clip.use_gelu bool           = true
clip_model_load: - type  f32:  276 tensors
clip_model_load: - type  f16:  163 tensors
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    yes
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
clip_model_load: CLIP using CUDA backend
clip_model_load: params backend buffer size =  811.79 MB (439 tensors)
key clip.vision.image_grid_pinpoints not found in file
key clip.vision.mm_patch_merge_type not found in file
key clip.vision.image_crop_resolution not found in file

An error occurred during model operation: [WinError -529697949] Windows Error 0xe06d7363

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Gemma 3:4B Multimodal CLIP Error [WinError -529697949] Windows Error 0xe06d7363 #2031

Expected Behavior

Current Behavior

Environment and Context

Steps to Reproduce

Failure Logs

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Search code, repositories, users, issues, pull requests...

Gemma 3:4B Multimodal CLIP Error [WinError -529697949] Windows Error 0xe06d7363 #2031

Description

Expected Behavior

Current Behavior

Environment and Context

Steps to Reproduce

Failure Logs

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions