Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Conversation

Readon
Copy link

@Readon Readon commented Sep 18, 2025

This commit introduces a new parameter num_moe_offload to the Modelfile, allowing users to offload Mixture-of-Experts (MoE) weights to the CPU to reduce VRAM usage.

The num_moe_offload parameter can be set to:

  • A positive integer N to offload the first N MoE layers.
  • -1 to offload all MoE layers.
  • 0 (default) to disable offloading.

This is implemented by passing tensor override rules to the underlying llama.cpp library, which already supports this functionality. The documentation for the new parameter has also been updated.

Try to use Jules to solve the #11772

This commit introduces a new parameter `num_moe_offload` to the Modelfile, allowing users to offload Mixture-of-Experts (MoE) weights to the CPU to reduce VRAM usage.

The `num_moe_offload` parameter can be set to:
- A positive integer `N` to offload the first `N` MoE layers.
- `-1` to offload all MoE layers.
- `0` (default) to disable offloading.

This is implemented by passing tensor override rules to the underlying `llama.cpp` library, which already supports this functionality. The documentation for the new parameter has also been updated.
@jessegross
Copy link
Contributor

Thank you but we want to be able to configure this automatically based on available memory rather than making the user configure it, similar to how the rest of Ollama works. I also don't think it handles multiple GPUs properly.

In addition, generally we are not adding new features to the old llama engine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Morty Proxy This is a proxified and sanitized view of the page, visit original site.