Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings
Discussion options

Context

Distributing and storing GGUFs is difficult for 70b+ models, especially on f16. Lot of issue can happen during file transfers, examples:

  • temporary disk full
  • network interruption

Typically, GGUFs need to be transferred from Hugging Face to an internal storage like s3, minio, git lfs, nexus or artifactory, then downloaded by the inference server and stored locally (or on a k8s PvC for example).

Storage solutions and filesystems poorly support large GGUF, typically HF does not support files larger than 50GB.
Such limits also exist on Artifactory.

Solution

We recently introduced gguf-split CLI and support the load of sharded GGUFs model in llama.cpp:

Download a model
from huggingface_hub import snapshot_download
snapshot_download(repo_id="keyfan/grok-1-hf")
Convert to GGUF F16
python -u convert-hf-to-gguf.py \
  ~/.cache/huggingface/hub/models--keyfan--grok-1-hf/snapshots/64e7373053c1bc7994ce427827b78ec11c181b3e/ \
  --outfile grok-1-f16.gguf \
  --outtype f16

NOTE: Follow llama.cpp build instructions to generate all tools/cli: make.

Quantize (optional)
quantize grok-1-f16.gguf grok-1-q4_0.gguf q4_0

Build model shards

It is possible to use different sharding strategy:

  • Max tensors per file: --split-max-tensors 256
  • Max file size: --split-max-size 48G
gguf-split --split --split-max-tensors 256 grok-1-q4_0.gguf grok-1-q4_0

It will produce 9 files with maximum 256 tensors in each.

You can then upload the sharded model to your HF Repo:

huggingface-cli upload [repo_id] [local_path] [path_in_repo]

Files produced by gguf-split are valid GGUFs, so you can visualize them in HF.

Load sharded model

llama_load_model_from_file will detect the number of files and will load additional tensors from the rest of files.

main --model grok-1-q4_0-00001-of-00009.gguf -ngl 64

You may notice:

llama_model_loader: additional 8 GGUFs metadata loaded.

Load sharded model from a remote URL
main --hf-repo ggml-org/models \
  --hf-file grok-1/grok-1-q4_0-00001-of-00009.gguf \
  --model   models/grok-1-q4_0-00001-of-00009.gguf \
  -ngl 64
You must be logged in to vote

Replies: 4 comments · 20 replies

Comment options

Maybe I am wrong, but I couldn't make --split-max-size 48G work.

You must be logged in to vote
12 replies
@christianazinn
Comment options

@ngxson Here it is.
Screenshot 2024-04-11 080953

The really strange part is that it looks like changing the parameter very slightly results in drastically different outputs:
Screenshot 2024-04-11 081135

@christianazinn
Comment options

Ah, and I'm not so lucky as @maziyarpanahi with splitting Mixtral 8x22b. Only possible difference is that I quantized each model with an imatrix, but I'm not getting nicely sized splits sans flags.

@4cecoder
Comment options

please for the love of gpt how do you merge gguf files?

there isn't much docs on the net?

I tried compiling llama.cpp on windows and i've got that but now what?

Just want to merge 1-5 sharded gguf of command r+!

@christianazinn
Comment options

@4cecoder gguf-split --merge [path-to-first-shard] [path-to-outfile] should do the trick.

@4cecoder

This comment has been hidden.

Comment options

On Windows you can compile llama.cpp by opening a VS native tools command prompt (i.e. x64 Native Tools Command Prompt for VS 2022) and running the commands below. Make sure your VS install has C++ dev libraries installed.

git clone --recursive https://github.com/ggerganov/llama.cpp
cmake -S . -B build && cmake --build build --config Release
build\bin\Release\gguf-split.exe -h
You must be logged in to vote
0 replies
Comment options

@dranger003 @phymbert may I ask how to compile gguf-split on MAC?

(llamacpp) taozhiyu@603e5f4a42f1 llama.cpp-master % gguf-split
zsh: command not found: gguf-split

You must be logged in to vote
5 replies
@phymbert
Comment options

phymbert Apr 17, 2024
Collaborator Author

Hum, I am Mac less. @ggerganov, could you please help ?

Is not simply calling make not working ?

@ggerganov
Comment options

make
./gguf-split
@phymbert
Comment options

phymbert Apr 17, 2024
Collaborator Author

Summary updated ^)

@taozhiyuai
Comment options

Works on Mac.thanks.

'make
./gguf-split'

@matbeedotcom
Comment options

I dont have to compile this whole project just to merge gguf's do i?

Comment options

Please also include more clear and specific instructions for --merge?

You must be logged in to vote
3 replies
@phymbert
Comment options

phymbert Apr 17, 2024
Collaborator Author

Merging is not necessary anymore in most use cases.

@4cecoder
Comment options

Merging in LLM studio is no longer necessary so just use LLM studio on windows @bennmann

@q5sys
Comment options

@phymbert "most cases" is not all cases, and it would be nice if this was clearly documented with full example for those cases that fall outside of "most". Front ends that use llama.cpp as a backend are sometimes not coded to operate with split files. and so people using those need to merge themselves. I dont know if it still is, but oobabooga is an example (at least a few weeks ago when I last tested it) where a UI was unable to deal with split files.

People are still running into issues trying to use this tool. As an example here's two in the past week. https://huggingface.co/leafspark/Meta-Llama-3.1-405B-Instruct-GGUF/discussions/2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
split GGUF split model sharding
Morty Proxy This is a proxified and sanitized view of the page, visit original site.