How to use the `gguf-split` / Model sharding demo #6404

Mar 31, 2024

phymbert
Mar 31, 2024
Collaborator

Context

Distributing and storing GGUFs is difficult for 70b+ models, especially on f16. Lot of issue can happen during file transfers, examples:

temporary disk full
network interruption

Typically, GGUFs need to be transferred from Hugging Face to an internal storage like s3, minio, git lfs, nexus or artifactory, then downloaded by the inference server and stored locally (or on a k8s PvC for example).

Storage solutions and filesystems poorly support large GGUF, typically HF does not support files larger than 50GB.
Such limits also exist on Artifactory.

Solution

We recently introduced gguf-split CLI and support the load of sharded GGUFs model in llama.cpp:

Download a model

from huggingface_hub import snapshot_download
snapshot_download(repo_id="keyfan/grok-1-hf")

Convert to GGUF F16

python -u convert-hf-to-gguf.py \
  ~/.cache/huggingface/hub/models--keyfan--grok-1-hf/snapshots/64e7373053c1bc7994ce427827b78ec11c181b3e/ \
  --outfile grok-1-f16.gguf \
  --outtype f16

NOTE: Follow llama.cpp build instructions to generate all tools/cli: make.

Quantize (optional)

quantize grok-1-f16.gguf grok-1-q4_0.gguf q4_0

Build model shards

It is possible to use different sharding strategy:

Max tensors per file: --split-max-tensors 256
Max file size: --split-max-size 48G

gguf-split --split --split-max-tensors 256 grok-1-q4_0.gguf grok-1-q4_0

It will produce 9 files with maximum 256 tensors in each.

You can then upload the sharded model to your HF Repo:

huggingface-cli upload [repo_id] [local_path] [path_in_repo]

Files produced by gguf-split are valid GGUFs, so you can visualize them in HF.

Load sharded model

llama_load_model_from_file will detect the number of files and will load additional tensors from the rest of files.

main --model grok-1-q4_0-00001-of-00009.gguf -ngl 64

You may notice:

llama_model_loader: additional 8 GGUFs metadata loaded.

Load sharded model from a remote URL

main --hf-repo ggml-org/models \
  --hf-file grok-1/grok-1-q4_0-00001-of-00009.gguf \
  --model   models/grok-1-q4_0-00001-of-00009.gguf \
  -ngl 64

christianazinn · Apr 10, 2024

maziyarpanahi
Apr 10, 2024

Maybe I am wrong, but I couldn't make --split-max-size 48G work.

12 replies

christianazinn Apr 11, 2024

@ngxson Here it is.

The really strange part is that it looks like changing the parameter very slightly results in drastically different outputs:

christianazinn Apr 11, 2024

Ah, and I'm not so lucky as @maziyarpanahi with splitting Mixtral 8x22b. Only possible difference is that I quantized each model with an imatrix, but I'm not getting nicely sized splits sans flags.

4cecoder Apr 12, 2024

please for the love of gpt how do you merge gguf files?

there isn't much docs on the net?

I tried compiling llama.cpp on windows and i've got that but now what?

Just want to merge 1-5 sharded gguf of command r+!

christianazinn Apr 12, 2024

@4cecoder gguf-split --merge [path-to-first-shard] [path-to-outfile] should do the trick.

This comment has been hidden.

Sign in to view

Apr 12, 2024

dranger003
Apr 12, 2024

On Windows you can compile llama.cpp by opening a VS native tools command prompt (i.e. x64 Native Tools Command Prompt for VS 2022) and running the commands below. Make sure your VS install has C++ dev libraries installed.

git clone --recursive https://github.com/ggerganov/llama.cpp
cmake -S . -B build && cmake --build build --config Release
build\bin\Release\gguf-split.exe -h

0 replies

phymbert · Apr 17, 2024

taozhiyuai
Apr 17, 2024

@dranger003 @phymbert may I ask how to compile gguf-split on MAC?

(llamacpp) taozhiyu@603e5f4a42f1 llama.cpp-master % gguf-split
zsh: command not found: gguf-split

5 replies

phymbert Apr 17, 2024
Collaborator Author

Hum, I am Mac less. @ggerganov, could you please help ?

Is not simply calling make not working ?

ggerganov Apr 17, 2024
Maintainer

make
./gguf-split

phymbert Apr 17, 2024
Collaborator Author

Summary updated ^)

taozhiyuai Apr 17, 2024

Works on Mac.thanks.

'make
./gguf-split'

matbeedotcom Nov 13, 2024

I dont have to compile this whole project just to merge gguf's do i?

phymbert · Apr 17, 2024

bennmann
Apr 17, 2024

Please also include more clear and specific instructions for --merge?

3 replies

phymbert Apr 17, 2024
Collaborator Author

Merging is not necessary anymore in most use cases.

4cecoder Apr 18, 2024

Merging in LLM studio is no longer necessary so just use LLM studio on windows @bennmann

q5sys Aug 9, 2024

@phymbert "most cases" is not all cases, and it would be nice if this was clearly documented with full example for those cases that fall outside of "most". Front ends that use llama.cpp as a backend are sometimes not coded to operate with split files. and so people using those need to merge themselves. I dont know if it still is, but oobabooga is an example (at least a few weeks ago when I last tested it) where a UI was unable to deal with split files.

People are still running into issues trying to use this tool. As an example here's two in the past week. https://huggingface.co/leafspark/Meta-Llama-3.1-405B-Instruct-GGUF/discussions/2

Search code, repositories, users, issues, pull requests...

How to use the gguf-split / Model sharding demo #6404

Uh oh!

Uh oh!

phymbert Mar 31, 2024 Collaborator

Context

Solution

Download a model

Convert to GGUF F16

Quantize (optional)

Build model shards

Load sharded model

Load sharded model from a remote URL

Replies: 4 comments · 20 replies

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment has been hidden.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

phymbert Apr 17, 2024 Collaborator Author

Uh oh!

ggerganov Apr 17, 2024 Maintainer

Uh oh!

phymbert Apr 17, 2024 Collaborator Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

phymbert Apr 17, 2024 Collaborator Author

Uh oh!

Uh oh!

Uh oh!

How to use the `gguf-split` / Model sharding demo #6404

phymbert
Mar 31, 2024
Collaborator

phymbert Apr 17, 2024
Collaborator Author

ggerganov Apr 17, 2024
Maintainer

phymbert Apr 17, 2024
Collaborator Author

phymbert Apr 17, 2024
Collaborator Author