Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

ggml-quants : ternary packing for TriLMs and BitNet b1.58 #8151

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 33 commits into from
Sep 6, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
bd80749
ggml-quants : 1.625 bpw ternary packing for BitNet 1.58b
compilade Jun 19, 2024
7ef4254
ggml-quants : faster 1.625 bpw AVX2 vec_dot
compilade Jun 19, 2024
48b73b8
ggml-quants : substract 1 when back in epi8
compilade Jun 19, 2024
ef1e345
ggml-quants : Q2_2 now faster than Q4_K on with AVX2
compilade Jun 20, 2024
638ad52
ggml-quants : cleanup Q1_3 code formatting
compilade Jun 23, 2024
9465ec6
ggml-quants : ARM NEON vec_dot for q2_2 and q1_3
compilade Jun 25, 2024
89dc3b2
ggml-quants : use ceiling division when quantizing q1_3
compilade Jun 26, 2024
961e293
convert-hf : simplify BitNet pre-quantization
compilade Jun 26, 2024
0996149
convert-hf : allow converting the weird BitNet 1.3B
compilade Jun 27, 2024
bfd2f21
bitnet : replace 1.58b with b1.58, as in the paper
compilade Jun 29, 2024
ec50944
ggml-quants : fix build failure on Windows
compilade Jun 29, 2024
8fbd593
ggml-quants : attempt to fix Arm 32-bit support
compilade Jun 29, 2024
dd3e62a
ggml : add some informative comments in q1_3 vec_dot
compilade Jul 29, 2024
79a278e
Merge branch 'master' into compilade/bitnet-ternary
compilade Jul 29, 2024
77b8f84
ggml : add TQ1_0 and TQ2_0 ternary quantization types
compilade Jul 30, 2024
560873f
ggml : even faster TQ2_0
compilade Jul 31, 2024
e971957
ggml : also faster TQ1_0
compilade Jul 31, 2024
a6dd699
ggml : fix build issues in certain environments
compilade Aug 1, 2024
5417089
ggml : add NEON vec_dot implementation for TQ1_0 and TQ2_0
compilade Aug 1, 2024
45719a2
ggml : avoid directly using vmlal_high_s8, for 32-bit ARM compat
compilade Aug 1, 2024
04eec58
ggml : remove q1_3 and q2_2
compilade Aug 2, 2024
f034aa1
ggml-quants : rename fields of TQ1_0 and TQ2_0 structs for consistency
compilade Aug 3, 2024
96b3d41
ggml-quants : allow using vdotq_s32 in TQ2_0 vec_dot
compilade Aug 7, 2024
d911cd1
Merge branch 'master' into compilade/bitnet-ternary
compilade Aug 11, 2024
3a0bf17
gguf-py : Numpy (de)quantization for TQ1_0 and TQ2_0
compilade Aug 12, 2024
895004f
convert : allow direct conversion to TQ1_0 and TQ2_0
compilade Aug 13, 2024
69f7726
ggml-quants : allow using ARM dot product instructions for TQ1_0
compilade Aug 13, 2024
82b2404
Merge branch 'master' into compilade/bitnet-ternary
compilade Aug 13, 2024
35cc556
ggml-quants : deduplicate TQ1_0 and TQ2_0 __ARM_FEATURE_DOTPROD support
compilade Aug 13, 2024
cb6d996
Merge branch 'master' into compilade/bitnet-ternary
compilade Aug 22, 2024
7f3a619
Merge branch 'master' into compilade/bitnet-ternary
compilade Sep 4, 2024
8d61607
ggml ; remove unused ggml_mul special case
compilade Sep 4, 2024
75b3a09
test-backend-ops : add TQ1_0 and TQ2_0 comments for later
compilade Sep 4, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
convert-hf : allow converting the weird BitNet 1.3B
Its FFN size is 5460 which is not convenient.
The offending tensors are kept in F16,
which makes the final model 5.01 bpw.
  • Loading branch information
compilade committed Jun 27, 2024
commit 0996149911458ce9821aa49e10db4e7c1187486d
16 changes: 10 additions & 6 deletions 16 convert-hf-to-gguf.py
Original file line number Diff line number Diff line change
Expand Up @@ -301,12 +301,16 @@ def write_tensors(self):
if self.ftype != gguf.LlamaFileType.ALL_F32 and extra_f16 and not extra_f32:
# TODO: cleaner model-specific per-tensor types
# NOTE: Q1_3 is only relevant for BitNet 1.58b
if self.ftype == gguf.LlamaFileType.MOSTLY_Q1_3 and not any(
self.match_model_tensor_name(new_name, key, None)
for key in [
gguf.MODEL_TENSOR.TOKEN_EMBD,
gguf.MODEL_TENSOR.OUTPUT,
]
if (
self.ftype == gguf.LlamaFileType.MOSTLY_Q1_3
and gguf.can_quantize_to_q1_3(data)
and not any(
self.match_model_tensor_name(new_name, key, None)
for key in [
gguf.MODEL_TENSOR.TOKEN_EMBD,
gguf.MODEL_TENSOR.OUTPUT,
]
)
):
data = gguf.quantize_q1_3(data)
assert data.dtype == np.uint8
Expand Down
4 changes: 4 additions & 0 deletions 4 gguf-py/gguf/quants.py
Original file line number Diff line number Diff line change
Expand Up @@ -126,6 +126,10 @@ def quantize_q8_0(data: np.ndarray):
__q1_3_block_size, __q1_3_type_size = GGML_QUANT_SIZES[GGMLQuantizationType.Q1_3]


def can_quantize_to_q1_3(n: np.ndarray) -> bool:
return n.shape[-1] % __q1_3_block_size == 0


def __quantize_q1_3_shape_change(s: tuple[int, ...]) -> tuple[int, ...]:
return (*s[:-1], s[-1] // __q1_3_block_size * __q1_3_type_size)

Expand Down
Loading
Morty Proxy This is a proxified and sanitized view of the page, visit original site.