high-level question about generating/using derived quantized models #1596

Jun 6, 2024

hpcpony
Jun 6, 2024

I'm pretty new to the whole LLM thing and am trying to do some comparisons across different processors (i.e. CPU MAX, skylake, EPYC, NVIDIA GPU, etc...) For the CPU part of this I'm focused on the intel extensions for tranformers (and neural-speed underneath). I'd like to use different quantization (INT8, FP16, FP32, etc...) with different "acceleration" (i.e. AVX2, AVX512, AMX, etc...) I'm having a hard time getting my head around how to run with different quantization and/or acceleration. I'd like to start with the safetensors from https://huggingface.co/bigscience/bloom (for limitations placed on me).

Is the quantization implicit in the model file I run with? Are there tools to convert from *.safetensors to model files that would then be usable by intel extension/neural-speed?

Are there arguments to intel/extensions/neural-speed functions to internally do the conversion? How can I see what's actually being used internally to the intel extensions code? (I've poked around in the code, but there's so much there and I haven't really gotten my heads around it)

I appreciated any discussion that helps me get on the right track.

Thanks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

high-level question about generating/using derived quantized models #1596

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Search code, repositories, users, issues, pull requests...

high-level question about generating/using derived quantized models #1596

Uh oh!

hpcpony Jun 6, 2024

Replies: 0 comments

hpcpony
Jun 6, 2024