Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings
This repository was archived by the owner on Oct 25, 2024. It is now read-only.
Discussion options

I'm pretty new to the whole LLM thing and am trying to do some comparisons across different processors (i.e. CPU MAX, skylake, EPYC, NVIDIA GPU, etc...) For the CPU part of this I'm focused on the intel extensions for tranformers (and neural-speed underneath). I'd like to use different quantization (INT8, FP16, FP32, etc...) with different "acceleration" (i.e. AVX2, AVX512, AMX, etc...) I'm having a hard time getting my head around how to run with different quantization and/or acceleration. I'd like to start with the safetensors from https://huggingface.co/bigscience/bloom (for limitations placed on me).

Is the quantization implicit in the model file I run with? Are there tools to convert from *.safetensors to model files that would then be usable by intel extension/neural-speed?

Are there arguments to intel/extensions/neural-speed functions to internally do the conversion? How can I see what's actually being used internally to the intel extensions code? (I've poked around in the code, but there's so much there and I haven't really gotten my heads around it)

I appreciated any discussion that helps me get on the right track.

Thanks.

You must be logged in to vote

Replies: 0 comments

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
1 participant
Morty Proxy This is a proxified and sanitized view of the page, visit original site.