Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings
This repository was archived by the owner on Oct 25, 2024. It is now read-only.
Discussion options

I would like to perform model inference using the capabilities of Intel AMX. I have conducted inference on Llama-2 using the code provided in the LLM Runtime section. I noticed that the inference times with the code from the repository, which uses the "intel_extension_for_transformers.transformers" library, and the inference times using the base transformers library are similar.

Therefore, I was wondering if I need to enable Intel AMX on my machine somehow. Currently, I am using the m7i.2xlarge and m7i.4xlarge instances on AWS. Any suggestions? Thank you!

You must be logged in to vote

Replies: 3 comments

Comment options

We support it on aws instance.
Can you share how to execute llm runtime?

You must be logged in to vote
0 replies
Comment options

Hi, and thank you for the response!
The code used is this.

from transformers import AutoTokenizer, TextStreamer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM, WeightOnlyQuantConfig
import time
model_name = "meta-llama/Llama-2-7b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
streamer = TextStreamer(tokenizer)
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True)

while True:
    print("> ", end="")
    prompt = input().strip()
    if prompt == "quit":
        break
    start = time.time()
    b_prompt = "[INST]{}[/INST]".format(prompt)
    inputs = tokenizer(b_prompt, return_tensors="pt").input_ids
    outputs = model.generate(inputs, streamer=streamer,
                num_beams=1, max_new_tokens=512, do_sample=True, repetition_penalty=1.1)
    inference_time = time.time() - start
    print(f'\n> Total Inference Time: {inference_time} seconds')

Compared to the code in the LLM Runtime, I removed the line that quantizes the model.

woq_config = WeightOnlyQuantConfig(compute_dtype="int8", weight_dtype="int4")

Perhaps, am I forced to quantize the model?
Thanks for the help! :)

You must be logged in to vote
0 replies
Comment options

Perhaps, am I forced to quantize the model?

I'm afraid yes. We only enable AMX with quantized weight (as it benefits little due to intensive runtime conversion). In fact, you are even not running our optimized LLM Runtime if WeightOnlyQuantConfig is not provided.

if isinstance(quantization_config, WeightOnlyQuantConfig):
logger.info("Applying Weight Only Quantization.")
if use_llm_runtime:
logger.info("Using LLM runtime.")
quantization_config.post_init_runtime()
from intel_extension_for_transformers.llm.runtime.graph import Model

You must be logged in to vote
0 replies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
🙏
Q&A
Labels
None yet
3 participants
Morty Proxy This is a proxified and sanitized view of the page, visit original site.