LLM inference with Intel AMX #664

Nov 10, 2023

mcipriano01
Nov 10, 2023

I would like to perform model inference using the capabilities of Intel AMX. I have conducted inference on Llama-2 using the code provided in the LLM Runtime section. I noticed that the inference times with the code from the repository, which uses the "intel_extension_for_transformers.transformers" library, and the inference times using the base transformers library are similar.

Therefore, I was wondering if I need to enable Intel AMX on my machine somehow. Currently, I am using the m7i.2xlarge and m7i.4xlarge instances on AWS. Any suggestions? Thank you!

Nov 15, 2023

kevinintel
Nov 15, 2023

We support it on aws instance.
Can you share how to execute llm runtime?

0 replies

Nov 15, 2023

mcipriano01
Nov 15, 2023
Author

Hi, and thank you for the response!
The code used is this.

from transformers import AutoTokenizer, TextStreamer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM, WeightOnlyQuantConfig
import time
model_name = "meta-llama/Llama-2-7b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
streamer = TextStreamer(tokenizer)
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True)

while True:
    print("> ", end="")
    prompt = input().strip()
    if prompt == "quit":
        break
    start = time.time()
    b_prompt = "[INST]{}[/INST]".format(prompt)
    inputs = tokenizer(b_prompt, return_tensors="pt").input_ids
    outputs = model.generate(inputs, streamer=streamer,
                num_beams=1, max_new_tokens=512, do_sample=True, repetition_penalty=1.1)
    inference_time = time.time() - start
    print(f'\n> Total Inference Time: {inference_time} seconds')

Compared to the code in the LLM Runtime, I removed the line that quantizes the model.

woq_config = WeightOnlyQuantConfig(compute_dtype="int8", weight_dtype="int4")

Perhaps, am I forced to quantize the model?
Thanks for the help! :)

0 replies

Nov 21, 2023

DDEle
Nov 21, 2023

Perhaps, am I forced to quantize the model?

I'm afraid yes. We only enable AMX with quantized weight (as it benefits little due to intensive runtime conversion). In fact, you are even not running our optimized LLM Runtime if WeightOnlyQuantConfig is not provided.

intel-extension-for-transformers/intel_extension_for_transformers/transformers/modeling/modeling_auto.py

Lines 140 to 145 in f7d6baa

    
           if isinstance(quantization_config, WeightOnlyQuantConfig): 
        
               logger.info("Applying Weight Only Quantization.") 
        
               if use_llm_runtime: 
        
                   logger.info("Using LLM runtime.") 
        
                   quantization_config.post_init_runtime() 
        
                   from intel_extension_for_transformers.llm.runtime.graph import Model

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

LLM inference with Intel AMX #664

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Search code, repositories, users, issues, pull requests...

LLM inference with Intel AMX #664

Uh oh!

mcipriano01 Nov 10, 2023

Replies: 3 comments

Uh oh!

kevinintel Nov 15, 2023

Uh oh!

mcipriano01 Nov 15, 2023 Author

Uh oh!

DDEle Nov 21, 2023

mcipriano01
Nov 10, 2023

kevinintel
Nov 15, 2023

mcipriano01
Nov 15, 2023
Author

DDEle
Nov 21, 2023