Open
Description
Prerequisites
- I have read the ServerlessLLM documentation.
- I have searched the Issue Tracker to ensure this hasn't been reported before.
System Information
OS: Ubuntu 22.04
Python Version: 3.10.16
GPU: NVIDIA GeForce RTX 4060 Ti
Problem Description
The dtype of the weights in the sllm
loaded Qwen/Qwen2.5-1.5B
model is torch.bfloat16
when the weights loaded from transformers
is torch.float32
.
sllm
output for Qwen/Qwen2.5-1.5b
transformers
output for Qwen/Qwen2.5-1.5b
Doesn't happen for facebook/opt-1.3b
:
sllm
output for facebook/opt-1.3b
transformers
output for facebook/opt-1.3b
Steps to Reproduce
Code snippets:
import os
import torch
from transformers import AutoTokenizer
from sllm_store.transformers import load_model, save_model
# model_name = "facebook/opt-1.3b"
model_name = "Qwen/Qwen2.5-1.5b"
model_folder = os.getenv("MODEL_FOLDER") # change this to your own path
model_path = os.path.join(model_folder, model_name)
# =======================================================================================================================
torch.cuda.empty_cache()
# warm up the GPU
num_gpus = torch.cuda.device_count()
for i in range(num_gpus):
torch.ones(1).to(f"cuda:{i}")
torch.cuda.synchronize()
model = load_model(
model_name,
device_map="auto",
storage_path=model_folder,
fully_parallel=True,
)
# =======================================================================================================================
tokenizer = AutoTokenizer.from_pretrained(model_name)
inputs = tokenizer("Hello, my dog is cute", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
for name, param in model.named_parameters():
print(f"{name}: shape={param.shape}, dtype={param.dtype}")
import os
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
# model_name = "facebook/opt-1.3b"
model_name = "Qwen/Qwen2.5-1.5b"
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True)
model = model.to('cuda')
# =======================================================================================================================
num_gpus = torch.cuda.device_count()
for i in range(num_gpus):
torch.ones(1).to(f"cuda:{i}")
torch.cuda.synchronize()
tokenizer = AutoTokenizer.from_pretrained(model_name)
inputs = tokenizer("Hello, my dog is cute", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
for name, param in model.named_parameters():
print(f"{name}: shape={param.shape}, dtype={param.dtype}")
Steps to reproduce:
- Start
sllm-store
as described in quickstart guidesllm-store start --storage-path $PWD/models --mem-pool-size 4GB
- Run the two snippets and look at the model parameter outputs.
Expected Behavior
Expected to produce the same dtype (torch.float32) for both.
Additional Context
No response
Usage Statistics (Optional)
No response
Metadata
Metadata
Assignees
Labels
No labels