[BUG] dtype of loaded model is torch.bfloat16 instead of torch.float16 like transformers for Qwen/Qwen2.5-1.5B

Prerequisites

I have read the ServerlessLLM documentation.
I have searched the Issue Tracker to ensure this hasn't been reported before.

System Information

OS: Ubuntu 22.04
Python Version: 3.10.16
GPU: NVIDIA GeForce RTX 4060 Ti

Problem Description

The dtype of the weights in the sllm loaded Qwen/Qwen2.5-1.5B model is torch.bfloat16 when the weights loaded from transformers is torch.float32.

sllm output for Qwen/Qwen2.5-1.5b

transformers output for Qwen/Qwen2.5-1.5b

Doesn't happen for facebook/opt-1.3b:

sllm output for facebook/opt-1.3b

transformers output for facebook/opt-1.3b

Steps to Reproduce

Code snippets:

import os
import torch
from transformers import AutoTokenizer
from sllm_store.transformers import load_model, save_model

# model_name = "facebook/opt-1.3b"
model_name = "Qwen/Qwen2.5-1.5b"
model_folder = os.getenv("MODEL_FOLDER") # change this to your own path
model_path = os.path.join(model_folder, model_name)
# =======================================================================================================================
torch.cuda.empty_cache()
# warm up the GPU
num_gpus = torch.cuda.device_count()
for i in range(num_gpus):
    torch.ones(1).to(f"cuda:{i}")
    torch.cuda.synchronize()

model = load_model(
    model_name,
    device_map="auto",
    storage_path=model_folder, 
    fully_parallel=True,
)
# =======================================================================================================================
tokenizer = AutoTokenizer.from_pretrained(model_name)
inputs = tokenizer("Hello, my dog is cute", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

for name, param in model.named_parameters():
    print(f"{name}: shape={param.shape}, dtype={param.dtype}")

import os
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# model_name = "facebook/opt-1.3b"
model_name = "Qwen/Qwen2.5-1.5b"
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True)
model = model.to('cuda')
# =======================================================================================================================
num_gpus = torch.cuda.device_count()
for i in range(num_gpus):
    torch.ones(1).to(f"cuda:{i}")
    torch.cuda.synchronize()

tokenizer = AutoTokenizer.from_pretrained(model_name)
inputs = tokenizer("Hello, my dog is cute", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

for name, param in model.named_parameters():
    print(f"{name}: shape={param.shape}, dtype={param.dtype}")

Steps to reproduce:

Start sllm-store as described in quickstart guide sllm-store start --storage-path $PWD/models --mem-pool-size 4GB
Run the two snippets and look at the model parameter outputs.

Expected Behavior

Expected to produce the same dtype (torch.float32) for both.

Additional Context

No response

Usage Statistics (Optional)

No response

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] dtype of loaded model is `torch.bfloat16` instead of `torch.float16` like `transformers` for `Qwen/Qwen2.5-1.5B` #209

Prerequisites

System Information

Problem Description

Steps to Reproduce

Expected Behavior

Additional Context

Usage Statistics (Optional)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Search code, repositories, users, issues, pull requests...

[BUG] dtype of loaded model is torch.bfloat16 instead of torch.float16 like transformers for Qwen/Qwen2.5-1.5B #209

Description

Prerequisites

System Information

Problem Description

Steps to Reproduce

Expected Behavior

Additional Context

Usage Statistics (Optional)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[BUG] dtype of loaded model is `torch.bfloat16` instead of `torch.float16` like `transformers` for `Qwen/Qwen2.5-1.5B` #209