Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

[BUG] dtype of loaded model is torch.bfloat16 instead of torch.float16 like transformers for Qwen/Qwen2.5-1.5B #209

Copy link
Copy link
Open
@dalongbao

Description

@dalongbao
Issue body actions

Prerequisites

System Information

OS: Ubuntu 22.04
Python Version: 3.10.16
GPU: NVIDIA GeForce RTX 4060 Ti

Problem Description

The dtype of the weights in the sllm loaded Qwen/Qwen2.5-1.5B model is torch.bfloat16 when the weights loaded from transformers is torch.float32.

sllm output for Qwen/Qwen2.5-1.5b
Image

transformers output for Qwen/Qwen2.5-1.5b
Image

Doesn't happen for facebook/opt-1.3b:

sllm output for facebook/opt-1.3b
Image

transformers output for facebook/opt-1.3b
Image

Steps to Reproduce

Code snippets:

import os
import torch
from transformers import AutoTokenizer
from sllm_store.transformers import load_model, save_model

# model_name = "facebook/opt-1.3b"
model_name = "Qwen/Qwen2.5-1.5b"
model_folder = os.getenv("MODEL_FOLDER") # change this to your own path
model_path = os.path.join(model_folder, model_name)
# =======================================================================================================================
torch.cuda.empty_cache()
# warm up the GPU
num_gpus = torch.cuda.device_count()
for i in range(num_gpus):
    torch.ones(1).to(f"cuda:{i}")
    torch.cuda.synchronize()

model = load_model(
    model_name,
    device_map="auto",
    storage_path=model_folder, 
    fully_parallel=True,
)
# =======================================================================================================================
tokenizer = AutoTokenizer.from_pretrained(model_name)
inputs = tokenizer("Hello, my dog is cute", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

for name, param in model.named_parameters():
    print(f"{name}: shape={param.shape}, dtype={param.dtype}")
import os
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# model_name = "facebook/opt-1.3b"
model_name = "Qwen/Qwen2.5-1.5b"
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True)
model = model.to('cuda')
# =======================================================================================================================
num_gpus = torch.cuda.device_count()
for i in range(num_gpus):
    torch.ones(1).to(f"cuda:{i}")
    torch.cuda.synchronize()

tokenizer = AutoTokenizer.from_pretrained(model_name)
inputs = tokenizer("Hello, my dog is cute", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

for name, param in model.named_parameters():
    print(f"{name}: shape={param.shape}, dtype={param.dtype}")

Steps to reproduce:

  1. Start sllm-store as described in quickstart guide sllm-store start --storage-path $PWD/models --mem-pool-size 4GB
  2. Run the two snippets and look at the model parameter outputs.

Expected Behavior

Expected to produce the same dtype (torch.float32) for both.

Additional Context

No response

Usage Statistics (Optional)

No response

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions

    Morty Proxy This is a proxified and sanitized view of the page, visit original site.