Open
Description
Prerequisites
- I have read the ServerlessLLM documentation.
- I have searched the Issue Tracker to ensure this hasn't been reported before.
System Information
OS: Ubuntu 22.04
Python Version: 3.10.16
GPU: NVIDIA GeForce RTX 4060 Ti
Problem Description
When you load models in transformers
, the lm_head
weight isn't actually used - the embeddings are reused instead so you have to load one less weight.
Functionally there's no difference, just that sllm
will store and load one more parameter into memory.
https://discuss.huggingface.co/t/why-is-the-lm-head-layer-in-gpt2lmheadmodel-not-a-parameter/639/5
Steps to Reproduce
Code snippets:
import os
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from sllm_store.transformers import load_model, save_model
model_name = "facebook/opt-1.3b"
#model_name = "Qwen/Qwen2.5-1.5b"
model_folder = os.getenv("MODEL_FOLDER")
model_path = os.path.join(model_folder, model_name)
torch.cuda.empty_cache()
# warm up the GPU
num_gpus = torch.cuda.device_count()
for i in range(num_gpus):
torch.ones(1).to(f"cuda:{i}")
torch.cuda.synchronize()
# sllm model
model = load_model(
model_name,
device_map="auto",
storage_path=model_folder,
fully_parallel=True,
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
inputs = tokenizer("Hello, my dog is cute", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
count1 = 0
for name, param in model.named_parameters():
# print(f"{name}: {param.dtype}, {param.device}")
count1 += 1
# transformers model
model = AutoModelForCausalLM.from_pretrained(
model_name,
)
model = model.to('cuda')
outputs = model.generate(**inputs)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
count2 = 0
for name, param in model.named_parameters():
# print(f"{name}: shape={param.shape}, dtype={param.dtype}")
count2 += 1
print(f"sllm model: {count1} parameters, transformers model: {count2} parameters") # the no. parameters in sllm is transformers + 1 due to the additional lm_head weight being loaded
Steps to reproduce:
- Start
sllm
like in quickstart - Save the model
- Run this code
Expected Behavior
No response
Additional Context
No response
Usage Statistics (Optional)
No response
Metadata
Metadata
Assignees
Labels
No labels