[BUG] lm_head parameter is loaded as a weight when transformers doesn't

Prerequisites

I have read the ServerlessLLM documentation.
I have searched the Issue Tracker to ensure this hasn't been reported before.

System Information

OS: Ubuntu 22.04
Python Version: 3.10.16
GPU: NVIDIA GeForce RTX 4060 Ti

Problem Description

When you load models in transformers, the lm_head weight isn't actually used - the embeddings are reused instead so you have to load one less weight.

Functionally there's no difference, just that sllm will store and load one more parameter into memory.

https://discuss.huggingface.co/t/why-is-the-lm-head-layer-in-gpt2lmheadmodel-not-a-parameter/639/5

Steps to Reproduce

Code snippets:

import os
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

from sllm_store.transformers import load_model, save_model

model_name = "facebook/opt-1.3b"
#model_name = "Qwen/Qwen2.5-1.5b"
model_folder = os.getenv("MODEL_FOLDER")
model_path = os.path.join(model_folder, model_name)

torch.cuda.empty_cache()
# warm up the GPU
num_gpus = torch.cuda.device_count()
for i in range(num_gpus):
    torch.ones(1).to(f"cuda:{i}")
    torch.cuda.synchronize()


# sllm model
model = load_model(
    model_name,
    device_map="auto",
    storage_path=model_folder,
    fully_parallel=True,
)

tokenizer = AutoTokenizer.from_pretrained(model_name)
inputs = tokenizer("Hello, my dog is cute", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

count1 = 0
for name, param in model.named_parameters():
#        print(f"{name}: {param.dtype}, {param.device}")
        count1 += 1


# transformers model
model = AutoModelForCausalLM.from_pretrained(
        model_name,
)
model = model.to('cuda')
outputs = model.generate(**inputs)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

count2 = 0
for name, param in model.named_parameters():
#        print(f"{name}: shape={param.shape}, dtype={param.dtype}")
        count2 += 1


print(f"sllm model: {count1} parameters, transformers model: {count2} parameters") # the no. parameters in sllm is transformers + 1 due to the additional lm_head weight being loaded

Steps to reproduce:

Start sllm like in quickstart
Save the model
Run this code

Expected Behavior

No response

Additional Context

No response

Usage Statistics (Optional)

No response

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] `lm_head` parameter is loaded as a weight when transformers doesn't #217

Prerequisites

System Information

Problem Description

Steps to Reproduce

Expected Behavior

Additional Context

Usage Statistics (Optional)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Search code, repositories, users, issues, pull requests...

[BUG] lm_head parameter is loaded as a weight when transformers doesn't #217

Description

Prerequisites

System Information

Problem Description

Steps to Reproduce

Expected Behavior

Additional Context

Usage Statistics (Optional)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[BUG] `lm_head` parameter is loaded as a weight when transformers doesn't #217