Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

[BUG] lm_head parameter is loaded as a weight when transformers doesn't #217

Copy link
Copy link
Open
@dalongbao

Description

@dalongbao
Issue body actions

Prerequisites

System Information

OS: Ubuntu 22.04
Python Version: 3.10.16
GPU: NVIDIA GeForce RTX 4060 Ti

Problem Description

When you load models in transformers, the lm_head weight isn't actually used - the embeddings are reused instead so you have to load one less weight.

Functionally there's no difference, just that sllm will store and load one more parameter into memory.

https://discuss.huggingface.co/t/why-is-the-lm-head-layer-in-gpt2lmheadmodel-not-a-parameter/639/5

Steps to Reproduce

Code snippets:

import os
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

from sllm_store.transformers import load_model, save_model

model_name = "facebook/opt-1.3b"
#model_name = "Qwen/Qwen2.5-1.5b"
model_folder = os.getenv("MODEL_FOLDER")
model_path = os.path.join(model_folder, model_name)

torch.cuda.empty_cache()
# warm up the GPU
num_gpus = torch.cuda.device_count()
for i in range(num_gpus):
    torch.ones(1).to(f"cuda:{i}")
    torch.cuda.synchronize()


# sllm model
model = load_model(
    model_name,
    device_map="auto",
    storage_path=model_folder,
    fully_parallel=True,
)

tokenizer = AutoTokenizer.from_pretrained(model_name)
inputs = tokenizer("Hello, my dog is cute", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

count1 = 0
for name, param in model.named_parameters():
#        print(f"{name}: {param.dtype}, {param.device}")
        count1 += 1


# transformers model
model = AutoModelForCausalLM.from_pretrained(
        model_name,
)
model = model.to('cuda')
outputs = model.generate(**inputs)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

count2 = 0
for name, param in model.named_parameters():
#        print(f"{name}: shape={param.shape}, dtype={param.dtype}")
        count2 += 1


print(f"sllm model: {count1} parameters, transformers model: {count2} parameters") # the no. parameters in sllm is transformers + 1 due to the additional lm_head weight being loaded

Steps to reproduce:

  1. Start sllm like in quickstart
  2. Save the model
  3. Run this code

Expected Behavior

No response

Additional Context

No response

Usage Statistics (Optional)

No response

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions

    Morty Proxy This is a proxified and sanitized view of the page, visit original site.